Performance characteristics of Python's tarfile
When working with a Python codebase I stumbled upon an awkward performance bug. Let’s take a look at how to reproduce it, the initial debugging steps, and the workaround.
Background
To understand the cause of this bug, let’s get some context on the problem that the code was written to tackle. This specific snippet of code was taking some binary files, and compressing them into a gzipped tarfile. There were about five binary input files, each one being around a gigabyte each.
The code snippet for compressing the tarfiles looked roughly like this:
# compress.py:
import tarfile
tarfiles = [ f"tar{x!s}" for x in range(5) ]
if __name__ == "__main__":
# to get compress-with-6.py just add compresslevel=6
# as a kwarg to the open call
with tarfile.open("final.tar.gz", "w:gz") as tar:
for name in tarfiles:
tar.add(name)
Benchmarking
Let’s create some test data by running this script:
#!/usr/bin/env bash
for i in {0..4}
do
head -c 1G </dev/urandom > tar${i}
done
Let’s see how long it takes compress.py
to archive our files (all code in this post was tested on and idle AWS x86
c6i.4xlarge
EC2 instance):
$ time python3 compress.py
real 2m37.690s
user 2m33.203s
sys 0m3.755s
real
is wallclock time and user + sys
is the CPU time. We can guess that because the real
and user
time are
pretty close, Python is performing the compression on a single thread. Since compression is a parallelizable task, we’ll
be able to improve on this. Let’s dig in and see if we can
speed this step up!
Analysis
Searching Google for similar symptoms brings us to this stackoverflow post that described the same problem I was seeing. Some answers mentioned that Python compresses the data more than similar tools, which could be leading to the slowdown. Let’s explore that idea.
The Python documentation says that the default compresslevel
is 9 for the tarfile
module. gzip
’s default level is 6, so let’s try to use a
compression level of 6 in the Python implementation to see if we can get a speed improvement.
Let’s also use hyperfine for the comparison to remove some noise (hyperfine
will run the command multiple times and give an averaged measurement):
$ hyperfine 'python3 compress-with-6.py' 'python3 compress.py'
Benchmark 1: python3 compress-with-6.py
Time (mean ± σ): 158.204 s ± 1.295 s [User: 153.267 s, System: 3.763 s]
Range (min … max): 157.261 s … 160.656 s 10 runs
Benchmark 2: python3 compress.py
Time (mean ± σ): 157.848 s ± 1.441 s [User: 153.279 s, System: 3.690 s]
Range (min … max): 157.086 s … 161.888 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
python3 compress.py ran
1.00 ± 0.01 times faster than python3 compress-with-6.py
These results seem to show that a compression level of 9 isn’t significantly slower than a compression level of 6 for this data set. Setting the compression level in this case was a red herring.
Although setting the compression level did not have any impact when archiving these binaries, I did notice that a
compression level of 6 was marginally faster on the data that was used in the original Python codebase. This intuitively
makes sense because gzip
will not attempt to compress data as much as when using compression level 9. It’s possible
that we didn’t observe any speedup here because our binaries are just noise, and the binaries in the other project were
not (but this is just speculation).
Using an alternative implementation
Let’s try some command line utilities to see if they fare better. tar
1 paired with pigz
should be able to take
advantage of the high core count on this AWS instance.
To compress a set of files into a tarfile, we run tar cf - tar* | pigz > final.tar.gz
. Let look at the performance
benchmarks:
$ hyperfine 'python3 compress-with-6.py' 'python3 compress.py' 'tar cf - tar* | pigz > final.tar.gz'
Benchmark 1: python3 compress-with-6.py
Time (mean ± σ): 158.439 s ± 1.822 s [User: 153.565 s, System: 3.694 s]
Range (min … max): 157.374 s … 163.444 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: python3 compress.py
Time (mean ± σ): 158.084 s ± 1.129 s [User: 153.550 s, System: 3.712 s]
Range (min … max): 157.236 s … 161.074 s 10 runs
Benchmark 3: tar cf - tar* | pigz > final.tar.gz
Time (mean ± σ): 47.122 s ± 1.753 s [User: 212.975 s, System: 10.280 s]
Range (min … max): 42.134 s … 47.702 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
tar cf - tar* | pigz > final.tar.gz ran
3.35 ± 0.13 times faster than python3 compress.py
3.36 ± 0.13 times faster than python3 compress-with-6.py
We can see that in these experiments using tar
in conjunction with pigz
offers a ~3x speedup compared to using the
built in Python standard library. We also see that the user + sys
time is much higher than the wallclock time, this
means that tar
+ pigz
is using multiple cores to archive the files.
Final thoughts
These benchmarks are similar to the improvement that I saw in the other Python codebase I was working on. However, the workload was larger and the EC2 instance had more cores. In the original codebase, a 6 minute archive task got cut down to a 15 second operation, which turned a 30 minute Continuous Integration (CI) job into a 24 minute job. This is a prime example of a massive performance impact being achieved with a small code change.
However, to generalize this performance debugging journey, developers need to have insight into their system to know what they should optimize. Without performance insights into their system, developers are not empowered to make performance related changes. In our case, I was able to catch this performance bug because we were logging the elapsed time for steps executed in our CI environment. Although this is a simple performance metric, in this case it proved to be a valuable one.
-
Beware of the differences between BSD
tar
and GNUtar
. They are not compatible and if users are expecting archives created by one, and you provide archives created by the other, they will run into issues untarring your files. MacOSX will use a BSD styletar
implementation by default, and most Linux systems will use a GNU style implementation. ↩︎