December 26, 2023

Performance characteristics of Python's tarfile

When working with a Python codebase I stumbled upon an awkward performance bug. Let’s take a look at how to reproduce it, the initial debugging steps, and the workaround.

Background

To understand the cause of this bug, let’s get some context on the problem that the code was written to tackle. This specific snippet of code was taking some binary files, and compressing them into a gzipped tarfile. There were about five binary input files, each one being around a gigabyte each.

The code snippet for compressing the tarfiles looked roughly like this:

# compress.py:
import tarfile

tarfiles = [ f"tar{x!s}" for x in range(5) ]

if __name__ == "__main__":
    # to get compress-with-6.py just add compresslevel=6
    # as a kwarg to the open call
    with tarfile.open("final.tar.gz", "w:gz") as tar:
        for name in tarfiles:
            tar.add(name)

Benchmarking

Let’s create some test data by running this script:

#!/usr/bin/env bash
for i in {0..4}
do
    head -c 1G </dev/urandom > tar${i}
done

Let’s see how long it takes compress.py to archive our files (all code in this post was tested on and idle AWS x86 c6i.4xlarge EC2 instance):

$ time python3 compress.py

real    2m37.690s
user    2m33.203s
sys     0m3.755s

real is wallclock time and user + sys is the CPU time. We can guess that because the real and user time are pretty close, Python is performing the compression on a single thread. Since compression is a parallelizable task, we’ll be able to improve on this. Let’s dig in and see if we can speed this step up!

Analysis

Searching Google for similar symptoms brings us to this stackoverflow post that described the same problem I was seeing. Some answers mentioned that Python compresses the data more than similar tools, which could be leading to the slowdown. Let’s explore that idea.

The Python documentation says that the default compresslevel is 9 for the tarfile module. gzip’s default level is 6, so let’s try to use a compression level of 6 in the Python implementation to see if we can get a speed improvement.

Let’s also use hyperfine for the comparison to remove some noise (hyperfine will run the command multiple times and give an averaged measurement):

$ hyperfine 'python3 compress-with-6.py' 'python3 compress.py'
Benchmark 1: python3 compress-with-6.py
  Time (mean ± σ):     158.204 s ±  1.295 s    [User: 153.267 s, System: 3.763 s]
  Range (min … max):   157.261 s … 160.656 s    10 runs

Benchmark 2: python3 compress.py
  Time (mean ± σ):     157.848 s ±  1.441 s    [User: 153.279 s, System: 3.690 s]
  Range (min … max):   157.086 s … 161.888 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  python3 compress.py ran
    1.00 ± 0.01 times faster than python3 compress-with-6.py

These results seem to show that a compression level of 9 isn’t significantly slower than a compression level of 6 for this data set. Setting the compression level in this case was a red herring.

Although setting the compression level did not have any impact when archiving these binaries, I did notice that a compression level of 6 was marginally faster on the data that was used in the original Python codebase. This intuitively makes sense because gzip will not attempt to compress data as much as when using compression level 9. It’s possible that we didn’t observe any speedup here because our binaries are just noise, and the binaries in the other project were not (but this is just speculation).

Using an alternative implementation

Let’s try some command line utilities to see if they fare better. tar¹ paired with pigz should be able to take advantage of the high core count on this AWS instance.

To compress a set of files into a tarfile, we run tar cf - tar* | pigz > final.tar.gz. Let look at the performance benchmarks:

$ hyperfine 'python3 compress-with-6.py' 'python3 compress.py' 'tar cf - tar* | pigz > final.tar.gz'
Benchmark 1: python3 compress-with-6.py
  Time (mean ± σ):     158.439 s ±  1.822 s    [User: 153.565 s, System: 3.694 s]
  Range (min … max):   157.374 s … 163.444 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: python3 compress.py
  Time (mean ± σ):     158.084 s ±  1.129 s    [User: 153.550 s, System: 3.712 s]
  Range (min … max):   157.236 s … 161.074 s    10 runs

Benchmark 3: tar cf - tar* | pigz > final.tar.gz
  Time (mean ± σ):     47.122 s ±  1.753 s    [User: 212.975 s, System: 10.280 s]
  Range (min … max):   42.134 s … 47.702 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  tar cf - tar* | pigz > final.tar.gz ran
    3.35 ± 0.13 times faster than python3 compress.py
    3.36 ± 0.13 times faster than python3 compress-with-6.py

We can see that in these experiments using tar in conjunction with pigz offers a ~3x speedup compared to using the built in Python standard library. We also see that the user + sys time is much higher than the wallclock time, this means that tar + pigz is using multiple cores to archive the files.

Final thoughts

These benchmarks are similar to the improvement that I saw in the other Python codebase I was working on. However, the workload was larger and the EC2 instance had more cores. In the original codebase, a 6 minute archive task got cut down to a 15 second operation, which turned a 30 minute Continuous Integration (CI) job into a 24 minute job. This is a prime example of a massive performance impact being achieved with a small code change.

However, to generalize this performance debugging journey, developers need to have insight into their system to know what they should optimize. Without performance insights into their system, developers are not empowered to make performance related changes. In our case, I was able to catch this performance bug because we were logging the elapsed time for steps executed in our CI environment. Although this is a simple performance metric, in this case it proved to be a valuable one.

Beware of the differences between BSD tar and GNU tar. They are not compatible and if users are expecting archives created by one, and you provide archives created by the other, they will run into issues untarring your files. MacOSX will use a BSD style tar implementation by default, and most Linux systems will use a GNU style implementation. ↩︎