Advent 2021: Zstandard

December 19, 2021

Programming

approximately 2 minutes to read

This blog is part of the 24 posts long series "Advent 2021":

Advent 2021: Intro (December 01, 2021)
Advent 2021: C++ (December 02, 2021)
Advent 2021: C# (December 03, 2021)
Advent 2021: Python (December 04, 2021)
Advent 2021: Go (December 05, 2021)
Advent 2021: TypeScript (December 06, 2021)
Advent 2021: CMake (December 07, 2021)
Advent 2021: Django (December 08, 2021)
Advent 2021: Angular (December 09, 2021)
Advent 2021: Flask (December 10, 2021)
Advent 2021: gRPC (December 11, 2021)
Advent 2021: GraphQL (December 12, 2021)
Advent 2021: XML & JSON (December 13, 2021)
Advent 2021: Matplotlib, Pandas & Numpy (December 14, 2021)
Advent 2021: Linux (December 15, 2021)
Advent 2021: Ansible (December 16, 2021)
Advent 2021: SQLite (December 17, 2021)
Advent 2021: Catch2 (December 18, 2021)
Advent 2021: Zstandard (December 19, 2021)
Advent 2021: ZFS (December 20, 2021)
Advent 2021: Thunderbird (December 21, 2021)
Advent 2021: Visual Studio Code (December 22, 2021)
Advent 2021: Blender (December 23, 2021)
Advent 2021: Open source (December 24, 2021)

Yet another library that has appeared recently and really changed how I do things is Zstandard – a compression library. Compression is an interesting area of computer science which hasn’t seen much innovation for a long time. When I started programming, there were dozens of competing algorithms and tools: RAR, UHARC, ACE, ARJ, ZIP, and a few more exotic ones. After a while, things seemed to settle down on ZIP and 7z, with ZIP being the clear leader (ZIP using deflate compression – ZIP itself is just the container, but I’s pretty much universally used with deflate. The canonical library for that is zlib). More recently, a few new algorithms started to crop up: Brotli, LZ4, and Snappy. Those were attacking the compression problem from new angles, and I ended up using LZ4 a lot of compression of data sets as it is incredibly easy to add, but they all remained fairly niche.

Fast forward a few years and a new contender entered the ring: Zstd. Zstd provided a great blend of compression quality and performance, being as good as standard ZIP at much higher speed, or being dramatically faster at slightly lower compression ratios. Especially in the age of NVMe drives which can read at GiB/s rates, you really don’t want a compression algorithm which can’t decompress at that speed. Additionally, it’s scalable: It can go extremely fast at reduced compression, or achieve very high compression rates at significantly higher cost.

For me, Zstd replaced most uses of LZ4 and zlib by now. I compress data sets using Zstd as it’s much faster and has no downsides, by being a better algorithm. I also use it for backups and other long-term archival storage as it doesn’t seem to go away anytime soon. The next frontier here will be hopefully a replacement for ZIP (the package format!) as that one hasn’t aged very well, but it still ubiquitous and often times the only supported built-in format. I’m optimistic though – there’s a good reason to develop a new on-disk format now (if only to support the latest advances in compression) and I wouldn’t be surprised if we’ll see a ZIP replacement in the coming years. Till then, I’ll continue to use Zstd where I already can!