The xz implementation has 10 levels (0 - 9) of compression and the compression ratio vs. At the same time, this ability to look further back also consumes more CPU cycles and memory. This means the algorithm is going to find more repeating patterns than zlib does and can gain a higher compression ratio than zlib. Where the zlib algorithm and file format are limited to a history or "window" of 32 kilobytes, showing its 1995 age, the xz implementation uses a much larger history size. The xz implementation is based on the earlier lzma implementation and in many ways modernizes the zlib algorithm. Adding XZ (LZMA*)Īs much as zlib is the default standard algorithm, you likely have seen a lot of files with the xz extension. More information about these optimizations to zlib can be found in the whitepaper that Intel published about these optimizations. Now, regarding the performance of zlib, details matter and various Linux distributions end up having different performances and, specifically, the Clear Linux* OS distribution (a project I work on) ships with an optimized implementation of zlib. Utilities like "gzip" generally use level 6 as default level, avoiding the steep end of the curve. The reference file I used is the source code tar file of the Linux 4.9 kernel which is 664 megabytes in size.Īs you can see in figure 1, as the compression level goes up, so does the time to compress, and, at the higher levels (7-9), the time goes up a lot for a modest gain in extra compression. To show the differences between these levels, I have measured both the compression ratio, how much smaller did the data get, and the time it took to compress the reference file I used. The format on the disk for all levels is the same shielding a decompressor from the level differences. The amount of compute time spent on searching for repeating patterns varies from level to level. Zlib provides 9 levels of compression and a "level 0" which just does a 1:1 uncompressed store of the data. Partly, because they avoided many legal pitfalls which hit other algorithms in the litigious era of the 1990s. The zlib algorithm and file format are standardized in internet standards (RFC 1950, 19) and have been extremely widely used since 1995. No blog post about data compression implementations can exist without talking about the zlib algorithm, called Deflate, and its implementation. The various implementations I will compare in this blog post differ primarily in how far back they search and how efficient their search algorithms are. The key is the first step as described before: finding the repeating patterns. If multiple algorithms are proven to be theoretically optimal, why does this blog post exist? The catch is: this optimality only happens for very large data sets and, generally, without regard for the computation time and memory consumption this optimal algorithm would take to implement. The LZ77 family of algorithms form the basis of most data compression implementations used in this blog post. I've had the pleasure to be in a classroom watching professor Ziv prove the LZ77 algorithm is likewise theoretically optimal. For example, the Context Tree Weighting was the first algorithm proven to be optimal for very large data streams with a fixed sized startup overhead. Various scientists have proven that an optimal compression algorithm exists. The pattern is turned into an efficient stream of bits and bytes - the so called entropy coding - on the disk. The algorithm uses information from earlier in the file to find repeating patterns. Generally, compression algorithms work in two steps: I am going to spare you the hard sciency bits for the rest of the article but a quick high-level overview will be useful.ĭata compression algorithms take advantage of the redundancy in the files they are compressing to reduce the size of the files and thus the storage/transmission requirements. Data compression falls in the realm of the information theory field of science and has been thoroughly studied since Claude Shannon started research in this area in 1948. About data compressionįirst, let us have a quick word about data compression algorithms. The result are many choices and this blog post tries to show the differences between these choices. Several of these compression algorithms provide a tunable, called "level", a number from 0 to 9 that changes the behavior of the algorithm. The typical list of compression options includes things like zlib, xz, bzip2 as well as lz4 and Snappy. A typical Linux* OS offers many options for reducing the storage space of data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |