wimlib supports compression and decompression in all of the compression formats known to be used in WIM archives:
wimlib's compressors for these formats usually outperform and outcompress their closed-source Microsoft equivalents. See the benchmarks.
Since these formats are not well known outside of Microsoft, some extra information is provided below.
wimlib's XPRESS and LZX decompressors are also reused in the NTFS-3G system compression plugin by the same author, which extends NTFS-3G with support for reading files compressed with the "system compression" feature Microsoft introduced in Windows 10.
The Huffman variant of the XPRESS compression format uses LZ77-style dictionary compression combined with Huffman coding. It is designed for fast compression and decompression with a small dictionary size (typically 4096 to 65,536 bytes). The uncompressed data is represented as a sequence of literals and matches. A literal is a single byte, whereas a match consists of a (length, offset) pair which tells the decompressor to copy 3 or more bytes from data it has already decompressed. A single alphabet of 512 symbols combines literals (256 possible values) and match headers (256 more possible values) and is Huffman-coded.
Example uses of XPRESS:
wimlib's XPRESS compressor supports different compression levels (the default is level 50):
The LZX format is basically a slightly improved version of the more commonly used DEFLATE format (the format used in ZIP files, and by zlib and gzip). It combines LZ77-style dictionary compression with Huffman coding and is intended to be used with a small-to-medium dictionary size (32,768 to 2,097,152 bytes). Uncompressed data is represented a sequence of literals and matches. Literals and match headers are combined into a single Huffman-coded "main alphabet". The lengths of long matches are Huffman-coded using the "length alphabet". The compressor can choose to divide the input into multiple "blocks", each of which uses separate Huffman codes. In addition, the LZX format includes repeat offset codes, the option to entropy-encode the bottom three bits of match offsets, and x86 machine code preprocessing. These characteristics allow "binary" data to be compressed better with LZX than with DEFLATE. Other improvements over DEFLATE are: no coding space is wasted on an end-of-block symbol; and the main alphabet captures length-offset correlation.
Example uses of LZX:
wimlib's LZX compressor supports different compression levels (the default is level 50):
Normal/slow (level ≥ 35): "Near-optimal" parsing with matches found using BT4 matchfinder (binary trees with 4 bytes hashing, plus tables with 3 and 2 bytes hashing), along with repeat offset match searching. The compression ratio is better than that of Microsoft's LZX compressor, while still being significantly faster. Because this is the default compression mode of wimlib-imagex, this is the most carefully engineered compressor available in wimlib. However, the LZX format does not allow as high a compression ratio as LZMS if a sufficiently large dictionary is used.
The compressor starts by preprocessing the input buffer. Then, it advances through the buffer and finds LZ77 matches. For each byte position, the 2, 3, and 4-byte sequences beginning at that position are hashed using a multiplicative hash function. The 2 and 3-byte hashes identify entries in the hash2 and hash3 tables, respectively. If either entry contains a position value, that position is checked for a length 2 or 3 match with the current position, respectively; then the entry is replaced with the current position. Assuming no hash collisions, this finds the closest length 2 and length 3 matches. Then, the 4-byte hash identifies a binary tree containing prior sequences (or positions). The sequence beginning with the 4-byte sub-sequence is searched for in the tree, and matches are reported along the search path. At the same time, the binary tree is re-rooted at the current sequence, which ensures that small offsets remain close to the root of the tree.
At some point, the compressor will decide it is time to end the current block. This happens when either the end of the input is reached, a maximum block size is reached, or when a heuristic determines that the type of data being compressed has changed significantly. Once this decision is made, the compressor runs a minimum-cost path graph search algorithm to find a concise compressed representation of the block. The (conceptual) graph being searched consists of a node per uncompressed byte and an edge per match or literal which can represent the sequence of uncompressed data between the two connected nodes. The cost of each edge is the number of bits required to represent that match or literal. A path through this graph therefore represents a valid compressed representation of the block's data with some cost, and we want the one with the minimum cost (or at least a low cost). The minimum-cost path algorithm can be thought of as similar to Dijkstra's algorithm, but it does not need to be as general (no priority queue is needed) because the graph is already topologically sorted by virtue of nodes being associated with byte positions.
At the default compression level (50), the minimum-cost path algorithm is actually run twice. The first pass uses edge costs derived from a mix of static default values and statistics collected about the block, whereas the second pass uses edge costs derived from the Huffman codes that were computed after the first pass. At higher compression levels, more than two passes are used for greater accuracy. This is effective in practice but is not optimal because the final Huffman codeword lengths cannot be predicted exactly, and more generally it is unknown which codeword lengths would actually produce the optimal solution.
Repeat offset matches are handled specially. Because the state of the recent offsets queue cannot be predicted in advance, repeat offset matches are searched for during the graph search step, not during the main matchfinding step. This also implies that the graph search must proceed in the forwards direction rather than in the backwards direction, since some edges aren't known in advance.
The compressor uses various heuristics to avoid an exhaustive search. For example, it only considers the smallest offset at which each distinct match length is available (e.g. if there is a length 3 match at offsets 100 and 1000, then it will only consider the one at offset 100). The compressor is also subject to cut-offs in cases where there are too many matches or the matches are very long.
Once the compressor has chosen the match/literal sequence for each block, it outputs the actual compressed bits, using Huffman encoding when required.
LZMS is an undocumented compression format that Microsoft released in 2012 or 2013. It perhaps could be described as Microsoft's answer to LZMA (i.e. the format used in 7z and xz files), although LZMS usually produces a worse compression ratio than LZMA. Like LZMA, LZMS is an LZ77-based algorithm. It achieves a relatively high compression ratio by relying on a large LZ77 dictionary size (up to 67,108,864 bytes) and statistically modelling the LZ77 stream of literals and matches. Unlike LZMA but like some LZMA competitors such as LZHAM, LZMS uses Huffman coding in addition to the more concise arithmetic coding, presumably to make decompression faster. The Huffman codes are rebuilt periodically and are not stored with the compressed data. The format includes a preprocessing step for x86 and x86_64 machine code. It does not include a "delta" filter for multimedia data but rather allows a special "delta" match type in addition to the traditional LZ77 match type.
Example uses of LZMS:
wimlib's LZMS compressor supports different compression levels but is primarily optimized for the default level (50). Similar to commonly used LZMA compressors, wimlib's LZMS compressor performs a heuristic graph search over small regions of the uncompressed data to choose a concise compressed representation, considering many different possibilities and choosing the one which requires the fewest bits. LZ77-style matches are found using a string search algorithm which handles very large buffers (tens or hundreds of megabytes in size) relatively efficiently. First, the string search algorithm builds the suffix array of the input data using libdivsufsort by Yuta Mori. Second, the string search algorithm uses the suffix array to build a compact representation of the lcp-interval tree. Third, for each input buffer position in sequential order, the string search algorithm reports a match for each lcp-interval that contains that position's rank, was previously visited, and has the greatest lcp value among all lcp-intervals under consideration which were last visited equally recently. The reported match length is the lcp value of the lcp-interval, and the reported match offset is the distance to the position at which that lcp-interval was last visited. More information can be found in the source code, Kasai et al (2001), Abouelhoda et al (2004), and Chen et al (2008). LZMS "Delta" matches are found using simpler data structure (a hash table).
As mentioned, wimlib's compressors for XPRESS, LZX, and LZMS usually outperform and outcompress their Microsoft equivalents. In addition, more compression settings are provided. Although results will vary depending on the data being compressed, the table below shows results for a common use case: creating an x86 Windows PE image ("boot.wim"). Each row shows the compression type, the size of the resulting WIM archive in bytes, and the time it took to create the file. When possible, the results with the Microsoft equivalent are included.
|Compression||wimlib (v1.10.0)||WIMGAPI (Windows 10)|
|LZMS (solid) ||88,114,666 in 122.2s||88,772,520 in 201.4s|
|LZMS (non-solid) ||116,147,378 in 60.0s||N/A|
|LZX (slow) ||125,514,977 in 48.4s||N/A|
|LZX (normal) ||125,929,467 in 27.5s||127,295,034 in 45.9s|
|LZX (quick) ||129,997,269 in 4.5s||N/A|
|XPRESS (slow) ||135,147,924 in 20.8s||N/A|
|XPRESS ||137,953,928 in 3.6s||140,459,585 in 8.5s|
|"WIMBoot" (slow) ||165,005,928 in 18.5s||N/A|
|"WIMBoot" ||166,893,119 in 3.7s||169,111,375 in 10.5s|
|None ||361,315,118 in 1.4s||361,315,144 in 5.4s|
wimlib-imagex's --compress option also accepts the "fast", "maximum", and "recovery" aliases for XPRESS, LZX, and LZMS, respectively.
The compression ratio provided by wimlib is also competitive with commonly used archive formats. Below are file sizes that result when the Canterbury corpus is compressed with wimlib (v1.10.0), WIMGAPI (Windows 10), and some other formats/programs:
|7z (7-zip, -9)||483,239|
|7z (7-zip, default)||484,700|
|tar.xz (xz, -9)||486,904|
|tar.xz (xz, default)||486,916|
|WIM (wimlib, LZMS solid)||515,800|
|WIM (WIMGAPI, LZMS solid)||521,366|
|tar.bz2 (bzip, -9)||565,008|
|tar.bz2 (bzip, default)||565,008|
|WIM (wimlib, LZMS non-solid)||581,046|
|WIM (wimlib, LZX slow)||618,766|
|WIM (wimlib, LZX normal)||621,898|
|WIM (WIMGAPI, LZX)||651,866|
|WIM (wimlib, LZX quick)||686,420|
|ZIP (Info-ZIP, -9)||732,297|
|tar.gz (gzip, -9)||733,971|
|ZIP (Info-ZIP, default)||735,334|
|tar.gz (gzip, default)||738,796|
|WIM (wimlib, XPRESS)||787,356|
|WIM (WIMGAPI, XPRESS)||825,536|
|WIM (wimlib, None)||2,814,216|
|WIM (WIMGAPI, None)||2,814,254|
WIM does even better on directory trees containing duplicate files, which the Canterbury corpus doesn't have.
wimlib.net (website) Copyright © 2015-2021 Eric Biggers
wimlib (software) Copyright © 2012-2021 Eric Biggers