Compression algorithms

wimlib supports compression and decompression in all of the compression formats known to be used in WIM archives:

wimlib's compressors for these formats usually outperform and outcompress their closed-source Microsoft equivalents. See the benchmarks.

Since these formats are not well known outside of Microsoft, some extra information is provided below.

wimlib's XPRESS and LZX decompressors are also reused in the NTFS-3G system compression plugin by the same author, which extends NTFS-3G with support for reading files compressed with the "system compression" feature Microsoft introduced in Windows 10.

XPRESS

The Huffman variant of the XPRESS compression format uses LZ77-style dictionary compression combined with Huffman coding. It is designed for fast compression and decompression with a small dictionary size (typically 4096 to 65,536 bytes). The uncompressed data is represented as a sequence of literals and matches. A literal is a single byte, whereas a match consists of a (length, offset) pair which tells the decompressor to copy 3 or more bytes from data it has already decompressed. A single alphabet of 512 symbols combines literals (256 possible values) and match headers (256 more possible values) and is Huffman-coded.

Example uses of XPRESS:

wimlib's XPRESS compressor supports different compression levels (the default is level 50):

LZX

The LZX format is basically a slightly improved version of the more commonly used DEFLATE format (the format used in ZIP files, and by zlib and gzip). It combines LZ77-style dictionary compression with Huffman coding and is intended to be used with a small-to-medium dictionary size (32,768 to 2,097,152 bytes). Uncompressed data is represented a sequence of literals and matches. Literals and match headers are combined into a single Huffman-coded "main alphabet". The lengths of long matches are Huffman-coded using the "length alphabet". The compressor can choose to divide the input into multiple "blocks", each of which uses separate Huffman codes. In addition, the LZX format includes repeat offset codes, the option to entropy-encode the bottom three bits of match offsets, and x86 machine code preprocessing. These characteristics allow "binary" data to be compressed better with LZX than with DEFLATE. Other improvements over DEFLATE are: no coding space is wasted on an end-of-block symbol; and the main alphabet captures length-offset correlation.

Example uses of LZX:

wimlib's LZX compressor supports different compression levels (the default is level 50):

LZMS

LZMS is an undocumented compression format that Microsoft released in 2012 or 2013. It perhaps could be described as Microsoft's answer to LZMA (i.e. the format used in 7z and xz files), although LZMS usually produces a worse compression ratio than LZMA. Like LZMA, LZMS is an LZ77-based algorithm. It achieves a relatively high compression ratio by relying on a large LZ77 dictionary size (up to 67,108,864 bytes) and statistically modelling the LZ77 stream of literals and matches. Unlike LZMA but like some LZMA competitors such as LZHAM, LZMS uses Huffman coding in addition to the more concise arithmetic coding, presumably to make decompression faster. The Huffman codes are rebuilt periodically and are not stored with the compressed data. The format includes a preprocessing step for x86 and x86_64 machine code. It does not include a "delta" filter for multimedia data but rather allows a special "delta" match type in addition to the traditional LZ77 match type.

Example uses of LZMS:

wimlib's LZMS compressor supports different compression levels but is primarily optimized for the default level (50). Similar to commonly used LZMA compressors, wimlib's LZMS compressor performs a heuristic graph search over small regions of the uncompressed data to choose a concise compressed representation, considering many different possibilities and choosing the one which requires the fewest bits. LZ77-style matches are found using a string search algorithm which handles very large buffers (tens or hundreds of megabytes in size) relatively efficiently. First, the string search algorithm builds the suffix array of the input data using libdivsufsort by Yuta Mori. Second, the string search algorithm uses the suffix array to build a compact representation of the lcp-interval tree. Third, for each input buffer position in sequential order, the string search algorithm reports a match for each lcp-interval that contains that position's rank, was previously visited, and has the greatest lcp value among all lcp-intervals under consideration which were last visited equally recently. The reported match length is the lcp value of the lcp-interval, and the reported match offset is the distance to the position at which that lcp-interval was last visited. More information can be found in the source code, Kasai et al (2001), Abouelhoda et al (2004), and Chen et al (2008). LZMS "Delta" matches are found using simpler data structure (a hash table).

Benchmarks

As mentioned, wimlib's compressors for XPRESS, LZX, and LZMS usually outperform and outcompress their Microsoft equivalents. In addition, more compression settings are provided. Although results will vary depending on the data being compressed, the table below shows results for a common use case: creating an x86 Windows PE image ("boot.wim"). Each row shows the compression type, the size of the resulting WIM archive in bytes, and the time it took to create the file. When possible, the results with the Microsoft equivalent are included.

Compressionwimlib (v1.10.0)WIMGAPI (Windows 10)
LZMS (solid) [1] 88,114,666 in 122.2s88,772,520 in 201.4s
LZMS (non-solid) [2] 116,147,378 in 60.0s N/A
LZX (slow) [3] 125,514,977 in 48.4s N/A
LZX (normal) [4] 125,929,467 in 27.5s127,295,034 in 45.9s
LZX (quick) [5] 129,997,269 in 4.5s N/A
XPRESS (slow) [6] 135,147,924 in 20.8s N/A
XPRESS [7] 137,953,928 in 3.6s140,459,585 in 8.5s
"WIMBoot" (slow) [8] 165,005,928 in 18.5s N/A
"WIMBoot" [9] 166,893,119 in 3.7s169,111,375 in 10.5s
None [10] 361,315,118 in 1.4s361,315,144 in 5.4s

Notes:

  1. --solid for wimlib-imagex; /compress:recovery for DISM. Compression chunk size in solid resources defaults to 67108864 bytes in both cases.
  2. --compress=LZMS for wimlib-imagex. Compression chunk size defaults to 131072 bytes.
  3. --compress=LZX:100 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
  4. --compress=LZX:50 or --compress=LZX or no option for wimlib-imagex; /compress:maximum for DISM. Compression chunk size defaults to 32768 bytes in both cases.
  5. --compress=LZX:20 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
  6. --compress=XPRESS:80 for wimlib-imagex. Compression chunk size defaults to 32768 bytes.
  7. --compress=XPRESS:50 or --compress=XPRESS for wimlib-imagex; /compress:fast for DISM. Compression chunk size defaults to 32768 bytes in both cases.
  8. --wimboot --compress=XPRESS:80 for wimlib-imagex. Same format as [9], but trying harder to get a good compression ratio.
  9. --wimboot for wimlib-imagex; /wimboot for DISM. Equivalent to --compress=XPRESS --chunk-size=4096
  10. --compress=none for wimlib-imagex; /compress:none for DISM.

wimlib-imagex's --compress option also accepts the "fast", "maximum", and "recovery" aliases for XPRESS, LZX, and LZMS, respectively.

Testing environment:

The compression ratio provided by wimlib is also competitive with commonly used archive formats. Below are file sizes that result when the Canterbury corpus is compressed with wimlib (v1.10.0), WIMGAPI (Windows 10), and some other formats/programs:

Format Size (bytes)
7z (7-zip, -9) 483,239
7z (7-zip, default) 484,700
tar.xz (xz, -9) 486,904
tar.xz (xz, default) 486,916
WIM (wimlib, LZMS solid) 515,800
WIM (WIMGAPI, LZMS solid) 521,366
tar.bz2 (bzip, -9) 565,008
tar.bz2 (bzip, default) 565,008
WIM (wimlib, LZMS non-solid)581,046
WIM (wimlib, LZX slow) 618,766
WIM (wimlib, LZX normal) 621,898
WIM (WIMGAPI, LZX) 651,866
WIM (wimlib, LZX quick) 686,420
ZIP (Info-ZIP, -9) 732,297
tar.gz (gzip, -9) 733,971
ZIP (Info-ZIP, default) 735,334
tar.gz (gzip, default) 738,796
WIM (wimlib, XPRESS) 787,356
WIM (WIMGAPI, XPRESS) 825,536
WIM (wimlib, None) 2,814,216
WIM (WIMGAPI, None) 2,814,254
tar 2,826,240

WIM does even better on directory trees containing duplicate files, which the Canterbury corpus doesn't have.