- * Algorithms
- * ==========
- *
- * There are actually two distinct overall algorithms implemented here. We
- * shall refer to them as the "slow" algorithm and the "fast" algorithm. The
- * "slow" algorithm spends more time compressing to achieve a higher compression
- * ratio compared to the "fast" algorithm. More details are presented below.
- *
- * Slow algorithm
- * --------------
- *
- * The "slow" algorithm to generate LZX-compressed data is roughly as follows:
- *
- * 1. Preprocess the input data to translate the targets of x86 call
- * instructions to absolute offsets.
- *
- * 2. Build the suffix array and inverse suffix array for the input data. The
- * suffix array contains the indices of all suffixes of the input data,
- * sorted lexcographically by the corresponding suffixes. The "position" of
- * a suffix is the index of that suffix in the original string, whereas the
- * "rank" of a suffix is the index at which that suffix's position is found
- * in the suffix array.
- *
- * 3. Build the longest common prefix array corresponding to the suffix array.
- *
- * 4. For each suffix, find the highest lower ranked suffix that has a lower
- * position, the lowest higher ranked suffix that has a lower position, and
- * the length of the common prefix shared between each. This information is
- * later used to link suffix ranks into a doubly-linked list for searching
- * the suffix array.
- *
- * 5. Set a default cost model for matches/literals.
- *
- * 6. Determine the lowest cost sequence of LZ77 matches ((offset, length)
- * pairs) and literal bytes to divide the input into. Raw match-finding is
- * done by searching the suffix array using a linked list to avoid
- * considering any suffixes that start after the current position. Each run
- * of the match-finder returns the approximate lowest-cost longest match as
- * well as any shorter matches that have even lower approximate costs. Each
- * such run also adds the suffix rank of the current position into the linked
- * list being used to search the suffix array. Parsing, or match-choosing,
- * is solved as a minimum-cost path problem using a forward "optimal parsing"
- * algorithm based on the Deflate encoder from 7-Zip. This algorithm moves
- * forward calculating the minimum cost to reach each byte until either a
- * very long match is found or until a position is found at which no matches
- * start or overlap.
- *
- * 7. Build the Huffman codes needed to output the matches/literals.
- *
- * 8. Up to a certain number of iterations, use the resulting Huffman codes to
- * refine a cost model and go back to Step #6 to determine an improved
- * sequence of matches and literals.
- *
- * 9. Output the resulting block using the match/literal sequences and the
- * Huffman codes that were computed for the block.
- *
- * Note: the algorithm does not yet attempt to split the input into multiple LZX
- * blocks, instead using a series of blocks of LZX_DIV_BLOCK_SIZE bytes.
- *
- * Fast algorithm
- * --------------
- *
- * The fast algorithm (and the only one available in wimlib v1.5.1 and earlier)
- * spends much less time on the main bottlenecks of the compression process ---
- * that is, the match finding and match choosing. Matches are found and chosen
- * with hash chains using a greedy parse with one position of look-ahead. No
- * block splitting is done; only compressing the full input into an aligned
- * offset block is considered.
- *
- * API
- * ===
- *
- * The old API (retained for backward compatibility) consists of just one
- * function:
- *
- * wimlib_lzx_compress()
- *
- * The new compressor has more potential parameters and needs more memory, so
- * the new API ties up memory allocations and compression parameters into a
- * context:
- *
- * wimlib_lzx_alloc_context()
- * wimlib_lzx_compress2()
- * wimlib_lzx_free_context()
- * wimlib_lzx_set_default_params()
- *
- * Both wimlib_lzx_compress() and wimlib_lzx_compress2() are designed to
- * compress an in-memory buffer of up to the window size, which can be any power
- * of two between 2^15 and 2^21 inclusively. However, by default, the WIM
- * format uses 2^15, and this is seemingly the only value that is compatible
- * with WIMGAPI. In any case, the window is not a true "sliding window" since
- * no data is ever "slid out" of the window. This is needed for the WIM format,
- * which is designed such that chunks may be randomly accessed.
- *
- * Both wimlib_lzx_compress() and wimlib_lzx_compress2() return 0 if the data
- * could not be compressed to less than the size of the uncompressed data.
- * Again, this is suitable for the WIM format, which stores such data chunks
- * uncompressed.
- *
- * The functions in this LZX compression API are exported from the library,
- * although with the possible exception of wimlib_lzx_set_default_params(), this
- * is only in case other programs happen to have uses for it other than WIM
- * reading/writing as already handled through the rest of the library.
- *
- * Acknowledgments
- * ===============
- *
- * Acknowledgments to several open-source projects and research papers that made
- * it possible to implement this code:
- *
- * - divsufsort (author: Yuta Mori), for the suffix array construction code,
- * located in a separate directory (divsufsort/).
- *
- * - "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its
- * Applications" (Kasai et al. 2001), for the LCP array computation.
- *
- * - "LPF computation revisited" (Crochemore et al. 2009) for the prev and next
- * array computations.
- *
- * - 7-Zip (author: Igor Pavlov) for the algorithm for forward optimal parsing
- * (match-choosing).
- *
- * - zlib (author: Jean-loup Gailly and Mark Adler), for the hash table
- * match-finding algorithm (used in lz77.c).
- *
- * - lzx-compress (author: Matthew T. Russotto), on which some parts of this
- * code were originally based.
+ * ----------------------------------------------------------------------------
+ *
+ * Algorithmic Overview
+ *
+ * At a high level, any implementation of LZX compression must operate as
+ * follows:
+ *
+ * 1. Preprocess the input data to translate the targets of 32-bit x86 call
+ * instructions to absolute offsets. (Actually, this is required for WIM,
+ * but might not be in other places LZX is used.)
+ *
+ * 2. Find a sequence of LZ77-style matches and literal bytes that expands to
+ * the preprocessed data.
+ *
+ * 3. Divide the match/literal sequence into one or more LZX blocks, each of
+ * which may be "uncompressed", "verbatim", or "aligned".
+ *
+ * 4. Output each LZX block.
+ *
+ * Step (1) is fairly straightforward. It requires looking for 0xe8 bytes in
+ * the input data and performing a translation on the 4 bytes following each
+ * one.
+ *
+ * Step (4) is complicated, but it is mostly determined by the LZX format. The
+ * only real choice we have is what algorithm to use to build the length-limited
+ * canonical Huffman codes. See lzx_write_all_blocks() for details.
+ *
+ * That leaves steps (2) and (3) as where all the hard stuff happens. Focusing
+ * on step (2), we need to do LZ77-style parsing on the input data, or "window",
+ * to divide it into a sequence of matches and literals. Each position in the
+ * window might have multiple matches associated with it, and we need to choose
+ * which one, if any, to actually use. Therefore, the problem can really be
+ * divided into two areas of concern: (a) finding matches at a given position,
+ * which we shall call "match-finding", and (b) choosing whether to use a
+ * match or a literal at a given position, and if using a match, which one (if
+ * there is more than one available). We shall call this "match-choosing". We
+ * first consider match-finding, then match-choosing.
+ *
+ * ----------------------------------------------------------------------------
+ *
+ * Match-finding
+ *
+ * Given a position in the window, we want to find LZ77-style "matches" with
+ * that position at previous positions in the window. With LZX, the minimum
+ * match length is 2 and the maximum match length is 257. The only restriction
+ * on offsets is that LZX does not allow the last 2 bytes of the window to match
+ * the the beginning of the window.
+ *
+ * Depending on how good a compression ratio we want (see the "Match-choosing"
+ * section), we may want to find: (a) all matches, or (b) just the longest
+ * match, or (c) just some "promising" matches that we are able to find quickly,
+ * or (d) just the longest match that we're able to find quickly. Below we
+ * introduce the match-finding methods that the code currently uses or has
+ * previously used:
+ *
+ * - Hash chains. Maintain a table that maps hash codes, computed from
+ * fixed-length byte sequences, to linked lists containing previous window
+ * positions. To search for matches, compute the hash for the current
+ * position in the window and search the appropriate hash chain. When
+ * advancing to the next position, prepend the current position to the
+ * appropriate hash list. This is a good approach for producing matches with
+ * stategy (d) and is useful for fast compression. Therefore, we provide an
+ * option to use this method for LZX compression. See lz_hash.c for the
+ * implementation.
+ *
+ * - Binary trees. Similar to hash chains, but each hash bucket contains a
+ * binary tree of previous window positions rather than a linked list. This
+ * is a good approach for producing matches with stategy (c) and is useful for
+ * achieving a good compression ratio. Therefore, we provide an option to use
+ * this method; see lz_bt.c for the implementation.
+ *
+ * - Suffix arrays. This code previously used this method to produce matches
+ * with stategy (c), but I've dropped it because it was slower than the binary
+ * trees approach, used more memory, and did not improve the compression ratio
+ * enough to compensate. Download wimlib v1.6.2 if you want the code.
+ * However, the suffix array method was basically as follows. Build the
+ * suffix array for the entire window. The suffix array contains each
+ * possible window position, sorted by the lexicographic order of the strings
+ * that begin at those positions. Find the matches at a given position by
+ * searching the suffix array outwards, in both directions, from the suffix
+ * array slot for that position. This produces the longest matches first, but
+ * "matches" that actually occur at later positions in the window must be
+ * skipped. To do this skipping, use an auxiliary array with dynamically
+ * constructed linked lists. Also, use the inverse suffix array to quickly
+ * find the suffix array slot for a given position without doing a binary
+ * search.
+ *
+ * ----------------------------------------------------------------------------
+ *
+ * Match-choosing
+ *
+ * Usually, choosing the longest match is best because it encodes the most data
+ * in that one item. However, sometimes the longest match is not optimal
+ * because (a) choosing a long match now might prevent using an even longer
+ * match later, or (b) more generally, what we actually care about is the number
+ * of bits it will ultimately take to output each match or literal, which is
+ * actually dependent on the entropy encoding using by the underlying
+ * compression format. Consequently, a longer match usually, but not always,
+ * takes fewer bits to encode than multiple shorter matches or literals that
+ * cover the same data.
+ *
+ * This problem of choosing the truly best match/literal sequence is probably
+ * impossible to solve efficiently when combined with entropy encoding. If we
+ * knew how many bits it takes to output each match/literal, then we could
+ * choose the optimal sequence using shortest-path search a la Dijkstra's
+ * algorithm. However, with entropy encoding, the chosen match/literal sequence
+ * affects its own encoding. Therefore, we can't know how many bits it will
+ * take to actually output any one match or literal until we have actually
+ * chosen the full sequence of matches and literals.
+ *
+ * Notwithstanding the entropy encoding problem, we also aren't guaranteed to
+ * choose the optimal match/literal sequence unless the match-finder (see
+ * section "Match-finder") provides the match-chooser with all possible matches
+ * at each position. However, this is not computationally efficient. For
+ * example, there might be many matches of the same length, and usually (but not
+ * always) the best choice is the one with the smallest offset. So in practice,
+ * it's fine to only consider the smallest offset for a given match length at a
+ * given position. (Actually, for LZX, it's also worth considering repeat
+ * offsets.)
+ *
+ * In addition, as mentioned earlier, in LZX we have the choice of using
+ * multiple blocks, each of which resets the Huffman codes. This expands the
+ * search space even further. Therefore, to simplify the problem, we currently
+ * we don't attempt to actually choose the LZX blocks based on the data.
+ * Instead, we just divide the data into fixed-size blocks of LZX_DIV_BLOCK_SIZE
+ * bytes each, and always use verbatim or aligned blocks (never uncompressed).
+ * A previous version of this code recursively split the input data into
+ * equal-sized blocks, up to a maximum depth, and chose the lowest-cost block
+ * divisions. However, this made compression much slower and did not actually
+ * help very much. It remains an open question whether a sufficiently fast and
+ * useful block-splitting algorithm is possible for LZX. Essentially the same
+ * problem also applies to DEFLATE. The Microsoft LZX compressor seemingly does
+ * do block splitting, although I don't know how fast or useful it is,
+ * specifically.
+ *
+ * Now, back to the entropy encoding problem. The "solution" is to use an
+ * iterative approach to compute a good, but not necessarily optimal,
+ * match/literal sequence. Start with a fixed assignment of symbol costs and
+ * choose an "optimal" match/literal sequence based on those costs, using
+ * shortest-path seach a la Dijkstra's algorithm. Then, for each iteration of
+ * the optimization, update the costs based on the entropy encoding of the
+ * current match/literal sequence, then choose a new match/literal sequence
+ * based on the updated costs. Usually, the actual cost to output the current
+ * match/literal sequence will decrease in each iteration until it converges on
+ * a fixed point. This result may not be the truly optimal match/literal
+ * sequence, but it usually is much better than one chosen by doing a "greedy"
+ * parse where we always chooe the longest match.
+ *
+ * An alternative to both greedy parsing and iterative, near-optimal parsing is
+ * "lazy" parsing. Briefly, "lazy" parsing considers just the longest match at
+ * each position, but it waits to choose that match until it has also examined
+ * the next position. This is actually a useful approach; it's used by zlib,
+ * for example. Therefore, for fast compression we combine lazy parsing with
+ * the hash chain max-finder. For normal/high compression we combine
+ * near-optimal parsing with the binary tree match-finder.
+ *
+ * Anyway, if you've read through this comment, you hopefully should have a
+ * better idea of why things are done in a certain way in this LZX compressor,
+ * as well as in other compressors for LZ77-based formats (including third-party
+ * ones). In my opinion, the phrase "compression algorithm" is often mis-used
+ * in place of "compression format", since there can be many different
+ * algorithms that all generate compressed data in the same format. The
+ * challenge is to design an algorithm that is efficient but still gives a good
+ * compression ratio.