- * Given a position in the window, we want to find LZ77-style "matches" with
- * that position at previous positions in the window. With LZX, the minimum
- * match length is 2 and the maximum match length is 257. The only restriction
- * on offsets is that LZX does not allow the last 2 bytes of the window to match
- * the the beginning of the window.
- *
- * Depending on how good a compression ratio we want (see the "Match-choosing"
- * section), we may want to find: (a) all matches, or (b) just the longest
- * match, or (c) just some "promising" matches that we are able to find quickly,
- * or (d) just the longest match that we're able to find quickly. Below we
- * introduce the match-finding methods that the code currently uses or has
- * previously used:
- *
- * - Hash chains. Maintain a table that maps hash codes, computed from
- * fixed-length byte sequences, to linked lists containing previous window
- * positions. To search for matches, compute the hash for the current
- * position in the window and search the appropriate hash chain. When
- * advancing to the next position, prepend the current position to the
- * appropriate hash list. This is a good approach for producing matches with
- * stategy (d) and is useful for fast compression. Therefore, we provide an
- * option to use this method for LZX compression. See lz_hash.c for the
- * implementation.
- *
- * - Binary trees. Similar to hash chains, but each hash bucket contains a
- * binary tree of previous window positions rather than a linked list. This
- * is a good approach for producing matches with stategy (c) and is useful for
- * achieving a good compression ratio. Therefore, we provide an option to use
- * this method; see lz_bt.c for the implementation.
- *
- * - Suffix arrays. This code previously used this method to produce matches
- * with stategy (c), but I've dropped it because it was slower than the binary
- * trees approach, used more memory, and did not improve the compression ratio
- * enough to compensate. Download wimlib v1.6.2 if you want the code.
- * However, the suffix array method was basically as follows. Build the
- * suffix array for the entire window. The suffix array contains each
- * possible window position, sorted by the lexicographic order of the strings
- * that begin at those positions. Find the matches at a given position by
- * searching the suffix array outwards, in both directions, from the suffix
- * array slot for that position. This produces the longest matches first, but
- * "matches" that actually occur at later positions in the window must be
- * skipped. To do this skipping, use an auxiliary array with dynamically
- * constructed linked lists. Also, use the inverse suffix array to quickly
- * find the suffix array slot for a given position without doing a binary
- * search.
- *
- * ----------------------------------------------------------------------------
- *
- * Match-choosing
- *
- * Usually, choosing the longest match is best because it encodes the most data
- * in that one item. However, sometimes the longest match is not optimal
- * because (a) choosing a long match now might prevent using an even longer
- * match later, or (b) more generally, what we actually care about is the number
- * of bits it will ultimately take to output each match or literal, which is
- * actually dependent on the entropy encoding using by the underlying
- * compression format. Consequently, a longer match usually, but not always,
- * takes fewer bits to encode than multiple shorter matches or literals that
- * cover the same data.
- *
- * This problem of choosing the truly best match/literal sequence is probably
- * impossible to solve efficiently when combined with entropy encoding. If we
- * knew how many bits it takes to output each match/literal, then we could
- * choose the optimal sequence using shortest-path search a la Dijkstra's
- * algorithm. However, with entropy encoding, the chosen match/literal sequence
- * affects its own encoding. Therefore, we can't know how many bits it will
- * take to actually output any one match or literal until we have actually
- * chosen the full sequence of matches and literals.
- *
- * Notwithstanding the entropy encoding problem, we also aren't guaranteed to
- * choose the optimal match/literal sequence unless the match-finder (see
- * section "Match-finder") provides the match-chooser with all possible matches
- * at each position. However, this is not computationally efficient. For
- * example, there might be many matches of the same length, and usually (but not
- * always) the best choice is the one with the smallest offset. So in practice,
- * it's fine to only consider the smallest offset for a given match length at a
- * given position. (Actually, for LZX, it's also worth considering repeat
- * offsets.)
- *
- * In addition, as mentioned earlier, in LZX we have the choice of using
- * multiple blocks, each of which resets the Huffman codes. This expands the
- * search space even further. Therefore, to simplify the problem, we currently
- * we don't attempt to actually choose the LZX blocks based on the data.
- * Instead, we just divide the data into fixed-size blocks of LZX_DIV_BLOCK_SIZE
- * bytes each, and always use verbatim or aligned blocks (never uncompressed).
- * A previous version of this code recursively split the input data into
- * equal-sized blocks, up to a maximum depth, and chose the lowest-cost block
- * divisions. However, this made compression much slower and did not actually
- * help very much. It remains an open question whether a sufficiently fast and
- * useful block-splitting algorithm is possible for LZX. Essentially the same
- * problem also applies to DEFLATE. The Microsoft LZX compressor seemingly does
- * do block splitting, although I don't know how fast or useful it is,
- * specifically.
- *
- * Now, back to the entropy encoding problem. The "solution" is to use an
- * iterative approach to compute a good, but not necessarily optimal,
- * match/literal sequence. Start with a fixed assignment of symbol costs and
- * choose an "optimal" match/literal sequence based on those costs, using
- * shortest-path seach a la Dijkstra's algorithm. Then, for each iteration of
- * the optimization, update the costs based on the entropy encoding of the
- * current match/literal sequence, then choose a new match/literal sequence
- * based on the updated costs. Usually, the actual cost to output the current
- * match/literal sequence will decrease in each iteration until it converges on
- * a fixed point. This result may not be the truly optimal match/literal
- * sequence, but it usually is much better than one chosen by doing a "greedy"
- * parse where we always chooe the longest match.
- *
- * An alternative to both greedy parsing and iterative, near-optimal parsing is
- * "lazy" parsing. Briefly, "lazy" parsing considers just the longest match at
- * each position, but it waits to choose that match until it has also examined
- * the next position. This is actually a useful approach; it's used by zlib,
- * for example. Therefore, for fast compression we combine lazy parsing with
- * the hash chain max-finder. For normal/high compression we combine
- * near-optimal parsing with the binary tree match-finder.
- *
- * Anyway, if you've read through this comment, you hopefully should have a
- * better idea of why things are done in a certain way in this LZX compressor,
- * as well as in other compressors for LZ77-based formats (including third-party
- * ones). In my opinion, the phrase "compression algorithm" is often mis-used
- * in place of "compression format", since there can be many different
- * algorithms that all generate compressed data in the same format. The
- * challenge is to design an algorithm that is efficient but still gives a good
- * compression ratio.