- * Since the match-finders are not specific to LZX, I will not explain them in
- * detail here. Instead, see lz_hash_chains.c and lz_binary_trees.c.
- *
- * ----------------------------------------------------------------------------
- *
- * Match-choosing
- *
- * Usually, choosing the longest match is best because it encodes the most data
- * in that one item. However, sometimes the longest match is not optimal
- * because (a) choosing a long match now might prevent using an even longer
- * match later, or (b) more generally, what we actually care about is the number
- * of bits it will ultimately take to output each match or literal, which is
- * actually dependent on the entropy encoding using by the underlying
- * compression format. Consequently, a longer match usually, but not always,
- * takes fewer bits to encode than multiple shorter matches or literals that
- * cover the same data.
- *
- * This problem of choosing the truly best match/literal sequence is probably
- * impossible to solve efficiently when combined with entropy encoding. If we
- * knew how many bits it takes to output each match/literal, then we could
- * choose the optimal sequence using shortest-path search a la Dijkstra's
- * algorithm. However, with entropy encoding, the chosen match/literal sequence
- * affects its own encoding. Therefore, we can't know how many bits it will
- * take to actually output any one match or literal until we have actually
- * chosen the full sequence of matches and literals.
- *
- * Notwithstanding the entropy encoding problem, we also aren't guaranteed to
- * choose the optimal match/literal sequence unless the match-finder (see
- * section "Match-finder") provides the match-chooser with all possible matches
- * at each position. However, this is not computationally efficient. For
- * example, there might be many matches of the same length, and usually (but not
- * always) the best choice is the one with the smallest offset. So in practice,
- * it's fine to only consider the smallest offset for a given match length at a
- * given position. (Actually, for LZX, it's also worth considering repeat
- * offsets.)
- *
- * In addition, as mentioned earlier, in LZX we have the choice of using
- * multiple blocks, each of which resets the Huffman codes. This expands the
- * search space even further. Therefore, to simplify the problem, we currently
- * we don't attempt to actually choose the LZX blocks based on the data.
- * Instead, we just divide the data into fixed-size blocks of LZX_DIV_BLOCK_SIZE
- * bytes each, and always use verbatim or aligned blocks (never uncompressed).
- * A previous version of this code recursively split the input data into
- * equal-sized blocks, up to a maximum depth, and chose the lowest-cost block
- * divisions. However, this made compression much slower and did not actually
- * help very much. It remains an open question whether a sufficiently fast and
- * useful block-splitting algorithm is possible for LZX. Essentially the same
- * problem also applies to DEFLATE. The Microsoft LZX compressor seemingly does
- * do block splitting, although I don't know how fast or useful it is,
- * specifically.
- *
- * Now, back to the entropy encoding problem. The "solution" is to use an
- * iterative approach to compute a good, but not necessarily optimal,
- * match/literal sequence. Start with a fixed assignment of symbol costs and
- * choose an "optimal" match/literal sequence based on those costs, using
- * shortest-path seach a la Dijkstra's algorithm. Then, for each iteration of
- * the optimization, update the costs based on the entropy encoding of the
- * current match/literal sequence, then choose a new match/literal sequence
- * based on the updated costs. Usually, the actual cost to output the current
- * match/literal sequence will decrease in each iteration until it converges on
- * a fixed point. This result may not be the truly optimal match/literal
- * sequence, but it usually is much better than one chosen by doing a "greedy"
- * parse where we always chooe the longest match.
- *
- * An alternative to both greedy parsing and iterative, near-optimal parsing is
- * "lazy" parsing. Briefly, "lazy" parsing considers just the longest match at
- * each position, but it waits to choose that match until it has also examined
- * the next position. This is actually a useful approach; it's used by zlib,
- * for example. Therefore, for fast compression we combine lazy parsing with
- * the hash chain max-finder. For normal/high compression we combine
- * near-optimal parsing with the binary tree match-finder.