From: Eric Biggers <ebiggers3@gmail.com>
Date: Sun, 17 Nov 2013 22:12:49 +0000 (-0600)
Subject: Merge experimental LZX compressor
X-Git-Tag: v1.5.2~3
X-Git-Url: https://wimlib.net/git/?p=wimlib;a=commitdiff_plain;h=2254a0fc3f1d7af1151ee83f3458f44339b5028b

Merge experimental LZX compressor
---

diff --git a/NEWS b/NEWS
index 79554aa4..c86a155e 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,14 @@
 Only the most important changes more recent than version 0.6 are noted here.
 
 Version 1.5.2:
+	Added a new experimental LZX compressor which can be enabled by passing
+	'--compress-slow' to `wimlib-imagex capture' or `wimlib-imagex
+	optimize'.  (The latter is only applicable if the WIM is already
+	LZX-compressed and the '--recompress' option is also given.)  The
+	experimental compressor is much slower but compresses the data slightly
+	more --- currently usually to within a fraction of a percent of the
+	results from WIMGAPI/ImageX.
+
 	An alignment bug that caused a crash in the LZX decompressor on some
 	builds was fixed.
 
diff --git a/README b/README
index 88b1521f..febaf267 100644
--- a/README
+++ b/README
@@ -24,7 +24,7 @@ A Windows Imaging (WIM) file is an archive designed primarily for archiving
 Windows filesystems.  However, it can be used on other platforms as well, with
 some limitations.  Like some other archive formats such as ZIP, files in WIM
 archives may be compressed.  WIM files support two compression formats: LZX and
-XPRESS.  Both are supported by wimlib.
+XPRESS *.  Both are supported by wimlib.
 
 A WIM file consists of one or more "images".  Each image is an independent
 top-level directory structure and is logically separate from all other images in
@@ -35,6 +35,10 @@ images.
 A WIM file may be either stand-alone or split into multiple parts.  Split WIMs
 are read-only and cannot be modified.
 
+* Note: The Windows 8 WIMGAPI apparently adds a third format, LZMS, but it is
+  not documented and is incompatible with ImageX and Dism.  It is unclear if
+  this new format is actually being used for anything.
+
                              IMAGEX IMPLEMENTATION
 
 wimlib itself is a C library, and it provides a documented public API (See:
@@ -62,10 +66,10 @@ commands and their syntax.  For additional documentation:
 
 wimlib (and wimlib-imagex) can create XPRESS or LZX compressed WIM archives.
 Currently, the XPRESS compression ratio is slightly better than that provided by
-Microsoft's software, while the LZX compression ratio is approaching that of
-Microsoft's software but is not quite there yet.  Running time is as good as or
-better than Microsoft's software, especially with multithreaded compression,
-available in wimlib v1.1.0 and later.
+Microsoft's software, while by default the LZX compression ratio is approaching
+that of Microsoft's software but is not quite there yet.  Running time is as
+good as or better than Microsoft's software, especially with multithreaded
+compression, available in wimlib v1.1.0 and later.
 
 The following tables compare the compression ratio and performance for creating
 a compressed x86_64 Windows PE image.  Note: these timings were done on Windows
@@ -84,6 +88,12 @@ even better performance on Linux.
         wimlib-imagex (v1.4.0, 2 threads): 18 sec                  51 sec
         Microsoft imagex.exe:              25 sec                  93 sec
 
+The above LZX values are using the default LZX compressor.  wimlib v1.5.2
+introduced a new experimental LZX compressor which can be enabled by passing
+'--compress-slow' to `wimlib-imagex capture' or `wimlib-imagex optimize'.  This
+compressor is much slower but compresses the data slightly more --- currently
+usually to within a fraction of a percent of the results from imagex.exe.
+
                                   NTFS SUPPORT
 
 WIM images may contain data, such as alternate data streams and
diff --git a/doc/imagex-capture.1.in b/doc/imagex-capture.1.in
index 18d07ed6..08ac4226 100644
--- a/doc/imagex-capture.1.in
+++ b/doc/imagex-capture.1.in
@@ -184,7 +184,7 @@ for \fB@IMAGEX_PROGNAME@ capture\fR, since the compression mode for
 is automatically set as such).
 \fITYPE\fR may be "none", "fast", or "maximum".  By default, it is "maximum".
 This default behavior is different from Microsoft's ImageX, where the default is
-"fast". \fB@IMAGEX_PROGNAME@ capture\fR instead gives you the best compression
+"fast". \fB@IMAGEX_PROGNAME@ capture\fR instead gives you the better compression
 ratio by default and makes up for the slightly slower compression by being
 faster than Microsoft's software in the first place and using multiple CPUs when
 available.
@@ -192,6 +192,15 @@ available.
 You may also specify the actual names of the compression algorithms, "XPRESS"
 and "LZX", instead of "fast" and "maximum", respectively.
 .TP
+\fB--compress-slow\fR
+Spend even more time compressing the data in order to achieve a higher
+compression ratio.  Currently, this only has an effect with LZX ("maximum")
+compression.  Depending on the data, compressing with this option will take
+around 10 to 15 times longer and produce a LZX-compressed WIM about 1% to 5%
+smaller than one produced with the default LZX compressor.  Depending on the
+data, the resulting WIM may be approximately the same size (typically no more
+than 0.4% different) as a LZX-compressed WIM produced by WIMGAPI.
+.TP
 \fB--threads\fR=\fINUM_THREADS\fR
 Number of threads to use for compressing data.  Default: autodetect (number of
 available CPUs).
diff --git a/doc/imagex-optimize.1.in b/doc/imagex-optimize.1.in
index fd5a6a2d..8ccbaa7d 100644
--- a/doc/imagex-optimize.1.in
+++ b/doc/imagex-optimize.1.in
@@ -39,6 +39,15 @@ a slightly worse LZX compression ratio than Microsoft's software.  So, you may
 not want to specify \fB--recompress\fR when optimizing a LZX-compressed WIM
 created on Windows with Microsoft's ImageX.
 .TP
+\fB--compress-slow\fR
+With \fB--recompress\fR: Spend even more time compressing the data in order to
+achieve a higher compression ratio.  Currently, this only has an effect with LZX
+("maximum") compression.  Depending on the data, compressing with this option
+will take around 10 to 15 times longer and produce a LZX-compressed WIM about 1%
+to 5% smaller than one produced with the default LZX compressor.  Depending on
+the data, the resulting WIM may be approximately the same size (typically no
+more than 0.4% different) as a LZX-compressed WIM produced by WIMGAPI.
+.TP
 \fB--threads\fR=\fINUM_THREADS\fR
 Number of threads to use for compressing data.  Default: autodetect (number of
 processors).  This parameter is only meaningful when \fB--recompress\fR is also
diff --git a/doc/imagex.1.in b/doc/imagex.1.in
index 0b1089fc..4028390e 100644
--- a/doc/imagex.1.in
+++ b/doc/imagex.1.in
@@ -125,12 +125,6 @@ mode, similar to Microsoft's ImageX.
 wimlib supports multithreaded compression, which can make it much faster to
 create compressed WIM files.
 .IP \[bu]
-wimlib's XPRESS compressor is slightly better than Microsoft's (in terms of
-compression ratio).
-.IP \[bu]
-wimlib's LZX compressor is slightly worse than Microsoft's (in terms of
-compression ratio), but it's still better than XPRESS compression.
-.IP \[bu]
 \fB@IMAGEX_PROGNAME@ capture\fR defaults to LZX ("maximum") compression for new
 WIMs, as opposed to Microsoft's software which defaults to XPRESS ("fast")
 compression.
diff --git a/include/wimlib.h b/include/wimlib.h
index 081dfca8..d5194c17 100644
--- a/include/wimlib.h
+++ b/include/wimlib.h
@@ -1501,6 +1501,13 @@ typedef int (*wimlib_iterate_lookup_table_callback_t)(const struct wimlib_resour
  * already implied for wimlib_overwrite().  */
 #define WIMLIB_WRITE_FLAG_STREAMS_OK			0x00000400
 
+/** Use the slow LZX compression algorithm (rather than the default fast LZX
+ * compression algorithm) to try to achieve a higher compression ratio.  Only
+ * has an effect if the WIM uses LZX compression; not to be confused with "fast"
+ * (XPRESS) compression.  This can be combined with
+ * ::WIMLIB_WRITE_FLAG_RECOMPRESS.  */
+#define WIMLIB_WRITE_FLAG_COMPRESS_SLOW			0x00000800
+
 /** @} */
 /** @ingroup G_general
  * @{ */
@@ -1642,6 +1649,97 @@ struct wimlib_extract_command {
 	int extract_flags;
 };
 
+/** @} */
+/** @ingroup G_compression
+ * @{ */
+
+/** LZX compression parameters to pass to wimlib_lzx_alloc_context().  */
+struct wimlib_lzx_params {
+	/** Must be set to the size of this structure, in bytes.  */
+	uint32_t size_of_this;
+
+	/** Relatively fast LZX compression algorithm with a decent compression
+	 * ratio; the suggested default.  */
+#define WIMLIB_LZX_ALGORITHM_FAST 0
+
+	/** Slower LZX compression algorithm that provides a better compression
+	 * ratio.  */
+#define WIMLIB_LZX_ALGORITHM_SLOW 1
+
+	/** Algorithm to use to perform the compression: either
+	 * ::WIMLIB_LZX_ALGORITHM_FAST or ::WIMLIB_LZX_ALGORITHM_SLOW.  The
+	 * format is still LZX; this refers to the method the code will use to
+	 * perform LZX-compatible compression.  */
+	uint32_t algorithm : 3;
+
+	/** If set to 1, the default parameters for the specified algorithm are
+	 * used rather than the ones specified in the following union.  */
+	uint32_t use_defaults : 1;
+
+	union {
+		/** Parameters for the fast algorithm.  */
+		struct wimlib_lzx_fast_params {
+			uint32_t fast_reserved1[10];
+		} fast;
+
+		/** Parameters for the slow algorithm.  */
+		struct wimlib_lzx_slow_params {
+			/** If set to 1, the compressor can output length 2
+			 * matches.  If set 0, the compressor only outputs
+			 * matches of length 3 or greater.  Suggested value: 1
+			 */
+			uint32_t use_len2_matches : 1;
+
+			uint32_t slow_reserved1 : 31;
+
+			/** This is the maximum match length to return from the
+			 * binary tree match-finder.  Any match reaching this
+			 * limit is still extended as far as possible.  Must be
+			 * at least 3 and no more than 257.  Suggested value:
+			 * 32.  */
+			uint32_t num_fast_bytes;
+
+			/** Number of passes to compute a match/literal sequence
+			 * for each LZX block.  This is for an iterative
+			 * algorithm that attempts to minimize the cost of the
+			 * match/literal sequence by using a cost model provided
+			 * by the previous iteration.  Must be at least 1.
+			 * Suggested value: 3.  */
+			uint32_t num_optim_passes;
+
+			/** The number of times to attempt to recursively split
+			 * each LZX block.  Up to (2**(num_split_passes)
+			 * sub-blocks can be created for a given input.  This
+			 * parameter can be 0, in which case the full input is
+			 * always output as one block.  Suggested value: 3.
+			 */
+			uint32_t num_split_passes;
+
+			uint32_t slow_reserved2[4];
+
+			/** Assumed cost of a main symbol with zero frequency.
+			 * Must be at least 1 and no more than 16.  Suggested
+			 * value: 15.  */
+			uint8_t main_nostat_cost;
+
+			/** Assumed cost of a length symbol with zero frequency.
+			 * Must be at least 1 and no more than 16.  Suggested
+			 * value: 15.  */
+			uint8_t len_nostat_cost;
+
+			/** Assumed cost of an aligned symbol with zero
+			 * frequency.  Must be at least 1 and no more than 8.
+			 * Suggested value: 7.  */
+			uint8_t aligned_nostat_cost;
+
+			uint8_t slow_reserved3[5];
+		} slow;
+	};
+};
+
+/** Opaque LZX compression context.  */
+struct wimlib_lzx_context;
+
 /** @} */
 /** @ingroup G_general
  * @{ */
@@ -2614,6 +2712,56 @@ wimlib_join(const wimlib_tchar * const *swms,
 extern unsigned
 wimlib_lzx_compress(const void *chunk, unsigned chunk_size, void *out);
 
+/**
+ * @ingroup G_compression
+ *
+ * Equivalent to wimlib_lzx_compress(), but uses the specified compression
+ * context, allocated by wimlib_lzx_alloc_context().
+ */
+extern unsigned
+wimlib_lzx_compress2(const void *chunk, unsigned chunk_size, void *out,
+		     struct wimlib_lzx_context *ctx);
+
+/**
+ * @ingroup G_compression
+ *
+ * Allocate a LZX compression context using the specified parameters.
+ *
+ * This function is exported for convenience only and should only be used by
+ * library clients looking to make use of wimlib's compression code for another
+ * purpose.
+ *
+ * @param params
+ *	Compression parameters to use, or @c NULL to use the default algorithm
+ *	and parameters.
+ *
+ * @param ctx_ret
+ *	A pointer to either @c NULL or an existing ::wimlib_lzx_context.  If
+ *	<code>*ctx_ret == NULL</code>, the new context is allocated.  If
+ *	<code>*ctx_ret != NULL</code>, the existing context is re-used if
+ *	possible.  Alternatively, this argument can itself be @c NULL to
+ *	indicate that only parameter validation is to be performed.
+ *
+ * @return 0 on success; nonzero on error.
+ *
+ * @retval ::WIMLIB_ERR_INVALID_PARAM
+ *	The compression parameters were invalid.
+ * @retval ::WIMLIB_ERR_NOMEM
+ *	Not enough memory to allocate the compression context.
+ */
+extern int
+wimlib_lzx_alloc_context(const struct wimlib_lzx_params *params,
+			 struct wimlib_lzx_context **ctx_pp);
+
+/**
+ * @ingroup G_compression
+ *
+ * Free the specified LZX compression context, allocated with
+ * wimlib_lzx_alloc_context().
+ */
+extern void
+wimlib_lzx_free_context(struct wimlib_lzx_context *ctx);
+
 /**
  * @ingroup G_compression
  *
diff --git a/include/wimlib/compress.h b/include/wimlib/compress.h
index 60b4c0ea..b6485a0d 100644
--- a/include/wimlib/compress.h
+++ b/include/wimlib/compress.h
@@ -69,8 +69,8 @@ struct lz_params {
 	unsigned too_far;
 };
 
-typedef unsigned (*lz_record_match_t)(unsigned, unsigned, void *, void *);
-typedef unsigned (*lz_record_literal_t)(u8, void *);
+typedef u32 (*lz_record_match_t)(unsigned, unsigned, void *, void *);
+typedef u32 (*lz_record_literal_t)(u8, void *);
 
 extern unsigned
 lz_analyze_block(const u8 uncompressed_data[],
@@ -96,8 +96,8 @@ flush_output_bitstream(struct output_bitstream *ostream);
 extern void
 make_canonical_huffman_code(unsigned num_syms,
 			    unsigned max_codeword_len,
-			    const freq_t freq_tab[],
-			    u8 lens[],
-			    u16 codewords[]);
+			    const freq_t freq_tab[restrict],
+			    u8 lens[restrict],
+			    u16 codewords[restrict]);
 
 #endif /* _WIMLIB_COMPRESS_H */
diff --git a/include/wimlib/lzx.h b/include/wimlib/lzx.h
index a3db788a..6061e730 100644
--- a/include/wimlib/lzx.h
+++ b/include/wimlib/lzx.h
@@ -7,8 +7,10 @@
 //#define ENABLE_LZX_DEBUG
 #ifdef ENABLE_LZX_DEBUG
 #	define LZX_DEBUG DEBUG
+#       define LZX_ASSERT wimlib_assert
 #else
 #	define LZX_DEBUG(format, ...)
+#	define LZX_ASSERT(...)
 #endif
 
 /* Constants, most of which are defined by the LZX specification: */
@@ -73,6 +75,9 @@
  * different as well.  */
 #define LZX_WIM_MAGIC_FILESIZE		12000000
 
+#define LZX_BLOCKTYPE_NBITS	3
+#define LZX_BLOCKSIZE_NBITS	16
+
 #define USE_LZX_EXTRA_BITS_ARRAY
 
 #ifdef USE_LZX_EXTRA_BITS_ARRAY
@@ -97,7 +102,7 @@ lzx_get_num_extra_bits(unsigned position_slot)
 extern const u32 lzx_position_base[LZX_NUM_POSITION_SLOTS];
 
 /* Least-recently used queue for match offsets. */
-struct lru_queue {
+struct lzx_lru_queue {
 	u32 R0;
 	u32 R1;
 	u32 R2;
diff --git a/include/wimlib/resource.h b/include/wimlib/resource.h
index e030e336..1d257a56 100644
--- a/include/wimlib/resource.h
+++ b/include/wimlib/resource.h
@@ -94,7 +94,8 @@ put_resource_entry(const struct resource_entry *entry,
 
 /* wimlib internal flags used when reading or writing resources.  */
 #define WIMLIB_WRITE_RESOURCE_FLAG_RECOMPRESS		0x00000001
-#define WIMLIB_WRITE_RESOURCE_FLAG_PIPABLE		0x00000002
+#define WIMLIB_WRITE_RESOURCE_FLAG_COMPRESS_SLOW	0x00000002
+#define WIMLIB_WRITE_RESOURCE_FLAG_PIPABLE		0x00000004
 #define WIMLIB_WRITE_RESOURCE_MASK			0x0000ffff
 
 #define WIMLIB_READ_RESOURCE_FLAG_RAW_FULL		0x80000000
@@ -136,14 +137,16 @@ read_resource_prefix(const struct wim_lookup_table_entry *lte,
 extern int
 write_wim_resource(struct wim_lookup_table_entry *lte, struct filedes *out_fd,
 		   int out_ctype, struct resource_entry *out_res_entry,
-		   int write_resource_flags);
+		   int write_resource_flags,
+		   struct wimlib_lzx_context **comp_ctx);
 
 extern int
 write_wim_resource_from_buffer(const void *buf, size_t buf_size,
 			       int reshdr_flags, struct filedes *out_fd,
 			       int out_ctype,
 			       struct resource_entry *out_res_entry,
-			       u8 *hash_ret, int write_resource_flags);
+			       u8 *hash_ret, int write_resource_flags,
+			       struct wimlib_lzx_context **comp_ctx);
 
 /* Functions to extract a resource.  */
 
diff --git a/include/wimlib/wim.h b/include/wimlib/wim.h
index 541d67ae..d1e09713 100644
--- a/include/wimlib/wim.h
+++ b/include/wimlib/wim.h
@@ -42,6 +42,9 @@ struct WIMStruct {
 	/* Temporary field */
 	void *private;
 
+	/* LZX compression context for default thread  */
+	struct wimlib_lzx_context *lzx_context;
+
 	struct list_head subwims;
 
 	struct list_head subwim_node;
diff --git a/programs/imagex.c b/programs/imagex.c
index 54fc4cf2..f23f0438 100644
--- a/programs/imagex.c
+++ b/programs/imagex.c
@@ -123,6 +123,7 @@ enum {
 	IMAGEX_COMMAND_OPTION,
 	IMAGEX_COMMIT_OPTION,
 	IMAGEX_COMPRESS_OPTION,
+	IMAGEX_COMPRESS_SLOW_OPTION,
 	IMAGEX_CONFIG_OPTION,
 	IMAGEX_DEBUG_OPTION,
 	IMAGEX_DELTA_FROM_OPTION,
@@ -188,6 +189,7 @@ static const struct option capture_or_append_options[] = {
 	{T("no-check"),    no_argument,       NULL, IMAGEX_NOCHECK_OPTION},
 	{T("nocheck"),     no_argument,       NULL, IMAGEX_NOCHECK_OPTION},
 	{T("compress"),    required_argument, NULL, IMAGEX_COMPRESS_OPTION},
+	{T("compress-slow"), no_argument,     NULL, IMAGEX_COMPRESS_SLOW_OPTION},
 	{T("config"),      required_argument, NULL, IMAGEX_CONFIG_OPTION},
 	{T("dereference"), no_argument,       NULL, IMAGEX_DEREFERENCE_OPTION},
 	{T("flags"),       required_argument, NULL, IMAGEX_FLAGS_OPTION},
@@ -281,6 +283,7 @@ static const struct option optimize_options[] = {
 	{T("nocheck"),     no_argument,       NULL, IMAGEX_NOCHECK_OPTION},
 	{T("no-check"),    no_argument,       NULL, IMAGEX_NOCHECK_OPTION},
 	{T("recompress"),  no_argument,       NULL, IMAGEX_RECOMPRESS_OPTION},
+	{T("compress-slow"), no_argument,     NULL, IMAGEX_COMPRESS_SLOW_OPTION},
 	{T("threads"),     required_argument, NULL, IMAGEX_THREADS_OPTION},
 	{T("pipable"),     no_argument,       NULL, IMAGEX_PIPABLE_OPTION},
 	{T("not-pipable"), no_argument,       NULL, IMAGEX_NOT_PIPABLE_OPTION},
@@ -1694,6 +1697,9 @@ imagex_capture_or_append(int argc, tchar **argv, int cmd)
 			if (compression_type == WIMLIB_COMPRESSION_TYPE_INVALID)
 				goto out_err;
 			break;
+		case IMAGEX_COMPRESS_SLOW_OPTION:
+			write_flags |= WIMLIB_WRITE_FLAG_COMPRESS_SLOW;
+			break;
 		case IMAGEX_FLAGS_OPTION:
 			flags_element = optarg;
 			break;
@@ -3217,6 +3223,9 @@ imagex_optimize(int argc, tchar **argv, int cmd)
 		case IMAGEX_RECOMPRESS_OPTION:
 			write_flags |= WIMLIB_WRITE_FLAG_RECOMPRESS;
 			break;
+		case IMAGEX_COMPRESS_SLOW_OPTION:
+			write_flags |= WIMLIB_WRITE_FLAG_COMPRESS_SLOW;
+			break;
 		case IMAGEX_THREADS_OPTION:
 			num_threads = parse_num_threads(optarg);
 			if (num_threads == UINT_MAX)
@@ -3630,12 +3639,12 @@ static const tchar *usage_strings[] = {
 [CMD_APPEND] =
 T(
 "    %"TS" (DIRECTORY | NTFS_VOLUME) WIMFILE\n"
-"                    [IMAGE_NAME [IMAGE_DESCRIPTION]] [--boot] [--check]\n"
-"                    [--nocheck] [--flags EDITION_ID] [--verbose]\n"
-"                    [--dereference] [--config=FILE] [--threads=NUM_THREADS]\n"
-"                    [--rebuild] [--unix-data] [--source-list] [--no-acls]\n"
-"                    [--strict-acls] [--rpfix] [--norpfix] [--pipable]\n"
-"                    [--not-pipable] [--update-of=[WIMFILE:]IMAGE]\n"
+"                    [IMAGE_NAME [IMAGE_DESCRIPTION]] [--boot]\n"
+"                    [--check] [--nocheck] [--compress-slow]\n"
+"                    [--flags EDITION_ID] [--verbose] [--dereference]\n"
+"                    [--config=FILE] [--threads=NUM_THREADS] [--source-list]\n"
+"                    [--no-acls] [--strict-acls] [--rpfix] [--norpfix]\n"
+"                    [--unix-data] [--pipable] [--update-of=[WIMFILE:]IMAGE]\n"
 ),
 [CMD_APPLY] =
 T(
@@ -3648,12 +3657,13 @@ T(
 [CMD_CAPTURE] =
 T(
 "    %"TS" (DIRECTORY | NTFS_VOLUME) WIMFILE\n"
-"		     [IMAGE_NAME [IMAGE_DESCRIPTION]] [--boot] [--check]\n"
-"                    [--nocheck] [--compress=TYPE] [--flags EDITION_ID]\n"
-"                    [--verbose] [--dereference] [--config=FILE]\n"
-"                    [--threads=NUM_THREADS] [--unix-data] [--source-list]\n"
-"                    [--no-acls] [--strict-acls] [--norpfix] [--pipable]\n"
-"                    [--update-of=[WIMFILE:]IMAGE] [--delta-from=WIMFILE]\n"
+"		     [IMAGE_NAME [IMAGE_DESCRIPTION]] [--boot]\n"
+"                    [--check] [--nocheck] [--compress=TYPE] [--compress-slow]\n"
+"                    [--flags EDITION_ID] [--verbose] [--dereference]\n"
+"                    [--config=FILE] [--threads=NUM_THREADS] [--source-list]\n"
+"                    [--no-acls] [--strict-acls] [--rpfix] [--norpfix]\n"
+"                    [--unix-data] [--pipable] [--update-of=[WIMFILE:]IMAGE]\n"
+"                    [--delta-from=WIMFILE]\n"
 ),
 [CMD_DELETE] =
 T(
@@ -3707,7 +3717,8 @@ T(
 [CMD_OPTIMIZE] =
 T(
 "    %"TS" WIMFILE [--check] [--nocheck] [--recompress]\n"
-"                    [--threads=NUM_THREADS] [--pipable] [--not-pipable]\n"
+"                    [--compress-slow] [--threads=NUM_THREADS]\n"
+"                    [--pipable] [--not-pipable]\n"
 ),
 [CMD_SPLIT] =
 T(
diff --git a/src/compress.c b/src/compress.c
index b25938f1..dfaabe6b 100644
--- a/src/compress.c
+++ b/src/compress.c
@@ -237,9 +237,9 @@ huffman_tree_compute_path_lengths(HuffmanNode *base_node, u16 cur_len)
 void
 make_canonical_huffman_code(unsigned num_syms,
 			    unsigned max_codeword_len,
-			    const freq_t * restrict freq_tab,
-			    u8 * restrict lens,
-			    u16 * restrict codewords)
+			    const freq_t freq_tab[restrict],
+			    u8 lens[restrict],
+			    u16 codewords[restrict])
 {
 	/* We require at least 2 possible symbols in the alphabet to produce a
 	 * valid Huffman decoding table. It is allowed that fewer than 2 symbols
diff --git a/src/integrity.c b/src/integrity.c
index 6d007fef..0bfb8050 100644
--- a/src/integrity.c
+++ b/src/integrity.c
@@ -363,7 +363,8 @@ write_integrity_table(WIMStruct *wim,
 					     WIMLIB_COMPRESSION_TYPE_NONE,
 					     &wim->hdr.integrity,
 					     NULL,
-					     0);
+					     0,
+					     &wim->lzx_context);
 	FREE(new_table);
 out_free_old_table:
 	FREE(old_table);
diff --git a/src/lookup_table.c b/src/lookup_table.c
index 4ad48beb..98433ae0 100644
--- a/src/lookup_table.c
+++ b/src/lookup_table.c
@@ -678,7 +678,8 @@ static int
 write_wim_lookup_table_from_stream_list(struct list_head *stream_list,
 					struct filedes *out_fd,
 					struct resource_entry *out_res_entry,
-					int write_resource_flags)
+					int write_resource_flags,
+					struct wimlib_lzx_context **comp_ctx)
 {
 	size_t table_size;
 	struct wim_lookup_table_entry *lte;
@@ -712,7 +713,8 @@ write_wim_lookup_table_from_stream_list(struct list_head *stream_list,
 					     WIMLIB_COMPRESSION_TYPE_NONE,
 					     out_res_entry,
 					     NULL,
-					     write_resource_flags);
+					     write_resource_flags,
+					     comp_ctx);
 	FREE(table_buf);
 	return ret;
 }
@@ -803,7 +805,8 @@ write_wim_lookup_table(WIMStruct *wim, int image, int write_flags,
 	return write_wim_lookup_table_from_stream_list(stream_list,
 						       &wim->out_fd,
 						       out_res_entry,
-						       write_resource_flags);
+						       write_resource_flags,
+						       &wim->lzx_context);
 }
 
 
diff --git a/src/lz77.c b/src/lz77.c
index 2d85ae8f..5ce8d89f 100644
--- a/src/lz77.c
+++ b/src/lz77.c
@@ -1,5 +1,5 @@
 /*
- * lz.c
+ * lz77.c
  *
  * This file provides the code to analyze a buffer of uncompressed data for
  * matches, as per the LZ77 algorithm.  It uses a hash table to accelerate the
diff --git a/src/lzx-compress.c b/src/lzx-compress.c
index c876a182..ddfe1bbd 100644
--- a/src/lzx-compress.c
+++ b/src/lzx-compress.c
@@ -1,12 +1,10 @@
 /*
  * lzx-compress.c
  *
- * LZX compression routines, originally based on code written by Matthew T.
- * Russotto (liblzxcomp), but heavily modified.
+ * LZX compression routines
  */
 
 /*
- * Copyright (C) 2002 Matthew T. Russotto
  * Copyright (C) 2012, 2013 Eric Biggers
  *
  * This file is part of wimlib, a library for working with WIM files.
@@ -27,34 +25,130 @@
 
 
 /*
- * This file provides wimlib_lzx_compress(), a function to compress an in-memory
- * buffer of data using LZX compression, as used in the WIM file format.
- *
- * Please see the comments in lzx-decompress.c for more information about this
- * compression format.
- *
- * One thing to keep in mind is that there is no sliding window, since the
- * window is always the entirety of a WIM chunk, which is at most WIM_CHUNK_SIZE
- * ( = 32768) bytes.
- *
- * The basic compression algorithm used here should be familiar if you are
- * familiar with Huffman trees and with other LZ77 and Huffman-based formats
- * such as DEFLATE.  Otherwise it can be quite tricky to understand.  Basically
- * it is the following:
- *
- * - Preprocess the input data (LZX-specific)
- * - Go through the input data and determine matches.  This part is based on
- *       code from zlib, and a hash table of 3-character strings is used to
- *       accelerate the process of finding matches.
- * - Build the Huffman trees based on the frequencies of symbols determined
- *       while recording matches.
- * - Output the block header, including the Huffman trees; then output the
- *       compressed stream of matches and literal characters.
- *
- * It is possible for a WIM chunk to include multiple LZX blocks, since for some
- * input data this will produce a better compression ratio (especially since
- * each block can include new Huffman codes).  However, producing multiple LZX
- * blocks from one input chunk is not yet implemented.
+ * This file contains a compressor for the LZX compression format, as used in
+ * the WIM file format.
+ *
+ * Format
+ * ======
+ *
+ * First, the primary reference for the LZX compression format is the
+ * specification released by Microsoft.
+ *
+ * Second, the comments in lzx-decompress.c provide some more information about
+ * the LZX compression format, including errors in the Microsoft specification.
+ *
+ * Do note that LZX shares many similarities with DEFLATE, the algorithm used by
+ * zlib and gzip.  Both LZX and DEFLATE use LZ77 matching and Huffman coding,
+ * and certain other details are quite similar, such as the method for storing
+ * Huffman codes.  However, some of the main differences are:
+ *
+ * - LZX preprocesses the data before attempting to compress it.
+ * - LZX uses a "main" alphabet which combines literals and matches, with the
+ *   match symbols containing a "length header" (giving all or part of the match
+ *   length) and a "position footer" (giving, roughly speaking, the order of
+ *   magnitude of the match offset).
+ * - LZX does not have static Huffman blocks; however it does have two types of
+ *   dynamic Huffman blocks ("aligned offset" and "verbatim").
+ * - LZX has a minimum match length of 2 rather than 3.
+ * - In LZX, match offsets 0 through 2 actually represent entries in a LRU queue
+ *   of match offsets.
+ *
+ * Algorithms
+ * ==========
+ *
+ * There are actually two distinct overall algorithms implemented here.  We
+ * shall refer to them as the "slow" algorithm and the "fast" algorithm.  The
+ * "slow" algorithm spends more time compressing to achieve a higher compression
+ * ratio compared to the "fast" algorithm.  More details are presented below.
+ *
+ * Slow algorithm
+ * --------------
+ *
+ * The "slow" algorithm to generate LZX-compressed data is roughly as follows:
+ *
+ * 1. Preprocess the input data to translate the targets of x86 call instructions
+ *    to absolute offsets.
+ *
+ * 2. Determine the best known sequence of LZ77 matches ((offset, length) pairs)
+ *    and literal bytes to divide the input into.  Raw match-finding is done
+ *    using a very clever binary tree search based on the "Bt3" algorithm from
+ *    7-Zip.  Parsing, or match-choosing, is solved essentially as a
+ *    minimum-cost path problem, but using a heuristic forward search based on
+ *    the Deflate encoder from 7-Zip rather than a more intuitive backward
+ *    search, the latter of which would naively require that all matches be
+ *    found.  This heuristic search, as well as other heuristics such as limits
+ *    on the matches considered, considerably speed up this part of the
+ *    algorithm, which is the main bottleneck.  Finally, after matches and
+ *    literals are chosen, the needed Huffman codes needed to output them are
+ *    built.
+ *
+ * 3. Up to a certain number of iterations, use the resulting Huffman codes to
+ *    refine a cost model and go back to Step #2 to determine an improved
+ *    sequence of matches and literals.
+ *
+ * 4. Up to a certain depth, try splitting the current block to see if the
+ *    compression ratio can be improved.  This may be the case if parts of the
+ *    input differ greatly from each other and could benefit from different
+ *    Huffman codes.
+ *
+ * 5. Output the resulting block(s) using the match/literal sequences and the
+ *    Huffman codes that were computed for each block.
+ *
+ * Fast algorithm
+ * --------------
+ *
+ * The fast algorithm (and the only one available in wimlib v1.5.1 and earlier)
+ * spends much less time on the main bottlenecks of the compression process ---
+ * that is the match finding, match choosing, and block splitting.  Matches are
+ * found and chosen with hash chains using a greedy parse with one position of
+ * look-ahead.  No block splitting is done; only compressing the full input into
+ * an aligned offset block is considered.
+ *
+ * API
+ * ===
+ *
+ * The old API (retained for backward compatibility) consists of just one function:
+ *
+ *	wimlib_lzx_compress()
+ *
+ * The new compressor has more potential parameters and needs more memory, so
+ * the new API ties up memory allocations and compression parameters into a
+ * context:
+ *
+ *	wimlib_lzx_alloc_context()
+ *	wimlib_lzx_compress2()
+ *	wimlib_lzx_free_context()
+ *
+ * Both wimlib_lzx_compress() and wimlib_lzx_compress2() are designed to
+ * compress an in-memory buffer of up to 32768 bytes.  There is no sliding
+ * window.  This is suitable for the WIM format, which uses fixed-size chunks
+ * that are seemingly always 32768 bytes.  If needed, the compressor potentially
+ * could be extended to support a larger and/or sliding window.
+ *
+ * Both wimlib_lzx_compress() and wimlib_lzx_compress2() return 0 if the data
+ * could not be compressed to less than the size of the uncompressed data.
+ * Again, this is suitable for the WIM format, which stores such data chunks
+ * uncompressed.
+ *
+ * The functions in this API are exported from the library, although this is
+ * only in case other programs happen to have uses for it other than WIM
+ * reading/writing as already handled through the rest of the library.
+ *
+ * Acknowledgments
+ * ===============
+ *
+ * Acknowledgments to several other open-source projects that made it possible
+ * to implement this code:
+ *
+ * - 7-Zip (author: Igor Pavlov), for the binary tree match-finding
+ *   algorithm, the heuristic near-optimal forward match-choosing
+ *   algorithm, and the block splitting algorithm.
+ *
+ * - zlib (author: Jean-loup Gailly and Mark Adler), for the hash table
+ *   match-finding algorithm.
+ *
+ * - lzx-compress (author: Matthew T. Russotto), on which some parts of this
+ *   code were originally based.
  */
 
 #ifdef HAVE_CONFIG_H
@@ -67,27 +161,266 @@
 #include "wimlib/lzx.h"
 #include "wimlib/util.h"
 
-#include <stdlib.h>
+#ifdef ENABLE_LZX_DEBUG
+#  include <wimlib/decompress.h>
+#endif
+
 #include <string.h>
 
+/* Experimental parameters not exposed through the API  */
+#define LZX_PARAM_OPTIM_ARRAY_SIZE	1024
+#define LZX_PARAM_ACCOUNT_FOR_LRU	1
+#define LZX_PARAM_DONT_SKIP_MATCHES	0
+#define LZX_PARAM_USE_EMPIRICAL_DEFAULT_COSTS 1
+
+/* Currently, this constant can't simply be changed because the code currently
+ * uses a static number of position slots.  */
+#define LZX_MAX_WINDOW_SIZE	32768
+
+/* This may be WIM-specific  */
+#define LZX_DEFAULT_BLOCK_SIZE  32768
+
+#define LZX_LZ_HASH_BITS	15
+#define LZX_LZ_HASH_SIZE	(1 << LZX_LZ_HASH_BITS)
+#define LZX_LZ_HASH_MASK	(LZX_LZ_HASH_SIZE - 1)
+#define LZX_LZ_HASH_SHIFT	5
+
+/* Codewords for the LZX main, length, and aligned offset Huffman codes  */
+struct lzx_codewords {
+	u16 main[LZX_MAINTREE_NUM_SYMBOLS];
+	u16 len[LZX_LENTREE_NUM_SYMBOLS];
+	u16 aligned[LZX_ALIGNEDTREE_NUM_SYMBOLS];
+};
+
+/* Lengths for the LZX main, length, and aligned offset Huffman codes  */
+struct lzx_lens {
+	u8 main[LZX_MAINTREE_NUM_SYMBOLS];
+	u8 len[LZX_LENTREE_NUM_SYMBOLS];
+	u8 aligned[LZX_ALIGNEDTREE_NUM_SYMBOLS];
+};
 
-/* Structure to contain the Huffman codes for the main, length, and aligned
- * offset trees. */
+/* The LZX main, length, and aligned offset Huffman codes  */
 struct lzx_codes {
-	u16 main_codewords[LZX_MAINTREE_NUM_SYMBOLS];
-	u8  main_lens[LZX_MAINTREE_NUM_SYMBOLS];
+	struct lzx_codewords codewords;
+	struct lzx_lens lens;
+};
+
+/* Tables for tallying symbol frequencies in the three LZX alphabets  */
+struct lzx_freqs {
+	freq_t main[LZX_MAINTREE_NUM_SYMBOLS];
+	freq_t len[LZX_LENTREE_NUM_SYMBOLS];
+	freq_t aligned[LZX_ALIGNEDTREE_NUM_SYMBOLS];
+};
+
+/* LZX intermediate match/literal format  */
+struct lzx_match {
+	/* Bit     Description
+	 *
+	 * 31      1 if a match, 0 if a literal.
+	 *
+	 * 30-25   position slot.  This can be at most 50, so it will fit in 6
+	 *         bits.
+	 *
+	 * 8-24    position footer.  This is the offset of the real formatted
+	 *         offset from the position base.  This can be at most 17 bits
+	 *         (since lzx_extra_bits[LZX_NUM_POSITION_SLOTS - 1] is 17).
+	 *
+	 * 0-7     length of match, minus 2.  This can be at most
+	 *         (LZX_MAX_MATCH - 2) == 255, so it will fit in 8 bits.  */
+	u32 data;
+};
+
+/* Raw LZ match/literal format: just a length and offset.
+ *
+ * The length is the number of bytes of the match, and the offset is the number
+ * of bytes back in the input the match is from the matched text.
+ *
+ * If @len < LZX_MIN_MATCH, then it's really just a literal byte.  */
+struct raw_match {
+	u16 len;
+	u16 offset;
+};
+
+/* Specification for a LZX block  */
+struct lzx_block_spec {
+
+	/* Set to 1 if this block has been split (in two --- we only considser
+	 * binary splits).  In such cases the rest of the fields are
+	 * unimportant, since the relevant information is rather in the
+	 * structures for the sub-blocks.  */
+	u8 is_split : 1;
 
-	u16 len_codewords[LZX_LENTREE_NUM_SYMBOLS];
-	u8  len_lens[LZX_LENTREE_NUM_SYMBOLS];
+	/* One of the LZX_BLOCKTYPE_* constants indicating which type of this
+	 * block.  */
+	u8 block_type : 2;
 
-	u16 aligned_codewords[LZX_ALIGNEDTREE_NUM_SYMBOLS];
-	u8  aligned_lens[LZX_ALIGNEDTREE_NUM_SYMBOLS];
+	/* 0-based position in the window at which this block starts.  */
+	u16 window_pos;
+
+	/* The number of bytes of uncompressed data this block represents.  */
+	u16 block_size;
+
+	/* The position in the 'chosen_matches' array in the `struct
+	 * lzx_compressor' at which the match/literal specifications for
+	 * this block begin.  */
+	unsigned chosen_matches_start_pos;
+
+	/* The number of match/literal specifications for this block.  */
+	u16 num_chosen_matches;
+
+	/* Huffman codes for this block.  */
+	struct lzx_codes codes;
+};
+
+/*
+ * An array of these structures is used during the match-choosing algorithm.
+ * They correspond to consecutive positions in the window and are used to keep
+ * track of the cost to reach each position, and the match/literal choices that
+ * need to be chosen to reach that position.
+ */
+struct lzx_optimal {
+	/* The approximate minimum cost, in bits, to reach this position in the
+	 * window which has been found so far.  */
+	u32 cost;
+
+	/* The union here is just for clarity, since the fields are used in two
+	 * slightly different ways.  Initially, the @prev structure is filled in
+	 * first, and links go from later in the window to earlier in the
+	 * window.  Later, @next structure is filled in and links go from
+	 * earlier in the window to later in the window.  */
+	union {
+		struct {
+			/* Position of the start of the match or literal that
+			 * was taken to get to this position in the approximate
+			 * minimum-cost parse.  */
+			u16 link;
+
+			/* Offset, relative to its starting position, of the
+			 * match or literal that was taken to get to this
+			 * position in the approximate minimum-cost parse.  */
+			u16 match_offset;
+		} prev;
+		struct {
+			/* Position at which the match or literal starting at
+			 * this position ends in the minimum-cost parse.  */
+			u16 link;
+
+			/* Offset, relative to its starting position, of the
+			 * match or literal starting at this position in the
+			 * approximate minimum-cost parse.  */
+			u16 match_offset;
+		} next;
+	};
+#if LZX_PARAM_ACCOUNT_FOR_LRU
+	struct lzx_lru_queue queue;
+#endif
 };
 
-struct lzx_freq_tables {
-	freq_t main_freq_table[LZX_MAINTREE_NUM_SYMBOLS];
-	freq_t len_freq_table[LZX_LENTREE_NUM_SYMBOLS];
-	freq_t aligned_freq_table[LZX_ALIGNEDTREE_NUM_SYMBOLS];
+/* State of the LZX compressor  */
+struct lzx_compressor {
+
+	/* The parameters that were used to create the compressor.  */
+	struct wimlib_lzx_params params;
+
+	/* The buffer of data to be compressed.
+	 *
+	 * 0xe8 byte preprocessing is done directly on the data here before
+	 * further compression.
+	 *
+	 * Note that this compressor does *not* use a sliding window!!!!
+	 * It's not needed in the WIM format, since every chunk is compressed
+	 * independently.  This is by design, to allow random access to the
+	 * chunks.
+	 *
+	 * We reserve a few extra bytes to potentially allow reading off the end
+	 * of the array in the match-finding code for optimization purposes.
+	 */
+	u8 window[LZX_MAX_WINDOW_SIZE + 12];
+
+	/* Number of bytes of data to be compressed, which is the number of
+	 * bytes of data in @window that are actually valid.  */
+	unsigned window_size;
+
+	/* The current match offset LRU queue.  */
+	struct lzx_lru_queue queue;
+
+	/* Space for sequence of matches/literals that were chosen.
+	 *
+	 * Each LZX_MAX_WINDOW_SIZE-sized portion of this array is used for a
+	 * different block splitting level.  */
+	struct lzx_match *chosen_matches;
+
+	/* Structures used during block splitting.
+	 *
+	 * This can be thought of as a binary tree.  block_specs[(1) - 1]
+	 * represents to the top-level block (root node), and block_specs[(i*2)
+	 * - 1] and block_specs[(i*2+1) - 1] represent the sub-blocks (child
+	 * nodes) resulting from a binary split of the block represented by
+	 * block_spec[(i) - 1].
+	 */
+	struct lzx_block_spec *block_specs;
+
+	/* This is simply filled in with zeroes and used to avoid special-casing
+	 * the output of the first compressed Huffman code, which conceptually
+	 * has a delta taken from a code with all symbols having zero-length
+	 * codewords.  */
+	struct lzx_codes zero_codes;
+
+	/* Slow algorithm only: The current cost model.  */
+	struct lzx_lens costs;
+
+	/* Slow algorithm only:  Table that maps the hash codes for 3 character
+	 * sequences to the most recent position that sequence (or a sequence
+	 * sharing the same hash code) appeared in the window.  */
+	u16 *hash_tab;
+
+	/* Slow algorithm only:  Table that maps 2-character sequences to the
+	 * most recent position that sequence appeared in the window.  */
+	u16 *digram_tab;
+
+	/* Slow algorithm only: Table that contains the logical child pointers
+	 * in the binary trees in the match-finding code.
+	 *
+	 * child_tab[i*2] and child_tab[i*2+1] are the left and right pointers,
+	 * respectively, from the binary tree root corresponding to window
+	 * position i.  */
+	u16 *child_tab;
+
+	/* Slow algorithm only: Matches that were already found and are saved in
+	 * memory for subsequent queries (e.g. when block splitting).  */
+	struct raw_match *cached_matches;
+
+	/* Slow algorithm only: Next position in 'cached_matches' to either
+	 * return or fill in.  */
+	unsigned cached_matches_pos;
+
+	/* Slow algorithm only: %true if reading from 'cached_matches'; %false
+	 * if writing to 'cached_matches'.  */
+	bool matches_already_found;
+
+	/* Slow algorithm only: Position in window of next match to return.  */
+	unsigned match_window_pos;
+
+	/* Slow algorithm only: No matches returned shall reach past this
+	 * position.  */
+	unsigned match_window_end;
+
+	/* Slow algorithm only: Temporary space used for match-choosing
+	 * algorithm.
+	 *
+	 * The size of this array must be at least LZX_MAX_MATCH but otherwise
+	 * is arbitrary.  More space simply allows the match-choosing algorithm
+	 * to find better matches (depending on the input, as always).  */
+	struct lzx_optimal *optimum;
+
+	/* Slow algorithm only: Variables used by the match-choosing algorithm.
+	 *
+	 * When matches have been chosen, optimum_cur_idx is set to the position
+	 * in the window of the next match/literal to return and optimum_end_idx
+	 * is set to the position in the window at the end of the last
+	 * match/literal to return.  */
+	u32 optimum_cur_idx;
+	u32 optimum_end_idx;
 };
 
 /* Returns the LZX position slot that corresponds to a given formatted offset.
@@ -99,7 +432,7 @@ struct lzx_freq_tables {
  * numbers in the lzx_position_base array to calculate the slot directly from
  * the formatted offset without actually looking at the array.
  */
-static inline unsigned
+static unsigned
 lzx_get_position_slot(unsigned formatted_offset)
 {
 #if 0
@@ -122,146 +455,72 @@ lzx_get_position_slot(unsigned formatted_offset)
 		 * increases starting at the 655360 entry, and it is >= 2
 		 * because the below calculation fails if the most significant
 		 * bit is lower than the 2's place. */
-		wimlib_assert(formatted_offset >= 2 && formatted_offset < 655360);
+		LZX_ASSERT(2 <= formatted_offset  && formatted_offset < 655360);
 		unsigned mssb_idx = bsr32(formatted_offset);
 		return (mssb_idx << 1) |
 			((formatted_offset >> (mssb_idx - 1)) & 1);
 	}
 }
 
-static u32
-lzx_record_literal(u8 literal, void *_main_freq_tab)
+/* Compute the hash code for the next 3-character sequence in the window.  */
+static unsigned
+lzx_lz_compute_hash(const u8 *window)
 {
-	freq_t *main_freq_tab = _main_freq_tab;
-	main_freq_tab[literal]++;
-	return literal;
+	unsigned hash;
+
+	hash = window[0];
+	hash <<= LZX_LZ_HASH_SHIFT;
+	hash ^= window[1];
+	hash <<= LZX_LZ_HASH_SHIFT;
+	hash ^= window[2];
+	return hash & LZX_LZ_HASH_MASK;
 }
 
-/* Constructs a match from an offset and a length, and updates the LRU queue and
- * the frequency of symbols in the main, length, and aligned offset alphabets.
- * The return value is a 32-bit number that provides the match in an
- * intermediate representation documented below. */
-static u32
-lzx_record_match(unsigned match_offset, unsigned match_len,
-		 void *_freq_tabs, void *_queue)
+/* Build the main, length, and aligned offset Huffman codes used in LZX.
+ *
+ * This takes as input the frequency tables for each code and produces as output
+ * a set of tables that map symbols to codewords and lengths.  */
+static void
+lzx_make_huffman_codes(const struct lzx_freqs *freqs,
+		       struct lzx_codes *codes)
 {
-	struct lzx_freq_tables *freq_tabs = _freq_tabs;
-	struct lru_queue *queue = _queue;
-	unsigned position_slot;
-	unsigned position_footer = 0;
-	u32 match;
-	u32 len_header;
-	u32 len_pos_header;
-	unsigned len_footer;
-	unsigned adjusted_match_len;
-
-	wimlib_assert(match_len >= LZX_MIN_MATCH && match_len <= LZX_MAX_MATCH);
-	wimlib_assert(match_offset != 0);
-
-	/* If possible, encode this offset as a repeated offset. */
-	if (match_offset == queue->R0) {
-		position_slot = 0;
-	} else if (match_offset == queue->R1) {
-		swap(queue->R0, queue->R1);
-		position_slot = 1;
-	} else if (match_offset == queue->R2) {
-		swap(queue->R0, queue->R2);
-		position_slot = 2;
-	} else {
-		/* Not a repeated offset. */
-
-		/* offsets of 0, 1, and 2 are reserved for the repeated offset
-		 * codes, so non-repeated offsets must be encoded as 3+.  The
-		 * minimum offset is 1, so encode the offsets offset by 2. */
-		unsigned formatted_offset = match_offset + LZX_MIN_MATCH;
-
-		queue->R2 = queue->R1;
-		queue->R1 = queue->R0;
-		queue->R0 = match_offset;
-
-		/* The (now-formatted) offset will actually be encoded as a
-		 * small position slot number that maps to a certain hard-coded
-		 * offset (position base), followed by a number of extra bits---
-		 * the position footer--- that are added to the position base to
-		 * get the original formatted offset. */
-
-		position_slot = lzx_get_position_slot(formatted_offset);
-		position_footer = formatted_offset &
-				  ((1 << lzx_get_num_extra_bits(position_slot)) - 1);
-	}
-
-	adjusted_match_len = match_len - LZX_MIN_MATCH;
-
-	/* Pack the position slot, position footer, and match length into an
-	 * intermediate representation.
-	 *
-	 * bits    description
-	 * ----    -----------------------------------------------------------
-	 *
-	 * 31      1 if a match, 0 if a literal.
-	 *
-	 * 30-25   position slot.  This can be at most 50, so it will fit in 6
-	 *         bits.
-	 *
-	 * 8-24    position footer.  This is the offset of the real formatted
-	 *         offset from the position base.  This can be at most 17 bits
-	 *         (since lzx_extra_bits[LZX_NUM_POSITION_SLOTS - 1] is 17).
-	 *
-	 * 0-7     length of match, offset by 2.  This can be at most
-	 *         (LZX_MAX_MATCH - 2) == 255, so it will fit in 8 bits.  */
-	match = 0x80000000 |
-		(position_slot << 25) |
-		(position_footer << 8) |
-		(adjusted_match_len);
-
-	/* The match length must be at least 2, so let the adjusted match length
-	 * be the match length minus 2.
-	 *
-	 * If it is less than 7, the adjusted match length is encoded as a 3-bit
-	 * number offset by 2.  Otherwise, the 3-bit length header is all 1's
-	 * and the actual adjusted length is given as a symbol encoded with the
-	 * length tree, offset by 7.
-	 */
-	if (adjusted_match_len < LZX_NUM_PRIMARY_LENS) {
-		len_header = adjusted_match_len;
-	} else {
-		len_header = LZX_NUM_PRIMARY_LENS;
-		len_footer = adjusted_match_len - LZX_NUM_PRIMARY_LENS;
-		freq_tabs->len_freq_table[len_footer]++;
-	}
-	len_pos_header = (position_slot << 3) | len_header;
-
-	wimlib_assert(len_pos_header < LZX_MAINTREE_NUM_SYMBOLS - LZX_NUM_CHARS);
-
-	freq_tabs->main_freq_table[len_pos_header + LZX_NUM_CHARS]++;
+	make_canonical_huffman_code(LZX_MAINTREE_NUM_SYMBOLS,
+				    LZX_MAX_CODEWORD_LEN,
+				    freqs->main,
+				    codes->lens.main,
+				    codes->codewords.main);
 
-	/* Equivalent to:
-	 * if (lzx_extra_bits[position_slot] >= 3) */
-	if (position_slot >= 8)
-		freq_tabs->aligned_freq_table[position_footer & 7]++;
+	make_canonical_huffman_code(LZX_LENTREE_NUM_SYMBOLS,
+				    LZX_MAX_CODEWORD_LEN,
+				    freqs->len,
+				    codes->lens.len,
+				    codes->codewords.len);
 
-	return match;
+	make_canonical_huffman_code(LZX_ALIGNEDTREE_NUM_SYMBOLS, 8,
+				    freqs->aligned,
+				    codes->lens.aligned,
+				    codes->codewords.aligned);
 }
 
 /*
- * Writes a compressed literal match to the output.
+ * Output a LZX match.
  *
- * @out:         The output bitstream.
- * @block_type:  The type of the block (LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM)
- * @match:   	 The match, encoded as a 32-bit number.
- * @codes:	Pointer to a structure that contains the codewords for the
- * 			main, length, and aligned offset Huffman codes.
+ * @out:         The bitstream to write the match to.
+ * @block_type:  The type of the LZX block (LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM)
+ * @match:	 The match.
+ * @codes:	 Pointer to a structure that contains the codewords for the
+ *		 main, length, and aligned offset Huffman codes.
  */
-static int
+static void
 lzx_write_match(struct output_bitstream *out, int block_type,
-		u32 match, const struct lzx_codes *codes)
+		struct lzx_match match, const struct lzx_codes *codes)
 {
 	/* low 8 bits are the match length minus 2 */
-	unsigned match_len_minus_2 = match & 0xff;
+	unsigned match_len_minus_2 = match.data & 0xff;
 	/* Next 17 bits are the position footer */
-	unsigned position_footer = (match >> 8) & 0x1ffff;	/* 17 bits */
+	unsigned position_footer = (match.data >> 8) & 0x1ffff;	/* 17 bits */
 	/* Next 6 bits are the position slot. */
-	unsigned position_slot = (match >> 25) & 0x3f;	/* 6 bits */
+	unsigned position_slot = (match.data >> 25) & 0x3f;	/* 6 bits */
 	unsigned len_header;
 	unsigned len_footer;
 	unsigned len_pos_header;
@@ -269,7 +528,6 @@ lzx_write_match(struct output_bitstream *out, int block_type,
 	unsigned num_extra_bits;
 	unsigned verbatim_bits;
 	unsigned aligned_bits;
-	int ret;
 
 	/* If the match length is less than MIN_MATCH (= 2) +
 	 * NUM_PRIMARY_LENS (= 7), the length header contains
@@ -301,22 +559,16 @@ lzx_write_match(struct output_bitstream *out, int block_type,
 	main_symbol = len_pos_header + LZX_NUM_CHARS;
 
 	/* Output main symbol. */
-	ret = bitstream_put_bits(out, codes->main_codewords[main_symbol],
-				 codes->main_lens[main_symbol]);
-	if (ret != 0)
-		return ret;
+	bitstream_put_bits(out, codes->codewords.main[main_symbol],
+			   codes->lens.main[main_symbol]);
 
 	/* If there is a length footer, output it using the
 	 * length Huffman code. */
 	if (len_footer != (unsigned)(-1)) {
-		ret = bitstream_put_bits(out, codes->len_codewords[len_footer],
-					 codes->len_lens[len_footer]);
-		if (ret != 0)
-			return ret;
+		bitstream_put_bits(out, codes->codewords.len[len_footer],
+				   codes->lens.len[len_footer]);
 	}
 
-	wimlib_assert(position_slot < LZX_NUM_POSITION_SLOTS);
-
 	num_extra_bits = lzx_get_num_extra_bits(position_slot);
 
 	/* For aligned offset blocks with at least 3 extra bits, output the
@@ -326,109 +578,40 @@ lzx_write_match(struct output_bitstream *out, int block_type,
 	if ((block_type == LZX_BLOCKTYPE_ALIGNED) && (num_extra_bits >= 3)) {
 
 		verbatim_bits = position_footer >> 3;
-		ret = bitstream_put_bits(out, verbatim_bits,
-					 num_extra_bits - 3);
-		if (ret != 0)
-			return ret;
+		bitstream_put_bits(out, verbatim_bits,
+				   num_extra_bits - 3);
 
 		aligned_bits = (position_footer & 7);
-		ret = bitstream_put_bits(out,
-					 codes->aligned_codewords[aligned_bits],
-					 codes->aligned_lens[aligned_bits]);
-		if (ret != 0)
-			return ret;
+		bitstream_put_bits(out,
+				   codes->codewords.aligned[aligned_bits],
+				   codes->lens.aligned[aligned_bits]);
 	} else {
 		/* verbatim bits is the same as the position
 		 * footer, in this case. */
-		ret = bitstream_put_bits(out, position_footer, num_extra_bits);
-		if (ret != 0)
-			return ret;
+		bitstream_put_bits(out, position_footer, num_extra_bits);
 	}
-	return 0;
 }
 
-/*
- * Writes all compressed literals in a block, both matches and literal bytes, to
- * the output bitstream.
- *
- * @out:         The output bitstream.
- * @block_type:  The type of the block (LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM)
- * @match_tab[]:   The array of matches that will be output.  It has length
- * 			of @num_compressed_literals.
- * @num_compressed_literals:  Number of compressed literals to be output.
- * @codes:	Pointer to a structure that contains the codewords for the
- * 			main, length, and aligned offset Huffman codes.
- */
-static int
-lzx_write_compressed_literals(struct output_bitstream *ostream,
-			      int block_type,
-			      const u32 match_tab[],
-			      unsigned  num_compressed_literals,
-			      const struct lzx_codes *codes)
+static unsigned
+lzx_build_precode(const u8 lens[restrict],
+		  const u8 prev_lens[restrict],
+		  unsigned num_syms,
+		  freq_t precode_freqs[restrict LZX_PRETREE_NUM_SYMBOLS],
+		  u8 output_syms[restrict num_syms],
+		  u8 precode_lens[restrict LZX_PRETREE_NUM_SYMBOLS],
+		  u16 precode_codewords[restrict LZX_PRETREE_NUM_SYMBOLS],
+		  unsigned * num_additional_bits_ret)
 {
-	unsigned i;
-	u32 match;
-	int ret;
-
-	for (i = 0; i < num_compressed_literals; i++) {
-		match = match_tab[i];
-
-		/* High bit of the match indicates whether the match is an
-		 * actual match (1) or a literal uncompressed byte (0) */
-		if (match & 0x80000000) {
-			/* match */
-			ret = lzx_write_match(ostream, block_type, match,
-					      codes);
-			if (ret != 0)
-				return ret;
-		} else {
-			/* literal byte */
-			wimlib_assert(match < LZX_NUM_CHARS);
-			ret = bitstream_put_bits(ostream,
-						 codes->main_codewords[match],
-						 codes->main_lens[match]);
-			if (ret != 0)
-				return ret;
-		}
-	}
-	return 0;
-}
-
-/*
- * Writes a compressed Huffman tree to the output, preceded by the pretree for
- * it.
- *
- * The Huffman tree is represented in the output as a series of path lengths
- * from which the canonical Huffman code can be reconstructed.  The path lengths
- * themselves are compressed using a separate Huffman code, the pretree, which
- * consists of LZX_PRETREE_NUM_SYMBOLS (= 20) symbols that cover all possible code
- * lengths, plus extra codes for repeated lengths.  The path lengths of the
- * pretree precede the path lengths of the larger code and are uncompressed,
- * consisting of 20 entries of 4 bits each.
- *
- * @out:	The bitstream for the compressed output.
- * @lens:	The code lengths for the Huffman tree, indexed by symbol.
- * @num_symbols:	The number of symbols in the code.
- */
-static int
-lzx_write_compressed_tree(struct output_bitstream *out,
-			  const u8 lens[], unsigned num_symbols)
-{
-	/* Frequencies of the length symbols, including the RLE symbols (NOT the
-	 * actual lengths themselves). */
-	freq_t pretree_freqs[LZX_PRETREE_NUM_SYMBOLS];
-	u8 pretree_lens[LZX_PRETREE_NUM_SYMBOLS];
-	u16 pretree_codewords[LZX_PRETREE_NUM_SYMBOLS];
-	u8 output_syms[num_symbols * 2];
 	unsigned output_syms_idx;
 	unsigned cur_run_len;
 	unsigned i;
 	unsigned len_in_run;
 	unsigned additional_bits;
 	signed char delta;
-	u8 pretree_sym;
+	unsigned num_additional_bits = 0;
 
-	ZERO_ARRAY(pretree_freqs);
+	memset(precode_freqs, 0,
+	       LZX_PRETREE_NUM_SYMBOLS * sizeof(precode_freqs[0]));
 
 	/* Since the code word lengths use a form of RLE encoding, the goal here
 	 * is to find each run of identical lengths when going through them in
@@ -444,9 +627,9 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 	 */
 	output_syms_idx = 0;
 	cur_run_len = 1;
-	for (i = 1; i <= num_symbols; i++) {
+	for (i = 1; i <= num_syms; i++) {
 
-		if (i != num_symbols && lens[i] == lens[i - 1]) {
+		if (i != num_syms && lens[i] == lens[i - 1]) {
 			/* Still in a run--- keep going. */
 			cur_run_len++;
 			continue;
@@ -469,7 +652,8 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 			while (cur_run_len >= 20) {
 
 				additional_bits = min(cur_run_len - 20, 0x1f);
-				pretree_freqs[18]++;
+				num_additional_bits += 5;
+				precode_freqs[18]++;
 				output_syms[output_syms_idx++] = 18;
 				output_syms[output_syms_idx++] = additional_bits;
 				cur_run_len -= 20 + additional_bits;
@@ -480,7 +664,8 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 			 * follows the magic length. */
 			while (cur_run_len >= 4) {
 				additional_bits = min(cur_run_len - 4, 0xf);
-				pretree_freqs[17]++;
+				num_additional_bits += 4;
+				precode_freqs[17]++;
 				output_syms[output_syms_idx++] = 17;
 				output_syms[output_syms_idx++] = additional_bits;
 				cur_run_len -= 4 + additional_bits;
@@ -494,7 +679,7 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 			 * nonzeroes, where n is a literal bit that follows the
 			 * magic length, and where the value of the lengths in
 			 * the run is given by an extra length symbol, encoded
-			 * with the pretree, that follows the literal bit.
+			 * with the precode, that follows the literal bit.
 			 *
 			 * The extra length symbol is encoded as a difference
 			 * from the length of the codeword for the first symbol
@@ -502,11 +687,13 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 			 * */
 			while (cur_run_len >= 4) {
 				additional_bits = (cur_run_len > 4);
-				delta = -(signed char)len_in_run;
+				num_additional_bits += 1;
+				delta = (signed char)prev_lens[i - cur_run_len] -
+					(signed char)len_in_run;
 				if (delta < 0)
 					delta += 17;
-				pretree_freqs[19]++;
-				pretree_freqs[(unsigned char)delta]++;
+				precode_freqs[19]++;
+				precode_freqs[(unsigned char)delta]++;
 				output_syms[output_syms_idx++] = 19;
 				output_syms[output_syms_idx++] = additional_bits;
 				output_syms[output_syms_idx++] = delta;
@@ -517,41 +704,88 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 		/* Any remaining lengths in the run are outputted without RLE,
 		 * as a difference from the length of that codeword in the
 		 * previous tree. */
-		while (cur_run_len--) {
-			delta = -(signed char)len_in_run;
+		while (cur_run_len > 0) {
+			delta = (signed char)prev_lens[i - cur_run_len] -
+				(signed char)len_in_run;
 			if (delta < 0)
 				delta += 17;
 
-			pretree_freqs[(unsigned char)delta]++;
+			precode_freqs[(unsigned char)delta]++;
 			output_syms[output_syms_idx++] = delta;
+			cur_run_len--;
 		}
 
 		cur_run_len = 1;
 	}
 
-	wimlib_assert(output_syms_idx < ARRAY_LEN(output_syms));
-
-	/* Build the pretree from the frequencies of the length symbols. */
+	/* Build the precode from the frequencies of the length symbols. */
 
 	make_canonical_huffman_code(LZX_PRETREE_NUM_SYMBOLS,
 				    LZX_MAX_CODEWORD_LEN,
-				    pretree_freqs, pretree_lens,
-				    pretree_codewords);
+				    precode_freqs, precode_lens,
+				    precode_codewords);
+
+	if (num_additional_bits_ret)
+		*num_additional_bits_ret = num_additional_bits;
 
-	/* Write the lengths of the pretree codes to the output. */
+	return output_syms_idx;
+}
+
+/*
+ * Writes a compressed Huffman code to the output, preceded by the precode for
+ * it.
+ *
+ * The Huffman code is represented in the output as a series of path lengths
+ * from which the canonical Huffman code can be reconstructed.  The path lengths
+ * themselves are compressed using a separate Huffman code, the precode, which
+ * consists of LZX_PRETREE_NUM_SYMBOLS (= 20) symbols that cover all possible
+ * code lengths, plus extra codes for repeated lengths.  The path lengths of the
+ * precode precede the path lengths of the larger code and are uncompressed,
+ * consisting of 20 entries of 4 bits each.
+ *
+ * @out:		Bitstream to write the code to.
+ * @lens:		The code lengths for the Huffman code, indexed by symbol.
+ * @prev_lens:		Code lengths for this Huffman code, indexed by symbol,
+ *			in the *previous block*, or all zeroes if this is the
+ *			first block.
+ * @num_syms:		The number of symbols in the code.
+ */
+static void
+lzx_write_compressed_code(struct output_bitstream *out,
+			  const u8 lens[restrict],
+			  const u8 prev_lens[restrict],
+			  unsigned num_syms)
+{
+	freq_t precode_freqs[LZX_PRETREE_NUM_SYMBOLS];
+	u8 output_syms[num_syms];
+	u8 precode_lens[LZX_PRETREE_NUM_SYMBOLS];
+	u16 precode_codewords[LZX_PRETREE_NUM_SYMBOLS];
+	unsigned i;
+	unsigned num_output_syms;
+	u8 precode_sym;
+
+	num_output_syms = lzx_build_precode(lens,
+					    prev_lens,
+					    num_syms,
+					    precode_freqs,
+					    output_syms,
+					    precode_lens,
+					    precode_codewords,
+					    NULL);
+
+	/* Write the lengths of the precode codes to the output. */
 	for (i = 0; i < LZX_PRETREE_NUM_SYMBOLS; i++)
-		bitstream_put_bits(out, pretree_lens[i],
+		bitstream_put_bits(out, precode_lens[i],
 				   LZX_PRETREE_ELEMENT_SIZE);
 
-	/* Write the length symbols, encoded with the pretree, to the output. */
+	/* Write the length symbols, encoded with the precode, to the output. */
 
-	i = 0;
-	while (i < output_syms_idx) {
-		pretree_sym = output_syms[i++];
+	for (i = 0; i < num_output_syms; ) {
+		precode_sym = output_syms[i++];
 
-		bitstream_put_bits(out, pretree_codewords[pretree_sym],
-				   pretree_lens[pretree_sym]);
-		switch (pretree_sym) {
+		bitstream_put_bits(out, precode_codewords[precode_sym],
+				   precode_lens[precode_sym]);
+		switch (precode_sym) {
 		case 17:
 			bitstream_put_bits(out, output_syms[i++], 4);
 			break;
@@ -561,217 +795,2046 @@ lzx_write_compressed_tree(struct output_bitstream *out,
 		case 19:
 			bitstream_put_bits(out, output_syms[i++], 1);
 			bitstream_put_bits(out,
-					   pretree_codewords[output_syms[i]],
-					   pretree_lens[output_syms[i]]);
+					   precode_codewords[output_syms[i]],
+					   precode_lens[output_syms[i]]);
 			i++;
 			break;
 		default:
 			break;
 		}
 	}
-	return 0;
 }
 
-/* Builds the canonical Huffman code for the main tree, the length tree, and the
- * aligned offset tree. */
-static void
-lzx_make_huffman_codes(const struct lzx_freq_tables *freq_tabs,
-		       struct lzx_codes *codes)
+static unsigned
+lzx_huffman_code_output_cost(const u8 lens[restrict],
+			     const freq_t freqs[restrict],
+			     unsigned num_syms)
 {
-	make_canonical_huffman_code(LZX_MAINTREE_NUM_SYMBOLS,
-					LZX_MAX_CODEWORD_LEN,
-					freq_tabs->main_freq_table,
-					codes->main_lens,
-					codes->main_codewords);
+	unsigned cost = 0;
 
-	make_canonical_huffman_code(LZX_LENTREE_NUM_SYMBOLS,
-					LZX_MAX_CODEWORD_LEN,
-					freq_tabs->len_freq_table,
-					codes->len_lens,
-					codes->len_codewords);
+	for (unsigned i = 0; i < num_syms; i++)
+		cost += (unsigned)lens[i] * (unsigned)freqs[i];
 
-	make_canonical_huffman_code(LZX_ALIGNEDTREE_NUM_SYMBOLS, 8,
-					freq_tabs->aligned_freq_table,
-					codes->aligned_lens,
-					codes->aligned_codewords);
+	return cost;
 }
 
-static void
-do_call_insn_translation(u32 *call_insn_target, int input_pos,
-			 s32 file_size)
+/* Return the number of bits required to output the lengths for the specified
+ * Huffman code in compressed format (encoded with a precode).  */
+static unsigned
+lzx_code_cost(const u8 lens[], const u8 prev_lens[], unsigned num_syms)
 {
-	s32 abs_offset;
-	s32 rel_offset;
+	u8 output_syms[num_syms];
+	freq_t precode_freqs[LZX_PRETREE_NUM_SYMBOLS];
+	u8 precode_lens[LZX_PRETREE_NUM_SYMBOLS];
+	u16 precode_codewords[LZX_PRETREE_NUM_SYMBOLS];
+	unsigned cost = 0;
+	unsigned num_additional_bits;
 
-	rel_offset = le32_to_cpu(*call_insn_target);
-	if (rel_offset >= -input_pos && rel_offset < file_size) {
-		if (rel_offset < file_size - input_pos) {
-			/* "good translation" */
-			abs_offset = rel_offset + input_pos;
+	/* Acount for the lengths of the precode itself.  */
+	cost += LZX_PRETREE_NUM_SYMBOLS * LZX_PRETREE_ELEMENT_SIZE;
+
+	lzx_build_precode(lens, prev_lens, num_syms,
+			  precode_freqs, output_syms,
+			  precode_lens, precode_codewords,
+			  &num_additional_bits);
+
+	/* Account for all precode symbols output.  */
+	cost += lzx_huffman_code_output_cost(precode_lens, precode_freqs,
+					     LZX_PRETREE_NUM_SYMBOLS);
+
+	/* Account for additional bits.  */
+	cost += num_additional_bits;
+
+	return cost;
+}
+
+/*
+ * Writes all compressed matches and literal bytes in a LZX block to the the
+ * output bitstream.
+ *
+ * @out:         The output bitstream.
+ * @block_type:  The type of the block (LZX_BLOCKTYPE_ALIGNED or LZX_BLOCKTYPE_VERBATIM)
+ * @match_tab[]:   The array of matches that will be output.  It has length
+ *			of @num_compressed_literals.
+ * @num_compressed_literals:  Number of compressed literals to be output.
+ * @codes:	Pointer to a structure that contains the codewords for the
+ *			main, length, and aligned offset Huffman codes.
+ */
+static void
+lzx_write_matches_and_literals(struct output_bitstream *ostream,
+			       int block_type,
+			       const struct lzx_match match_tab[],
+			       unsigned match_count,
+			       const struct lzx_codes *codes)
+{
+	for (unsigned i = 0; i < match_count; i++) {
+		struct lzx_match match = match_tab[i];
+
+		/* High bit of the match indicates whether the match is an
+		 * actual match (1) or a literal uncompressed byte (0)  */
+		if (match.data & 0x80000000) {
+			/* match */
+			lzx_write_match(ostream, block_type,
+					match, codes);
 		} else {
-			/* "compensating translation" */
-			abs_offset = rel_offset - file_size;
+			/* literal byte */
+			bitstream_put_bits(ostream,
+					   codes->codewords.main[match.data],
+					   codes->lens.main[match.data]);
+		}
+	}
+}
+
+
+static void
+lzx_assert_codes_valid(const struct lzx_codes * codes)
+{
+#ifdef ENABLE_LZX_DEBUG
+	unsigned i;
+
+	for (i = 0; i < LZX_MAINTREE_NUM_SYMBOLS; i++)
+		LZX_ASSERT(codes->lens.main[i] <= LZX_MAX_CODEWORD_LEN);
+
+	for (i = 0; i < LZX_LENTREE_NUM_SYMBOLS; i++)
+		LZX_ASSERT(codes->lens.len[i] <= LZX_MAX_CODEWORD_LEN);
+
+	for (i = 0; i < LZX_ALIGNEDTREE_NUM_SYMBOLS; i++)
+		LZX_ASSERT(codes->lens.aligned[i] <= 8);
+
+	const unsigned tablebits = 10;
+	u16 decode_table[(1 << tablebits) +
+			 (2 * max(LZX_MAINTREE_NUM_SYMBOLS, LZX_LENTREE_NUM_SYMBOLS))]
+			 _aligned_attribute(DECODE_TABLE_ALIGNMENT);
+	LZX_ASSERT(0 == make_huffman_decode_table(decode_table,
+						  LZX_MAINTREE_NUM_SYMBOLS,
+						  tablebits,
+						  codes->lens.main,
+						  LZX_MAX_CODEWORD_LEN));
+	LZX_ASSERT(0 == make_huffman_decode_table(decode_table,
+						  LZX_LENTREE_NUM_SYMBOLS,
+						  tablebits,
+						  codes->lens.len,
+						  LZX_MAX_CODEWORD_LEN));
+	LZX_ASSERT(0 == make_huffman_decode_table(decode_table,
+						  LZX_ALIGNEDTREE_NUM_SYMBOLS,
+						  min(tablebits, 6),
+						  codes->lens.aligned,
+						  8));
+#endif /* ENABLE_LZX_DEBUG */
+}
+
+/* Write a LZX aligned offset or verbatim block to the output.  */
+static void
+lzx_write_compressed_block(int block_type,
+			   unsigned block_size,
+			   struct lzx_match * chosen_matches,
+			   unsigned num_chosen_matches,
+			   const struct lzx_codes * codes,
+			   const struct lzx_codes * prev_codes,
+			   struct output_bitstream * ostream)
+{
+	unsigned i;
+
+	LZX_ASSERT(block_type == LZX_BLOCKTYPE_ALIGNED ||
+		   block_type == LZX_BLOCKTYPE_VERBATIM);
+	LZX_ASSERT(block_size <= LZX_MAX_WINDOW_SIZE);
+	LZX_ASSERT(num_chosen_matches <= LZX_MAX_WINDOW_SIZE);
+	lzx_assert_codes_valid(codes);
+
+	/* The first three bits indicate the type of block and are one of the
+	 * LZX_BLOCKTYPE_* constants.  */
+	bitstream_put_bits(ostream, block_type, LZX_BLOCKTYPE_NBITS);
+
+	/* The next bit indicates whether the block size is the default (32768),
+	 * indicated by a 1 bit, or whether the block size is given by the next
+	 * 16 bits, indicated by a 0 bit.  */
+	if (block_size == LZX_DEFAULT_BLOCK_SIZE) {
+		bitstream_put_bits(ostream, 1, 1);
+	} else {
+		bitstream_put_bits(ostream, 0, 1);
+		bitstream_put_bits(ostream, block_size, LZX_BLOCKSIZE_NBITS);
+	}
+
+	/* Write out lengths of the main code. Note that the LZX specification
+	 * incorrectly states that the aligned offset code comes after the
+	 * length code, but in fact it is the very first tree to be written
+	 * (before the main code).  */
+	if (block_type == LZX_BLOCKTYPE_ALIGNED)
+		for (i = 0; i < LZX_ALIGNEDTREE_NUM_SYMBOLS; i++)
+			bitstream_put_bits(ostream, codes->lens.aligned[i],
+					   LZX_ALIGNEDTREE_ELEMENT_SIZE);
+
+	LZX_DEBUG("Writing main code...");
+
+	/* Write the pre-tree and lengths for the first LZX_NUM_CHARS symbols in
+	 * the main code, which are the codewords for literal bytes.  */
+	lzx_write_compressed_code(ostream,
+				  codes->lens.main,
+				  prev_codes->lens.main,
+				  LZX_NUM_CHARS);
+
+	/* Write the pre-tree and lengths for the rest of the main code, which
+	 * are the codewords for match headers.  */
+	lzx_write_compressed_code(ostream,
+				  codes->lens.main + LZX_NUM_CHARS,
+				  prev_codes->lens.main + LZX_NUM_CHARS,
+				  LZX_MAINTREE_NUM_SYMBOLS - LZX_NUM_CHARS);
+
+	LZX_DEBUG("Writing length code...");
+
+	/* Write the pre-tree and lengths for the length code.  */
+	lzx_write_compressed_code(ostream,
+				  codes->lens.len,
+				  prev_codes->lens.len,
+				  LZX_LENTREE_NUM_SYMBOLS);
+
+	LZX_DEBUG("Writing matches and literals...");
+
+	/* Write the actual matches and literals.  */
+	lzx_write_matches_and_literals(ostream, block_type,
+				       chosen_matches, num_chosen_matches,
+				       codes);
+
+	LZX_DEBUG("Done writing block.");
+}
+
+/* Write the LZX block of index @block_number, or recurse to its children
+ * recursively if it is a split block.
+ *
+ * Return a pointer to the Huffman codes for the last block written.
+ */
+static struct lzx_codes *
+lzx_write_block_recursive(struct lzx_compressor *ctx,
+			  unsigned block_number,
+			  struct lzx_codes * prev_codes,
+			  struct output_bitstream *ostream)
+{
+	struct lzx_block_spec *spec = &ctx->block_specs[block_number - 1];
+
+	if (spec->is_split) {
+		prev_codes = lzx_write_block_recursive(ctx, block_number * 2 + 0,
+						       prev_codes, ostream);
+		prev_codes = lzx_write_block_recursive(ctx, block_number * 2 + 1,
+						       prev_codes, ostream);
+	} else {
+		LZX_DEBUG("Writing block #%u (type=%d, size=%u, num_chosen_matches=%u)...",
+			  block_number, spec->block_type, spec->block_size,
+			  spec->num_chosen_matches);
+		lzx_write_compressed_block(spec->block_type,
+					   spec->block_size,
+					   &ctx->chosen_matches[spec->chosen_matches_start_pos],
+					   spec->num_chosen_matches,
+					   &spec->codes,
+					   prev_codes,
+					   ostream);
+		prev_codes = &spec->codes;
+	}
+	return prev_codes;
+}
+
+/* Write out the LZX blocks that were computed.  */
+static void
+lzx_write_all_blocks(struct lzx_compressor *ctx, struct output_bitstream *ostream)
+{
+	lzx_write_block_recursive(ctx, 1, &ctx->zero_codes, ostream);
+}
+
+static u32
+lzx_record_literal(u8 literal, void *_freqs)
+{
+	struct lzx_freqs *freqs = _freqs;
+
+	freqs->main[literal]++;
+
+	return (u32)literal;
+}
+
+/* Constructs a match from an offset and a length, and updates the LRU queue and
+ * the frequency of symbols in the main, length, and aligned offset alphabets.
+ * The return value is a 32-bit number that provides the match in an
+ * intermediate representation documented below.  */
+static u32
+lzx_record_match(unsigned match_offset, unsigned match_len,
+		 void *_freqs, void *_queue)
+{
+	struct lzx_freqs *freqs = _freqs;
+	struct lzx_lru_queue *queue = _queue;
+	unsigned position_slot;
+	unsigned position_footer = 0;
+	u32 len_header;
+	u32 len_pos_header;
+	unsigned len_footer;
+	unsigned adjusted_match_len;
+
+	LZX_ASSERT(match_len >= LZX_MIN_MATCH && match_len <= LZX_MAX_MATCH);
+
+	/* If possible, encode this offset as a repeated offset. */
+	if (match_offset == queue->R0) {
+		position_slot = 0;
+	} else if (match_offset == queue->R1) {
+		swap(queue->R0, queue->R1);
+		position_slot = 1;
+	} else if (match_offset == queue->R2) {
+		swap(queue->R0, queue->R2);
+		position_slot = 2;
+	} else {
+		/* Not a repeated offset. */
+
+		/* offsets of 0, 1, and 2 are reserved for the repeated offset
+		 * codes, so non-repeated offsets must be encoded as 3+.  The
+		 * minimum offset is 1, so encode the offsets offset by 2. */
+		unsigned formatted_offset = match_offset + 2;
+
+		queue->R2 = queue->R1;
+		queue->R1 = queue->R0;
+		queue->R0 = match_offset;
+
+		/* The (now-formatted) offset will actually be encoded as a
+		 * small position slot number that maps to a certain hard-coded
+		 * offset (position base), followed by a number of extra bits---
+		 * the position footer--- that are added to the position base to
+		 * get the original formatted offset. */
+
+		position_slot = lzx_get_position_slot(formatted_offset);
+		position_footer = formatted_offset &
+				  ((1 << lzx_get_num_extra_bits(position_slot)) - 1);
+	}
+
+	adjusted_match_len = match_len - LZX_MIN_MATCH;
+
+
+	/* The match length must be at least 2, so let the adjusted match length
+	 * be the match length minus 2.
+	 *
+	 * If it is less than 7, the adjusted match length is encoded as a 3-bit
+	 * number offset by 2.  Otherwise, the 3-bit length header is all 1's
+	 * and the actual adjusted length is given as a symbol encoded with the
+	 * length tree, offset by 7.
+	 */
+	if (adjusted_match_len < LZX_NUM_PRIMARY_LENS) {
+		len_header = adjusted_match_len;
+	} else {
+		len_header = LZX_NUM_PRIMARY_LENS;
+		len_footer = adjusted_match_len - LZX_NUM_PRIMARY_LENS;
+		freqs->len[len_footer]++;
+	}
+	len_pos_header = (position_slot << 3) | len_header;
+
+	freqs->main[len_pos_header + LZX_NUM_CHARS]++;
+
+	/* Equivalent to:
+	 * if (lzx_extra_bits[position_slot] >= 3) */
+	if (position_slot >= 8)
+		freqs->aligned[position_footer & 7]++;
+
+	/* Pack the position slot, position footer, and match length into an
+	 * intermediate representation.
+	 *
+	 * bits    description
+	 * ----    -----------------------------------------------------------
+	 *
+	 * 31      1 if a match, 0 if a literal.
+	 *
+	 * 30-25   position slot.  This can be at most 50, so it will fit in 6
+	 *         bits.
+	 *
+	 * 8-24    position footer.  This is the offset of the real formatted
+	 *         offset from the position base.  This can be at most 17 bits
+	 *         (since lzx_extra_bits[LZX_NUM_POSITION_SLOTS - 1] is 17).
+	 *
+	 * 0-7     length of match, offset by 2.  This can be at most
+	 *         (LZX_MAX_MATCH - 2) == 255, so it will fit in 8 bits.  */
+	return 0x80000000 |
+		(position_slot << 25) |
+		(position_footer << 8) |
+		(adjusted_match_len);
+}
+
+/* Set the cost model @ctx->costs from the Huffman codeword lengths specified in
+ * @lens.
+ *
+ * These are basically the same thing, except that Huffman codewords with length
+ * 0 corresponds to symbols with zero frequency.  These need to be assigned
+ * actual costs.  The specific values assigned are arbitrary, but they should be
+ * fairly high (near the maximum codeword length) to take into account the fact
+ * that uses of these symbols are expected to be rare.
+ */
+static void
+lzx_set_costs(struct lzx_compressor * ctx, const struct lzx_lens * lens)
+{
+	unsigned i;
+
+	memcpy(&ctx->costs, lens, sizeof(struct lzx_lens));
+
+	for (i = 0; i < LZX_MAINTREE_NUM_SYMBOLS; i++)
+		if (ctx->costs.main[i] == 0)
+			ctx->costs.main[i] = ctx->params.slow.main_nostat_cost;
+
+	for (i = 0; i < LZX_LENTREE_NUM_SYMBOLS; i++)
+		if (ctx->costs.len[i] == 0)
+			ctx->costs.len[i] = ctx->params.slow.len_nostat_cost;
+
+	for (i = 0; i < LZX_ALIGNEDTREE_NUM_SYMBOLS; i++)
+		if (ctx->costs.aligned[i] == 0)
+			ctx->costs.aligned[i] = ctx->params.slow.aligned_nostat_cost;
+}
+
+static u32
+lzx_literal_cost(u8 c, const struct lzx_lens * costs)
+{
+	return costs->main[c];
+}
+
+/* Given a (length, offset) pair that could be turned into a valid LZX match as
+ * well as costs for the codewords in the main, length, and aligned Huffman
+ * codes, return the approximate number of bits it will take to represent this
+ * match in the compressed output.  */
+static unsigned
+lzx_match_cost(unsigned length, unsigned offset, const struct lzx_lens *costs
+
+#if LZX_PARAM_ACCOUNT_FOR_LRU
+	       , struct lzx_lru_queue *queue
+#endif
+	)
+{
+	unsigned position_slot, len_header, main_symbol;
+	unsigned cost = 0;
+
+	/* Calculate position slot and length header, then combine them into the
+	 * main symbol.  */
+
+#if LZX_PARAM_ACCOUNT_FOR_LRU
+	if (offset == queue->R0) {
+		position_slot = 0;
+	} else if (offset == queue->R1) {
+		swap(queue->R0, queue->R1);
+		position_slot = 1;
+	} else if (offset == queue->R2) {
+		swap(queue->R0, queue->R2);
+		position_slot = 2;
+	} else
+#endif
+		position_slot = lzx_get_position_slot(offset + 2);
+
+	len_header = min(length - LZX_MIN_MATCH, LZX_NUM_PRIMARY_LENS);
+	main_symbol = ((position_slot << 3) | len_header) + LZX_NUM_CHARS;
+
+	/* Account for main symbol.  */
+	cost += costs->main[main_symbol];
+
+	/* Account for extra position information.  */
+	unsigned num_extra_bits = lzx_get_num_extra_bits(position_slot);
+	if (num_extra_bits >= 3) {
+		cost += num_extra_bits - 3;
+		cost += costs->aligned[(offset + LZX_MIN_MATCH) & 7];
+	} else {
+		cost += num_extra_bits;
+	}
+
+	/* Account for extra length information.  */
+	if (length - LZX_MIN_MATCH >= LZX_NUM_PRIMARY_LENS)
+		cost += costs->len[length - LZX_MIN_MATCH - LZX_NUM_PRIMARY_LENS];
+
+	return cost;
+}
+
+/* This procedure effectively creates a new binary tree corresponding to the
+ * current string at the same time that it searches the existing tree nodes for
+ * matches.  */
+static unsigned
+lzx_lz_get_matches(const u8 window[restrict],
+		   const unsigned bytes_remaining,
+		   const unsigned strstart,
+		   const unsigned max_length,
+		   u16 child_tab[restrict],
+		   unsigned cur_match,
+		   const unsigned prev_len,
+		   struct raw_match * const matches)
+{
+	u16 *new_tree_lt_ptr = &child_tab[strstart * 2];
+	u16 *new_tree_gt_ptr = &child_tab[strstart * 2 + 1];
+
+	u16 longest_lt_match_len = 0;
+	u16 longest_gt_match_len = 0;
+
+	/* Maximum number of nodes to walk down before stopping  */
+	unsigned depth = max_length;
+
+	/* Length of longest match found so far  */
+	unsigned longest_match_len = prev_len;
+
+	/* Maximum length of match to return  */
+	unsigned len_limit = min(bytes_remaining, max_length);
+
+	/* Number of matches found so far  */
+	unsigned num_matches = 0;
+
+	for (;;) {
+
+		/* Stop if too many nodes were traversed or if there is no next
+		 * node  */
+		if (depth-- == 0 || cur_match == 0) {
+			*new_tree_gt_ptr = 0;
+			*new_tree_lt_ptr = 0;
+			return num_matches;
+		}
+
+		/* Load the pointers to the children of the binary tree node
+		 * corresponding to the current match  */
+		u16 * const cur_match_ptrs = &child_tab[cur_match * 2];
+
+		/* Set up pointers to the current match and to the current
+		 * string  */
+		const u8 * const matchptr = &window[cur_match];
+		const u8 * const strptr = &window[strstart];
+
+		u16 len = min(longest_lt_match_len,
+			      longest_gt_match_len);
+
+		if (matchptr[len] == strptr[len]) {
+			while (++len != len_limit)
+				if (matchptr[len] != strptr[len])
+					break;
+
+			if (len > longest_match_len) {
+				longest_match_len = len;
+				matches[num_matches].len = len;
+				matches[num_matches].offset = strstart - cur_match;
+				num_matches++;
+
+				if (len == len_limit) {
+					/* Length limit was reached.  Link left pointer
+					 * in the new tree with left subtree of current
+					 * match tree, and link the right pointer in the
+					 * new tree with the right subtree of the
+					 * current match tree.  This in effect deletes
+					 * the node for the currrent match, which is
+					 * desirable because the current match is the
+					 * same as the current string up until the
+					 * length limit, so in subsequent queries it
+					 * will never be preferable to the current
+					 * position.  */
+					*new_tree_lt_ptr = cur_match_ptrs[0];
+					*new_tree_gt_ptr = cur_match_ptrs[1];
+					return num_matches;
+				}
+			}
+		}
+
+		if (matchptr[len] < strptr[len]) {
+			/* Case 1:  The current match is lexicographically less
+			 * than the current string.
+			 *
+			 * Since we are searching the binary tree structures, we
+			 * need to walk down to the *right* subtree of the
+			 * current match's node to get to a match that is
+			 * lexicographically *greater* than the current match
+			 * but still lexicographically *lesser* than the current
+			 * string.
+			 *
+			 * At the same time, we link the entire binary tree
+			 * corresponding to the current match into the
+			 * appropriate place in the new binary tree being built
+			 * for the current string.  */
+			*new_tree_lt_ptr = cur_match;
+			new_tree_lt_ptr = &cur_match_ptrs[1];
+			cur_match = *new_tree_lt_ptr;
+			longest_lt_match_len = len;
+		} else {
+			/* Case 2:  The current match is lexicographically
+			 * greater than the current string.
+			 *
+			 * This is analogous to Case 1 above, but everything
+			 * happens in the other direction.
+			 */
+			*new_tree_gt_ptr = cur_match;
+			new_tree_gt_ptr = &cur_match_ptrs[0];
+			cur_match = *new_tree_gt_ptr;
+			longest_gt_match_len = len;
+		}
+	}
+}
+
+/* Equivalent to lzx_lz_get_matches(), but only updates the tree and doesn't
+ * return matches.  See that function for details (including comments).  */
+static void
+lzx_lz_skip_matches(const u8 window[restrict],
+		    const unsigned bytes_remaining,
+		    const unsigned strstart,
+		    const unsigned max_length,
+		    u16 child_tab[restrict],
+		    unsigned cur_match,
+		    const unsigned prev_len)
+{
+	u16 *new_tree_lt_ptr = &child_tab[strstart * 2];
+	u16 *new_tree_gt_ptr = &child_tab[strstart * 2 + 1];
+
+	u16 longest_lt_match_len = 0;
+	u16 longest_gt_match_len = 0;
+
+	unsigned depth = max_length;
+
+	unsigned longest_match_len = prev_len;
+
+	unsigned len_limit = min(bytes_remaining, max_length);
+
+	for (;;) {
+		if (depth-- == 0 || cur_match == 0) {
+			*new_tree_gt_ptr = 0;
+			*new_tree_lt_ptr = 0;
+			return;
+		}
+
+		u16 * const cur_match_ptrs = &child_tab[cur_match * 2];
+
+		const u8 * const matchptr = &window[cur_match];
+		const u8 * const strptr = &window[strstart];
+
+		u16 len = min(longest_lt_match_len,
+			      longest_gt_match_len);
+
+		if (matchptr[len] == strptr[len]) {
+			while (++len != len_limit)
+				if (matchptr[len] != strptr[len])
+					break;
+
+			if (len > longest_match_len) {
+				longest_match_len = len;
+
+				if (len == len_limit) {
+					*new_tree_lt_ptr = cur_match_ptrs[0];
+					*new_tree_gt_ptr = cur_match_ptrs[1];
+					return;
+				}
+			}
+		}
+
+		if (matchptr[len] < strptr[len]) {
+			*new_tree_lt_ptr = cur_match;
+			new_tree_lt_ptr = &cur_match_ptrs[1];
+			cur_match = *new_tree_lt_ptr;
+			longest_lt_match_len = len;
+		} else {
+			*new_tree_gt_ptr = cur_match;
+			new_tree_gt_ptr = &cur_match_ptrs[0];
+			cur_match = *new_tree_gt_ptr;
+			longest_gt_match_len = len;
+		}
+	}
+}
+
+static unsigned
+lzx_lz_get_matches_caching(struct lzx_compressor *ctx,
+			   struct raw_match **matches_ret);
+
+/* Tell the match-finder to skip the specified number of bytes (@n) in the
+ * input.  */
+static void
+lzx_lz_skip_bytes(struct lzx_compressor *ctx, unsigned n)
+{
+
+#if LZX_PARAM_DONT_SKIP_MATCHES
+	/* Option 1: Still cache the matches from the positions skipped.  They
+	 * will then be available in later passes.  */
+	struct raw_match *matches;
+	while (n--)
+		lzx_lz_get_matches_caching(ctx, &matches);
+#else
+	/* Option 2: Simply mark the positions skipped as having no matches
+	 * available.  */
+	LZX_ASSERT(n <= ctx->match_window_end - ctx->match_window_pos);
+	if (ctx->matches_already_found) {
+		while (n--) {
+			LZX_ASSERT(ctx->cached_matches[ctx->cached_matches_pos].offset ==
+				   ctx->match_window_pos);
+			ctx->cached_matches_pos += ctx->cached_matches[ctx->cached_matches_pos].len + 1;
+			ctx->match_window_pos++;
+		}
+	} else {
+		while (n--) {
+			if (ctx->params.slow.use_len2_matches &&
+			    ctx->match_window_end - ctx->match_window_pos >= 2) {
+				unsigned c1 = ctx->window[ctx->match_window_pos];
+				unsigned c2 = ctx->window[ctx->match_window_pos + 1];
+				unsigned digram = c1 | (c2 << 8);
+				ctx->digram_tab[digram] = ctx->match_window_pos;
+			}
+			if (ctx->match_window_end - ctx->match_window_pos >= 3) {
+				unsigned hash;
+				unsigned cur_match;
+
+				hash = lzx_lz_compute_hash(&ctx->window[ctx->match_window_pos]);
+
+				cur_match = ctx->hash_tab[hash];
+				ctx->hash_tab[hash] = ctx->match_window_pos;
+
+				lzx_lz_skip_matches(ctx->window,
+						    ctx->match_window_end - ctx->match_window_pos,
+						    ctx->match_window_pos,
+						    ctx->params.slow.num_fast_bytes,
+						    ctx->child_tab,
+						    cur_match, 1);
+			}
+			ctx->cached_matches[ctx->cached_matches_pos].len = 0;
+			ctx->cached_matches[ctx->cached_matches_pos].offset = ctx->match_window_pos;
+			ctx->cached_matches_pos++;
+			ctx->match_window_pos++;
+		}
+	}
+#endif /* !LZX_PARAM_DONT_SKIP_MATCHES */
+}
+
+/* Retrieve a list of matches available at the next position in the input.
+ *
+ * The return value is the number of matches found, and a pointer to them is
+ * written to @matches_ret.  The matches will be sorted in order by length.
+ *
+ * This is essentially a wrapper around lzx_lz_get_matches() that caches its
+ * output the first time and also performs the needed hashing.
+ */
+static unsigned
+lzx_lz_get_matches_caching(struct lzx_compressor *ctx,
+			   struct raw_match **matches_ret)
+{
+	unsigned num_matches;
+	struct raw_match *matches;
+
+	LZX_ASSERT(ctx->match_window_end >= ctx->match_window_pos);
+
+	matches = &ctx->cached_matches[ctx->cached_matches_pos + 1];
+
+	if (ctx->matches_already_found) {
+		num_matches = ctx->cached_matches[ctx->cached_matches_pos].len;
+		LZX_ASSERT(ctx->cached_matches[ctx->cached_matches_pos].offset == ctx->match_window_pos);
+
+		for (int i = (int)num_matches - 1; i >= 0; i--) {
+			if (ctx->match_window_pos + matches[i].len > ctx->match_window_end)
+				matches[i].len = ctx->match_window_end - ctx->match_window_pos;
+			else
+				break;
+		}
+	} else {
+		unsigned prev_len = 1;
+		struct raw_match * matches_ret = &ctx->cached_matches[ctx->cached_matches_pos + 1];
+		num_matches = 0;
+
+		if (ctx->params.slow.use_len2_matches &&
+		    ctx->match_window_end - ctx->match_window_pos >= 3) {
+			unsigned c1 = ctx->window[ctx->match_window_pos];
+			unsigned c2 = ctx->window[ctx->match_window_pos + 1];
+			unsigned digram = c1 | (c2 << 8);
+			unsigned cur_match;
+
+			cur_match = ctx->digram_tab[digram];
+			ctx->digram_tab[digram] = ctx->match_window_pos;
+			if (cur_match != 0 &&
+			    ctx->window[cur_match + 2] != ctx->window[ctx->match_window_pos + 2])
+			{
+				matches_ret->len = 2;
+				matches_ret->offset = ctx->match_window_pos - cur_match;
+				matches_ret++;
+				num_matches++;
+				prev_len = 2;
+			}
+		}
+		if (ctx->match_window_end - ctx->match_window_pos >= 3) {
+			unsigned hash;
+			unsigned cur_match;
+
+			hash = lzx_lz_compute_hash(&ctx->window[ctx->match_window_pos]);
+
+			cur_match = ctx->hash_tab[hash];
+			ctx->hash_tab[hash] = ctx->match_window_pos;
+			num_matches += lzx_lz_get_matches(ctx->window,
+							  ctx->match_window_end - ctx->match_window_pos,
+							  ctx->match_window_pos,
+							  ctx->params.slow.num_fast_bytes,
+							  ctx->child_tab,
+							  cur_match,
+							  prev_len,
+							  matches_ret);
+		}
+
+		ctx->cached_matches[ctx->cached_matches_pos].len = num_matches;
+		ctx->cached_matches[ctx->cached_matches_pos].offset = ctx->match_window_pos;
+
+		if (num_matches) {
+			struct raw_match *longest_match_ptr =
+				&ctx->cached_matches[ctx->cached_matches_pos + 1 +
+						     num_matches - 1];
+			u16 len = longest_match_ptr->len;
+
+			/* If the longest match returned by the match-finder
+			 * reached the number of fast bytes, extend it as much
+			 * as possible.  */
+			if (len == ctx->params.slow.num_fast_bytes) {
+				const unsigned maxlen =
+					min(ctx->match_window_end - ctx->match_window_pos,
+					    LZX_MAX_MATCH);
+
+				const u8 * const matchptr =
+					&ctx->window[ctx->match_window_pos - longest_match_ptr->offset];
+
+				const u8 * const strptr =
+					&ctx->window[ctx->match_window_pos];
+
+				while (len < maxlen && matchptr[len] == strptr[len])
+					len++;
+			}
+			longest_match_ptr->len = len;
+		}
+	}
+	ctx->cached_matches_pos += num_matches + 1;
+	*matches_ret = matches;
+
+#if 0
+	printf("\n");
+	for (unsigned i = 0; i < num_matches; i++)
+	{
+		printf("Len %u Offset %u\n", matches[i].len, matches[i].offset);
+	}
+#endif
+
+	for (unsigned i = 0; i < num_matches; i++) {
+		LZX_ASSERT(matches[i].len <= LZX_MAX_MATCH);
+		if (matches[i].len >= LZX_MIN_MATCH) {
+			LZX_ASSERT(matches[i].offset <= ctx->match_window_pos);
+			LZX_ASSERT(matches[i].len <= ctx->match_window_end - ctx->match_window_pos);
+			LZX_ASSERT(!memcmp(&ctx->window[ctx->match_window_pos],
+					   &ctx->window[ctx->match_window_pos - matches[i].offset],
+					   matches[i].len));
+		}
+	}
+
+	ctx->match_window_pos++;
+	return num_matches;
+}
+
+/*
+ * Reverse the linked list of near-optimal matches so that they can be returned
+ * in forwards order.
+ *
+ * Returns the first match in the list.
+ */
+static struct raw_match
+lzx_lz_reverse_near_optimal_match_list(struct lzx_compressor *ctx,
+				       unsigned cur_pos)
+{
+	unsigned prev_link, saved_prev_link;
+	unsigned prev_match_offset, saved_prev_match_offset;
+
+	ctx->optimum_end_idx = cur_pos;
+
+	saved_prev_link = ctx->optimum[cur_pos].prev.link;
+	saved_prev_match_offset = ctx->optimum[cur_pos].prev.match_offset;
+
+	do {
+		prev_link = saved_prev_link;
+		prev_match_offset = saved_prev_match_offset;
+
+		saved_prev_link = ctx->optimum[prev_link].prev.link;
+		saved_prev_match_offset = ctx->optimum[prev_link].prev.match_offset;
+
+		ctx->optimum[prev_link].next.link = cur_pos;
+		ctx->optimum[prev_link].next.match_offset = prev_match_offset;
+
+		cur_pos = prev_link;
+	} while (cur_pos != 0);
+
+	ctx->optimum_cur_idx = ctx->optimum[0].next.link;
+
+	return (struct raw_match)
+		{ .len = ctx->optimum_cur_idx,
+		  .offset = ctx->optimum[0].next.match_offset,
+		};
+}
+
+/*
+ * lzx_lz_get_near_optimal_match() -
+ *
+ * Choose the "best" match or literal to use at the next position in the input.
+ *
+ * Unlike a "greedy" parser that always takes the longest match, or even a
+ * parser with one match/literal look-ahead like zlib, the algorithm used here
+ * may look ahead many matches/literals to determine the best match/literal to
+ * output next.  The motivation is that the compression ratio is improved if the
+ * compressor can do things like use a shorter-than-possible match in order to
+ * allow a longer match later, and also take into account the Huffman code cost
+ * model rather than simply assuming that longer is better.  It is not a true
+ * "optimal" parser, however, since some shortcuts can be taken; for example, if
+ * a match is very long, the parser just chooses it immediately before too much
+ * time is wasting considering many different alternatives that are unlikely to
+ * be better.
+ *
+ * This algorithm is based on that used in 7-Zip's deflate encoder.
+ *
+ * Each call to this function does one of two things:
+ *
+ * 1. Build a near-optimal sequence of matches/literals, up to some point, that
+ *    will be returned by subsequent calls to this function, then return the
+ *    first one.
+ *
+ * OR
+ *
+ * 2. Return the next match/literal previously computed by a call to this
+ *    function;
+ *
+ * This function relies on the following state in the compressor context:
+ *
+ *	ctx->window	     (read-only: preprocessed data being compressed)
+ *	ctx->cost	     (read-only: cost model to use)
+ *	ctx->optimum	     (internal state; leave uninitialized)
+ *	ctx->optimum_cur_idx (must set to 0 before first call)
+ *	ctx->optimum_end_idx (must set to 0 before first call)
+ *	ctx->hash_tab	     (must set to 0 before first call)
+ *	ctx->cached_matches  (internal state; leave uninitialized)
+ *	ctx->cached_matches_pos (initialize to 0 before first call; save and
+ *				 restore value if restarting parse from a
+ *				 certain position)
+ *	ctx->match_window_pos (must initialize to position of next match to
+ *			       return; subsequent calls return subsequent
+ *			       matches)
+ *	ctx->match_window_end (must initialize to limit of match-finding region;
+ *			       subsequent calls use the same limit)
+ *
+ * The return value is a (length, offset) pair specifying the match or literal
+ * chosen.
+ */
+static struct raw_match
+lzx_lz_get_near_optimal_match(struct lzx_compressor * ctx)
+{
+#if 0
+	/* Testing: literals only  */
+	ctx->match_window_pos++;
+	return (struct raw_match) { .len = 0 };
+#elif 0
+	/* Testing: greedy parsing  */
+	struct raw_match *matches;
+	unsigned num_matches;
+	struct raw_match match = {.len = 0};
+
+	num_matches = lzx_lz_get_matches_caching(ctx, &matches);
+	if (num_matches) {
+		match = matches[num_matches - 1];
+		lzx_lz_skip_bytes(ctx, match.len - 1);
+	}
+	return match;
+#else
+	unsigned num_possible_matches;
+	struct raw_match *possible_matches;
+	struct raw_match match;
+	unsigned longest_match_len;
+	unsigned len, match_idx;
+
+	if (ctx->optimum_cur_idx != ctx->optimum_end_idx) {
+		/* Case 2: Return the next match/literal already found.  */
+		match.len = ctx->optimum[ctx->optimum_cur_idx].next.link -
+				    ctx->optimum_cur_idx;
+		match.offset = ctx->optimum[ctx->optimum_cur_idx].next.match_offset;
+
+		ctx->optimum_cur_idx = ctx->optimum[ctx->optimum_cur_idx].next.link;
+		return match;
+	}
+
+	/* Case 1:  Compute a new list of matches/literals to return.  */
+
+	ctx->optimum_cur_idx = 0;
+	ctx->optimum_end_idx = 0;
+
+	/* Get matches at this position.  */
+	num_possible_matches = lzx_lz_get_matches_caching(ctx, &possible_matches);
+
+	/* If no matches found, return literal.  */
+	if (num_possible_matches == 0)
+		return (struct raw_match){ .len = 0 };
+
+	/* The matches that were found are sorted by length.  Get the length of
+	 * the longest one.  */
+	longest_match_len = possible_matches[num_possible_matches - 1].len;
+
+	/* Greedy heuristic:  if the longest match that was found is greater
+	 * than LZX_PARAM_NUM_FAST_BYTES, return it immediately; don't both
+	 * doing more work.  */
+	if (longest_match_len > ctx->params.slow.num_fast_bytes) {
+		lzx_lz_skip_bytes(ctx, longest_match_len - 1);
+		return possible_matches[num_possible_matches - 1];
+	}
+
+	/* Calculate the cost to reach the next position by outputting a
+	 * literal.  */
+#if LZX_PARAM_ACCOUNT_FOR_LRU
+	ctx->optimum[0].queue = ctx->queue;
+	ctx->optimum[1].queue = ctx->optimum[0].queue;
+#endif
+	ctx->optimum[1].cost = lzx_literal_cost(ctx->window[ctx->match_window_pos],
+						&ctx->costs);
+	ctx->optimum[1].prev.link = 0;
+
+	/* Calculate the cost to reach any position up to and including that
+	 * reached by the longest match, using the shortest (i.e. closest) match
+	 * that reaches each position.  */
+	match_idx = 0;
+	BUILD_BUG_ON(LZX_MIN_MATCH != 2);
+	for (len = LZX_MIN_MATCH; len <= longest_match_len; len++) {
+
+		LZX_ASSERT(match_idx < num_possible_matches);
+
+	#if LZX_PARAM_ACCOUNT_FOR_LRU
+		ctx->optimum[len].queue = ctx->optimum[0].queue;
+	#endif
+		ctx->optimum[len].prev.link = 0;
+		ctx->optimum[len].prev.match_offset = possible_matches[match_idx].offset;
+		ctx->optimum[len].cost = lzx_match_cost(len,
+							possible_matches[match_idx].offset,
+							&ctx->costs
+						#if LZX_PARAM_ACCOUNT_FOR_LRU
+							, &ctx->optimum[len].queue
+						#endif
+							);
+		if (len == possible_matches[match_idx].len)
+			match_idx++;
+	}
+
+	unsigned cur_pos = 0;
+
+	/* len_end: greatest index forward at which costs have been calculated
+	 * so far  */
+	unsigned len_end = longest_match_len;
+
+
+	for (;;) {
+		/* Advance to next position.  */
+		cur_pos++;
+
+		if (cur_pos == len_end || cur_pos == LZX_PARAM_OPTIM_ARRAY_SIZE)
+			return lzx_lz_reverse_near_optimal_match_list(ctx, cur_pos);
+
+		/* retrieve the number of matches available at this position  */
+		num_possible_matches = lzx_lz_get_matches_caching(ctx,
+								  &possible_matches);
+
+		unsigned new_len = 0;
+
+		if (num_possible_matches != 0) {
+			new_len = possible_matches[num_possible_matches - 1].len;
+
+			/* Greedy heuristic:  if we found a match greater than
+			 * LZX_PARAM_NUM_FAST_BYTES, stop immediately.  */
+			if (new_len > ctx->params.slow.num_fast_bytes) {
+
+				/* Build the list of matches to return and get
+				 * the first one.  */
+				match = lzx_lz_reverse_near_optimal_match_list(ctx, cur_pos);
+
+				/* Append the long match to the end of the list.  */
+				ctx->optimum[cur_pos].next.match_offset =
+					possible_matches[num_possible_matches - 1].offset;
+				ctx->optimum[cur_pos].next.link = cur_pos + new_len;
+				ctx->optimum_end_idx = cur_pos + new_len;
+
+				/* Skip over the remaining bytes of the long match.  */
+				lzx_lz_skip_bytes(ctx, new_len - 1);
+
+				/* Return first match in the list  */
+				return match;
+			}
+		}
+
+		/* Consider proceeding with a literal byte.  */
+		u32 cur_cost = ctx->optimum[cur_pos].cost;
+		u32 cur_plus_literal_cost = cur_cost +
+			lzx_literal_cost(ctx->window[ctx->match_window_pos - 1],
+					 &ctx->costs);
+		if (cur_plus_literal_cost < ctx->optimum[cur_pos + 1].cost) {
+			ctx->optimum[cur_pos + 1].cost = cur_plus_literal_cost;
+			ctx->optimum[cur_pos + 1].prev.link = cur_pos;
+		#if LZX_PARAM_ACCOUNT_FOR_LRU
+			ctx->optimum[cur_pos + 1].queue = ctx->optimum[cur_pos].queue;
+		#endif
+		}
+
+		if (num_possible_matches == 0)
+			continue;
+
+		/* Consider proceeding with a match.  */
+
+		while (len_end < cur_pos + new_len)
+			ctx->optimum[++len_end].cost = ~(u32)0;
+
+		match_idx = 0;
+		for (len = LZX_MIN_MATCH; len <= new_len; len++) {
+			LZX_ASSERT(match_idx < num_possible_matches);
+		#if LZX_PARAM_ACCOUNT_FOR_LRU
+			struct lzx_lru_queue q = ctx->optimum[cur_pos].queue;
+		#endif
+			u32 cost = cur_cost + lzx_match_cost(len,
+							     possible_matches[match_idx].offset,
+							     &ctx->costs
+							#if LZX_PARAM_ACCOUNT_FOR_LRU
+							     , &q
+							#endif
+							     );
+
+			if (cost < ctx->optimum[cur_pos + len].cost) {
+				ctx->optimum[cur_pos + len].cost = cost;
+				ctx->optimum[cur_pos + len].prev.link = cur_pos;
+				ctx->optimum[cur_pos + len].prev.match_offset =
+						possible_matches[match_idx].offset;
+			#if LZX_PARAM_ACCOUNT_FOR_LRU
+				ctx->optimum[cur_pos + len].queue = q;
+			#endif
+			}
+
+			if (len == possible_matches[match_idx].len)
+				match_idx++;
 		}
-		*call_insn_target = cpu_to_le32(abs_offset);
 	}
+#endif
 }
 
-/* This is the reverse of undo_call_insn_preprocessing() in lzx-decompress.c.
- * See the comment above that function for more information. */
+/* Account for extra bits in the main symbols.  */
 static void
-do_call_insn_preprocessing(u8 uncompressed_data[], int uncompressed_data_len)
+lzx_update_mainsym_match_costs(int block_type,
+			       u8 main_lens[LZX_MAINTREE_NUM_SYMBOLS])
 {
-	for (int i = 0; i < uncompressed_data_len - 10; i++) {
-		if (uncompressed_data[i] == 0xe8) {
-			do_call_insn_translation((u32*)&uncompressed_data[i + 1],
-						 i,
-						 LZX_WIM_MAGIC_FILESIZE);
-			i += 4;
+	unsigned i;
+
+	LZX_ASSERT(block_type == LZX_BLOCKTYPE_ALIGNED ||
+		   block_type == LZX_BLOCKTYPE_VERBATIM);
+
+	for (i = LZX_NUM_CHARS; i < LZX_MAINTREE_NUM_SYMBOLS; i++) {
+		unsigned position_slot = (i >> 3) & 0x1f;
+
+		/* If it's a verbatim block, add the number of extra bits
+		 * corresponding to the position slot.
+		 *
+		 * If it's an aligned block and there would normally be at least
+		 * 3 extra bits, count 3 less because they will be output as an
+		 * aligned offset symbol instead.  */
+		unsigned num_extra_bits = lzx_get_num_extra_bits(position_slot);
+
+		if (block_type == LZX_BLOCKTYPE_ALIGNED && num_extra_bits >= 3)
+			num_extra_bits -= 3;
+		main_lens[i] += num_extra_bits;
+	}
+}
+
+/*
+ * Compute the costs, in bits, to output a compressed block as aligned offset
+ * and verbatim.
+ *
+ * @block_size
+ *	Number of bytes of uncompressed data this block represents.
+ * @codes
+ *	Huffman codes that will be used to output the block.
+ * @prev_codes
+ *	Huffman codes for the previous block, or all zeroes if this is the first
+ *	block.
+ * @freqs
+ *	Frequencies of Huffman symbol that will be output in the block.
+ * @aligned_cost_ret
+ *	Cost of aligned block will be returned here.
+ * @verbatim_cost_ret
+ *	Cost of verbatim block will be returned here.
+ */
+static void
+lzx_compute_compressed_block_costs(unsigned block_size,
+				   const struct lzx_codes *codes,
+				   const struct lzx_codes *prev_codes,
+				   const struct lzx_freqs *freqs,
+				   unsigned * aligned_cost_ret,
+				   unsigned * verbatim_cost_ret)
+{
+	unsigned common_cost = 0;
+	unsigned aligned_cost = 0;
+	unsigned verbatim_cost = 0;
+
+	u8 updated_main_lens[LZX_MAINTREE_NUM_SYMBOLS];
+
+	/* Account for cost of block header.  */
+	common_cost += LZX_BLOCKTYPE_NBITS;
+	if (block_size == LZX_DEFAULT_BLOCK_SIZE)
+		common_cost += 1;
+	else
+		common_cost += LZX_BLOCKSIZE_NBITS;
+
+	/* Account for cost of outputting aligned offset code.  */
+	aligned_cost += LZX_ALIGNEDTREE_NUM_SYMBOLS * LZX_ALIGNEDTREE_ELEMENT_SIZE;
+
+	/* Account for cost of outputting main and length codes.  */
+	common_cost += lzx_code_cost(codes->lens.main,
+				     prev_codes->lens.main,
+				     LZX_NUM_CHARS);
+	common_cost += lzx_code_cost(codes->lens.main + LZX_NUM_CHARS,
+				     prev_codes->lens.main + LZX_NUM_CHARS,
+				     LZX_MAINTREE_NUM_SYMBOLS - LZX_NUM_CHARS);
+	common_cost += lzx_code_cost(codes->lens.len,
+				     prev_codes->lens.len,
+				     LZX_LENTREE_NUM_SYMBOLS);
+
+	/* Account for cost to output main, length, and aligned symbols, taking
+	 * into account extra position bits.  */
+
+	memcpy(updated_main_lens, codes->lens.main, LZX_MAINTREE_NUM_SYMBOLS);
+	lzx_update_mainsym_match_costs(LZX_BLOCKTYPE_VERBATIM, updated_main_lens);
+	verbatim_cost += lzx_huffman_code_output_cost(updated_main_lens,
+						      freqs->main,
+						      LZX_MAINTREE_NUM_SYMBOLS);
+	memcpy(updated_main_lens, codes->lens.main, LZX_MAINTREE_NUM_SYMBOLS);
+	lzx_update_mainsym_match_costs(LZX_BLOCKTYPE_ALIGNED, updated_main_lens);
+	aligned_cost += lzx_huffman_code_output_cost(updated_main_lens,
+						     freqs->main,
+						     LZX_MAINTREE_NUM_SYMBOLS);
+
+	common_cost += lzx_huffman_code_output_cost(codes->lens.len,
+						    freqs->len,
+						    LZX_LENTREE_NUM_SYMBOLS);
+
+	aligned_cost += lzx_huffman_code_output_cost(codes->lens.aligned,
+						     freqs->aligned,
+						     LZX_ALIGNEDTREE_NUM_SYMBOLS);
+
+	*aligned_cost_ret = aligned_cost + common_cost;
+	*verbatim_cost_ret = verbatim_cost + common_cost;
+}
+
+/* Prepare a (nonsplit) compressed block.  */
+static unsigned
+lzx_prepare_compressed_block(struct lzx_compressor *ctx, unsigned block_number,
+			     struct lzx_codes *prev_codes)
+{
+	struct lzx_block_spec *spec = &ctx->block_specs[block_number - 1];
+	unsigned orig_cached_matches_pos = ctx->cached_matches_pos;
+	struct lzx_lru_queue orig_queue = ctx->queue;
+	struct lzx_freqs freqs;
+	unsigned cost;
+
+	/* Here's where the real work happens.  The following loop runs one or
+	 * more times, each time using a cost model based on the Huffman codes
+	 * computed from the previous iteration (the first iteration uses a
+	 * default model).  Each iteration of the loop uses a heuristic
+	 * algorithm to divide the block into near-optimal matches/literals from
+	 * beginning to end.  */
+	LZX_ASSERT(ctx->params.slow.num_optim_passes >= 1);
+	spec->num_chosen_matches = 0;
+	for (unsigned pass = 0; pass < ctx->params.slow.num_optim_passes; pass++)
+	{
+		LZX_DEBUG("Block %u: Match-choosing pass %u of %u",
+			  block_number, pass + 1,
+			  ctx->params.slow.num_optim_passes);
+
+		/* Reset frequency tables.  */
+		memset(&freqs, 0, sizeof(freqs));
+
+		/* Reset match offset LRU queue.  */
+		ctx->queue = orig_queue;
+
+		/* Reset match-finding position.  */
+		ctx->cached_matches_pos = orig_cached_matches_pos;
+		ctx->match_window_pos = spec->window_pos;
+		ctx->match_window_end = spec->window_pos + spec->block_size;
+
+		/* Set cost model.  */
+		lzx_set_costs(ctx, &spec->codes.lens);
+
+		unsigned window_pos = spec->window_pos;
+		unsigned end = window_pos + spec->block_size;
+
+		while (window_pos < end) {
+			struct raw_match match;
+			struct lzx_match lzx_match;
+
+			match = lzx_lz_get_near_optimal_match(ctx);
+
+			if (match.len >= LZX_MIN_MATCH) {
+
+				/* Best to output a match here.  */
+
+				LZX_ASSERT(match.len <= LZX_MAX_MATCH);
+				LZX_ASSERT(!memcmp(&ctx->window[window_pos],
+						   &ctx->window[window_pos - match.offset],
+						   match.len));
+
+				/* Tally symbol frequencies.  */
+				lzx_match.data = lzx_record_match(match.offset,
+								  match.len,
+								  &freqs,
+								  &ctx->queue);
+
+				window_pos += match.len;
+			} else {
+				/* Best to output a literal here.  */
+
+				/* Tally symbol frequencies.  */
+				lzx_match.data = lzx_record_literal(ctx->window[window_pos],
+								    &freqs);
+
+				window_pos += 1;
+			}
+
+			/* If it's the last pass, save the match/literal in
+			 * intermediate form.  */
+			if (pass == ctx->params.slow.num_optim_passes - 1) {
+				ctx->chosen_matches[spec->chosen_matches_start_pos +
+						    spec->num_chosen_matches] = lzx_match;
+
+				spec->num_chosen_matches++;
+			}
 		}
+		LZX_ASSERT(window_pos == end);
+
+		/* Build Huffman codes using the new frequencies.  */
+		lzx_make_huffman_codes(&freqs, &spec->codes);
+
+		/* The first time we get here is when the full input has been
+		 * processed, so the match-finding is done.  */
+		ctx->matches_already_found = true;
+	}
+
+	LZX_DEBUG("Block %u: saved %u matches/literals @ %u",
+		  block_number, spec->num_chosen_matches,
+		  spec->chosen_matches_start_pos);
+
+	unsigned aligned_cost;
+	unsigned verbatim_cost;
+
+	lzx_compute_compressed_block_costs(spec->block_size,
+					   &spec->codes,
+					   prev_codes,
+					   &freqs,
+					   &aligned_cost,
+					   &verbatim_cost);
+
+	/* Choose whether to make the block aligned offset or verbatim.  */
+	if (aligned_cost < verbatim_cost) {
+		spec->block_type = LZX_BLOCKTYPE_ALIGNED;
+		cost = aligned_cost;
+		LZX_DEBUG("Using aligned block (cost %u vs %u for verbatim)",
+			  aligned_cost, verbatim_cost);
+	} else {
+		spec->block_type = LZX_BLOCKTYPE_VERBATIM;
+		cost = verbatim_cost;
+		LZX_DEBUG("Using verbatim block (cost %u vs %u for aligned)",
+			  verbatim_cost, aligned_cost);
 	}
+
+	LZX_DEBUG("Block %u is %u => %u bytes unsplit.",
+		  block_number, spec->block_size, cost / 8);
+
+	return cost;
 }
 
+/*
+ * lzx_prepare_block_recursive() -
+ *
+ * Given a (possibly nonproper) sub-sequence of the preprocessed input, compute
+ * the LZX block(s) that it should be output as.
+ *
+ * This function initially considers the case where the given sub-sequence of
+ * the preprocessed input be output as a single block.  This block is calculated
+ * and its cost (number of bits required to output it) is computed.
+ *
+ * Then, if @max_split_level is greater than zero, a split into two evenly sized
+ * subblocks is considered.  The block is recursively split in this way,
+ * potentially up to the depth specified by @max_split_level.  The cost of the
+ * split block is compared to the cost of the single block, and the lower cost
+ * solution is used.
+ *
+ * For each compressed output block computed, the sequence of matches/literals
+ * and the corresponding Huffman codes for the block are produced and saved.
+ *
+ * The return value is the approximate number of bits the block (or all
+ * subblocks, in the case that the split block had lower cast), will take up
+ * when written to the compressed output.
+ */
+static unsigned
+lzx_prepare_block_recursive(struct lzx_compressor * ctx,
+			    unsigned block_number,
+			    unsigned max_split_level,
+			    struct lzx_codes **prev_codes_p)
+{
+	struct lzx_block_spec *spec = &ctx->block_specs[block_number - 1];
+	unsigned cost;
+	unsigned orig_cached_matches_pos;
+	struct lzx_lru_queue orig_queue, nonsplit_queue;
+	struct lzx_codes *prev_codes = *prev_codes_p;
+
+	LZX_DEBUG("Preparing block %u...", block_number);
+
+	/* Save positions of chosen and cached matches, and the match offset LRU
+	 * queue, so that they can be restored if splitting is attempted.  */
+	orig_cached_matches_pos = ctx->cached_matches_pos;
+	orig_queue = ctx->queue;
+
+	/* Consider outputting the input subsequence as a single block.  */
+	spec->is_split = 0;
+	cost = lzx_prepare_compressed_block(ctx, block_number, prev_codes);
+	nonsplit_queue = ctx->queue;
+
+	*prev_codes_p = &spec->codes;
+
+	/* If the maximum split level is at least one, consider splitting the
+	 * block in two.  */
+	if (max_split_level--) {
+
+		LZX_DEBUG("Calculating split of block %u...", block_number);
+
+		struct lzx_block_spec *spec1, *spec2;
+		unsigned split_cost;
+
+		ctx->cached_matches_pos = orig_cached_matches_pos;
+		ctx->queue = orig_queue;
+
+		/* Prepare and get the cost of the first sub-block.  */
+		spec1 = &ctx->block_specs[block_number * 2 - 1];
+		spec1->codes.lens = spec->codes.lens;
+		spec1->window_pos = spec->window_pos;
+		spec1->block_size = spec->block_size / 2;
+		spec1->chosen_matches_start_pos = spec->chosen_matches_start_pos +
+						  LZX_MAX_WINDOW_SIZE;
+		split_cost = lzx_prepare_block_recursive(ctx,
+							 block_number * 2,
+							 max_split_level,
+							 &prev_codes);
+
+		/* Prepare and get the cost of the second sub-block.  */
+		spec2 = spec1 + 1;
+		spec2->codes.lens = spec->codes.lens;
+		spec2->window_pos = spec->window_pos + spec1->block_size;
+		spec2->block_size = spec->block_size - spec1->block_size;
+		spec2->chosen_matches_start_pos = spec1->chosen_matches_start_pos +
+						  spec1->block_size;
+		split_cost += lzx_prepare_block_recursive(ctx,
+							  block_number * 2 + 1,
+							  max_split_level,
+							  &prev_codes);
+
+		/* Compare the cost of the whole block with that of the split
+		 * block.  Choose the lower cost solution.  */
+		if (split_cost < cost) {
+			LZX_DEBUG("Splitting block %u is worth it "
+				  "(%u => %u bytes).",
+				  block_number, cost / 8, split_cost / 8);
+			spec->is_split = 1;
+			cost = split_cost;
+			*prev_codes_p = prev_codes;
+		} else {
+			LZX_DEBUG("Splitting block %u is NOT worth it "
+				  "(%u => %u bytes).",
+				  block_number, cost / 8, split_cost / 8);
+			ctx->queue = nonsplit_queue;
+		}
+	}
 
-static const struct lz_params lzx_lz_params = {
+	return cost;
+}
 
-	 /* LZX_MIN_MATCH == 2, but 2-character matches are rarely useful; the
-	  * minimum match for compression is set to 3 instead. */
-	.min_match      = 3,
+/* Empirical averages  */
+static const u8 lzx_default_mainsym_costs[LZX_MAINTREE_NUM_SYMBOLS] = {
+	7, 9, 9, 10, 9, 10, 10, 10, 9, 10, 9, 10, 10, 9, 10, 10, 9, 10, 10, 11,
+	10, 10, 10, 11, 10, 11, 11, 11, 10, 11, 11, 11, 8, 11, 9, 10, 9, 10, 11,
+	11, 9, 9, 11, 10, 10, 9, 9, 9, 8, 8, 8, 8, 8, 9, 9, 9, 8, 8, 9, 9, 9, 9,
+	10, 10, 10, 8, 9, 8, 8, 8, 8, 9, 9, 9, 10, 10, 8, 8, 9, 9, 8, 10, 9, 8,
+	8, 9, 8, 9, 9, 10, 10, 10, 9, 10, 11, 9, 10, 8, 9, 8, 8, 8, 8, 9, 8, 8,
+	9, 9, 8, 8, 8, 8, 8, 10, 8, 8, 7, 8, 9, 9, 9, 9, 10, 11, 10, 10, 11, 11,
+	10, 11, 11, 10, 10, 11, 11, 11, 10, 10, 11, 10, 11, 10, 11, 11, 10, 11,
+	11, 12, 11, 11, 11, 12, 11, 11, 11, 11, 11, 11, 11, 12, 10, 11, 11, 11,
+	11, 11, 11, 12, 11, 11, 11, 11, 11, 12, 11, 11, 10, 11, 11, 11, 11, 11,
+	11, 11, 10, 11, 11, 11, 11, 11, 11, 11, 10, 11, 11, 11, 11, 11, 11, 11,
+	10, 11, 11, 11, 11, 11, 11, 11, 10, 11, 11, 11, 11, 12, 11, 11, 10, 11,
+	11, 11, 11, 12, 11, 11, 10, 11, 11, 11, 10, 12, 11, 11, 10, 10, 11, 10,
+	10, 11, 11, 11, 10, 11, 11, 11, 10, 11, 11, 11, 10, 11, 11, 11, 10, 11,
+	10, 9, 8, 7, 10, 10, 11, 10, 11, 7, 9, 9, 11, 11, 11, 12, 11, 9, 10, 10,
+	12, 12, 13, 13, 12, 11, 10, 12, 12, 14, 14, 14, 13, 12, 9, 12, 13, 14,
+	14, 14, 14, 14, 9, 10, 13, 14, 14, 14, 14, 14, 9, 9, 11, 11, 13, 13, 13,
+	14, 9, 9, 11, 12, 12, 13, 13, 13, 8, 8, 11, 11, 12, 12, 12, 11, 9, 9,
+	10, 11, 12, 12, 12, 11, 8, 9, 10, 10, 11, 12, 11, 10, 9, 9, 10, 11, 11,
+	12, 11, 10, 8, 9, 10, 10, 11, 11, 11, 9, 9, 9, 10, 11, 11, 11, 11, 9, 8,
+	8, 10, 10, 11, 11, 11, 9, 9, 9, 10, 10, 11, 11, 11, 9, 9, 8, 9, 10, 11,
+	11, 11, 9, 10, 9, 10, 11, 11, 11, 11, 9, 14, 9, 9, 10, 10, 11, 10, 9,
+	14, 9, 10, 11, 11, 11, 11, 9, 14, 9, 10, 10, 11, 11, 11, 9, 14, 10, 10,
+	11, 11, 12, 11, 10, 14, 10, 10, 10, 11, 11, 11, 10, 14, 11, 11, 11, 11,
+	12, 12, 10, 14, 10, 11, 11, 11, 12, 11, 10, 14, 11, 11, 11, 12, 12, 12,
+	11, 15, 11, 11, 11, 12, 12, 12, 11, 14, 12, 12, 12, 12, 13, 12, 11, 15,
+	12, 12, 12, 13, 13, 13, 12, 15, 14, 13, 14, 14, 14, 14, 13,
+};
 
-	.max_match      = LZX_MAX_MATCH,
-	.good_match	= LZX_MAX_MATCH,
-	.nice_match     = LZX_MAX_MATCH,
-	.max_chain_len  = LZX_MAX_MATCH,
-	.max_lazy_match = LZX_MAX_MATCH,
-	.too_far        = 4096,
+/* Empirical averages  */
+static const u8 lzx_default_lensym_costs[LZX_LENTREE_NUM_SYMBOLS] = {
+	5, 5, 5, 5, 5, 6, 5, 5, 6, 7, 7, 7, 8, 8, 7, 8, 9, 9, 9, 9, 10, 9, 9,
+	10, 9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 12, 12, 12, 11, 12, 12,
+	12, 12, 12, 12, 13, 12, 12, 12, 13, 12, 13, 13, 12, 12, 13, 12, 13, 13,
+	13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 13, 14, 13, 14, 13,
+	14, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
+	14, 14, 14, 14, 14, 14, 14, 14, 14, 10,
 };
 
-/* API function documented in wimlib.h  */
-WIMLIBAPI unsigned
-wimlib_lzx_compress(const void *_uncompressed_data, unsigned uncompressed_len,
-		    void *compressed_data)
+/*
+ * Set default symbol costs.
+ */
+static void
+lzx_set_default_costs(struct lzx_lens * lens)
 {
-	struct output_bitstream ostream;
-	u8 uncompressed_data[uncompressed_len + 8];
-	struct lzx_freq_tables freq_tabs;
-	struct lzx_codes codes;
-	u32 match_tab[uncompressed_len];
-	struct lru_queue queue;
-	unsigned num_matches;
-	unsigned compressed_len;
 	unsigned i;
-	int ret;
-	int block_type = LZX_BLOCKTYPE_ALIGNED;
 
-	wimlib_assert(uncompressed_len <= 32768);
+#if LZX_PARAM_USE_EMPIRICAL_DEFAULT_COSTS
+	memcpy(&lens->main, lzx_default_mainsym_costs, LZX_MAINTREE_NUM_SYMBOLS);
+	memcpy(&lens->len, lzx_default_lensym_costs, LZX_LENTREE_NUM_SYMBOLS);
 
-	if (uncompressed_len < 100)
-		return 0;
+#else
+	/* Literal symbols  */
+	for (i = 0; i < LZX_NUM_CHARS; i++)
+		lens->main[i] = 8;
 
-	memset(&freq_tabs, 0, sizeof(freq_tabs));
-	queue.R0 = 1;
-	queue.R1 = 1;
-	queue.R2 = 1;
+	/* Match header symbols  */
+	for (; i < LZX_MAINTREE_NUM_SYMBOLS; i++)
+		lens->main[i] = 10;
 
-	/* The input data must be preprocessed. To avoid changing the original
-	 * input, copy it to a temporary buffer. */
-	memcpy(uncompressed_data, _uncompressed_data, uncompressed_len);
-	memset(uncompressed_data + uncompressed_len, 0, 8);
+	/* Length symbols  */
+	for (i = 0; i < LZX_LENTREE_NUM_SYMBOLS; i++)
+		lens->len[i] = 8;
+#endif
 
-	/* Before doing any actual compression, do the call instruction (0xe8
-	 * byte) translation on the uncompressed data. */
-	do_call_insn_preprocessing(uncompressed_data, uncompressed_len);
+	/* Aligned offset symbols  */
+	for (i = 0; i < LZX_ALIGNEDTREE_NUM_SYMBOLS; i++)
+		lens->aligned[i] = 3;
+}
 
-	/* Determine the sequence of matches and literals that will be output,
-	 * and in the process, keep counts of the number of times each symbol
-	 * will be output, so that the Huffman trees can be made. */
+/*
+ * lzx_prepare_blocks() -
+ *
+ * Calculate the blocks to split the preprocessed data into.
+ *
+ * Input ---  the preprocessed data:
+ *
+ *	ctx->window[]
+ *	ctx->window_size
+ *
+ * Working space:
+ *	Match finding:
+ *		ctx->hash_tab
+ *		ctx->child_tab
+ *		ctx->cached_matches
+ *		ctx->cached_matches_pos
+ *		ctx->matches_already_found
+ *
+ *	Block cost modeling:
+ *		ctx->costs
+ *		ctx->block_specs (also an output)
+ *
+ *	Match choosing:
+ *		ctx->optimum
+ *		ctx->optimum_cur_idx
+ *		ctx->optimum_end_idx
+ *		ctx->chosen_matches (also an output)
+ *
+ * Output --- the block specifications and the corresponding match/literal data:
+ *
+ *	ctx->block_specs[]
+ *	ctx->chosen_matches[]
+ *
+ * The return value is the approximate number of bits the compressed data will
+ * take up.
+ */
+static unsigned
+lzx_prepare_blocks(struct lzx_compressor * ctx)
+{
+	/* This function merely does some initializations, then passes control
+	 * to lzx_prepare_block_recursive().  */
+
+	/* 1. Initialize match-finding variables.  */
+
+	/* Zero all entries in the hash table, indicating that no length-3
+	 * character sequences have been discovered in the input yet.  */
+	memset(ctx->hash_tab, 0, LZX_LZ_HASH_SIZE * 2 * sizeof(ctx->hash_tab[0]));
+	if (ctx->params.slow.use_len2_matches)
+		memset(ctx->digram_tab, 0, 256 * 256 * sizeof(ctx->digram_tab[0]));
+	/* Note: ctx->child_tab need not be initialized.  */
+
+	/* No matches have been found and cached yet.  */
+	ctx->cached_matches_pos = 0;
+	ctx->matches_already_found = false;
+
+	/* 2. Initialize match-choosing variables.  */
+	ctx->optimum_cur_idx = 0;
+	ctx->optimum_end_idx = 0;
+	/* Note: ctx->optimum need not be initialized.  */
+	ctx->block_specs[0].chosen_matches_start_pos = 0;
+
+	/* 3. Set block 1 (index 0) to represent the entire input data.  */
+	ctx->block_specs[0].block_size = ctx->window_size;
+	ctx->block_specs[0].window_pos = 0;
+
+	/* 4. Set up a default Huffman symbol cost model for block 1 (index 0).
+	 * The model will be refined later.  */
+	lzx_set_default_costs(&ctx->block_specs[0].codes.lens);
+
+	/* 5. Initialize the match offset LRU queue.  */
+	ctx->queue = (struct lzx_lru_queue){1, 1, 1};
+
+	/* 6. Pass control to recursive procedure.  */
+	struct lzx_codes * prev_codes = &ctx->zero_codes;
+	return lzx_prepare_block_recursive(ctx, 1,
+					   ctx->params.slow.num_split_passes,
+					   &prev_codes);
+}
 
-	num_matches = lz_analyze_block(uncompressed_data, uncompressed_len,
-				       match_tab, lzx_record_match,
-				       lzx_record_literal, &freq_tabs,
-				       &queue, freq_tabs.main_freq_table,
+/*
+ * This is the fast version of lzx_prepare_blocks(), which "quickly" prepares a
+ * single compressed block containing the entire input.  See the description of
+ * the "Fast algorithm" at the beginning of this file for more information.
+ *
+ * Input ---  the preprocessed data:
+ *
+ *	ctx->window[]
+ *	ctx->window_size
+ *
+ * Working space:
+ *	ctx->queue
+ *
+ * Output --- the block specifications and the corresponding match/literal data:
+ *
+ *	ctx->block_specs[]
+ *	ctx->chosen_matches[]
+ */
+static void
+lzx_prepare_block_fast(struct lzx_compressor * ctx)
+{
+	unsigned num_matches;
+	struct lzx_freqs freqs;
+	struct lzx_block_spec *spec;
+
+	/* Parameters to hash chain LZ match finder  */
+	static const struct lz_params lzx_lz_params = {
+		/* LZX_MIN_MATCH == 2, but 2-character matches are rarely
+		 * useful; the minimum match for compression is set to 3
+		 * instead. */
+		.min_match      = 3,
+		.max_match      = LZX_MAX_MATCH,
+		.good_match     = LZX_MAX_MATCH,
+		.nice_match     = LZX_MAX_MATCH,
+		.max_chain_len  = LZX_MAX_MATCH,
+		.max_lazy_match = LZX_MAX_MATCH,
+		.too_far        = 4096,
+	};
+
+	/* Initialize symbol frequencies and match offset LRU queue.  */
+	memset(&freqs, 0, sizeof(struct lzx_freqs));
+	ctx->queue = (struct lzx_lru_queue){ 1, 1, 1 };
+
+	/* Determine series of matches/literals to output.  */
+	num_matches = lz_analyze_block(ctx->window,
+				       ctx->window_size,
+				       (u32*)ctx->chosen_matches,
+				       lzx_record_match,
+				       lzx_record_literal,
+				       &freqs,
+				       &ctx->queue,
+				       &freqs,
 				       &lzx_lz_params);
 
-	lzx_make_huffman_codes(&freq_tabs, &codes);
 
-	/* Initialize the output bitstream. */
-	init_output_bitstream(&ostream, compressed_data, uncompressed_len - 1);
+	/* Set up block specification.  */
+	spec = &ctx->block_specs[0];
+	spec->is_split = 0;
+	spec->block_type = LZX_BLOCKTYPE_ALIGNED;
+	spec->window_pos = 0;
+	spec->block_size = ctx->window_size;
+	spec->num_chosen_matches = num_matches;
+	spec->chosen_matches_start_pos = 0;
+	lzx_make_huffman_codes(&freqs, &spec->codes);
+}
+
+static void
+do_call_insn_translation(u32 *call_insn_target, int input_pos,
+			 s32 file_size)
+{
+	s32 abs_offset;
+	s32 rel_offset;
 
-	/* The first three bits tell us what kind of block it is, and are one
-	 * of the LZX_BLOCKTYPE_* values.  */
-	bitstream_put_bits(&ostream, block_type, 3);
+	rel_offset = le32_to_cpu(*call_insn_target);
+	if (rel_offset >= -input_pos && rel_offset < file_size) {
+		if (rel_offset < file_size - input_pos) {
+			/* "good translation" */
+			abs_offset = rel_offset + input_pos;
+		} else {
+			/* "compensating translation" */
+			abs_offset = rel_offset - file_size;
+		}
+		*call_insn_target = cpu_to_le32(abs_offset);
+	}
+}
 
-	/* The next bit indicates whether the block size is the default (32768),
-	 * indicated by a 1 bit, or whether the block size is given by the next
-	 * 16 bits, indicated by a 0 bit. */
-	if (uncompressed_len == 32768) {
-		bitstream_put_bits(&ostream, 1, 1);
-	} else {
-		bitstream_put_bits(&ostream, 0, 1);
-		bitstream_put_bits(&ostream, uncompressed_len, 16);
+/* This is the reverse of undo_call_insn_preprocessing() in lzx-decompress.c.
+ * See the comment above that function for more information.  */
+static void
+do_call_insn_preprocessing(u8 data[], int size)
+{
+	for (int i = 0; i < size - 10; i++) {
+		if (data[i] == 0xe8) {
+			do_call_insn_translation((u32*)&data[i + 1], i,
+						 LZX_WIM_MAGIC_FILESIZE);
+			i += 4;
+		}
 	}
+}
 
-	/* Write out the aligned offset tree. Note that M$ lies and says that
-	 * the aligned offset tree comes after the length tree, but that is
-	 * wrong; it actually is before the main tree.  */
-	if (block_type == LZX_BLOCKTYPE_ALIGNED)
-		for (i = 0; i < LZX_ALIGNEDTREE_NUM_SYMBOLS; i++)
-			bitstream_put_bits(&ostream, codes.aligned_lens[i],
-					   LZX_ALIGNEDTREE_ELEMENT_SIZE);
+/* API function documented in wimlib.h  */
+WIMLIBAPI unsigned
+wimlib_lzx_compress2(const void			* const restrict uncompressed_data,
+		     unsigned			  const          uncompressed_len,
+		     void			* const restrict compressed_data,
+		     struct wimlib_lzx_context	* const restrict lzx_ctx)
+{
+	struct lzx_compressor *ctx = (struct lzx_compressor*)lzx_ctx;
+	struct output_bitstream ostream;
+	unsigned compressed_len;
 
-	/* Write the pre-tree and lengths for the first LZX_NUM_CHARS symbols in the
-	 * main tree. */
-	ret = lzx_write_compressed_tree(&ostream, codes.main_lens,
-				        LZX_NUM_CHARS);
-	if (ret)
+	if (uncompressed_len < 100) {
+		LZX_DEBUG("Too small to bother compressing.");
 		return 0;
+	}
 
-	/* Write the pre-tree and symbols for the rest of the main tree. */
-	ret = lzx_write_compressed_tree(&ostream, codes.main_lens +
-					LZX_NUM_CHARS,
-					LZX_MAINTREE_NUM_SYMBOLS -
-						LZX_NUM_CHARS);
-	if (ret)
+	if (uncompressed_len > 32768) {
+		LZX_DEBUG("Only up to 32768 bytes of uncompressed data are supported.");
 		return 0;
+	}
 
-	/* Write the pre-tree and symbols for the length tree. */
-	ret = lzx_write_compressed_tree(&ostream, codes.len_lens,
-					LZX_LENTREE_NUM_SYMBOLS);
-	if (ret)
-		return 0;
+	wimlib_assert(lzx_ctx != NULL);
 
-	/* Write the compressed literals. */
-	ret = lzx_write_compressed_literals(&ostream, block_type,
-					    match_tab, num_matches, &codes);
-	if (ret)
-		return 0;
+	LZX_DEBUG("Attempting to compress %u bytes...", uncompressed_len);
+
+	/* The input data must be preprocessed.  To avoid changing the original
+	 * input, copy it to a temporary buffer.  */
+	memcpy(ctx->window, uncompressed_data, uncompressed_len);
+	ctx->window_size = uncompressed_len;
+
+	/* This line is unnecessary; it just avoids inconsequential accesses of
+	 * uninitialized memory that would show up in memory-checking tools such
+	 * as valgrind.  */
+	memset(&ctx->window[ctx->window_size], 0, 12);
+
+	LZX_DEBUG("Preprocessing data...");
+
+	/* Before doing any actual compression, do the call instruction (0xe8
+	 * byte) translation on the uncompressed data.  */
+	do_call_insn_preprocessing(ctx->window, ctx->window_size);
+
+	LZX_DEBUG("Preparing blocks...");
+
+	/* Prepare the compressed data.  */
+	if (ctx->params.algorithm == WIMLIB_LZX_ALGORITHM_FAST)
+		lzx_prepare_block_fast(ctx);
+	else
+		lzx_prepare_blocks(ctx);
+
+	LZX_DEBUG("Writing compressed blocks...");
+
+	/* Generate the compressed data.  */
+	init_output_bitstream(&ostream, compressed_data, ctx->window_size - 1);
+	lzx_write_all_blocks(ctx, &ostream);
 
-	ret = flush_output_bitstream(&ostream);
-	if (ret)
+	LZX_DEBUG("Flushing bitstream...");
+	if (flush_output_bitstream(&ostream)) {
+		/* If the bitstream cannot be flushed, then the output space was
+		 * exhausted.  */
+		LZX_DEBUG("Data did not compress to less than original length!");
 		return 0;
+	}
 
+	/* Compute the length of the compressed data.  */
 	compressed_len = ostream.bit_output - (u8*)compressed_data;
 
-#ifdef ENABLE_VERIFY_COMPRESSION
-	/* Verify that we really get the same thing back when decompressing. */
+	LZX_DEBUG("Done: compressed %u => %u bytes.",
+		  uncompressed_len, compressed_len);
+
+#if defined(ENABLE_LZX_DEBUG) || defined(ENABLE_VERIFY_COMPRESSION)
+	/* Verify that we really get the same thing back when decompressing.  */
 	{
 		u8 buf[uncompressed_len];
+		int ret;
+		unsigned i;
+
 		ret = wimlib_lzx_decompress(compressed_data, compressed_len,
 					    buf, uncompressed_len);
-		if (ret != 0) {
-			ERROR("lzx_compress(): Failed to decompress data we compressed");
-			abort();
+		if (ret) {
+			ERROR("Failed to decompress data we "
+			      "compressed using LZX algorithm");
+			wimlib_assert(0);
+			return 0;
 		}
 
+		bool bad = false;
+		const u8 * udata = uncompressed_data;
 		for (i = 0; i < uncompressed_len; i++) {
-			if (buf[i] != *((u8*)_uncompressed_data + i)) {
-				ERROR("lzx_compress(): Data we compressed didn't "
-				      "decompress to the original data (difference at "
-				      "byte %u of %u)", i + 1, uncompressed_len);
-				abort();
+			if (buf[i] != udata[i]) {
+				bad = true;
+				ERROR("Data we compressed using LZX algorithm "
+				      "didn't decompress to original "
+				      "(difference at idx %u: c %#02x, u %#02x)",
+				      i, buf[i], udata[i]);
 			}
 		}
+		if (bad) {
+			wimlib_assert(0);
+			return 0;
+		}
 	}
 #endif
 	return compressed_len;
 }
+
+static bool
+lzx_params_compatible(const struct wimlib_lzx_params *oldparams,
+		      const struct wimlib_lzx_params *newparams)
+{
+	return 0 == memcmp(oldparams, newparams, sizeof(struct wimlib_lzx_params));
+}
+
+/* API function documented in wimlib.h  */
+WIMLIBAPI int
+wimlib_lzx_alloc_context(const struct wimlib_lzx_params *params,
+			 struct wimlib_lzx_context **ctx_pp)
+{
+
+	LZX_DEBUG("Allocating LZX context...");
+
+	struct lzx_compressor *ctx;
+
+	static const struct wimlib_lzx_params fast_default = {
+		.size_of_this = sizeof(struct wimlib_lzx_params),
+		.algorithm = WIMLIB_LZX_ALGORITHM_FAST,
+		.use_defaults = 0,
+		.fast = {
+		},
+	};
+	static const struct wimlib_lzx_params slow_default = {
+		.size_of_this = sizeof(struct wimlib_lzx_params),
+		.algorithm = WIMLIB_LZX_ALGORITHM_SLOW,
+		.use_defaults = 0,
+		.slow = {
+			.use_len2_matches = 1,
+			.num_fast_bytes = 32,
+			.num_optim_passes = 3,
+			.num_split_passes = 3,
+			.main_nostat_cost = 15,
+			.len_nostat_cost = 15,
+			.aligned_nostat_cost = 7,
+		},
+	};
+
+	if (params == NULL) {
+		LZX_DEBUG("Using default algorithm and parameters.");
+		params = &slow_default;
+	}
+
+	if (params->algorithm != WIMLIB_LZX_ALGORITHM_SLOW &&
+	    params->algorithm != WIMLIB_LZX_ALGORITHM_FAST)
+	{
+		LZX_DEBUG("Invalid algorithm.");
+		return WIMLIB_ERR_INVALID_PARAM;
+	}
+
+	if (params->use_defaults) {
+		if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW)
+			params = &slow_default;
+		else
+			params = &fast_default;
+	}
+
+	if (params->size_of_this != sizeof(struct wimlib_lzx_params)) {
+		LZX_DEBUG("Invalid parameter structure size!");
+		return WIMLIB_ERR_INVALID_PARAM;
+	}
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW) {
+		if (params->slow.num_fast_bytes < 3 ||
+		    params->slow.num_fast_bytes > 257)
+		{
+			LZX_DEBUG("Invalid number of fast bytes!");
+			return WIMLIB_ERR_INVALID_PARAM;
+		}
+
+		if (params->slow.num_optim_passes < 1)
+		{
+			LZX_DEBUG("Invalid number of optimization passes!");
+			return WIMLIB_ERR_INVALID_PARAM;
+		}
+
+		if (params->slow.main_nostat_cost < 1 ||
+		    params->slow.main_nostat_cost > 16)
+		{
+			LZX_DEBUG("Invalid main_nostat_cost!");
+			return WIMLIB_ERR_INVALID_PARAM;
+		}
+
+		if (params->slow.len_nostat_cost < 1 ||
+		    params->slow.len_nostat_cost > 16)
+		{
+			LZX_DEBUG("Invalid len_nostat_cost!");
+			return WIMLIB_ERR_INVALID_PARAM;
+		}
+
+		if (params->slow.aligned_nostat_cost < 1 ||
+		    params->slow.aligned_nostat_cost > 8)
+		{
+			LZX_DEBUG("Invalid aligned_nostat_cost!");
+			return WIMLIB_ERR_INVALID_PARAM;
+		}
+	}
+
+	if (ctx_pp == NULL) {
+		LZX_DEBUG("Check parameters only.");
+		return 0;
+	}
+
+	ctx = *(struct lzx_compressor**)ctx_pp;
+
+	if (ctx && lzx_params_compatible(&ctx->params, params))
+		return 0;
+
+	LZX_DEBUG("Allocating memory.");
+
+	ctx = MALLOC(sizeof(struct lzx_compressor));
+	if (ctx == NULL)
+		goto err;
+
+	size_t block_specs_length;
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW)
+		block_specs_length = ((1 << (params->slow.num_split_passes + 1)) - 1);
+	else
+		block_specs_length = 1;
+	ctx->block_specs = MALLOC(block_specs_length * sizeof(ctx->block_specs[0]));
+	if (ctx->block_specs == NULL)
+		goto err_free_ctx;
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW) {
+		ctx->hash_tab = MALLOC((LZX_LZ_HASH_SIZE + 2 * LZX_MAX_WINDOW_SIZE) *
+				        sizeof(ctx->hash_tab[0]));
+		if (ctx->hash_tab == NULL)
+			goto err_free_block_specs;
+		ctx->child_tab = ctx->hash_tab + LZX_LZ_HASH_SIZE;
+	} else {
+		ctx->hash_tab = NULL;
+		ctx->child_tab = NULL;
+	}
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW &&
+	    params->slow.use_len2_matches)
+	{
+		ctx->digram_tab = MALLOC(256 * 256 * sizeof(ctx->digram_tab[0]));
+		if (ctx->digram_tab == NULL)
+			goto err_free_hash_tab;
+	} else {
+		ctx->digram_tab = NULL;
+	}
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW) {
+		ctx->cached_matches = MALLOC(10 * LZX_MAX_WINDOW_SIZE *
+					     sizeof(ctx->cached_matches[0]));
+		if (ctx->cached_matches == NULL)
+			goto err_free_digram_tab;
+	} else {
+		ctx->cached_matches = NULL;
+	}
+
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW) {
+		ctx->optimum = MALLOC((LZX_PARAM_OPTIM_ARRAY_SIZE + LZX_MAX_MATCH) *
+				       sizeof(ctx->optimum[0]));
+		if (ctx->optimum == NULL)
+			goto err_free_cached_matches;
+	} else {
+		ctx->optimum = NULL;
+	}
+
+	size_t chosen_matches_length;
+	if (params->algorithm == WIMLIB_LZX_ALGORITHM_SLOW)
+		chosen_matches_length = LZX_MAX_WINDOW_SIZE *
+					(params->slow.num_split_passes + 1);
+	else
+		chosen_matches_length = LZX_MAX_WINDOW_SIZE;
+
+	ctx->chosen_matches = MALLOC(chosen_matches_length *
+				     sizeof(ctx->chosen_matches[0]));
+	if (ctx->chosen_matches == NULL)
+		goto err_free_optimum;
+
+	memcpy(&ctx->params, params, sizeof(struct wimlib_lzx_params));
+	memset(&ctx->zero_codes, 0, sizeof(ctx->zero_codes));
+
+	LZX_DEBUG("Successfully allocated new LZX context.");
+
+	wimlib_lzx_free_context(*ctx_pp);
+	*ctx_pp = (struct wimlib_lzx_context*)ctx;
+	return 0;
+
+err_free_optimum:
+	FREE(ctx->optimum);
+err_free_cached_matches:
+	FREE(ctx->cached_matches);
+err_free_digram_tab:
+	FREE(ctx->digram_tab);
+err_free_hash_tab:
+	FREE(ctx->hash_tab);
+err_free_block_specs:
+	FREE(ctx->block_specs);
+err_free_ctx:
+	FREE(ctx);
+err:
+	LZX_DEBUG("Ran out of memory.");
+	return WIMLIB_ERR_NOMEM;
+}
+
+/* API function documented in wimlib.h  */
+WIMLIBAPI void
+wimlib_lzx_free_context(struct wimlib_lzx_context *_ctx)
+{
+	struct lzx_compressor *ctx = (struct lzx_compressor*)_ctx;
+
+	if (ctx) {
+		FREE(ctx->chosen_matches);
+		FREE(ctx->optimum);
+		FREE(ctx->cached_matches);
+		FREE(ctx->digram_tab);
+		FREE(ctx->hash_tab);
+		FREE(ctx->block_specs);
+		FREE(ctx);
+	}
+}
+
+/* API function documented in wimlib.h  */
+WIMLIBAPI unsigned
+wimlib_lzx_compress(const void * const restrict uncompressed_data,
+		    unsigned	 const		uncompressed_len,
+		    void       * const restrict compressed_data)
+{
+	int ret;
+	struct wimlib_lzx_context *ctx;
+	unsigned compressed_len;
+
+	ret = wimlib_lzx_alloc_context(NULL, &ctx);
+	if (ret) {
+		wimlib_assert(ret != WIMLIB_ERR_INVALID_PARAM);
+		WARNING("Couldn't allocate LZX compression context: %"TS"",
+			wimlib_get_error_string(ret));
+		return 0;
+	}
+
+	compressed_len = wimlib_lzx_compress2(uncompressed_data,
+					      uncompressed_len,
+					      compressed_data,
+					      ctx);
+
+	wimlib_lzx_free_context(ctx);
+
+	return compressed_len;
+}
diff --git a/src/lzx-decompress.c b/src/lzx-decompress.c
index 96332dd8..d25d188b 100644
--- a/src/lzx-decompress.c
+++ b/src/lzx-decompress.c
@@ -326,7 +326,7 @@ lzx_read_block_header(struct input_bitstream *istream,
 		      unsigned *block_size_ret,
 		      unsigned *block_type_ret,
 		      struct lzx_tables *tables,
-		      struct lru_queue *queue)
+		      struct lzx_lru_queue *queue)
 {
 	int ret;
 	unsigned block_type;
@@ -335,15 +335,15 @@ lzx_read_block_header(struct input_bitstream *istream,
 	unsigned i;
 	unsigned len;
 
-	ret = bitstream_ensure_bits(istream, 4);
+	ret = bitstream_ensure_bits(istream, LZX_BLOCKTYPE_NBITS + 1);
 	if (ret) {
-		DEBUG("LZX input stream overrun");
+		LZX_DEBUG("LZX input stream overrun");
 		return ret;
 	}
 
 	/* The first three bits tell us what kind of block it is, and are one
 	 * of the LZX_BLOCKTYPE_* values.  */
-	block_type = bitstream_read_bits_nocheck(istream, 3);
+	block_type = bitstream_read_bits_nocheck(istream, LZX_BLOCKTYPE_NBITS);
 
 	/* The next bit indicates whether the block size is the default (32768),
 	 * indicated by a 1 bit, or whether the block size is given by the next
@@ -353,7 +353,7 @@ lzx_read_block_header(struct input_bitstream *istream,
 	if (s) {
 		block_size = 32768;
 	} else {
-		ret = bitstream_read_bits(istream, 16, &block_size);
+		ret = bitstream_read_bits(istream, LZX_BLOCKSIZE_NBITS, &block_size);
 		if (ret)
 			return ret;
 		block_size = le16_to_cpu(block_size);
@@ -380,8 +380,8 @@ lzx_read_block_header(struct input_bitstream *istream,
 						tables->alignedtree_lens,
 						8);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to make the decode "
-			      "table for the aligned offset tree");
+			LZX_DEBUG("Failed to make the decode table for the "
+				  "aligned offset tree");
 			return ret;
 		}
 
@@ -398,9 +398,8 @@ lzx_read_block_header(struct input_bitstream *istream,
 		ret = lzx_read_code_lens(istream, tables->maintree_lens,
 					 LZX_NUM_CHARS);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to read the code "
-			      "lengths for the first 256 elements of the "
-			      "main tree");
+			LZX_DEBUG("Failed to read the code lengths for the "
+				  "first 256 elements of the main tree");
 			return ret;
 		}
 
@@ -413,9 +412,8 @@ lzx_read_block_header(struct input_bitstream *istream,
 					 tables->maintree_lens + LZX_NUM_CHARS,
 					 LZX_MAINTREE_NUM_SYMBOLS - LZX_NUM_CHARS);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to read the path "
-			      "lengths for the remaining elements of the main "
-			      "tree");
+			LZX_DEBUG("Failed to read the path lengths for the "
+				  "remaining elements of the main tree");
 			return ret;
 		}
 
@@ -428,8 +426,8 @@ lzx_read_block_header(struct input_bitstream *istream,
 						tables->maintree_lens,
 						LZX_MAX_CODEWORD_LEN);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to make the decode "
-			      "table for the main tree");
+			LZX_DEBUG("Failed to make the decode "
+				  "table for the main tree");
 			return ret;
 		}
 
@@ -437,8 +435,8 @@ lzx_read_block_header(struct input_bitstream *istream,
 		ret = lzx_read_code_lens(istream, tables->lentree_lens,
 					 LZX_LENTREE_NUM_SYMBOLS);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to read the path "
-			      "lengths for the length tree");
+			LZX_DEBUG("Failed to read the path "
+				  "lengths for the length tree");
 			return ret;
 		}
 
@@ -449,8 +447,7 @@ lzx_read_block_header(struct input_bitstream *istream,
 						tables->lentree_lens,
 						LZX_MAX_CODEWORD_LEN);
 		if (ret) {
-			DEBUG("lzx_decompress(): Failed to build the length "
-			      "Huffman tree");
+			LZX_DEBUG("Failed to build the length Huffman tree");
 			return ret;
 		}
 		/* The bitstream of compressed literals and matches for this
@@ -466,16 +463,16 @@ lzx_read_block_header(struct input_bitstream *istream,
 		 * the next 16 bits. */
 		if (istream->bitsleft == 0) {
 			if (istream->data_bytes_left < 14) {
-				DEBUG("lzx_decompress(): Insufficient length in "
-				      "uncompressed block");
+				LZX_DEBUG("Insufficient length in "
+					  "uncompressed block");
 				return -1;
 			}
 			istream->data += 2;
 			istream->data_bytes_left -= 2;
 		} else {
 			if (istream->data_bytes_left < 12) {
-				DEBUG("lzx_decompress(): Insufficient length in "
-				      "uncompressed block");
+				LZX_DEBUG("Insufficient length in "
+					  "uncompressed block");
 				return -1;
 			}
 			istream->bitsleft = 0;
@@ -490,7 +487,7 @@ lzx_read_block_header(struct input_bitstream *istream,
 		 * be read in lzx_decompress(). */
 		break;
 	default:
-		DEBUG("lzx_decompress(): Found invalid block");
+		LZX_DEBUG("Found invalid block");
 		return -1;
 	}
 	*block_type_ret = block_type;
@@ -538,7 +535,7 @@ lzx_decode_match(unsigned main_element, int block_type,
 		 unsigned bytes_remaining, u8 *window,
 		 unsigned window_pos,
 		 const struct lzx_tables *tables,
-		 struct lru_queue *queue,
+		 struct lzx_lru_queue *queue,
 		 struct input_bitstream *istream)
 {
 	unsigned length_header;
@@ -658,15 +655,16 @@ lzx_decode_match(unsigned main_element, int block_type,
 	 * position. */
 
 	if (match_len > bytes_remaining) {
-		DEBUG("lzx_decode_match(): Match of length %u bytes overflows "
-		      "uncompressed block size", match_len);
+		LZX_DEBUG("Match of length %u bytes overflows "
+			  "uncompressed block size", match_len);
 		return -1;
 	}
 
 	if (match_offset > window_pos) {
-		DEBUG("lzx_decode_match(): Match of length %u bytes references "
-		      "data before window (match_offset = %u, window_pos = %u)",
-		      match_len, match_offset, window_pos);
+		LZX_DEBUG("Match of length %u bytes references "
+			  "data before window (match_offset = %u, "
+			  "window_pos = %u)",
+			  match_len, match_offset, window_pos);
 		return -1;
 	}
 
@@ -767,7 +765,7 @@ lzx_decompress_block(int block_type, unsigned block_size,
 		     u8 *window,
 		     unsigned window_pos,
 		     const struct lzx_tables *tables,
-		     struct lru_queue *queue,
+		     struct lzx_lru_queue *queue,
 		     struct input_bitstream *istream)
 {
 	unsigned main_element;
@@ -810,7 +808,7 @@ wimlib_lzx_decompress(const void *compressed_data, unsigned compressed_len,
 {
 	struct lzx_tables tables;
 	struct input_bitstream istream;
-	struct lru_queue queue;
+	struct lzx_lru_queue queue;
 	unsigned window_pos;
 	unsigned block_size;
 	unsigned block_type;
@@ -853,9 +851,9 @@ wimlib_lzx_decompress(const void *compressed_data, unsigned compressed_len,
 			  block_size, window_pos);
 
 		if (block_size > uncompressed_len - window_pos) {
-			DEBUG("lzx_decompress(): Expected a block size of at "
-			      "most %u bytes (found %u bytes)",
-			      uncompressed_len - window_pos, block_size);
+			LZX_DEBUG("Expected a block size of at "
+				  "most %u bytes (found %u bytes)",
+				  uncompressed_len - window_pos, block_size);
 			return -1;
 		}
 
@@ -881,10 +879,10 @@ wimlib_lzx_decompress(const void *compressed_data, unsigned compressed_len,
 		case LZX_BLOCKTYPE_UNCOMPRESSED:
 			LZX_DEBUG("LZX_BLOCKTYPE_UNCOMPRESSED");
 			if (istream.data_bytes_left < block_size) {
-				DEBUG("Unexpected end of input when "
-				      "reading %u bytes from LZX bitstream "
-				      "(only have %u bytes left)",
-				      block_size, istream.data_bytes_left);
+				LZX_DEBUG("Unexpected end of input when "
+					  "reading %u bytes from LZX bitstream "
+					  "(only have %u bytes left)",
+					  block_size, istream.data_bytes_left);
 				return -1;
 			}
 			memcpy(&((u8*)uncompressed_data)[window_pos], istream.data,
diff --git a/src/metadata_resource.c b/src/metadata_resource.c
index 25590369..006871c6 100644
--- a/src/metadata_resource.c
+++ b/src/metadata_resource.c
@@ -299,7 +299,8 @@ write_metadata_resource(WIMStruct *wim, int image, int write_resource_flags)
 					     wim->compression_type,
 					     &imd->metadata_lte->output_resource_entry,
 					     imd->metadata_lte->hash,
-					     write_resource_flags);
+					     write_resource_flags,
+					     &wim->lzx_context);
 
 	/* Original checksum was overridden; set a flag so it isn't used.  */
 	imd->metadata_lte->dont_check_metadata_hash = 1;
diff --git a/src/wim.c b/src/wim.c
index e1cda332..bf6ebbf0 100644
--- a/src/wim.c
+++ b/src/wim.c
@@ -775,6 +775,7 @@ wimlib_free(WIMStruct *wim)
 	if (filedes_valid(&wim->out_fd))
 		filedes_close(&wim->out_fd);
 
+	wimlib_lzx_free_context(wim->lzx_context);
 
 	free_lookup_table(wim->lookup_table);
 
diff --git a/src/write.c b/src/write.c
index 3ddf0142..0639fd76 100644
--- a/src/write.c
+++ b/src/write.c
@@ -68,6 +68,42 @@
 #  include <sys/uio.h> /* for `struct iovec' */
 #endif
 
+static int
+alloc_lzx_context(int write_resource_flags, struct wimlib_lzx_context **ctx_pp)
+{
+	struct wimlib_lzx_params params;
+	params.size_of_this = sizeof(params);
+	if (write_resource_flags & WIMLIB_WRITE_RESOURCE_FLAG_COMPRESS_SLOW)
+		params.algorithm = WIMLIB_LZX_ALGORITHM_SLOW;
+	else
+		params.algorithm = WIMLIB_LZX_ALGORITHM_FAST;
+	params.use_defaults = 1;
+	return wimlib_lzx_alloc_context(&params, ctx_pp);
+}
+
+static unsigned
+compress_chunk(const void * uncompressed_data,
+	       unsigned uncompressed_len,
+	       void *compressed_data,
+	       int out_ctype,
+	       struct wimlib_lzx_context *comp_ctx)
+{
+	switch (out_ctype) {
+	case WIMLIB_COMPRESSION_TYPE_XPRESS:
+		return wimlib_xpress_compress(uncompressed_data,
+					      uncompressed_len,
+					      compressed_data);
+	case WIMLIB_COMPRESSION_TYPE_LZX:
+		return wimlib_lzx_compress2(uncompressed_data,
+					    uncompressed_len,
+					    compressed_data,
+					    comp_ctx);
+	default:
+		wimlib_assert(0);
+		return 0;
+	}
+}
+
 /* Chunk table that's located at the beginning of each compressed resource in
  * the WIM.  (This is not the on-disk format; the on-disk format just has an
  * array of offsets.) */
@@ -158,41 +194,6 @@ chunk_tab_record_chunk(struct chunk_table *chunk_tab, unsigned out_chunk_size)
 	}
 }
 
-/*
- * compress_func_t- Pointer to a function to compresses a chunk
- *                  of a WIM resource.  This may be either
- *                  wimlib_xpress_compress() (xpress-compress.c) or
- *                  wimlib_lzx_compress() (lzx-compress.c).
- *
- * @chunk:	  Uncompressed data of the chunk.
- * @chunk_size:	  Size of the uncompressed chunk, in bytes.
- * @out:	  Pointer to output buffer of size at least (@chunk_size - 1) bytes.
- *
- * Returns the size of the compressed data written to @out in bytes, or 0 if the
- * data could not be compressed to (@chunk_size - 1) bytes or fewer.
- *
- * As a special requirement, the compression code is optimized for the WIM
- * format and therefore requires (@chunk_size <= 32768).
- *
- * As another special requirement, the compression code will read up to 8 bytes
- * off the end of the @chunk array for performance reasons.  The values of these
- * bytes will not affect the output of the compression, but the calling code
- * must make sure that the buffer holding the uncompressed chunk is actually at
- * least (@chunk_size + 8) bytes, or at least that these extra bytes are in
- * mapped memory that will not cause a memory access violation if accessed.
- */
-typedef unsigned (*compress_func_t)(const void *chunk, unsigned chunk_size,
-				    void *out);
-
-static compress_func_t
-get_compress_func(int out_ctype)
-{
-	if (out_ctype == WIMLIB_COMPRESSION_TYPE_LZX)
-		return wimlib_lzx_compress;
-	else
-		return wimlib_xpress_compress;
-}
-
 /* Finishes a WIM chunk table and writes it to the output file at the correct
  * offset.  */
 static int
@@ -284,7 +285,8 @@ finalize_and_check_sha1(SHA_CTX *sha_ctx, struct wim_lookup_table_entry *lte)
 }
 
 struct write_resource_ctx {
-	compress_func_t compress;
+	int out_ctype;
+	struct wimlib_lzx_context *comp_ctx;
 	struct chunk_table *chunk_tab;
 	struct filedes *out_fd;
 	SHA_CTX sha_ctx;
@@ -305,15 +307,17 @@ write_resource_cb(const void *chunk, size_t chunk_size, void *_ctx)
 
 	out_chunk = chunk;
 	out_chunk_size = chunk_size;
-	if (ctx->compress) {
+	if (ctx->out_ctype != WIMLIB_COMPRESSION_TYPE_NONE) {
 		void *compressed_chunk;
 		unsigned compressed_size;
 
 		/* Compress the chunk.  */
 		compressed_chunk = alloca(chunk_size);
-		compressed_size = (*ctx->compress)(chunk, chunk_size,
-						   compressed_chunk);
 
+		compressed_size = compress_chunk(chunk, chunk_size,
+						 compressed_chunk,
+						 ctx->out_ctype,
+						 ctx->comp_ctx);
 		/* Use compressed data if compression to less than input size
 		 * was successful.  */
 		if (compressed_size) {
@@ -371,9 +375,11 @@ error:
  *	On success, this is filled in with the offset, flags, compressed size,
  *	and uncompressed size of the resource in the output WIM.
  *
- * @write_resource_flags:
+ * @resource_flags:
  *	* WIMLIB_WRITE_RESOURCE_FLAG_RECOMPRESS to force data to be recompressed even
  *	  if it could otherwise be copied directly from the input;
+ *	* WIMLIB_WRITE_RESOURCE_FLAG_COMPRESS_SLOW to compress the data as much
+ *	  as possible;
  *	* WIMLIB_WRITE_RESOURCE_FLAG_PIPABLE if writing a resource for a pipable WIM
  *	  (and the output file descriptor may be a pipe).
  *
@@ -387,7 +393,8 @@ int
 write_wim_resource(struct wim_lookup_table_entry *lte,
 		   struct filedes *out_fd, int out_ctype,
 		   struct resource_entry *out_res_entry,
-		   int resource_flags)
+		   int resource_flags,
+		   struct wimlib_lzx_context **comp_ctx)
 {
 	struct write_resource_ctx write_ctx;
 	off_t res_start_offset;
@@ -430,15 +437,23 @@ write_wim_resource(struct wim_lookup_table_entry *lte,
 		read_size = lte->resource_entry.original_size;
 	}
 
+
 	/* If the output resource is to be compressed, initialize the chunk
 	 * table and set the function to use for chunk compression.  Exceptions:
 	 * no compression function is needed if doing a raw copy; also, no chunk
 	 * table is needed if doing a *full* (not per-chunk) raw copy.  */
-	write_ctx.compress = NULL;
+	write_ctx.out_ctype = WIMLIB_COMPRESSION_TYPE_NONE;
 	write_ctx.chunk_tab = NULL;
 	if (out_ctype != WIMLIB_COMPRESSION_TYPE_NONE) {
-		if (!(resource_flags & WIMLIB_READ_RESOURCE_FLAG_RAW))
-			write_ctx.compress = get_compress_func(out_ctype);
+		if (!(resource_flags & WIMLIB_READ_RESOURCE_FLAG_RAW)) {
+			write_ctx.out_ctype = out_ctype;
+			if (out_ctype == WIMLIB_COMPRESSION_TYPE_LZX) {
+				ret = alloc_lzx_context(resource_flags, comp_ctx);
+				if (ret)
+					goto out;
+			}
+			write_ctx.comp_ctx = *comp_ctx;
+		}
 		if (!(resource_flags & WIMLIB_READ_RESOURCE_FLAG_RAW_FULL)) {
 			ret = begin_wim_resource_chunk_tab(lte, out_fd,
 							   &write_ctx.chunk_tab,
@@ -514,7 +529,7 @@ try_write_again:
 			goto out_free_chunk_tab;
 		out_ctype = WIMLIB_COMPRESSION_TYPE_NONE;
 		FREE(write_ctx.chunk_tab);
-		write_ctx.compress = NULL;
+		write_ctx.out_ctype = WIMLIB_COMPRESSION_TYPE_NONE;
 		write_ctx.chunk_tab = NULL;
 		write_ctx.doing_sha = false;
 		goto try_write_again;
@@ -550,7 +565,8 @@ write_wim_resource_from_buffer(const void *buf, size_t buf_size,
 			       int reshdr_flags, struct filedes *out_fd,
 			       int out_ctype,
 			       struct resource_entry *out_res_entry,
-			       u8 *hash_ret, int write_resource_flags)
+			       u8 *hash_ret, int write_resource_flags,
+			       struct wimlib_lzx_context **comp_ctx)
 {
 	/* Set up a temporary lookup table entry to provide to
 	 * write_wim_resource(). */
@@ -570,7 +586,7 @@ write_wim_resource_from_buffer(const void *buf, size_t buf_size,
 	}
 
 	ret = write_wim_resource(&lte, out_fd, out_ctype, out_res_entry,
-				 write_resource_flags);
+				 write_resource_flags, comp_ctx);
 	if (ret)
 		return ret;
 	if (hash_ret)
@@ -671,7 +687,8 @@ shared_queue_get(struct shared_queue *q)
 struct compressor_thread_params {
 	struct shared_queue *res_to_compress_queue;
 	struct shared_queue *compressed_res_queue;
-	compress_func_t compress;
+	int out_ctype;
+	struct wimlib_lzx_context *comp_ctx;
 };
 
 #define MAX_CHUNKS_PER_MSG 2
@@ -689,12 +706,18 @@ struct message {
 };
 
 static void
-compress_chunks(struct message *msg, compress_func_t compress)
+compress_chunks(struct message *msg, int out_ctype,
+		struct wimlib_lzx_context *comp_ctx)
 {
 	for (unsigned i = 0; i < msg->num_chunks; i++) {
-		unsigned len = compress(msg->uncompressed_chunks[i],
-					msg->uncompressed_chunk_sizes[i],
-					msg->compressed_chunks[i]);
+		unsigned len;
+
+		len = compress_chunk(msg->uncompressed_chunks[i],
+				     msg->uncompressed_chunk_sizes[i],
+				     msg->compressed_chunks[i],
+				     out_ctype,
+				     comp_ctx);
+
 		void *out_chunk;
 		unsigned out_len;
 		if (len) {
@@ -722,12 +745,11 @@ compressor_thread_proc(void *arg)
 	struct compressor_thread_params *params = arg;
 	struct shared_queue *res_to_compress_queue = params->res_to_compress_queue;
 	struct shared_queue *compressed_res_queue = params->compressed_res_queue;
-	compress_func_t compress = params->compress;
 	struct message *msg;
 
 	DEBUG("Compressor thread ready");
 	while ((msg = shared_queue_get(res_to_compress_queue)) != NULL) {
-		compress_chunks(msg, compress);
+		compress_chunks(msg, params->out_ctype, params->comp_ctx);
 		shared_queue_put(compressed_res_queue, msg);
 	}
 	DEBUG("Compressor thread terminating");
@@ -791,6 +813,7 @@ do_write_streams_progress(struct write_streams_progress_data *progress_data,
 struct serial_write_stream_ctx {
 	struct filedes *out_fd;
 	int out_ctype;
+	struct wimlib_lzx_context **comp_ctx;
 	int write_resource_flags;
 };
 
@@ -800,7 +823,8 @@ serial_write_stream(struct wim_lookup_table_entry *lte, void *_ctx)
 	struct serial_write_stream_ctx *ctx = _ctx;
 	return write_wim_resource(lte, ctx->out_fd,
 				  ctx->out_ctype, &lte->output_resource_entry,
-				  ctx->write_resource_flags);
+				  ctx->write_resource_flags,
+				  ctx->comp_ctx);
 }
 
 
@@ -898,6 +922,7 @@ do_write_stream_list_serial(struct list_head *stream_list,
 			    struct wim_lookup_table *lookup_table,
 			    struct filedes *out_fd,
 			    int out_ctype,
+			    struct wimlib_lzx_context **comp_ctx,
 			    int write_resource_flags,
 			    struct write_streams_progress_data *progress_data)
 {
@@ -905,6 +930,7 @@ do_write_stream_list_serial(struct list_head *stream_list,
 		.out_fd = out_fd,
 		.out_ctype = out_ctype,
 		.write_resource_flags = write_resource_flags,
+		.comp_ctx = comp_ctx,
 	};
 	return do_write_stream_list(stream_list,
 				    lookup_table,
@@ -920,6 +946,8 @@ write_flags_to_resource_flags(int write_flags)
 
 	if (write_flags & WIMLIB_WRITE_FLAG_RECOMPRESS)
 		resource_flags |= WIMLIB_WRITE_RESOURCE_FLAG_RECOMPRESS;
+	if (write_flags & WIMLIB_WRITE_FLAG_COMPRESS_SLOW)
+		resource_flags |= WIMLIB_WRITE_RESOURCE_FLAG_COMPRESS_SLOW;
 	if (write_flags & WIMLIB_WRITE_FLAG_PIPABLE)
 		resource_flags |= WIMLIB_WRITE_RESOURCE_FLAG_PIPABLE;
 	return resource_flags;
@@ -930,6 +958,7 @@ write_stream_list_serial(struct list_head *stream_list,
 			 struct wim_lookup_table *lookup_table,
 			 struct filedes *out_fd,
 			 int out_ctype,
+			 struct wimlib_lzx_context **comp_ctx,
 			 int write_resource_flags,
 			 struct write_streams_progress_data *progress_data)
 {
@@ -945,6 +974,7 @@ write_stream_list_serial(struct list_head *stream_list,
 					   lookup_table,
 					   out_fd,
 					   out_ctype,
+					   comp_ctx,
 					   write_resource_flags,
 					   progress_data);
 }
@@ -994,6 +1024,7 @@ struct main_writer_thread_ctx {
 	struct filedes *out_fd;
 	off_t res_start_offset;
 	int out_ctype;
+	struct wimlib_lzx_context **comp_ctx;
 	int write_resource_flags;
 	struct shared_queue *res_to_compress_queue;
 	struct shared_queue *compressed_res_queue;
@@ -1215,7 +1246,8 @@ receive_compressed_chunks(struct main_writer_thread_ctx *ctx)
 							 ctx->out_fd,
 							 WIMLIB_COMPRESSION_TYPE_NONE,
 							 &cur_lte->output_resource_entry,
-							 ctx->write_resource_flags);
+							 ctx->write_resource_flags,
+							 ctx->comp_ctx);
 				if (ret)
 					return ret;
 			} else {
@@ -1254,6 +1286,7 @@ receive_compressed_chunks(struct main_writer_thread_ctx *ctx)
 								  ctx->lookup_table,
 								  ctx->out_fd,
 								  ctx->out_ctype,
+								  ctx->comp_ctx,
 								  ctx->write_resource_flags,
 								  ctx->progress_data);
 				if (ret)
@@ -1352,6 +1385,7 @@ main_writer_thread_finish(void *_ctx)
 					   ctx->lookup_table,
 					   ctx->out_fd,
 					   ctx->out_ctype,
+					   ctx->comp_ctx,
 					   ctx->write_resource_flags,
 					   ctx->progress_data);
 }
@@ -1447,6 +1481,7 @@ write_stream_list_parallel(struct list_head *stream_list,
 			   struct wim_lookup_table *lookup_table,
 			   struct filedes *out_fd,
 			   int out_ctype,
+			   struct wimlib_lzx_context **comp_ctx,
 			   int write_resource_flags,
 			   struct write_streams_progress_data *progress_data,
 			   unsigned num_threads)
@@ -1488,21 +1523,36 @@ write_stream_list_parallel(struct list_head *stream_list,
 	if (ret)
 		goto out_destroy_res_to_compress_queue;
 
-	struct compressor_thread_params params;
-	params.res_to_compress_queue = &res_to_compress_queue;
-	params.compressed_res_queue = &compressed_res_queue;
-	params.compress = get_compress_func(out_ctype);
+	struct compressor_thread_params *params;
+
+	params = CALLOC(num_threads, sizeof(params[0]));
+	if (params == NULL) {
+		ret = WIMLIB_ERR_NOMEM;
+		goto out_destroy_compressed_res_queue;
+	}
+
+	for (unsigned i = 0; i < num_threads; i++) {
+		params[i].res_to_compress_queue = &res_to_compress_queue;
+		params[i].compressed_res_queue = &compressed_res_queue;
+		params[i].out_ctype = out_ctype;
+		if (out_ctype == WIMLIB_COMPRESSION_TYPE_LZX) {
+			ret = alloc_lzx_context(write_resource_flags,
+						&params[i].comp_ctx);
+			if (ret)
+				goto out_free_params;
+		}
+	}
 
 	compressor_threads = MALLOC(num_threads * sizeof(pthread_t));
 	if (!compressor_threads) {
 		ret = WIMLIB_ERR_NOMEM;
-		goto out_destroy_compressed_res_queue;
+		goto out_free_params;
 	}
 
 	for (unsigned i = 0; i < num_threads; i++) {
 		DEBUG("pthread_create thread %u of %u", i + 1, num_threads);
 		ret = pthread_create(&compressor_threads[i], NULL,
-				     compressor_thread_proc, &params);
+				     compressor_thread_proc, &params[i]);
 		if (ret != 0) {
 			ret = -1;
 			ERROR_WITH_ERRNO("Failed to create compressor "
@@ -1523,6 +1573,7 @@ write_stream_list_parallel(struct list_head *stream_list,
 	ctx.lookup_table          = lookup_table;
 	ctx.out_fd                = out_fd;
 	ctx.out_ctype             = out_ctype;
+	ctx.comp_ctx		  = comp_ctx;
 	ctx.res_to_compress_queue = &res_to_compress_queue;
 	ctx.compressed_res_queue  = &compressed_res_queue;
 	ctx.num_messages          = queue_size;
@@ -1558,6 +1609,10 @@ out_join:
 		}
 	}
 	FREE(compressor_threads);
+out_free_params:
+	for (unsigned i = 0; i < num_threads; i++)
+		wimlib_lzx_free_context(params[i].comp_ctx);
+	FREE(params);
 out_destroy_compressed_res_queue:
 	shared_queue_destroy(&compressed_res_queue);
 out_destroy_res_to_compress_queue:
@@ -1571,6 +1626,7 @@ out_serial_quiet:
 					lookup_table,
 					out_fd,
 					out_ctype,
+					comp_ctx,
 					write_resource_flags,
 					progress_data);
 
@@ -1584,7 +1640,9 @@ out_serial_quiet:
 static int
 write_stream_list(struct list_head *stream_list,
 		  struct wim_lookup_table *lookup_table,
-		  struct filedes *out_fd, int out_ctype, int write_flags,
+		  struct filedes *out_fd, int out_ctype,
+		  struct wimlib_lzx_context **comp_ctx,
+		  int write_flags,
 		  unsigned num_threads, wimlib_progress_func_t progress_func)
 {
 	struct wim_lookup_table_entry *lte;
@@ -1653,6 +1711,7 @@ write_stream_list(struct list_head *stream_list,
 						 lookup_table,
 						 out_fd,
 						 out_ctype,
+						 comp_ctx,
 						 write_resource_flags,
 						 &progress_data,
 						 num_threads);
@@ -1662,6 +1721,7 @@ write_stream_list(struct list_head *stream_list,
 					       lookup_table,
 					       out_fd,
 					       out_ctype,
+					       comp_ctx,
 					       write_resource_flags,
 					       &progress_data);
 	if (ret == 0)
@@ -1984,6 +2044,7 @@ write_wim_streams(WIMStruct *wim, int image, int write_flags,
 				 wim->lookup_table,
 				 &wim->out_fd,
 				 wim->compression_type,
+				 &wim->lzx_context,
 				 write_flags,
 				 num_threads,
 				 progress_func);
@@ -2044,7 +2105,8 @@ write_wim_metadata_resources(WIMStruct *wim, int image, int write_flags,
 						 &wim->out_fd,
 						 wim->compression_type,
 						 &imd->metadata_lte->output_resource_entry,
-						 write_resource_flags);
+						 write_resource_flags,
+						 &wim->lzx_context);
 		}
 		if (ret)
 			return ret;
@@ -2869,6 +2931,7 @@ overwrite_wim_inplace(WIMStruct *wim, int write_flags,
 				wim->lookup_table,
 				&wim->out_fd,
 				wim->compression_type,
+				&wim->lzx_context,
 				write_flags,
 				num_threads,
 				progress_func);
diff --git a/src/xml.c b/src/xml.c
index ffb25e16..b22bc094 100644
--- a/src/xml.c
+++ b/src/xml.c
@@ -1515,7 +1515,8 @@ write_wim_xml_data(WIMStruct *wim, int image, u64 total_bytes,
 					     WIMLIB_COMPRESSION_TYPE_NONE,
 					     out_res_entry,
 					     NULL,
-					     write_resource_flags);
+					     write_resource_flags,
+					     &wim->lzx_context);
 	FREE(xml_data);
 	return ret;
 }