Adapting wimlib to CAB LZX

Comments, questions, bug reports, etc.
Post Reply
ncommander
Posts: 1
Joined: Sat Dec 31, 2022 3:21 pm

Adapting wimlib to CAB LZX

Post by ncommander »

Hi,

I've been trying to use wimlib to compress data using the LZX format. However, I require the compressed data to be readable by libraries like mspack.
As far as I'm concerned, there are no CAB LZX compressor implementations out there, except lzxcomp by Matthew T. Russotto, which has suboptimal performance (over 1min to compress a 50MB file, with no compression level or options available).

So far, I've modified how the block headers are written to match the CAB format, as well as writing the LZX header that determines whether Intel E8 is used.
I compress the data 32KB at a time (LZX frame) and mspack can decode the output perfectly!
There's just one issue: the decoder needs to be reset for each LZX frame, or the MAINTREE and LENGTH tables fail to build. It seems the lengths read from one frame affect another. Do you have any idea how this could be fixed, and if so, would the changes to wimlib's aligned/verbatim compression be deep? Alternatively, as a workaround, is there a valid LZX block that could clear these Huffman lengths?

I totally understand if you can't provide any answers as the topic clearly falls outside of wimlib, but I would appreciate any pointers to which functions would require changing, if additional state would have to be introduced, or an alternative to wimlib for the task.

Thanks for your time! :)
synchronicity
Site Admin
Posts: 472
Joined: Sun Aug 02, 2015 10:31 pm

Re: Adapting wimlib to CAB LZX

Post by synchronicity »

The LZX compression code in wimlib has been heavily optimized for the case where the data to compress is in a single buffer, **and** matches can be made with the whole buffer. That's what the WIM format uses.

In CAB, there is instead a long stream (potentially gigabytes, I think?), where matches can be made with the last 32KiB only. That's quite a bit different.

The minimal change to get it working would be to support compressing such a stream in a single buffer, with the matchfinder changed to find matches in the last 32KiB only. This is feasible (it's what I do for DEFLATE in https://github.com/ebiggers/libdeflate), but it would still be a significant change.

To get it working **properly** would require adding streaming support, so that data can be streamed in incrementally, so that potentially gigabytes of memory isn't needed. That would be a much larger change, and I've stayed away from that sort of thing in all the compressors I've written.
Post Reply