Eric Biggers [Sat, 7 Jun 2014 21:46:39 +0000 (16:46 -0500)]
lzx-compress.c: Honor cache_limit
This ensures the match cache is never overrun. If for some reason we
average more than LZX_CACHE_PER_POS (currently 8) matches per position,
_excluding skipped positions_, just don't return any more matches.
Eric Biggers [Mon, 2 Jun 2014 03:01:14 +0000 (22:01 -0500)]
Switch from suffix array match-finder to binary tree match-finder
This uses less memory (8 bytes overhead per position vs. 14), is faster,
requires less code (no libdivsufsort), and in some tests actually results
in a better compression ratio.
A binary tree match-finder was used in wimlib v1.5.2 but it didn't seem
as good then, probably because it was combined with a slow block division
algorithm for LZX.
Repeat offsets are now handled differently. The binary tree match-finder
itself doesn't have any logic for repeat offsets at all; instead, the
match-choosing code (now implemented separately for LZX and LZMS) now
does special checks for matches at repeat offsets.
Since less memory is used for match-finding now, increase the default
LZMS pack chunk size from 2^25 (33554432) to 2^26 (67108864).
Eric Biggers [Sat, 31 May 2014 15:12:28 +0000 (10:12 -0500)]
inode_fixup.c: Fix check for directory hard links
We shouldn't assume that the attributes are consistent, so we should
check both ways for directory hard links. Specifically, in the case
where the being-inserted dentry is not marked as a directory but for some
reason it shares an inode number with a dentry marked as a directory, we
want to detect that as a directory hard link.
Eric Biggers [Thu, 29 May 2014 05:42:54 +0000 (00:42 -0500)]
Optimize Huffman code generation
This commit significantly improves the performance of length-limited
canonical Huffman code generation by introducing several optimizations
based on the 7-Zip implementation.
Altlhough Huffman code generation is not the main bottleneck of any of
the compression algorithms implemented here, an optimized implementation
can still improve overall performance by several percent. In addition,
it significantly speeds up LZMS decompression, which requires frequent
rebuilding of adaptive Huffman codes.
Some peformance comparisons:
- The average time taken to generate all three LZX codes (main,
length, and aligned offset) when compressing a Windows PE
filesystem with LZX decreased from 73.9 us to 17.3 us.
- The time taken to compress a Windows PE filesystem with LZX, using
the "fast" LZX algorithm (hash chain match finder and lazy
parsing), decreased from 12.3 s to 11.6 s (a 5.7% improvement).
- The time taken to decompress a Windows PE filesystem with LZMS,
using solid blocks, decreased from 12.3 s to 9.0 s (a 27%
improvement).
If the position footer is unconditionally calculated as the match offset
minus the position base value, the (ultimately unused) position footer
for repeat matches can overflow the number of bits in which it is stored
in the intermediate representation used by this implementation. For now,
use the old version, which would set the position footers of repeat
matches to 0.
Eric Biggers [Tue, 27 May 2014 21:51:24 +0000 (16:51 -0500)]
Faster Huffman symbol decoding
When decoding a codeword short enough for a direct mapping, we can read
the codeword length at the same time we read the symbol itself. This
speeds up Huffman decoding slightly.
This commit also updates and improves the comments for
make_huffman_decode_table().
Eric Biggers [Tue, 27 May 2014 16:03:07 +0000 (11:03 -0500)]
LZX, XPRESS decompression: Return 0 bits on overrun
If the compressed data is invalid such that the compressed data buffer is
overrun, it's simpler to just return 0 bits instead of explicitly
checking the return value at every call site of bitstream_read_bits() and
read_huffsym().
This doesn't necessarily mean that invalid data will go undetected. Just
for LZX decompression, chances are there will be another problem if all
0's start being returned (e.g. invalid match or invalid Huffman tree).
For WIM operations like extraction, the uncompressed data is checked with
SHA-1 message digests anyway, so it's virtually impossible for corruption
to go undetected.
Eric Biggers [Tue, 27 May 2014 02:22:23 +0000 (21:22 -0500)]
resource.c: Don't call lseek() if not necessary
To be reading a pipable resource from a pipe, is_pipable must be set in
the 'struct wim_resource_spec', so check that first before calling
filedes_is_seekable().
Eric Biggers [Mon, 26 May 2014 04:41:18 +0000 (23:41 -0500)]
lzx-compress.c: Disable verification by default
The algorithm seems to be sufficiently well tested now. And the data is
checked with SHA-1 message digests anyway. This slightly improves LZX
compression performance.
Eric Biggers [Sat, 24 May 2014 02:43:19 +0000 (21:43 -0500)]
Update mount implementation
- Improve comments and documentation
- Speed up fd allocation by tracking lower bound on next available slot
- Open staging files by directory-relative names
- Simplify, and hopefully improve, how unmounting images works.
Eric Biggers [Fri, 23 May 2014 01:39:31 +0000 (20:39 -0500)]
Recognize tagged metadata items and use for UNIX data
This is undocumented, but the Microsoft implementation seems to allow
variable-length, tagged metadata items to be appended to WIM dentries.
Currently it uses them for Object IDs, which DISM (Windows 8.1) will
backup and restore by default.
This commit adds support for these items, so they can be read and written
unmodified.
In addition, for our UNIX data extension, instead of storing the UNIX
data in the reserved fields of the dentry or in alternate data streams,
store it in a custom tagged item with a randomly chosen tag. This is
perhaps the best choice compatibility-wise.
Eric Biggers [Thu, 22 May 2014 15:50:47 +0000 (10:50 -0500)]
Update progress functions
- Register progress function with WIM instead of explicitly specifying it
with each function call
- Add support for user-supplied context
- Add support for aborting the current operation by returning a status
from the progress function
Eric Biggers [Thu, 22 May 2014 03:21:20 +0000 (22:21 -0500)]
New format for UNIX data
This moves the UNIX data into reserved fields of the WIM dentry.
This is theoretically less extensible than the previous format, which was
to add a specially-named alternate data stream entry to each file with
UNIX data. However, having the UNIX data present in the metadata
resource is simpler and avoids problems when doing sequential extraction.
For now, this also seems to maintain compatibility with the MS
implementation, since it seems simply ignore the reserve fields.
Eric Biggers [Mon, 19 May 2014 17:05:11 +0000 (12:05 -0500)]
Merge branch 'new_extract'
This rewrites much of the extraction code to give more freedom to
extraction backends about how they extract the files. This has
advantages for all three backends:
UNIX: Now does a lot fewer path lookups and should exhibit better data
locality. Final "set timestamps" pass only processes directories, since
nondirectory timestamps are set before that point.
NTFS-3g: Now completely based on inode numbers; no paths are ever
created, unless a printable path is needed for an error message. Got rid
of final "set attributes" and "set timestamps" passes, since all
attributes and timestamps are set before that point.
Windows: Now uses NT paths relative to the target directory, performs
fewer path lookups, extracts encrypted files without individually reading
ranges from the WIM (expensive for ESD files), and a few other changes.
A lot of code previously in the generic file extract.c actually could be
moved here.
Also, there are no longer any temporary files needed. All three backends
simply write all the instances of each stream at the same time, which is
more efficient.
Still TODO:
- New format for UNIX data
- Add fallbacks for when number of open files gets too high
- Add fallback for huge encrypted files
- If possible add hardlink and symlink modes back, however these
are pretty useless.
Eric Biggers [Sat, 17 May 2014 06:09:06 +0000 (01:09 -0500)]
inode_fixup.c: Further simplify MS bug workaround
Based on what WIMGAPI does, and assuming it sufficiently works around its
own [presumably fixed] bug, the workaround can be simplified to just
"don't hard link dentries that have different unnamed data streams".
Eric Biggers [Sat, 17 May 2014 03:08:16 +0000 (22:08 -0500)]
inode_fixup.c: Simplify logic
Most of the logic in this file is working around a Microsoft bug, but
there must be an easier way to do it. This commits strips the code down
to something more logical that still seems to "do the right thing" on a
Windows 7 WIM affected by the Microsoft bug.
Eric Biggers [Fri, 16 May 2014 01:13:34 +0000 (20:13 -0500)]
Rewrite code for capture rpfix
The previous code had many potential problems on the Windows side.
Targets of Windows native symbolic links (except those with
SYMBOLIC_LINK_RELATIVE set) and junction points are stored in the reparse
points as NT namespace paths. Therefore they can validly include
prefixes like '\??\', '\DosDevices\', or '\Device\HardDiskVolume1'. They
potentially do not use drive letters at all, and they may be over
MAX_PATH characters. Such paths should be accessed using the native API
(NtOpenFile), not Win32 (CreateFile). This commit changes the code to do
so.