process stays at 99% sometime and write a lot to disk

Gehard · Post by **Gehard** » Fri Apr 15, 2016 11:36 am

On one of our folder (containing a 40GB partition dump made with dd, without compression and some small checksum files) wimlib was really fast until 99%. At this point, cpu usage dropped to 1% for wimlib, but the source and destination HDD continued to read and write for pretty long. We waited 5 minutes and killed the process because it wouldn't complete and we weren't sure were all this data was written. File size shown in explorer was't growing anymore despite the ressource monitor reporting 60MB/sec to be written and read by wimlib = 18GB over 5minutes, without it being visible anywhere (free space on partition was also not changing).
Operation was done on Win 7 x64 sp1, with default option.

Post by **synchronicity** » Sat Apr 16, 2016 1:41 am

Was this on a system with a lot of memory? It's possible the operating system was just catching up with previously buffered writes. On modern operating systems, data written by applications is usually stored in memory and flushed to disk asynchronously. Applications do not have much, if any, visibility into this.

If you are concerned that wimlib is doing unnecessary work regardless, please use Process Monitor (https://technet.microsoft.com/en-us/sys ... nitor.aspx) to dump the list of system calls being made by wimlib-imagex.exe during the relevant time period.

Gehard · Post by **Gehard** » Sat Apr 16, 2016 11:24 am

Yes the system has 16GB of memory. We will look into it more in depth.

Gehard · Post by **Gehard** » Sat Apr 23, 2016 5:41 pm

Ok, Process Monitor revealed it's due to a big .7z file. Of course this file was the last one so it took more than a day to find it

1) It appears that compression stops when archiving 7z files (8%CPU used), which is a really good idea as it is not further compressible. But wim archives are being recompressed (they use 99% CPU), so it seem strange.
2) Process monitor showed that it was reading and writing at increased offset over time, so we waited until it completed this time and it completed.
3) But the strange thing is that it took more than an hour to make the last GB. I don't if it's really so or if the xxx of yyy includes the current file being processed. It would explain why it looked like taking an hour for 1GB while in fact it still had a hundred of GB to process. But for example, when files already present in the wim are being processed, the numbers are not updated. This shows that the processed GB are updated after the file has been fully processed...
What is Wimlib expected to do in 1 and 3 ?

Post by **synchronicity** » Sun Apr 24, 2016 3:05 pm

1.) The first thing probably appears to take little CPU time because that particular workload happens to be I/O bound. I don't understand what you mean by "wim archives are being recompressed".

3.) The "Archiving file data" progress is reported in terms of bytes, not files. So it will continue to be updated while a large file is being archived.

Is it possible that your 7z file is incredibly fragmented and that is slowing down the process, regardless of what wimlib (or any other application, for that matter) does? Potentially useful: how long does it take to 'cat' or 'type' the file's contents for the first time after rebooting?

Gehard · Post by **Gehard** » Tue Apr 26, 2016 8:13 am

1) The read/write speed is about the same (40MB/sec in both ways) during the whole process. Only the cpu usage changes significantly when a .7z file is being processed (comparable to archiving without compression). What I mean with wim archive being recompressed is that an already compressed .wim archive file which is added to a wim archive will use 100%CPU. So wimlib compresses the already compressed data. It seems to not be the case on the .7z file as the CPU usage drops to about 8%.

Our test case:
- A folder containing a 40GB .7z archive compressed with lzma2 maximum using 7z 15.14 and a 20GB wim archive with default settings (lzx compression).
- we archive this folder with default settings (lzx compression)
- while processing the 7z archive, processed data doesn't update until the .7z archive has been completely processed (checked with process monitor) and CPU usage is low (about 8%).
- While processing the wim, cpu usage is at 99%

3) On our test, the figures are only updated after a new file has been processed. It has to be a new file, if it's a duplicate, nothing is updated. (we don't use --update of as it's not reliable, we missed many updated files using it in our first tests, so we have a lot of duplicates and figures stays the same a long time)

Post by **synchronicity** » Tue Apr 26, 2016 11:26 pm

wimlib tries to compress all files, regardless of file extension or file format, since it cannot predict, in general, whether the data will be compressible or not. Note that if the data is incompressible, then the compression speed will be much faster than if the data is moderately compressible.

In addition, if a file's data does not compress to less than its original size, then the data is re-written uncompressed. This work is not represented in the progress information.

Furthermore, since a WIM archive contains each distinct file contents only once, it is preferable to know whether each file's contents is needed in the archive before compressing and writing it. Consequently, if a file's contents does not have a unique size among all known file contents, then wimlib will compute the file's SHA-1 message digest and only archive the file if the SHA-1 message does not match a known SHA-1 message digest. This work is not represented in the progress information.

Therefore, in the worst case, archiving a file's contents could consist of the following passes:

1.) Read and checksum
2.) Read and write compressed
3.) Read and write uncompressed

Since the progress information is only updated for pass (2) and you are working with very large files, this may explain the confusing progress information you are encountering.

I do not think there an easy way to report accurate progress information for these large files, since it cannot be predicted ahead of time whether the file will need one pass, two passes, or three passes. One idea would be to report progress information during each pass, but only for one-third of the bytes. This would keep the progress meter moving, but in the common case (1 or 2 passes needed) it would require that the progress suddenly jump forward as each file is completed.

In theory, pass (1) can be skipped, provided that unneeded data is truncated from the output file. However, if there is a significant amount of duplicate data then this would be a performance loss overall.

Another idea is that there could be a stronger heuristic around not compressing files which are likely to be incompressible. This could be helpful and may seem "obvious" for something like a 20 GB 7z file, but heuristics are not guaranteed to be correct and can be harmful.

Additional note: the --update-of option being unreliable is a known issue which has been reported on these forums: viewtopic.php?f=1&t=270. Unfortunately, it is caused by a bug in Windows which cannot, to my knowledge, be worked around.

Gehard · Post by **Gehard** » Wed Apr 27, 2016 8:33 pm

Thank you for the very good explanation.
In the case where the file being added has a unique size (no other files that has already been processed has the same size), it would be safe to compute the checksum and compress at the same time, because only files of the same size can have the same content if I'm correct?
No problem for the update bug, we all know Microsoft

Wimlib is already a pleasure to work with.

Post by **synchronicity** » Thu Apr 28, 2016 12:02 am

Yes, by definition the contents of two files can only be duplicates if they have the same size. Just before starting to write to the output file, wimlib quickly marks each file contents as "definitely unique" (unique size) or "potentially duplicate" (not unique size). For definitely unique file contents, the SHA-1 checksum is only computed incrementally while reading and writing the file. For potentially duplicate file contents, the extra checksumming pass is needed. Note that when looking for duplicates, wimlib must consider file contents being newly added as well as file contents already in the archive, or in a referenced archive in the case of writing a "delta WIM".

Gehard · Post by **Gehard** » Thu Apr 28, 2016 6:48 pm

If it's already implemented, the behavior I reported is strange. We archived a folder containing only one .7z file. So the low CPU usage phase is difficult to understand , because only a pure checksum phase without compression can lead to such a low CPU usage if I'm correct.
In our test, even with 2 very fast HDD, using default options, your algorithm is so fast, that the HDD stays the limiting factor. So the pure checksum phase could be avoided to make it even faster. As it is not true with lzx + solid or lzms, it may be good to have it as an option, but in many cases it would further improve performance.

wimlib

process stays at 99% sometime and write a lot to disk

process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk

Re: process stays at 99% sometime and write a lot to disk