checksum collision and incremental backups

bliblubli · Post by **bliblubli** » Thu Dec 31, 2015 11:33 am

Hi Synchronicity,
You did a really nice job! I have some questions:
1) As some new file systems on linux had problems with automatic deduplication due to collision in checksums, I would like to know how far wimlib was tested? Did you or someone you know used it to create incremental backups of huge server file-systems? Or is there some theoretical proof that the sha-1 checksum is "safe enough"?
2) If I understood it right, wimlib can do incremental backups, that keep older versions of a tree, including files that were deleted. How can I restore the directory tree as it was on dd-mm-yyyy? Is there something like zpaq extract e:\backup.zpaq c:\Users\Bob -to tmp -until 2015-10-30?
3) Is there an efficient way to clean older images by creating a new "full backup" later in time as the first one and removing the older one? After a year, files that have been deleted are safe to remove.
Your lib really solves the problem zpaq has like missing hardlink/symlin/junction support, I just would like to be sure it also is ( or becomes ) as easy to use and as reliable as zpaq before switching

Post by **synchronicity** » Thu Dec 31, 2015 3:35 pm

1) As some new file systems on linux had problems with automatic deduplication
due to collision in checksums, I would like to know how far wimlib was tested?
Did you or someone you know used it to create incremental backups of huge server
file-systems? Or is there some theoretical proof that the sha-1 checksum is
"safe enough"?

Currently, SHA-1 is good enough in practice. It's designed to be a cryptographic checksum and has 2^160 (1461501637330902918203684832716283019655932542976) possible values, which makes random collisions infeasible. And as of this writing, no one has publicly reported *any* two inputs that produce the same SHA-1 message digest. Eventually there will be attacks... but for collisions costing thousands of dollars in computing resources, the worry is about forging SSL certificates, not tricking random file archiving programs.

You should also be aware that WIM archives only support whole-file deduplication. There isn't any deduplication done within files.

2) If I understood it right, wimlib can do incremental backups, that keep older versions of a tree, including files that were deleted. How can I restore the directory tree as it was on dd-mm-yyyy? Is there something like zpaq extract e:\backup.zpaq c:\Users\Bob -to tmp -until 2015-10-30?

For incremental backups, you would simply be adding each version of your directory tree to a WIM archive as an image. So to restore a particular version of the directory tree, you would need to extract it from the corresponding image. You have to specify the image by name or index (timestamp isn't supported), so naming your images by date is probably a good idea.

3) Is there an efficient way to clean older images by creating a new "full backup" later in time as the first one and removing the older one? After a year, files that have been deleted are safe to remove. Your lib really solves the problem zpaq has like missing hardlink/symlin/junction support, I just would like to be sure it also is ( or becomes ) as easy to use and as reliable as zpaq before switching

You could delete old images from the WIM archive and optimize (rebuild) it. Then you'd end up with a minimally-sized archive that contains just the image(s) you want to retain.

bliblubli · Post by **bliblubli** » Thu Dec 31, 2015 5:59 pm

For 2) and 1) ok

for 3) What I would delete to save space would be the first image (the full backup), only keeping the increments, would then rebuilding work? For files that didn't get any modification in one year, thus only being saved in the first image, how could wimlib rebuild the tree correctly without the first image?

Post by **synchronicity** » Thu Dec 31, 2015 6:26 pm

If you keep all images in a single WIM archive, then wimlib will never remove files that are still referenced. If you keep your images in multiple archives and plan to delete old archives manually, then you will need to export the image(s) you want before deleting the old archives.

bliblubli · Post by **bliblubli** » Thu Dec 31, 2015 6:37 pm

The plan was to have it all in one archive, so I can safely remove the first image, nice

Look like a viable alternative to Zpaq, will test it now.

bliblubli · Post by **bliblubli** » Fri Jan 01, 2016 2:16 pm

First tests are really good

However, I used the 1.8.4 beta 2 :

1) The doc says wimlib defaults to LZX compression, but in my case, it was uncompressed if I didn't specify the --compress option.
2) Could you add an option to automatically extract duplicated files as hardlinks (even if they were not in the original tree)? It would help save some space when deploying images.

LZX is damn fast and with automatic deduplication, it's just impressive!

Post by **synchronicity** » Fri Jan 01, 2016 3:07 pm

1) The doc says wimlib defaults to LZX compression, but in my case, it was uncompressed if I didn't specify the --compress option.

This shouldn't happen. Can you share the commands you used?

2) Could you add an option to automatically extract duplicated files as hardlinks (even if they were not in the original tree)? It would help save some space when deploying images.

This used to be supported, but was probably a mistake. In general, hardlinks are not a substitute for proper data deduplication. If you hard link all "copies" of a file, then anyone who changes one copy would, incorrectly, be changing all copies. There can also be metadata differences that make it impossible to know whether two copies of a file "should" be hardlinked or not, even if they have the same contents (hardlinked files share all metadata).

bliblubli · Post by **bliblubli** » Fri Jan 01, 2016 4:19 pm

1) was a misinterpretation from me. It writes xxx MiB of yyy MiB (uncompressed). But the line aboves it says it is compressed with LZX. So everything ok

2) Well, I would propose it as an option and of course the user has to know when to use it and when not. I can understand some people prefer full copies, but letting it as a possibility will help in many cases. Wimlib also saves the metadata, so it should be possible to ensure only files with same content and same metadata are hardlinked. Symlinks can have different metadata if I'm right, so they could be a good solution to complete the hardlink one.

wimlib

checksum collision and incremental backups

checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups

Re: checksum collision and incremental backups