option for allowed character set

Comments, questions, bug reports, etc.
Post Reply
bliblubli
Posts: 88
Joined: Thu Dec 31, 2015 10:45 am

option for allowed character set

Post by bliblubli »

Hi @synchronicity ,
the --include-invalid-names option is very helpful, but it's effect is different, depending on the OS one uses. And I couldn't find a way to ensure only legal characters when applying an image from Linux for a partition used under Windows. Would it be possible to either
- have its effect dependant on the target's partition file system or
- to add an option which ensure the same result when running the same command (like --charset=win and --charset=posix)?
synchronicity
Site Admin
Posts: 473
Joined: Sun Aug 02, 2015 10:31 pm

Re: option for allowed character set

Post by synchronicity »

Well, the NTFS filesystem (even Microsoft's implementation of it, theoretically) actually does technically support the characters like '?' which Windows doesn't. So, that's why libntfs-3g allows them. An option to consider them invalid when extracting could make sense. But then there is also the question of names such as "NUL" or names with a trailing dot or space, which Windows allows with \\?\-prefixed paths but doesn't allow with regular paths. That makes them incompatible with most Windows software. So, there are multiple levels of how "invalid" a filename is on Windows, which would complicate adding a new option...
bliblubli
Posts: 88
Joined: Thu Dec 31, 2015 10:45 am

Re: option for allowed character set

Post by bliblubli »

True. Then maybe just make the smallest denominator available. Like --charset=win_explorer so that at least the windows UI allows to copy/move/delete folders and chkdisk doesn't delete files when checking the file system?
synchronicity
Site Admin
Posts: 473
Joined: Sun Aug 02, 2015 10:31 pm

Re: option for allowed character set

Post by synchronicity »

The NTFS on-disk format allows filenames to contain any characters other than '\0' and '/' as long as the filenames are marked as being in the "POSIX namespace", which NTFS-3G does. I thought that chkdisk doesn't impose any additional restrictions; did you observe otherwise?

The reality is that filenames on Windows are complicated, so "smallest denominator" is not a clear-cut thing. There are many different aspects:

* Case sensitivity. NTFS actually supports both case sensitivity and case insensitivity, but by default the Windows kernel is configured to treat paths as case insensitive. This can be changed via a registry setting. But, paths used by Linux binaries running under the Windows Subsystem for Linux are treated as case sensitive regardless.
* Special characters. Some characters such as '?' aren't normally allowed on Windows, even with \\?\-prefixed paths. (They might work with the Windows Subsystem for Linux though; I'm not sure.)
* Special names. Windows treats some filenames such as "nul" or "con" specially; they refer to devices instead of files on the filesystem. However, Windows applications can get around this by using pseudo-NT-namespace path syntax, i.e. \\?\-prefixed paths.
* Special name formats. On Windows, trailing dots and spaces are normally stripped from paths when used. However, as above Windows applications can get around this by using \\?\-prefixed paths.
* Long names. Paths on Windows are normally limited to 255 characters. However, as above Windows applications can get around this by using \\?\-prefixed paths.
* Invalid Unicode. Many applications assume that filenames are valid Unicode (UTF-16LE in the case of Windows) and will get confused when they are not. But, Windows doesn't actually enforce that filenames are valid Unicode.

(And probably others I forgot... note that besides filenames there can be other fun "problems" such as files whose ACL does not grant anyone permissions, even the Administrator.)

The default behavior of NTFS-3G, and the behavior wimlib when extracting using libntfs-3g, is simple: all the names are extracted as-is. That is permitted by the NTFS on-disk format, and in general there's no way to tell which filenames might be "unwanted" -- e.g. maybe those "weird" names are for use by an application that uses \\?\-prefixed paths, or maybe they are intended to be accessed through the Windows Subsystem for Linux, or perhaps even just with NTFS-3G again. And I don't always know which subsets of these filename problems actually apply to specific Windows programs, e.g. Windows Explorer.

So tldr; while it may make sense to add an option to control the extraction behavior, it's hard to decide what it should do, exactly...
bliblubli
Posts: 88
Joined: Thu Dec 31, 2015 10:45 am

Re: option for allowed character set

Post by bliblubli »

Hi,
regarding chkdisk, create a file named "*" or ":" in a folder. On windows try to copy that folder in explorer (it will abort) or run chkdisk on the partition and it will delete that file.
Regarding the smallest denominator, it's clearly windows explorer. This regex should work :
^(?!^(?:PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d)(?:\..+)?$)(?:\.*?(?!\.))[^\x00-\x1f\\?*:\";|\/<>]+(?<![\s.])$
You can read https://stackoverflow.com/questions/295 ... d-filename and https://stackoverflow.com/questions/627 ... er-windows for more variants.

Their are 2 use cases. First one is to backup and restore a partition managed exclusively under windows. In this case, no option for the path is needed. The other case is to backup a partition/tree managed by Linux, with all the paths that Linux supports for usage under windows (you copy some of your data on a usb stick for your grand mother or whatever, she's not gona use /?/ paths, she's already happy to know how to copy paste a folder). In this case, many file names can make explorer unusable.

Of course, you could check every file name manually or install Linux on your grand mother computer, or just use windows yourself because beeing a Linux user in a 95% windows world and family is a nightmare. But a very conservative regex could help a lot. (in this use case, it's not very important if some useless renaming is done like ".photo of your 80th birthday.jpg" to "photo of your 80th birthday.jpg" although the former would have been ok, as long as someone with only very basic IT knowledge can access all photos.)
synchronicity
Site Admin
Posts: 473
Joined: Sun Aug 02, 2015 10:31 pm

Re: option for allowed character set

Post by synchronicity »

Hmm, I'm considering going with 2 new extraction options --windows-names and --ignore-case. --windows-names would make names matching the special characters, special names, or special name formats (see my list above) be considered invalid. That would be similar to your suggested regex and would also be the same rules applied by the ntfs-3g mount option 'windows_names'. Separately, --ignore-case would make all case-insensitively-matching filenames except the first be considered invalid. Then to be consistent with current behavior, to extract replacement names rather than ignoring the invalid names, the existing option --include-invalid-names would also have be given.

Unfortunately that ignores the other problems I described such as long paths and invalid Unicode, but it might be good enough for some use cases...

Note that there is already a way to control case insensitivity in wimlib-imagex via an environmental variable, but it's primarily meant to control how paths in WIM images are interpreted; I don't think it should control extraction behavior too.
bliblubli
Posts: 88
Joined: Thu Dec 31, 2015 10:45 am

Re: option for allowed character set

Post by bliblubli »

Sounds good. So if I understand correctly, extracting with "--windows-names --ignore-case --include-invalid-names" will extract all files with cleaned names?
The long path and invalid unicode are ok. In the scenario I describe, those won't happen. And the long path problem can be solved pretty easily by renaming some of the top folder.
Post Reply