Error : filename is not valid UTF-8

Comments, questions, bug reports, etc.
Post Reply
Skyblue
Posts: 23
Joined: Fri Apr 08, 2016 7:12 am

Error : filename is not valid UTF-8

Post by Skyblue »

Hi synchronicity,

The latest release of wimlib ( V 1.13.4 ) reports the following error message while capturing a folder on Oracle Linux 5.8 32-bit :

Code: Select all

[ERROR] "/appdata/krbapp/prodappl/xxkrb/11.5.0/bin/file▒200.xls": filename is not valid UTF-8.  This is not supported.
Is there a way to archive those non-UTF-8 files with wimlib? Thanks.
synchronicity
Site Admin
Posts: 472
Joined: Sun Aug 02, 2015 10:31 pm

Re: Error : filename is not valid UTF-8

Post by synchronicity »

I'm afraid not, as the WIM file format stores filenames as Windows-style wide character strings (UTF-16LE, with unpaired surrogates allowed). As a result there is no way to represent a UNIX-style arbitrary byte sequence filename unless it is valid UTF-8 (with unpaired surrogates allowed).

Edit: in principle filenames with a well-defined encoding other than UTF-8, say ISO-8859-1, could be mapped to UTF-16 as well. Almost everyone uses UTF-8 now though, so there hasn't been a need to support this.
Skyblue
Posts: 23
Joined: Fri Apr 08, 2016 7:12 am

Re: Error : filename is not valid UTF-8

Post by Skyblue »

Hi synchronicity,

Thanks for the info. Luckly, I had the option to delete the offending files and wimlib did the job.
chungy
Posts: 30
Joined: Mon Feb 15, 2016 3:40 am

Re: Error : filename is not valid UTF-8

Post by chungy »

synchronicity wrote: Thu Jul 08, 2021 9:37 pm Edit: in principle filenames with a well-defined encoding other than UTF-8, say ISO-8859-1, could be mapped to UTF-16 as well. Almost everyone uses UTF-8 now though, so there hasn't been a need to support this.
It should be possible, even, to have a flag that interprets all file names as ISO-8859-1 and capture every possible file name that might be seen, and a corresponding flag on apply/extract. It has a particular advantage in that the first 256 code points in Unicode are also the entire character set of 8859-1, the conversion is pretty simple.

I'd generally agree that assuming UTF-8 is a safe default (especially given how long it took for this issue to arrise), and maintains the least surprises. Old archives, to and from Windows, etc.
Post Reply