Wednesday, May 27, 2009

Pathname handling, coninued

So, here's the solution:

On non-Windows NT systems, I read and manage all pathnames as raw bytes, basically ignoring whatever encoding the system thinks it's using. Since the common case is to restore files to the same sort of system they were backed up from, that ensures that every file name can be backed up and restored.

But what about the less common case, where the restore is to a different sort of system, with a different encoding? To address that, in addition to the raw bytes of the pathname, I also store the file system pathname encoding -- that tells me the way the system claims to interpret the raw bytes of its names. When doing a restore, the restore process checks to see if the target system uses the same encoding as the source system. If so, it just writes out the raw bytes. If not, then it decodes the names with the source system decoder and encodes them with the destination system encoder.

If there is a "bad" name (a name that is invalid per the source system's encoding rules), that decoding may fail, or it may just produce garbage. If it fails, the restore system logs the error and retries the decoding with a special mode that replaces invalid characters with an "unknown" symbol, so that the decoding and restore succeeds, even though the file name is damaged.

However, the original byte string is still in the backup data, so theoretically I could someday make tools that allow the user to specify the encoding, or that try a bunch of different encodings, or whatever. The data exists to recover the correct pathname, assuming that's even possible.

There's one more issue, though: The method I'm using to write file names into the backup log only takes valid Unicode (and encodes it with UTF-8). So I want to store and manage raw byte strings, but I need to convert them to Unicode. To do that, I "decode" the byte strings with the latin1 decoder. That's a widely used encoding standard that has the convenient property that it maps every possible byte value to a unique Unicode value. The results are probably complete garbage, but that's okay, because I only use this latin1 encoding as a transport method. During restore, I decode the UTF-8 names to get Unicode, then encode them with latin1 which gives me back the original byte string.

No comments:

Post a Comment

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...