Monday, May 25, 2009

GridBackup pathname handling

(BTW, this post also serves as an announcement that I'm resuming work on GridBackup, barring any more crises)

After struggling with various approaches for months, I've finally settled on a method for handling file and directory names. You would think this would be easy, but it's not.

In a nutshell, representing names is a hard problem, and one that's been bothering computer scientists and software engineers for decades. To a computer, everything is ultimately a number, so finding a way to store the word "hello" requires converting the text to a series of numbers.

No problem, we just assign a number to each letter (h = 104, e = 101, etc.) and store that list of numbers, right? Well, sure, but the problem is that there are a LOT of symbols that may be in names. All of the first major computer systems were invented in the US, and so they had no ability to handle any symbols other than the ones we use -- basically A-Z (upper and lowercase), 0-9 and assorted punctuation characters.

When Europeans got involved, they needed to expand this set a bit. They needed various accented characters, plus a few modifications like the French รง (note the little tail on the bottom). Okay, so define numbers for each of those, right?

Oh, but then we have the cyrillic alphabet. And Indian Devanagari, Gujarati, etc. And Arabic scrip. And Korean Hangul script. And then there are the various Chinese and Japanese writing systems, some of which don't even have alphabets, but are ideographic.

The net is that there are a lot of writing systems in the world, and computers ultimately need to support all of them that are used by any significant number of people. And not only do they have different sets of symbols, but they flow in different directions (left to right, right to left, top to bottom) and many have various complex ways of merging or joining characters.

There is now a system for defining numbers for all of those many, many symbols, and specifying how they connect together, etc. It's called Unicode. Now, if only the whole world used it (and all used the same variant of it -- there are two major versions, UCS-16 and UTF-8, plus some more minor ones), then we could all pass files around and everyone would see the names correctly, whether they could make sense of them or not.

Actually, there is one major segment of the computing world that does use Unicode religiously (specifically UCS-16): Microsoft Windows NT, which includes Windows 2000, XP and Vista. This is one thing that MS got right, mainly because NT came about in the mid 90s when all this stuff was pretty well understood. On Windows, every file name is guaranteed to be correct, valid Unicode.

Mac and Unix systems (Mac lately is Unix, though not traditionally) came earlier and used different systems, as did other versions of Windows before NT. Specifically, they used a wide variety of encoding systems developed in different parts of the world, each appropriate for the writing of that area. In order to make the software customizable for different areas of the world, they provided a way to set a "locale" or a "codepage" which basically said "On this computer, we represent file names using CP949", or similar (CP949 is a numbering of Korean characters).

That works, but it means if you lift a file from a Vietnamese computer and drop it on an American computer, the name is complete gibberish. I don't mean gibberish in the sense of "not understandable if you don't know Vietnamese", I mean "not understandable, period", because the numbers used to represent the Vietnamese characters would be interpreted by the Amerian computer as whatever those numbers mean in the American character set. In many cases, the numbers representing a name in one system may be invalid on another computer. The computer won't just display garbage, it'll give you an error.

Even worse, many of these systems allowed users to specify their own "locale", so one computer may have one user using a German character set while another uses Russian.

So, the upshot of this is that on Mac, Linux, Unix and older Windows systems, there may be files with names that make absolutely no sense when interpreted the way everything else on that computer is interpreted.

The question is: How should a backup program deal with this?

I ultimately settled on these key goals:
  1. The backup should ALWAYS succeed, regardless of whether or not the file names are "valid".
  2. The backup should preserve enough information that it's just as possible to figure out the correct characters for a name from the backup as it is from the original system.
In addition, I have a constraint: The format that I'm putting the file name information into in the backup system (JSON) only accepts Unicode. So I have to somehow make sure that everything I store is valid Unicode, even though I don't necessarily have any idea what characters are in the name.

It's late. Having explained the problem, I'll describe my solution tomorrow.

1 comment:

  1. Yes, it was late, you should have come to bed instead of staying up half the night. Glad you are back to something your enjoying.

    ReplyDelete

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...