Thursday, May 28, 2009

Distributed vs Centralized Version Control

I've been meaning to write about this for a while, because I get asked about it on a regular basis, and because I wanted to try writing it all out to help me clarify for myself exactly why I find distributed version control so much better. The time to write it and the motivation to do so finally coincided, so here it is.

The basic differences between distributed and centralized version control are:

1. DVCS doesn't necessarily include any notion of a central repository. Every developer has a full repository with all version history, etc. on his local machine. Often, teams choose to designate one repository as the "main" one, just for convenience, but it isn't necessary.

2. Sets of changes can be pulled easily and flexibly between repositories, and there are powerful tools for picking out particular change sets.

Those are the core differences, though git (and most other DVCSs) also handle version control at a whole-tree level, rather than per-file. Due to the way change sets are pushed and pulled between repositories, it becomes important that all of the file changes related to a logical change (bugfix, feature, etc.) are a single commit that is thereafter managed as a unit.

The different approach to version control results in different interactions between developers, just as locking-based VCSs are used differently from merge-based VCSs. With a centralized repository, every developer pulls updates from the repository and pushes changes to it. That same approach can be used with a DVCS, but often is not, especially with OSS projects.

A DVCS also lends itself to a "request-pull" model. A developer fixes a bug or implements a feature and commits it to his reposiory, then notifies other developers (usually the project leader or his delegate) that the change is available to be "pulled". That person can then pull the changes, vet them for appropriateness, style, functionality, etc. and if he likes them, commits the changes to his repository.

Very large projects (hundreds of developers) normally end up organized in a tree structure, with a central "official" repository managed by a project leader plus a second layer of "lieutenants" (typically responsible for major subsystems), who often have one or more layers of trusted assistants themselves. New code is developed by a "leaf" developer in his local repository and when it's sufficiently clean/stable/etc., he notifies an appropriate individual who decides whether or not to accept the changes. That individual aggregates changes from many people, and then notifies the next layer up in the tree, and so on.

The most visible project of that sort is, of course, the Linux kernel, which has over 500 regular contributors and thousands of occasional contributors. With the DVCS model, Linus Torvalds is able to "manage" this huge development team, and to successfully vet, test, and integrate megabytes of changes ever month.

Smaller projects adopt different models depending on the level of team cohesiveness and trust. Within corporate teams, cohesiveness and trust are high, and the DVCS is used in a centralized mode, with every developer pushing changes into the primary repository without going through another individual. Open source projects are often a hybrid, with a small number of core developers who have direct access to the "official" repository, and a larger number who don't, and must ask one of those core developers to pick up their changes.

Increasingly, though, even small open source projects are abandoning that mode and shifting to almost entirely decentralized operation. "Almost" because there still ends up being an official repository which is the one from which releases are pulled to be made available to non-developers. Often, though, the owner of that repository is purely a "release manager", who may not even be a programmer.

Github really facilitates this very loose model. If you want to modify a project hosted on Github, you do it by "forking" the project. Github tracks those forks, though, and when you commit changes, those changes are made visible to the project owner, who has the ability to pull them. It also provides a one-click mechanism for requesting that the owner pull a change set, as well as very nice tools for handling the merges.

In my opinion, DVCSs offer two main advantages over centralized VCSs:

First, the organizational flexibility. Whatever sort of structure makes sense for your project and your team, you can implement it, simply by deciding who pulls from whom and, less commonly, who is allowed to push into whose repository without oversight.

The second is lower-level, more pragmatic: It's just really, really convenient to be working out of your own, purely local repository. Want to work on an airplane? No problem, you have full version history, the ability to commit, roll back, branch, merge, etc. Most all DVCSs provide you the ability to consolidate, divide and reorder commits to "clean up" the version history, as well. That may sound like a bad thing, but it's not: It allows the developer to present change sets as logical, cohesive, coherent wholes, in spite of the fact that the true development process involved many false starts and backtrackings. That not only makes moving change sets between repositories easier and cleaner, it makes the version history MORE USEFUL.

Finally, there's one more aspect to this practical benefit, which accrues most prominently to git: Speed. Linus Torvalds (primary author of git) likes to point out that beyond a certain point performance isn't just a pleasant reduction in thumb-twiddling time, once an operation is so fast that it takes no human-perceptible time, it changes the way developers work.

Perhaps the best example is branching. Many CVCSs provide branching and merging tools, but they're often cumbersome to use, and slow. Creating a new branch on a large project may take a minute -- or sometimes much more. Switching between branches is similar.

With git, for example, branch creation is instantaneous, and shifting between branches, even on trees with tens of thousands of source files, takes well under a second. As a result, git developers tend to use branches for everything they do. Any new little project, bugfix, feature request, etc., spawns a new branch. If the change works out, it's trivially pulled into the main line (and pushed/pulled from there to other developers). If not, deleting a branch is as trivial as creating one (However, NO commit is actually lost so it can always be brought back if really needed).

With traditional CVCSs (and even some of the DVCSs), developers end up with many copies of their source tree, in an effort to keep different streams of work separated. One copy may be for a new feature that's under development on the current in-development version. Another might be a copy of that tree, used for a risky, experimental approach to the new feature. Another might contain the released version of the software, used for testing. Yet another might contain an older version, with changes to fix a bug. With git (and a good build system that doesn't rebuild code unecessarily), there's no need for more than one.

So, to summarize, DVCSs are better because they can do everything CVCSs can do, and more, and faster. If you want to use a DVCS as a better CVCS -- or even as a front end to a real CVCS; git-svn makes Subversion hugely better -- you can do that. If you need a different model, you can do that. And you can do it all faster and, once you learn how, more eaily.

Wednesday, May 27, 2009

Pathname handling, coninued

So, here's the solution:

On non-Windows NT systems, I read and manage all pathnames as raw bytes, basically ignoring whatever encoding the system thinks it's using. Since the common case is to restore files to the same sort of system they were backed up from, that ensures that every file name can be backed up and restored.

But what about the less common case, where the restore is to a different sort of system, with a different encoding? To address that, in addition to the raw bytes of the pathname, I also store the file system pathname encoding -- that tells me the way the system claims to interpret the raw bytes of its names. When doing a restore, the restore process checks to see if the target system uses the same encoding as the source system. If so, it just writes out the raw bytes. If not, then it decodes the names with the source system decoder and encodes them with the destination system encoder.

If there is a "bad" name (a name that is invalid per the source system's encoding rules), that decoding may fail, or it may just produce garbage. If it fails, the restore system logs the error and retries the decoding with a special mode that replaces invalid characters with an "unknown" symbol, so that the decoding and restore succeeds, even though the file name is damaged.

However, the original byte string is still in the backup data, so theoretically I could someday make tools that allow the user to specify the encoding, or that try a bunch of different encodings, or whatever. The data exists to recover the correct pathname, assuming that's even possible.

There's one more issue, though: The method I'm using to write file names into the backup log only takes valid Unicode (and encodes it with UTF-8). So I want to store and manage raw byte strings, but I need to convert them to Unicode. To do that, I "decode" the byte strings with the latin1 decoder. That's a widely used encoding standard that has the convenient property that it maps every possible byte value to a unique Unicode value. The results are probably complete garbage, but that's okay, because I only use this latin1 encoding as a transport method. During restore, I decode the UTF-8 names to get Unicode, then encode them with latin1 which gives me back the original byte string.

Monday, May 25, 2009

GridBackup pathname handling

(BTW, this post also serves as an announcement that I'm resuming work on GridBackup, barring any more crises)

After struggling with various approaches for months, I've finally settled on a method for handling file and directory names. You would think this would be easy, but it's not.

In a nutshell, representing names is a hard problem, and one that's been bothering computer scientists and software engineers for decades. To a computer, everything is ultimately a number, so finding a way to store the word "hello" requires converting the text to a series of numbers.

No problem, we just assign a number to each letter (h = 104, e = 101, etc.) and store that list of numbers, right? Well, sure, but the problem is that there are a LOT of symbols that may be in names. All of the first major computer systems were invented in the US, and so they had no ability to handle any symbols other than the ones we use -- basically A-Z (upper and lowercase), 0-9 and assorted punctuation characters.

When Europeans got involved, they needed to expand this set a bit. They needed various accented characters, plus a few modifications like the French รง (note the little tail on the bottom). Okay, so define numbers for each of those, right?

Oh, but then we have the cyrillic alphabet. And Indian Devanagari, Gujarati, etc. And Arabic scrip. And Korean Hangul script. And then there are the various Chinese and Japanese writing systems, some of which don't even have alphabets, but are ideographic.

The net is that there are a lot of writing systems in the world, and computers ultimately need to support all of them that are used by any significant number of people. And not only do they have different sets of symbols, but they flow in different directions (left to right, right to left, top to bottom) and many have various complex ways of merging or joining characters.

There is now a system for defining numbers for all of those many, many symbols, and specifying how they connect together, etc. It's called Unicode. Now, if only the whole world used it (and all used the same variant of it -- there are two major versions, UCS-16 and UTF-8, plus some more minor ones), then we could all pass files around and everyone would see the names correctly, whether they could make sense of them or not.

Actually, there is one major segment of the computing world that does use Unicode religiously (specifically UCS-16): Microsoft Windows NT, which includes Windows 2000, XP and Vista. This is one thing that MS got right, mainly because NT came about in the mid 90s when all this stuff was pretty well understood. On Windows, every file name is guaranteed to be correct, valid Unicode.

Mac and Unix systems (Mac lately is Unix, though not traditionally) came earlier and used different systems, as did other versions of Windows before NT. Specifically, they used a wide variety of encoding systems developed in different parts of the world, each appropriate for the writing of that area. In order to make the software customizable for different areas of the world, they provided a way to set a "locale" or a "codepage" which basically said "On this computer, we represent file names using CP949", or similar (CP949 is a numbering of Korean characters).

That works, but it means if you lift a file from a Vietnamese computer and drop it on an American computer, the name is complete gibberish. I don't mean gibberish in the sense of "not understandable if you don't know Vietnamese", I mean "not understandable, period", because the numbers used to represent the Vietnamese characters would be interpreted by the Amerian computer as whatever those numbers mean in the American character set. In many cases, the numbers representing a name in one system may be invalid on another computer. The computer won't just display garbage, it'll give you an error.

Even worse, many of these systems allowed users to specify their own "locale", so one computer may have one user using a German character set while another uses Russian.

So, the upshot of this is that on Mac, Linux, Unix and older Windows systems, there may be files with names that make absolutely no sense when interpreted the way everything else on that computer is interpreted.

The question is: How should a backup program deal with this?

I ultimately settled on these key goals:
  1. The backup should ALWAYS succeed, regardless of whether or not the file names are "valid".
  2. The backup should preserve enough information that it's just as possible to figure out the correct characters for a name from the backup as it is from the original system.
In addition, I have a constraint: The format that I'm putting the file name information into in the backup system (JSON) only accepts Unicode. So I have to somehow make sure that everything I store is valid Unicode, even though I don't necessarily have any idea what characters are in the name.

It's late. Having explained the problem, I'll describe my solution tomorrow.

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...