Thursday, May 28, 2009

Distributed vs Centralized Version Control

I've been meaning to write about this for a while, because I get asked about it on a regular basis, and because I wanted to try writing it all out to help me clarify for myself exactly why I find distributed version control so much better. The time to write it and the motivation to do so finally coincided, so here it is.

The basic differences between distributed and centralized version control are:

1. DVCS doesn't necessarily include any notion of a central repository. Every developer has a full repository with all version history, etc. on his local machine. Often, teams choose to designate one repository as the "main" one, just for convenience, but it isn't necessary.

2. Sets of changes can be pulled easily and flexibly between repositories, and there are powerful tools for picking out particular change sets.

Those are the core differences, though git (and most other DVCSs) also handle version control at a whole-tree level, rather than per-file. Due to the way change sets are pushed and pulled between repositories, it becomes important that all of the file changes related to a logical change (bugfix, feature, etc.) are a single commit that is thereafter managed as a unit.

The different approach to version control results in different interactions between developers, just as locking-based VCSs are used differently from merge-based VCSs. With a centralized repository, every developer pulls updates from the repository and pushes changes to it. That same approach can be used with a DVCS, but often is not, especially with OSS projects.

A DVCS also lends itself to a "request-pull" model. A developer fixes a bug or implements a feature and commits it to his reposiory, then notifies other developers (usually the project leader or his delegate) that the change is available to be "pulled". That person can then pull the changes, vet them for appropriateness, style, functionality, etc. and if he likes them, commits the changes to his repository.

Very large projects (hundreds of developers) normally end up organized in a tree structure, with a central "official" repository managed by a project leader plus a second layer of "lieutenants" (typically responsible for major subsystems), who often have one or more layers of trusted assistants themselves. New code is developed by a "leaf" developer in his local repository and when it's sufficiently clean/stable/etc., he notifies an appropriate individual who decides whether or not to accept the changes. That individual aggregates changes from many people, and then notifies the next layer up in the tree, and so on.

The most visible project of that sort is, of course, the Linux kernel, which has over 500 regular contributors and thousands of occasional contributors. With the DVCS model, Linus Torvalds is able to "manage" this huge development team, and to successfully vet, test, and integrate megabytes of changes ever month.

Smaller projects adopt different models depending on the level of team cohesiveness and trust. Within corporate teams, cohesiveness and trust are high, and the DVCS is used in a centralized mode, with every developer pushing changes into the primary repository without going through another individual. Open source projects are often a hybrid, with a small number of core developers who have direct access to the "official" repository, and a larger number who don't, and must ask one of those core developers to pick up their changes.

Increasingly, though, even small open source projects are abandoning that mode and shifting to almost entirely decentralized operation. "Almost" because there still ends up being an official repository which is the one from which releases are pulled to be made available to non-developers. Often, though, the owner of that repository is purely a "release manager", who may not even be a programmer.

Github really facilitates this very loose model. If you want to modify a project hosted on Github, you do it by "forking" the project. Github tracks those forks, though, and when you commit changes, those changes are made visible to the project owner, who has the ability to pull them. It also provides a one-click mechanism for requesting that the owner pull a change set, as well as very nice tools for handling the merges.

In my opinion, DVCSs offer two main advantages over centralized VCSs:

First, the organizational flexibility. Whatever sort of structure makes sense for your project and your team, you can implement it, simply by deciding who pulls from whom and, less commonly, who is allowed to push into whose repository without oversight.

The second is lower-level, more pragmatic: It's just really, really convenient to be working out of your own, purely local repository. Want to work on an airplane? No problem, you have full version history, the ability to commit, roll back, branch, merge, etc. Most all DVCSs provide you the ability to consolidate, divide and reorder commits to "clean up" the version history, as well. That may sound like a bad thing, but it's not: It allows the developer to present change sets as logical, cohesive, coherent wholes, in spite of the fact that the true development process involved many false starts and backtrackings. That not only makes moving change sets between repositories easier and cleaner, it makes the version history MORE USEFUL.

Finally, there's one more aspect to this practical benefit, which accrues most prominently to git: Speed. Linus Torvalds (primary author of git) likes to point out that beyond a certain point performance isn't just a pleasant reduction in thumb-twiddling time, once an operation is so fast that it takes no human-perceptible time, it changes the way developers work.

Perhaps the best example is branching. Many CVCSs provide branching and merging tools, but they're often cumbersome to use, and slow. Creating a new branch on a large project may take a minute -- or sometimes much more. Switching between branches is similar.

With git, for example, branch creation is instantaneous, and shifting between branches, even on trees with tens of thousands of source files, takes well under a second. As a result, git developers tend to use branches for everything they do. Any new little project, bugfix, feature request, etc., spawns a new branch. If the change works out, it's trivially pulled into the main line (and pushed/pulled from there to other developers). If not, deleting a branch is as trivial as creating one (However, NO commit is actually lost so it can always be brought back if really needed).

With traditional CVCSs (and even some of the DVCSs), developers end up with many copies of their source tree, in an effort to keep different streams of work separated. One copy may be for a new feature that's under development on the current in-development version. Another might be a copy of that tree, used for a risky, experimental approach to the new feature. Another might contain the released version of the software, used for testing. Yet another might contain an older version, with changes to fix a bug. With git (and a good build system that doesn't rebuild code unecessarily), there's no need for more than one.

So, to summarize, DVCSs are better because they can do everything CVCSs can do, and more, and faster. If you want to use a DVCS as a better CVCS -- or even as a front end to a real CVCS; git-svn makes Subversion hugely better -- you can do that. If you need a different model, you can do that. And you can do it all faster and, once you learn how, more eaily.

No comments:

Post a Comment

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...