Friday, February 20, 2009

PeerBackup status

The file system scanner is pretty much complete. I'm satisfied that it doesn't miss anything and that it runs fast enough. The initial run on a file system is slow, of course, because it has to hash every file on the computer. On my desktop machine, it plows through 182 GiB in just over two hours. After that initial run, though, it's pretty quick. It does a complete scan in about five minutes; a little more if there are some big files that have changed. Not too bad. I imagine there is still some performance tuning I can do to squeeze a little bit out of that time (I notice that a little-has-changed run maxes my CPU, so there's probably some inefficiency there), but I'll defer that until after I have the basic system working.

On the upload job management, I have a good start. I've implemented a workable system for prioritizing uploads, and come up with a potential default prioritization scheme that balances three prioritizations -- user vs. system files, file age and file size.

Some files are more important to back up than others. Ideally, what we really want is a fairly fine-grained mechanism for allowing the specification of classes of files and then each class should be completed before the next is begun. So, for example, I want my personal finance information backed up before anything else, followed by my photos, followed by my work projects, followed by my personal projects, followed by system configuration information (/etc and some stuff in /var) followed by everything else in my home directory, followed by locally-installed applications (/opt and /usr/local), followed by everything else.

I ultimately want to allow such fine-grained control, but I don't expect very many of my target user audience will understand enough about their computers to set it up. I'm also not sure I know enough about Windows or OS X to define good approaches for those.

So, for now, I'm starting simple. My algorithm strongly prefers files in /home, /Users or C:\Documents and Settings, whichever of those paths exists, and doesn't prioritize beyond that. This essentially creates three classes: user files, other files, and files that shouldn't be backed up at all (implemented by specifying exclusions on the scanning process).

The next prioritization element is modification time. It seems like a good idea to back up recently-modified files before old files, on the theory that they're of greater interest to the user. The function was selected so that files modified in the last minute get maximum time prioritization, files that are a week old get 50% and it trails off from there, getting down to 10% after a year.

The last element is size. If you have to pick between getting a whole bunch of small files backed up or a few large ones, it's probably better to get the many small files. The function I chose is designed to be at maximum for empty files (though those won't actually be backed up, obviously), at 50% for 1 MiB and trail off from there. Oh, and the size value used is the lesser of the file size and the size of any cached delta for the file, so small deltas will get high priority. I'm not sure if that's a good idea or not, since it will tend to favor adding more revisions to backed-up files over getting files that haven't been backed up yet into the grid.

All three elements are weighted equally, on a scale from 0 to 1 million. But the "user files first" element is boolean -- either the file gets the full million or else it gets nothing -- while the age and size factors will almost never give full value for a file, and tend to trail off very quickly. I've only done rudimentary testing, but it looks like only very young and very small system files end up prioritized over user files. It'll need tuning, but I think it's a good start.

Job queue scanning is implemented, but I have realized another piece is required. Although I can create a priority queue containing all the jobs (and it's reasonably fast, even on a big queue), I need to add some logic to detect multiple jobs referring to the same path, because some of them may have dependencies on others.

If a file has multiple jobs that are each deltas from a previous revision (which means that one full revision is in the grid already), it doesn't make sense to upload them out of order, because later deltas are useless without their predecessors. Because the scanner won't cache deltas for files that haven't already been successfully uploaded, the question about uploading a delta whose full revision basis hasn't been uploaded should never arise.

So, I need to add a mechanism to allow me to identify when an upload job is one of several referring to a file, and then decide how to handle them.

I'm hoping (again) to get an alpha out this weekend, but I think I'll be snowboarding with the kids on Saturday, so we'll see how it goes.

2 comments:

  1. I am glad this stuff makes you happy. I only wish it made you stinking rich!

    ReplyDelete
  2. Maybe this one will. It's actually a very good idea. And a rather good use of high bandwidth and large capacity drives and p2p systems.

    ReplyDelete

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...