Sunday, October 4, 2009

GridBackup Re-architecture

GridBackup still isn't fully functional, but I think I'm going to change directions already. So far, it does scanning and backups pretty well, and has a basic backup verifier so you can check that your files really are safely backed up to the grid. It doesn't have a restore tool yet, but I could slap a simple one together in a few hours if it were needed.

However, there are some problems with the way it works now.

First, it doesn't address laptop users well at all. I have set it up for my brother Dirk, but because his backup server is just a machine sitting in the corner, there's really nothing for it to back up. All of his important files are on laptops. I set up a folder on the server, accessible via Samba (Windows file sharing) so that he can drop important files in a place where they can get backed up, but, predictably, having an extra step like that means that backups don't get done.

Second, there are problems with the implementation. The sqlite database used by the GridBackup, GridUpload and GridVerify scripts doesn't handle concurrent access well, so you can generally only run one of them at a time. But you have to be a little careful (in ways that I only know from experience, and am not sure I could explain) or you can corrupt the sqlite database when stopping one program to start another. The uploader really needs to be a daemon ('service' in Windows terminology), just running in the background all of the time. Ideally, it should really be integrated into Tahoe, so it starts and stops with Tahoe. I don't think I want to do it in Tahoe, at least just yet, because I want more freedom to work. However, given the Twisted application plugin system, I may be able to write it as a plugin that can be added to Tahoe.

In the new architecture, the "uploader" becomes the "backup server". It's purpose is to accept backup jobs delivered to it by a backup client (i.e. the "scanner"). The client should be run wherever the files are (i.e. on your laptop), and would deliver any changed files to the server as fast as it can. The intention is for the server to store those files itself until it can get them safely uploaded into the grid. From the client's point of view, once it delivers the files to the backup server, they are backed up, though it may take some time for the files to actually be delivered to grid storage.

One issue this approach raises is that it requires the backup server to have room to temporarily store all of the files sent to it. This shouldn't be a huge problem in most cases, because the assumption is the backup server is the same machine as the Tahoe node, and it has to have a lot of storage available -- usually 2-3 times as much as what the clients that use that server wish to back up.

However, in some cases users may not be using their Tahoe node for storage, and so may not have a lot of storage available. In that case, they'll want to run the backup client and server on one computer, and configure the server to restrict the amount of storage it uses, and to attempt to get files from their "original" locations, rather than its own storage when it doesn't have enough to hold everything. There's still value in having it store recently-modified files, on the grounds that if they've changed recently, they'll probably change again soon and the backup server doesn't store its own copies they may change before it gets around to uploading them to the grid.

No comments:

Post a Comment

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...