Friday, February 20, 2009

PeerBackup status

The file system scanner is pretty much complete. I'm satisfied that it doesn't miss anything and that it runs fast enough. The initial run on a file system is slow, of course, because it has to hash every file on the computer. On my desktop machine, it plows through 182 GiB in just over two hours. After that initial run, though, it's pretty quick. It does a complete scan in about five minutes; a little more if there are some big files that have changed. Not too bad. I imagine there is still some performance tuning I can do to squeeze a little bit out of that time (I notice that a little-has-changed run maxes my CPU, so there's probably some inefficiency there), but I'll defer that until after I have the basic system working.

On the upload job management, I have a good start. I've implemented a workable system for prioritizing uploads, and come up with a potential default prioritization scheme that balances three prioritizations -- user vs. system files, file age and file size.

Some files are more important to back up than others. Ideally, what we really want is a fairly fine-grained mechanism for allowing the specification of classes of files and then each class should be completed before the next is begun. So, for example, I want my personal finance information backed up before anything else, followed by my photos, followed by my work projects, followed by my personal projects, followed by system configuration information (/etc and some stuff in /var) followed by everything else in my home directory, followed by locally-installed applications (/opt and /usr/local), followed by everything else.

I ultimately want to allow such fine-grained control, but I don't expect very many of my target user audience will understand enough about their computers to set it up. I'm also not sure I know enough about Windows or OS X to define good approaches for those.

So, for now, I'm starting simple. My algorithm strongly prefers files in /home, /Users or C:\Documents and Settings, whichever of those paths exists, and doesn't prioritize beyond that. This essentially creates three classes: user files, other files, and files that shouldn't be backed up at all (implemented by specifying exclusions on the scanning process).

The next prioritization element is modification time. It seems like a good idea to back up recently-modified files before old files, on the theory that they're of greater interest to the user. The function was selected so that files modified in the last minute get maximum time prioritization, files that are a week old get 50% and it trails off from there, getting down to 10% after a year.

The last element is size. If you have to pick between getting a whole bunch of small files backed up or a few large ones, it's probably better to get the many small files. The function I chose is designed to be at maximum for empty files (though those won't actually be backed up, obviously), at 50% for 1 MiB and trail off from there. Oh, and the size value used is the lesser of the file size and the size of any cached delta for the file, so small deltas will get high priority. I'm not sure if that's a good idea or not, since it will tend to favor adding more revisions to backed-up files over getting files that haven't been backed up yet into the grid.

All three elements are weighted equally, on a scale from 0 to 1 million. But the "user files first" element is boolean -- either the file gets the full million or else it gets nothing -- while the age and size factors will almost never give full value for a file, and tend to trail off very quickly. I've only done rudimentary testing, but it looks like only very young and very small system files end up prioritized over user files. It'll need tuning, but I think it's a good start.

Job queue scanning is implemented, but I have realized another piece is required. Although I can create a priority queue containing all the jobs (and it's reasonably fast, even on a big queue), I need to add some logic to detect multiple jobs referring to the same path, because some of them may have dependencies on others.

If a file has multiple jobs that are each deltas from a previous revision (which means that one full revision is in the grid already), it doesn't make sense to upload them out of order, because later deltas are useless without their predecessors. Because the scanner won't cache deltas for files that haven't already been successfully uploaded, the question about uploading a delta whose full revision basis hasn't been uploaded should never arise.

So, I need to add a mechanism to allow me to identify when an upload job is one of several referring to a file, and then decide how to handle them.

I'm hoping (again) to get an alpha out this weekend, but I think I'll be snowboarding with the kids on Saturday, so we'll see how it goes.

Thursday, February 19, 2009


The PirateBay trial is generating a lot of discussion all over the world. I've posted my thoughts in bits and pieces on a couple of different forums, but I want to pull them together into one place.

First of all, I want to make clear that I support copyright. I think it's a fundamentally good idea when applied for its original purpose (original purpose in the US, at least, the original original purpose was censorship by the British Crown). That purpose is expressed in Article I, section 1 of the US Constitution this way: Congress is given authority to pass laws to "promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries".

Note that the purpose stated is to "promote progress", not to compensate authors and inventors (and musicians and filmmakers and programmers and...). The Framers recognized no inherent right of people to get paid for creating stuff, and certainly no right to control and profit forever from a single piece of work.

By nature, there is no natural limit to the flow of ideas and expressions, and therefore no inherent sort of "ownership". As Thomas Jefferson put it "He who receives an idea from me receives instructions himself without lessening mine, as he who lights his taper at mine receives light without darkening me." The natural state of "intellectual property" is that it doesn't exist, and there's no obvious reason it should, because my obtaining a copy of something in no way takes anything from anyone else.

However, the Framers also recognized that much of value to society might lie forever hidden because there was no motivation to publish it. Publishing books was expensive, and if someone else could copy them there might be no way to recoup those costs. In the 18th century publication created a significant barrier to the flow of material into the public domain, and so Congress was authorized to do some things to motivate people to cross that hurdle. They didn't concern themselves so much with things that might never be created for lack of motivation, but that's an issue as well. A talented author who must do other work to feed himself won't be able to write as much, and that's a bad thing.

Enter the notions of copyrights and patents (and trade secrets and trademarks as well, though they're a little different). Copyrights and patents work very differently, but the basic purpose of both is the same: To encourage publication, so that ideas and expressions can enter our culture, and to do it by granting artificially-created and enforced limited monopolies.

It's important to remember the artificial nature of these monopolies, and the fact that society spends significant amounts of money on enforcing them. The reason that we do this is not to benefit creators (though it does, and that's a nice side effect), but to benefit society, to enable the production and distribution of material to the public, and to ensure that the material will one day (soon, hopefully), enter the public domain and become a basis for the creation of even more.

With that understanding of copyright, it should be clear that as publication gets cheaper and easier, there less and less reason for society to provide motivation to people to publish. In fact, arguably, the publication barrier has disappeared entirely. I can write this little treatise on copyright law and publish it to the world at absolutely no cost other than my own time to type it. I could make a movie and publish it on youtube, or via bittorrent if youtube isn't sufficiently high-quality.

Technology has also lowered the cost of production of many kinds of copyrighted materials, but that tends more to raise the bar in terms of expected quality than to lower the cost. Talented people work hard to make things, and society benefits if there's a way they can get paid to put their talents to work.

Okay, with that as background, here's my take on the PirateBay situation:

First, from a technical perspective, there is no way to argue that the operators of the PirateBay engaged in any sort of copyright infringement. They basically provide a specialized version of Google, one that provides links to information hosted lots of other places around the world. Further, they don't even provide the links, they just operate a service where users can post the links and other users can find them.

This means that any sort of action against them is over actions that are a couple of steps removed from the real lawbreaking. "Contributory infringement", "conspiracy to make available ifringing copies", and the like are the phrases we have to use. I just don't think it holds water. If they're convicted, then we also have to force Google to verify that none of its links point to infringing material -- which is really hard given that whether or not a certain piece of audio, video or text is even infringement is sometimes a subject of debate even among lawyers.

In general, although I think copyright is a good idea, I don't think these sorts of generations-removed charges are a good idea, just as I don't think it's a good idea to hold gun manufacturers liable for what bad people choose to do with the tools they make. Tools have good and bad uses, and the PirateBay hosts a lot of links to perfectly legitimate content as well as links to stuff that is copyright infringement in most jurisdictions.

The bottom line is that if the motion picture studios and record labels want to shut down infringement, they need to go after the infringers, not the people who made the tools the infringers chose to use.

But there's a deeper issue here, and this is why I went into the origin of copyright first.

The deeper question is whether or not it should be illegal to share movies, music and books. It's generally been understood for years (and even upheld in court) that sharing stuff with your friends is not actionable copyright infringement. In general, in the past the line between infringement that mattered and infringement that didn't was whether or not anyone was making money. Want to tape a song off the radio? Fine. It's infringement, but non-commercial and non-impacting. Now make 10,000 copies of that tape and start selling it on the street, and you'll have police and lawyers knocking on your door.

In the past, that worked fine. The big media companies may have grumbled behind closed doors, but the non-commercial sharing was both hidden and moderately small-volume.

Now, however, non-commercial sharing has exploded. The Internet makes it possible for me to share a CD I bought with tens of thousands of people around the world. To media execs, who mentally count every copy shared as a sale lost, that adds up to billions in lost revenue. It's the kind of thing about which Something Must Be Done.

As a shareholder in some of those companies, I agree. As a member of society, I wonder. The purpose of copyright was to motivate people to overcome the publishing barrier, but that barrier is gone. At this point, the only remaining purpose of copyright is to help ensure that stuff gets created. Banning filesharing is probably good for Viacom's bottom line, but does it really help to motivate artists, musicians, authors, etc.?

I don't think so. The different industries have different dynamics, so lets look at each one in turn.

Book publishing first, just because its the industry that seems to contain some of the most forward-thinking and englightened people. I suppose it shouldn't be surprising that authors tend to be a little more deep-thinking that musicians or moviemakers. And I suppose it shouldn't be surprising that Science Fiction authors and their publishers are a little more forward-thinking than others in their industry, which leads me to Baen. A few years ago, Jim Baen decided to try an experiment. He offered the authors he publishes to option of putting a few of their books on-line in a FREE libray. Anyone could download them, and everyone was encouraged to share them with whoever they wanted. There were no restrictions on copying, and they were provided in every electronic format Baen could think of, so they could be read on any computer, electronic book reader, cellphone... whatever. Or you could even print a copy and read it on paper.

Everyone in the industry scoffed, of course, and some were angry because they thought Baen's decision to give books away for free would hurt their sales as well as Baen's. Jim Baen, and his authors -- some of the current top-selling Sci-Fi authors -- stood firm, though and carried out their experiment.

It's actually incorrect to call Baen's Free Library an "experiment", though, because Baen didn't see it that way at all. He saw it as way to prove a point and -- even more importantly -- a way to make some money. He fully expected that he and his authors would profit handsomely from giving stuff away for free.

Eric Flint, one of the authors in question, likes to point out that the book he decided to put up, Mother of Demons, was his first novel, and far from his best, and both its quality and his obscurity showed abundantly in its poor sales. It was a profitable book, but not what you'd call a success. Not until it became the first book on the Free Library, anyway. It is now Baen's best-selling backlist title, and one of Flint's top few.

That result surprised even Baen and Flint. They expected that giving away some of an author's work for free would increase the sales of other works. They calculated that their expected profit would come primarily from giving the first book in a series away, and then selling the rest. What even their bright, inquisitive science-fiction author minds never expected was that the books they chose to give away would also see significant new sales. But it was consistently true. So much so that Baen has since made a habit of including CDs in their hardcover editions, which contain completely unencumbered copies of dozens of books, with a label on the front that says "please share".

Why does this work? I have a number of theories, but they all boil down to this, at bottom: People like to buy stuff that they like. So if you're selling entertainment, the main problem you have is helping people to find out that they like your stuff, and giving things away for free is a good way to do it.

Turning to the music industry, the first thing we should take note of is that the industry has been giving their product away for free for decades. And not just "giving", either, they've actually been paying people to give their product away for free. They want so much to pay people to give their product away that we've had Congressional investigations and criminal prosecutions specifically to stop all this paying, and yet the industry has continued to do it, becoming ever more creative in the ways they pay people to give away their product for free.

I'm talking about radio, obviously. The record labels long ago understood that the very best way to increase sales was to increase airplay. Why? Because people rarely buy what they haven't heard, and never buy what they haven't heard of.

For some reason they think file sharing is different. They don't see a tune played on the radio as a lost sale, but they do see a downloaded MP3 as a lost sale. Are they right? Is it different? Not according to every formal, published study that's been done. There have been plenty of them conducted around the world and they invariably find a few things:
  • The people that download the most buy the most.
  • Increased on-line sharing is correlated with increased sales.
  • There is no concrete evidence that filesharing decreases sales.
This all seems to make no sense, until you remember that people like to buy stuff they like. How many music fans don't want to own the actual albums of their favorite band, with cover art and inserts and all? My kids have complained about me giving them music purchased on for birthdays and Christmas. Why? The music is the same, what does it matter? Because it's NOT the same as owning the "real" CD. And the fact is that publicity is what music needs for success, and filesharing provides that publicity.

Now, there is some question as to whether or not this will continue to be true, as we migrate more and more to digital file-based players (iPods) as the primary way we listen to music. Is there a difference between buying a song on iTunes or downloading it from a filesharing system? Perhaps not, but I think there is. The difference is that people who have the money like to buy stuff they like, and they feel good about having done it.

Another finding of several of the filesharing studies has been that individuals' entertainment budget is a relatively fixed quantity, and that they'll spend all of it. At most, the ability to get some things for free may shift where they spend it, but they'll still spend it. This indicates that a sale "lost" to filesharing (meaning a case where someone downloads a song and doesn't buy a copy also) isn't really lost . Maybe it's shifted. Most likely it wouldn't have happened at all anyway.

But let's suppose than in a future world where no one listens to music on anything but iPods, and no one buys it in any way except downloads, and all downloading has migrated to free sites, will music disappear? Will musicians have no way to make a living.

Consider Jonathan Coulton. He's a talented singer and songwriter from New York who writes and performs offbeat, fairly geeky music. Stuff like a song about zombies titled Re: Brains, with the chorus "All I wanna do is eat your brains, we're at an impasse here, maybe we can compromise", or a song about a love-struck computer programmer called Code Monkey. Coulton writes, sings, performs and records all of his music himself. He gives many of his songs away for free through his web site, and sells others for $1 each. He not only makes no bones about fans sharing his music (even the songs he doesn't give away for free), but he licenses all of it under a Creative Commons license which allows other people to use his stuff for free to make other stuff, like those videos I linked to.

Coulton has no record label deal, and doesn't want one. He quit his job as a computer programmer and spent six months writing and recording music, publishing a new song for free in every edition of a weekly podcast. By the end of a year, he was not only making a living doing his music, but a significantly better living than when he was working a normal job before -- and all of this in spite of the fact that it's trivial to get all of his music without paying a penny.

He doesn't have gold toilet seats, but he does have a comfortable income well into six figures, doing what he wants to do, and this in spite of the fact that his music is rather niche, with a very narrow market appeal. Personally, I think that's a fantastic model for music in the future, and a friend of mine pointed out that by allowing musicians to earn relatively "normal" incomes, rather than raking in millions, we may cut down on the number of them that crash their Ferrarris or fry their brains with drugs, and actually get more good music out of them.

But, in case you may be thinking that this approach only works for niche acts, consider Radiohead. They recently performed an experiment with their new CD, allowing listeners to download it and then pay whatever they think it's worth. I downloaded it and hated it, so I didn't give them a penny. They haven't released exact figures, but apparently the experiment was a huge success. Regular CD sales of the album were the highest Radiohead has ever experienced, the album going platinum in its first week on the shelves, and word is that their on-line sales were just as good, in spite of the fact that it was basically an on-line "tip jar", with no obligation to pay anything. According to the band, most on-line buyers paid pretty close to the normal retail price.

So, while obscure Jonathan Coulton makes a fine white collar-class living by not bothering himself about copyright, Radiohead is making millions doing the same thing.

With both books and music, it's quite clear that aggressive copyright enforcement, lawsuits against file sharers, lobbying for draconian criminal penalties for infringment, etc., are all both bad for society and bad for authors and musicians.

But what about movies? While one guy in a home studio can write, perform, record and distributed music, making a movie costs millions of dollars. Often hundreds of millions of dollars. Yeah, Blair Witch only cost a few tens of thousands, but that's not the rule, and not the quality of movie we usually want.

Were file-based movie watching to become the dominant form, that might be a real problem. Luckily, at present I don't see any indication that movie theaters are going away. Again, it comes down to people like going out to the movies, and that experience can't be replicated by file sharing. File sharing could concievably cut into DVD sales, but even if it wiped them out (which is highly unlikely), all that means is that moviemakers have to focus on the box office as their primary revenue source. That's what they did for decades until the advent of VCRs, and there's no reason the model can't work again. In actual fact, I expect that cinephiles, like audiophiles and bibliophiles, will continue buying because they like to. Especially since the moviemakers often include additional benefits, such as nice boxed sets, posters, extra content on the media, etc.

So, given that all of these industries seem well-positioned to prosper in the presence of file sharing, what value is there to society in investing a lot in limiting its members' freedom to share when they want to? Note that the focus is on costs and benefits to society, NOT costs and benefits to those who create the media. Their well-being is a part of the calculation, but only insofar as the fact that if they can't make a living making the stuff we want, they'll have to spend their time doing something else, which we probably consider less valuable.

Bringing this back to PirateBay, I think that they should get off because they didn't do anything wrong under the law. But not only did they not do anything illegal, they didn't even do anything wrong. Copyright is an artificial construct which we prop up through legal means in order to achieve an end goal. Even if the PirateBay facilitates infringement of copyright, our copyright law itself is currently badly broken, because it hasn't incorporated the new Internet reality. A copyright law adapted to achieve the maximum societal benefit at minimum societal cost will allow non-commercial file sharing.

First we need to stop bugging these people for doing what they're doing, and then we need to fix our broken laws. Unfortunately, the large media companies have a tremendous amount of influence over our laws, and filesharing costs them money. In fact, as far as the music industry goes, there's good reason to believe that the Internet makes the record labels as necessary as buggy whip makers. More and more musicians are connecting directly to their fan bases and not bothering with a label at all, so the people who are the focal points of the industy are fighting a desperate rearguard action for their own survival, precisely because they're not essential.

We need to help our legislators understand these issues, or the media industry is going to drive laws that will fill our prisons with college students whose only crime was to do something that harmed no one.

Saturday, February 14, 2009

Unit testing vs defensive coding

While working towards releasing the first alpha of my PeerBackup application, I decided to check out the code coverage on my unit tests. In the process of addressing some deficiencies, I noticed an unexpected tension between defensive coding and unit testing.

I want to get my test coverage as close to 100% as possible, because I really like the confidence it gives me that the code is working correctly. This is particularly important since I'm using Python, and without any sort of static validation, if a line doesn't get executed I have absolutely no idea whether or not it is at all functional. It could reference nonexistent variables or functions; there may even be some syntax errors that won't be found until the line is actually run.

So 100% coverage is a Good Thing. But... I learned years ago that defensive coding is also a Good Thing. I often write a few lines of code to address cases that I'm pretty sure are impossible, and which I definitely can't think of a way to create. Since I can't think of any way to create the case, I can't write a unit test to cover it -- which means that my unit test can't be 100% unless I remove the defensive code.

As a result, I find myself tempted to remove defensive code that is probably a really good idea, because there just might, after all, be a way to trigger it. Or even if there isn't, perhaps some future change will create a way to trigger it.

On balance, I'll leave the code in there and accept less than perfect coverage. I don't like it, though.

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...