Wednesday, October 7, 2009

Software Copyrights

I've written about this various places before, but I thought I'd put it here, mostly so I can link to it from other places rather than retyping the arguments.

I'm a big fan of copyright law. In the abstract, at least -- there are a lot of problems with our current law. And computer software is a huge part of my life. It's not only my job, it's one of my major passions and lately my biggest hobby. So it's natural that I should be interested in how copyright applies to software, and I think that the way we use copyright for software is very, very broken.

To explain why, first I have to give a little background on copyright.

The idea of modern copyright is a pretty simple one: Society grants creators of all sorts of useful and artistic intellectual works control for a limited period of time over who is allowed to produce copies of their work. There's a little more to it, and lots of corner cases and caveats, but that's the basic idea. Other than the "limited period" part of it, pretty much everybody understands that if an author writes a book, you can't make copies of it and sell them on the street without permission.

But, why not? Why do we do this? Thomas Jefferson argued that copyright made no sense. He said that ideas are naturally infinite, that as they're passed from person to person, everyone is enriched. He compared it to candle flame and pointed out that a man who lights his candle from mine has obtained light, and I have lost nothing.

So why should society invest large amounts of money and time in enforcing copyright laws, which restrict the natural freedom and urge to share?

There's a really good reason, and it's the one described in Article I Section 8 of the US Constitution: "To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries". Copyright law was created as a means to convince creators of intellectual works to publish them, to make them available to the world. The way it works is that society removes the freedom of everyone but the creator to make copies, for a time, so that the creator can benefit from his or her work. But any benefit to the creator of the work is just a pleasant side effect, because the real goal is to get that work published, into the hands of as many people as possible where it can spark new ideas and inspire new creativity. In other words, where it can Promote Progress.

Oh, it also promotes progress by motivating people to write, sing, etc. But the real goal of copyright was to promote publication and dissemination, because that's where the real progress is made, when ideas build on other ideas.

So, how does this relate to software?

Well, copyright law is the primary legal tool used to control the distribution of software, both by individuals and corporations who are trying to make money, and even by the Free Software movement, who have other goals. But Congress never really sat down and thought hard about how copyright should apply to software. Congress did change the law in order to balance out the short-term and long-term advantages to society when other technological changes came about, but they didn't do it fast enough for software, so the courts ended up deciding for them.

The courts basically decided that software is sufficiently expressive and creative to qualify for copyright protection. I completely agree. I have seen some truly beautiful code in my life, and even written a little. I hope someday to take a photo that's as beautiful as the best of the code I've written.

But what the courts failed to do, in a crucial oversight, is that they failed to distinguish between "source code" and "compiled binaries".

Source code is what programmers write. Reading or writing it requires some training, but it's designed to be human-readable, and computers can't make any direct use of it. Instead, the source code must first be processed by another program called a "compiler" (which itself was written in source code and processed by a compiler). The compiler turns the source code into "machine code", the actual pile of computer instructions that the computer reads and follows, like a very sophisticated cake recipe.

Source code is also what programmers read. Just as authors of poetry and novels hone their craft by reading the writing of other authors, the best way for a programmer to learn new ideas and new techniques is to read the code written by other programmers. The progress of software is promoted by making sure that programmers can read what other programmers have written, not so they can copy their programs directly, just as authors don't make word-for-word copies of the works of other authors, but so they can pick up ideas. Structure, phrasing, word choice, dialogue, character development plot arcs... all of these are things that authors learn from other authors, and there are analogous concepts to all of them in software.

The difference between a book and a software program, though, thanks to those first court rulings that decided that copyright applies to software, is that it's impossible for an author to publish a book to the world and simultaneously keep secret from the world the words that he used to write it. In order for an author to reap the benefits of publication, he also has to allow other authors to read his words and learn from them. How could it be otherwise.

With software, it is otherwise. Programmers can read and learn from source code, but once that source code has been fed through the grinding maw of the compiler, turning it into an opaque mass of machine instructions, it is extremely difficult to examine the result and determine how it does what it does. Not impossible, but very difficult. It's perhaps akin to taking a book and running it through a paper shredder, then piecing it laboriously back together in order to read it.

But copyright law, as currently applied, protects that ground-up version just as much as the readable version. And with software, the ground-up version is the one that has value to non-programmers. So, individuals and companies can produce software, publish the opaque binaries on CDs, in boxes on the shelf in the local computer store, and never have to reveal the ideas they used to create it. And yet they get the full weight of the legal system standing behind their copyrights, even though they have sidestepped the whole purpose of copyright law, to promote progress by disseminating ideas.

Moreover, although copyright is supposed to last for a limited time, after which copyrighted material falls into the public domain and becomes available for anyone to use for any purpose they wish, the source code of software published in binary-only form will never see the light of day. The binaries will fall into the public domain, but the source code was never published and will be lost.

It's not a total loss, of course. Other programmers can often infer interesting things about the structure of software from the behavior of binary copies. And some 'reverse engineering' (figuring out how it works by poking through the binary) does take place. But progress is hugely slowed by the predominance of 'closed' software, software for which no source code is available.

The Free Software movement is really a reaction to that limitation on progress. And it's a significant testament to the progress that is enabled by openness that Free Software constructed by ad-hoc groups of volunteers around the world often not only compares with, but bests, similar "closed" software constructed by large, well-paid and focused corporate teams.

I think the solution to this problem is very simple, though politically challenging: Software makers should be required to publish source code in order to receive copyright protection. It would still be illegal for programmers to copy this copyrighted source code, and illegal to copy the ground-up binaries as well, but other programmers could read the code and learn from the ideas, satisfying the progress-promoting goal of copyright law.

There would be practical benefits as well. As a purchaser of a software package, you would have some assurances that you do not now have. For example, should the company that sold you the package collapse, you could still hire a programmer to fix any defects you find in the software (the legalities of that would have to be worked out, but at least the ability is present). Even before you buy it, you would probably have the ability to ask others who've purchased it what they thought of the program -- and not only it's outward behavior, but also its inward structure. A good programmer can tell a lot about the quality and reliability of a software package by examining its source code. This is similar to a mechanic taking a look under the hood to see if an automobile is sound. But with closed-source software, the hood is welded shut.

A non-obvious benefit that I'm convinced we would see is a reduction in the amount of code copied illegally between programs. How can that be? Doesn't only publishing code in binary form completely prevent illegal copying? Not really. There are still people who see the source code, and they can still copy it. Lots of programmers (illegally) take a copy of their work with them when they change employers. I personally have witnessed a couple of cases of stolen code incorporated into closed-source software (not while at IBM; IBM is exceptionally cautious about this).

So illegally copying of source code happens now, but how would making source code more widely available reduce it? Simple, because it would be easier to find. Right now, companies that illegally copy source code usually get away with it, because they distribute only opaque binaries and it's difficult for anyone to recognize their act. But if everyone published source code, finding illegally-copied code would in most cases be a simple matter of scanning. There are tools right now that can scan a body of source code to see if it contains any code taken from the thousands and thousands of open source programs in the world. Companies use these tools to verify that their programmers haven't lifted some open source and dropped it into the company's software as a time-saver. It's easy to see that if most commercial source code were available, that this approach would be easily extended to cover that as well.

There might still be some companies who have such important and novel ideas in their source code that they dare not publish it. They would also have an option to protect their assets, through another facet of intellectual property law: Trade Secrets. They could classify their source as a trade secret, and sell their software only to customers who are willing to sign a contract committing them not to make copies. The contract would protect the binaries from being freely redistributed, and the source code would stay a secret. Any employee or other person who knowingly divulged the secret source code would be guilty of a crime.

I'm convinced that if obtaining copyright protection for software required publishing the source code, we would see an explosion in software progress. The quality and capability of the software packages we all use every day would grow by leaps and bounds. Software technology researchers would have access to a huge body of code to analyze and learn from, to help cull the best techniques and processes to help all programmers be able to do a better job. Tools would improve. Quality and reliability would improve. Security would improve. There would still be lots of problems, of course, because software is inherently, fundamentally hard. But applying copyright law in accordance with its underlying principles would better serve society's interests as a whole.

Sunday, October 4, 2009

GridBackup Re-architecture

GridBackup still isn't fully functional, but I think I'm going to change directions already. So far, it does scanning and backups pretty well, and has a basic backup verifier so you can check that your files really are safely backed up to the grid. It doesn't have a restore tool yet, but I could slap a simple one together in a few hours if it were needed.

However, there are some problems with the way it works now.

First, it doesn't address laptop users well at all. I have set it up for my brother Dirk, but because his backup server is just a machine sitting in the corner, there's really nothing for it to back up. All of his important files are on laptops. I set up a folder on the server, accessible via Samba (Windows file sharing) so that he can drop important files in a place where they can get backed up, but, predictably, having an extra step like that means that backups don't get done.

Second, there are problems with the implementation. The sqlite database used by the GridBackup, GridUpload and GridVerify scripts doesn't handle concurrent access well, so you can generally only run one of them at a time. But you have to be a little careful (in ways that I only know from experience, and am not sure I could explain) or you can corrupt the sqlite database when stopping one program to start another. The uploader really needs to be a daemon ('service' in Windows terminology), just running in the background all of the time. Ideally, it should really be integrated into Tahoe, so it starts and stops with Tahoe. I don't think I want to do it in Tahoe, at least just yet, because I want more freedom to work. However, given the Twisted application plugin system, I may be able to write it as a plugin that can be added to Tahoe.

In the new architecture, the "uploader" becomes the "backup server". It's purpose is to accept backup jobs delivered to it by a backup client (i.e. the "scanner"). The client should be run wherever the files are (i.e. on your laptop), and would deliver any changed files to the server as fast as it can. The intention is for the server to store those files itself until it can get them safely uploaded into the grid. From the client's point of view, once it delivers the files to the backup server, they are backed up, though it may take some time for the files to actually be delivered to grid storage.

One issue this approach raises is that it requires the backup server to have room to temporarily store all of the files sent to it. This shouldn't be a huge problem in most cases, because the assumption is the backup server is the same machine as the Tahoe node, and it has to have a lot of storage available -- usually 2-3 times as much as what the clients that use that server wish to back up.

However, in some cases users may not be using their Tahoe node for storage, and so may not have a lot of storage available. In that case, they'll want to run the backup client and server on one computer, and configure the server to restrict the amount of storage it uses, and to attempt to get files from their "original" locations, rather than its own storage when it doesn't have enough to hold everything. There's still value in having it store recently-modified files, on the grounds that if they've changed recently, they'll probably change again soon and the backup server doesn't store its own copies they may change before it gets around to uploading them to the grid.

My collection is complete(ish)!

I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...