Non-Scalable, Anti-Social (NSAS) Backup System
This article presents a non-scalable, anti-social backup system. It's non-scalable, in that if everyone did it, it wouldn't work. It's anti-social, because it completely perverts the intended use of the medium.
So why am I writing it? To me, it's interesting from a socio-technological point of view — how a certain technology has evolved to where it is today, how the very use of that technology has been subjugated for different purposes than it originally started, what the reaction of the legal system is, and the impact of it all.
The technology I'm talking about is USENET. Originally, (in the days of the early “Internet,”) USENET was used to propagate ideas, conversations, discussions, etc., between loosely coupled members of the community.
Allow me to paint you a picture of what that looked like “in the day.” Imagine, if you will, that your computer was a stand-alone island with no fixed connectivity, and performed all communications via a dial-up (modem) connection operating at 2400 baud. Most computers in private hands were like this; some computers were connected to each other constantly via a network, but these were mostly at universities and large corporations.
All traffic between you and other people around the world was based on your computer making a phone call to another computer and transferring data. Sometimes, the other computer would call you, and transfer data; in that case, at the end of the conversation, the other computer would ask if you had anything to transfer, and your computer would send its data.
Data that wasn't destined for the computer that called you would be passed along by that computer to the “next hop” computer, and so on, until it reached its destination. Your computer was expected to do the same — any data it received that wasn't destined for it, your computer would pass it on to the next computer in your calling tree.
The USENET news hierarchy was born as a means of transferring articles (news) between interested parties. Since it would be difficult to know which users at which sites would be interested in which particular articles, the convention was that entire hierarchies would be transferred from site to site, with perhaps the vast majority of articles going unread. (For some more information on this, please read my article, “Improving USENET News Performance” on this site.)
As the Internet connected more and more machines together directly, the peer-to-peer transfer mechanism gradually became replaced by centralized servers — so called “news providers”, such as Newshosting.com and Giganews.com, for example. Also, the types of articles began to change, people started posting large binaries, and hence the alt.binaries hierarchies were born.
Fast forward to today; the bulk of data on news servers are pirated versions of software, television shows, music, and movies. You might ask, “so why aren't the news hosting companies being shut down?” The answer is simple — they are “common carriers”, meaning that they do not examine the traffic going through them, they merely provide a conduit for that traffic. Think of your phone provider — they don't monitor your calls (the NSA does that for them), so they aren't legally obligated to monitor your conversation and cut you off should you discuss taboo subjects. Same with the news providers; since they don't monitor the traffic, they are allowed to live.
As of 2014-02, Wikipedia indicates that about 15 TB of news is posted daily! And some news providers are advertising over 2,000 days worth of storage. Doing the math (and ignoring the steady rise in traffic) indicates that 30 PB of storage is required to hold all this data (just one copy — first off, each news provider would have backups, and secondly, there are many providers. This is a ton of data!).
The Backup System
All I want to do is backup a few files in case of disaster. So I had this idea, you see. Why couldn't I simply post my backups to USENET, and have the world's news providers back it up for me for 2,000+ days (over 5 years), in nice air-conditioned data centers, located in geographically diverse areas?
I came up with two major problems; it was an abuse of the system, and they (the news providers) would pull my articles, destroying my backups. Well, interestingly, none of those hold water. The current USENET is itself a perversion (evolution?) of the original USENET as intended (for sharing text discussions). And, since the news providers are acting as “common carriers,” they wouldn't really be in a position to monitor my posts, let alone pull them. And finally, there's even a newsgroup called alt.binaries.backups :-)
So, I came up with the idea of posting backups to the USENET.
There are a few technical issues to be solved:
The simplest way to make your posts secure is to encrypt them. I use AES-256 in CBC mode, with a random block of data as the seed. This means that I take whatever I'm going to store and first put it into a tar file. This is an easy and elegant way of preserving file name, ownership, and permissions. Next, I bzip2 the contents, because first off the tar file has a large header of mostly zero data, and secondly because the data itself might in fact be compressible. (I do a test to see if the compression has in fact saved any space; if not, I just throw it away and use the uncompressed version). Then, I add 64 bytes of random garbage before the start of the tar file. This prevents “known plaintext” attacks. Finally, I encrypt the result in AES 256 CBC mode with a secure password.
Another aspect of security is what do you call the encrypted file when you post it? Obviously a subject line of “Rob Krten's Backup of /etc/passwd” is a bad idea.
Hashes to the rescue! I take the full pathname of the file being backed up, and add a known “initialization vector” to the front of it, and compute a SHA1 hash on it. This means that instead of the “obvious” filename I showed above, it'll be more like d030eed9d9fbc496968d926a4ab1416fe2ebcb66 — useless to anyone except me (and useless to me if I forget what initialization vector I used, or what the full path of the filename is that I'm looking for — LOL!)
How do I ensure that the data isn't corrupted after posting? Since I can't necessarily post a single article that's really big (say anything over 1MB in size), I need to break it up into parts, and post the parts individually. What if one of those parts gets corrupted, or lost, or ... ? There's a wonderful tool out there called PAR2, which generates recovery blocks for a data set using Reed-Solomon Error Correction (q.v.). Along with my 100MB backup file, I can also post some additional amount of redundant data. This way, if anything gets damaged, I can use the redundant data to recover my original files.
Another aspect of reliability is, “is my backup integral?” You'd be amazed at how many people and companies “back up” their data, only to find that, at the worst possible point in time, the backups are unreadable. It would be funny if it wasn't so sad.
I solve this by periodically downloading data and checking that it's what I expect to be there. This can be easily automated.
Finally, we need to talk about getting the data back in case of disaster. It's fine that we periodically checked it to make sure that it wasn't corrupted, and that it was still present, but retreiving a large data set in case of ultimate disaster can be difficult, especially if the metadata about the data set is gone (as would be the case for example if your disk was toasted, and it included the metadata).
An obvious solution is to mail yourself the metadata to your Gmail account — it's not that much data, is very compressible, and can be automated along with your backup.
Alternatively, drop the metadata in with the backup set to USENET as well, and just be damn sure that you remember the message ID so that you are able to retrieve it later. (Or mail yourself just the metadata message IDs to your Gmail account.) Of course, I'd also email myself the source code for stitching all the stuff back together in case of disaster, so that you cut down on the amount of work you need to do in that case.
There are two use cases I see for backups. One is for disaster recovery, where you want to ensure that some set of data is periodically updated on the web. The second is for live monitoring, such as data from cameras around the premises. The two differ slightly in implementation, so it's worth talking about that for a bit.
The periodic backup use case presents as a complete data set that needs to be stored somewhere. All of the inputs are known, and it's just a matter of encrypting the data, computing the recovery data set, and splitting the original file into postable-sized pieces. An NZB file is a good container for holding all of these bits together, and is in fact the standard way that such things are done.
The problem with streaming backups is that you may never get a chance to finish them! For example, consider the premises monitoring camera case. The reason you have this is because you are trying to catch an image of the burglar that's coming to steal your system :-) It is not a good idea to wait until midnight, tar up all the images from that day, and post it as if it was a periodic backup — your system has been sold for parts long before that time!
Therefore, what you need to do is continuously post images as they come from the camera. The main trick here is coming up with a naming convention for the message IDs so that you can retrieve them even in the absence of a completed metadata “index” file (like the one we used in the periodic backup case, above). So, naming your message IDs to include a sequence number based off the start of day, such as 2014-06-13-IMG-0012-Front.jpg or something like it would mean that you can linearly search through the newsgroups by “fishing” for the images you want. (Using the complete date and time as the image name would be harder, as you'd need to know when the image was taken, and if you use motion sensitive equipment, you might not necessarily have a continuous stream of images. Or, if you generate lots of images per second, and the timestamps are down to the millisecond, you'd have to potentially attempt to retrieve 1,000 messages in order to have the complete snapshot of the data set. Etc.)
Yes, you could then, at midnight, construct a metadata archive that indicates all of that day's images — at that point, you've reduced the problem to the periodic backup case as above (because your machine didn't end up getting stolen that night after all — tomorrow is another day, however).
I hope this rambling piece has given you some ideas on how to perform non-scalable, anti-social backups :-)