A couple of weeks ago, my laptop’s hard drive died and I ended up purchasing a new one. I hadn’t taken a backup in about 3 weeks, so I lost some progress on my research. This got me thinking about having a backup method with less friction than my current one.
My existing backup routine consists of plugging in my external hard drive, mounting it, running my backup script, taking a full image of my boot drive, and unmounting it. Not entirely awful (most of it is automated and it’s mostly just waiting), but there’s still friction around the initial steps of plugging in the drive and mounting it. Additionally, taking a full system image means having a read/write LVM snapshot (which becomes an issue if it is completely filled), which means I need to sit and wait around for the system image to finish — if I leave it for too long, the writes to the snapshot will accumulate, the snapshot will fill up, and bad things will happen.
I have long resisted backing up to the cloud due to the inherent privacy and security issues at stake. As the saying goes, the “cloud” is just someone else’s computer, and storing your files there is giving them access should they want it. You’re also at their mercy regarding security practices – if they’re breached, your data is at stake. At the same time, the golden rule for backups is 3-2-1: 3 total copies of your data, 2 on-site and 1 off-site. The 2 on-site copies are satisfied by my laptop and external hard drive. But the off-site version was not something I had ever seriously considered.
After this latest incident, I decided to investigate inexpensive cloud storage providers — given that my biggest usage will be storage (and I hope I never have to download the data), I don’t mind some tiny charges on downloads and other transactions. I found Backblaze B2, which offers storage for $0.005/GB/month, or $5/TB/month, after the first 10 GB free. While there are other charges (for example, $0.01/GB download charges after 1 GB free per day), my main concern is storage charges.
My next concern was privacy and security. Unlike their “Personal Backup” solution (which offers client-side encryption), B2 is just a set of servers waiting to receive files. They offer extensive APIs to allow third-party software to integrate with it, and I’ll come back to that later. Fundamentally, though, I realized that if I was going to have client-side encryption, I’d have to deal with it myself.
My main requirements were as follows:
- Encryption with GPG — I already use many other tools that integrate seamlessly with GPG, and managing yet another set of keys simply for retrieving my backups is impractical.
- Obfuscation of file and directory structure with a way to restore it — Just as important as the file contents is the file metadata — how large it is, what type of file it is, how many other files are like it in that directory, and so on. Even if the filenames and directory names are obfuscated, if the directory hierarchy is left intact, it leaks tons of metadata.
- Compression — This one should be obvious — I’m paying per GB, so I better compress stuff to minimize storage costs.
- The state of the backup should be recoverable — I should be able to download files from my last backup and continue my backup where I left off.
There are many different backup programs which fit some of these requirements, but none seemed to fit all. Some encrypt file and directory names, but leave the hierarchy intact. Others use GPG and tar to effectively obscure the file and directory structure, but I’d still have to manage the names of those tar archives to obfuscate them while still allowing me to determine which directory it is a backup of.
After looking for a bit, I decided to roll my own script to deal with this problem. The program is called “bkup” and can be found at https://gitlab.com/chiraag-nataraj/bkup. The design is as follows:
- When you pass a directory to bkup to back up, it first resolves the path of the directory relative to $HOME and creates that directory hierarchy under ~/.local/share/bkup/ and ~/.cache/bkup/. Let’s call the first directory $datadir for short and let’s call the other one $cachedir.
- It then generates a salt ($salt) and writes it to $datadir/.salt.gpg. At the same time, it writes the value of $datadir to $datadir/.name.gpg. Both of these files are also hard linked to $cachedir.
- The encrypted name of the archive is derived from the SHA1 sum of $datadir and $salt. This structure means that once the salt is written once and as long as it is not erased, the encrypted name of the archive will be stable. At the same time, one can always change the encrypted name by merely deleting $datadir/.salt.gpg.
- The program then proceeds to generate the archive. As this is the first time the program is being run on this directory, the program finds all files created after epoch and adds them to a giant tar archive. This tar archive is then passed through zstsd (a compressor) and subsequently split into 50MB chunks (configurable by setting a variable in the script). Finally, each chunk is encrypted before being written out to $cachedir/$hash-$date.tar.zst.nnnnnn.gpg, where nnnnnn are the 6 suffix digits indicating the number of the chunk.
- Once the backup finishes successfully (which can take a long time during the initial run), it writes out the date and time at which the backup started to $datadir/.date.gpg and hard links it to $cachedir/.date.gpg.
- On subsequent runs, bkup will read the salt from $datadir/.salt.gpg and the date of the last backup from $datadir/.date.gpg. It will then only look for files modified after that date and time, ensuring that backups after the first one are incremental.
The design of this program ensures that once you complete an initial backup and push it to the cloud, you never have to do a full backup again. You can merely download the .date.gpg, .salt.gpg, and .name.gpg files from your backup to determine which hash corresponds to which directory, recreate the proper hierarchy within ~/.local/share/bkup, and you’re good to go. This is especially important when you pay to download large quantities of files (as is the case with B2) – being able to minimize the amount of data you need to download in order to figure out which folder is what is quite convenient.
The syntax and an example run are provided in the README of the gitlab repository, which should hopefully clear up any confusions you may have.
I have started using this program already and am in the middle of completing my initial backup to B2. Once I have done that, I will probably set up automatic incremental backups and push them to the cloud with rclone (an amazing piece of software).