Hacking, Code & Open Source Reads

BorgBackup - Encrypted, Deduplicated Backups Done Right

Christian Lehnert2016-11-08~7 min read

Backup is the part of your stack that everyone claims to take seriously and almost nobody actually does. The standard pattern — rsync -a to an external drive once a week, maybe — is not backup. It is file copying, and it has none of the properties that make backup actually useful when you need it: integrity verification, point-in-time recovery, encryption at rest, and the ability to keep many versions without paying linearly in disk.

BorgBackup, which reached 1.0 in February of this year and has been stable through 1.0.x updates since, gives you all of those properties out of the box. This is a post about what Borg actually does, why it matters, and how to deploy it on a Debian 8 Jessie host in roughly fifteen minutes.

What Borg does that rsync does not

Three things, all of them critical:

Deduplication. Borg splits files into variable-length chunks using a content-defined rolling hash (Buzhash). Identical chunks are stored exactly once across the entire repository, regardless of which archive contains them. This means that backing up the same 100 GB filesystem twenty times in a row consumes roughly 100 GB plus a small overhead per archive — not 2 TB. If a single 4 GB VM image changes by a few hundred megabytes between backups, only the changed chunks are written. The math collapses storage cost to a fraction of what naive snapshots demand.

Encryption. All data and metadata in a Borg repository can be encrypted with AES-256 in CTR mode, authenticated with HMAC-SHA256. The encryption is end-to-end: the repository server, if you push to a remote, never sees plaintext. Your backups are useless to anyone who steals the disk.

Compression. LZ4 by default, optionally zlib or LZMA. Compression is applied per-chunk after deduplication, so it stacks with dedup rather than competing with it.

None of these are properties you can bolt onto rsync. Together they make Borg structurally a different kind of tool.

Installation on Debian 8

Jessie ships an older Borg, but jessie-backports carries 1.0.x, which is what you want:

1sudo sh -c 'echo "deb http://ftp.debian.org/debian jessie-backports main" \
2    > /etc/apt/sources.list.d/backports.list'
3sudo apt-get update
4sudo apt-get install -t jessie-backports borgbackup

Verify:

1borg --version

If you prefer the upstream version pinned exactly, install from PyPI into a virtualenv:

1sudo apt-get install python3-pip python3-venv libssl-dev libacl1-dev libacl1
2python3 -m venv ~/borg-venv
3~/borg-venv/bin/pip install borgbackup

For most operators, the backports package is the right choice. It tracks security updates and integrates with the system Python.

Initializing a repository

A Borg repository is a directory containing all the data and metadata for one or more archives. Initialize once:

1borg init --encryption=repokey /backup/repo

repokey stores the encryption key inside the repository itself, encrypted with a passphrase you choose at init time. The alternative, keyfile, stores the key on the client machine separately. repokey is simpler and is fine for most cases — the key is encrypted with your passphrase, so the repository alone is not enough to decrypt anything.

You will be prompted for a passphrase. Choose one that is long and stored somewhere you will not lose it, because if you lose this passphrase, your backups are unrecoverable. There is no recovery mechanism. This is a feature, not a bug — it is what makes the encryption meaningful — but it means you must treat the passphrase with the same seriousness as the data itself.

Creating an archive

An archive is a single point-in-time snapshot stored in the repository:

1borg create \
2    --stats --progress \
3    --compression lz4 \
4    /backup/repo::"hostname-{now:%Y-%m-%dT%H:%M:%S}" \
5    /etc /home /var/www /root \
6    --exclude '/home/*/.cache' \
7    --exclude '/var/www/*/cache'

The {now:...} placeholder is expanded by Borg at archive creation, so each run produces a uniquely named archive. The first run will write all the data. Subsequent runs will write only changed chunks, which on a typical home or small server is a few hundred megabytes per night even on a multi-hundred-gigabyte filesystem.

To see what's in the repository:

1borg list /backup/repo
2borg info /backup/repo::archive-name

To restore a single file:

1mkdir /tmp/restore && cd /tmp/restore
2borg extract /backup/repo::archive-name etc/nginx/nginx.conf

The path inside the archive is relative to the root of the original backup paths, with leading slashes stripped.

Pruning — keeping the repository finite

A repository where you create an archive every night and never delete anything will eventually fill the disk. Borg's prune command applies retention policies and deletes archives that fall outside them:

1borg prune --list /backup/repo \
2    --keep-daily=7 \
3    --keep-weekly=4 \
4    --keep-monthly=12

This keeps the last 7 daily archives, the last 4 weekly archives, and the last 12 monthly archives — a total of around 23 archives, sliding through time. Prune is intelligent: an archive can satisfy multiple retention buckets simultaneously, so the actual count is lower than the sum.

After prune, run borg compact (1.0+) to reclaim disk space from the deduplication metadata:

1borg compact /backup/repo

In Borg 1.0, compact happens implicitly during prune. If you upgrade to later versions, it becomes an explicit step.

Automation with systemd

A systemd timer is cleaner than cron and gives you proper journal logging. Two units in /etc/systemd/system/:

borg-backup.service:

 1[Unit]
 2Description=Borg backup
 3Wants=network-online.target
 4After=network-online.target
 5 
 6[Service]
 7Type=oneshot
 8Environment="BORG_PASSPHRASE=YOUR_PASSPHRASE_HERE"
 9Environment="BORG_REPO=/backup/repo"
10ExecStart=/usr/bin/borg create \
11    --compression lz4 \
12    %E/%H-{now:%%Y-%%m-%%dT%%H:%%M:%%S} \
13    /etc /home /var/www /root
14ExecStartPost=/usr/bin/borg prune \
15    --keep-daily=7 \
16    --keep-weekly=4 \
17    --keep-monthly=12

borg-backup.timer:

 1[Unit]
 2Description=Daily Borg backup
 3 
 4[Timer]
 5OnCalendar=daily
 6Persistent=true
 7RandomizedDelaySec=30min
 8 
 9[Install]
10WantedBy=timers.target

Enable:

1systemctl enable --now borg-backup.timer
2systemctl list-timers

Persistent=true means a missed backup (laptop closed at the scheduled time, server rebooted) runs as soon as the system is back. RandomizedDelaySec jitters the start time, which matters if multiple machines back up to the same remote.

Storing the passphrase in the systemd unit is a compromise — a sufficiently privileged attacker on the host can read it. The alternatives are BORG_PASSCOMMAND pointing at a script that retrieves the passphrase from a secrets manager, or running the backup as a dedicated user with a tightly-permissioned environment file. For most home and small-server setups, the unit-file approach is acceptable; for anything more sensitive, do better.

Remote repositories over SSH

Borg can use any SSH-accessible host as a remote repository. On the remote, install the same Borg version (the protocol is not always backwards-compatible across major versions). Then:

1borg init --encryption=repokey backup-user@backup-host:/backups/this-machine
2borg create backup-user@backup-host:/backups/this-machine::archive-name /etc /home

Use a dedicated SSH key for the backup user, restrict it on the remote with a command= directive in authorized_keys to allow only Borg operations, and consider enabling Borg's append-only mode on the remote to prevent a compromised client from deleting historical archives:

command="borg serve --append-only --restrict-to-path /backups/this-machine",no-port-forwarding,no-X11-forwarding,no-pty ssh-ed25519 AAAA...

Append-only is critical if you take threat modeling seriously. Without it, any attacker who compromises the client machine and gets the encryption passphrase can not only read your backups but delete the historical record. With append-only, the server-side service refuses delete operations, and the worst an attacker can do is fill the repository with garbage you can clean up later from a trusted client.

What it does not do

Borg is not a synchronization tool. It does not replicate your repository to another location automatically; if your backup disk fails, your backups go with it. Treat the repository as data that itself needs to be replicated — to a second remote, to cloud object storage, to a tape rotation. The 3-2-1 rule (three copies, two media, one offsite) still applies. Borg is the first copy; you are responsible for the others.

Borg also does not version-control file metadata in the way a real version control system does. It snapshots filesystem state. If you need diff-tracking and change attribution at a per-file level over years, Borg is not a substitute for Git for the things Git is good at.

And Borg has historically had compatibility breaks across major versions — 1.0 to 1.1, 1.1 to 1.2 — that can require repository migration. Read the upgrade notes before pulling new versions, and pin the version on your client and server in lockstep.

The summary

Borg makes encrypted, deduplicated, compressed backups boring and routine. The setup is fifteen minutes. The ongoing cost is one systemd timer and a few gigabytes of disk per machine per year, after the initial archive. There is no longer a credible reason to run backups any other way for a single-host or small-fleet environment.

If you do nothing else this week, set up Borg on the host you would most regret losing. The alternative — discovering during an actual incident that your "backups" were rsync to a disk that was sitting plugged in the whole time — is a kind of failure that is entirely avoidable, and entirely embarrassing when it happens.

Tagged:
#linux #backup #selfhosted #security
← Back to posts