Creating a mirror, local or remote, efficiently

A real backup should be usable when your file system, your entire disk, your entire computer, or (God forbid) your entire building, goes down/up in flames :-) Which means the backup should ideally be remote. (Again, I'm not counting tape backup systems and such like here -- this is for individual users, as I said at the start of the first article).

The first real tool we will discuss is called rsync, which is a mirroring tool. Usually, rsync mirrors are created on another machine, over a secure channel like ssh. If you are mirroring to the same machine, it should at least be on a different hard disk in order to get some protection from hardware failures. However, even mirroring to the same hard disk is of some use -- at least it helps against human error or a software bug wiping out some file.

rsync is extremely efficient in using network bandwidth. See the additional notes section at the bottom of this article for details.

How is it used?

(The following assumes a directory called sources sitting in your current directory on the local machine, and being mirrored to the home directory on the remote machine)

When is it useful?

What are the downsides?

Additional notes -- rsync's algorithm

The rsync algorithm is an absolutely wonderful piece of work, and is actually part of Andrew Tridgell's (of the Samba team) PhD thesis. It's one of those algorithms that give you a warm glow of satisfaction when you read and understand it!

Just think about this: you have a large file locally, and an older version of the same file on a remote machine. rsync manages to send only the changes (plus a little administrative overhead) without having both copies in one place to actually do a "diff" operation!

What it's really doing is a "diff" of two files on different computers, without completely copying either of the files to the other side! Until you read the algorithm, this doesn't even seem possible :-)

No other algorithm is even close to its efficiency when you have large files with only small changes (that is, the number of bytes changed is far, far less than the size of the file). As a result, almost all decent "online" backup programs now use this algorithm -- two of the three programs discussed next in this series (rdiff-backup and unison) certainly do. [The third one is an offline-capable backup solution so it cannot use this algorithm anyway -- the rsync algorithm needs both old and new versions to be online, even if they are not both on the same machine.]

Despite all this, I use rsync only when I want an exact mirror, with no extra files. For backup purposes, rdiff-backup (next article) is much better, because it not only gives you a mirror, but also maintains older versions as compressed reverse diffs, in an extra directory called rdiff-backup-data.

What next?

Well, the next chapter: Beyond rsync -- rdiff-backup