Answer: Amanda and Rsync.
You work in a typical IT environment with lots of machines, lots of apps, lots of data. While the NAS and SAN vendors may paint a rosy picture of all data residing on centralized data storage, the reality in most environments is lots of server attached storage -- in other words, disks installed inside machines.
However, even though the data resides locally, that doesn't mean it's not important. It might be the corporate e-mail, or the accounting system, or perhaps the ERP database. All of it is critical, and all of it needs to be backed up.
The decentralized nature of the data makes backing it up a real challenge. System administrators can end up shuttling back and forth between machines doing individual backups -- kind of an updated version of the old pre-networking days of sneakernet.
Obviously, there must be a better way. The ideal would be to back up each machine, automatically, to a centralized place where all the data could be placed onto disk for quick access backup and also, if desired, backed up onto tape for permanent offsite storage.
There are a couple of open source products that can really bring this vision to life: Amanda and Rsync. While somewhat different, they both will take data from remote machines and bring it to a central location for convenient backup storage. Both are relatively straightforward to install and can help reduce the administrative burden of backup.
Originally developed at the University of Maryland, Amanda is a client server system that backs data up to a central location and can even write it out to tape. "Amanda" is an acronym for "Advanced Maryland Automatic Network Disk Archiver."
Amanda has a clever backup algorithm with which it can do both full and incremental backups -- including partial full backups (I know it sounds like an oxymoron) -- thus reducing the network overload of a complete full backup. If this scheme sounds too confusing, it can be configured to do the standard full/incremental backup scheme more commonly used.
Installing Amanda requires software to be installed on the central server (the location where the data will be stored, whether disk or tape) as well as on each client machine. The server contains the very important files amanda.conf, disklist and tapelist. Amanda.conf is, as you might guess, the location of the overall configuration of the system, while the other two files list what resources are to be backed up and what tapes the data should be written to. Amanda.conf is a little too complicated to discuss in an article of this length, but there are a number of very good tutorials available on the Web.
The clients are called by the server in a polling fashion, according to the scheduling information contained in amanda.conf. Each client is called according to the configuration (multiple polling sessions can run simultaneously) and diskfile data is written to the appropriate location as defined in the tapelists file.
Amanda is most useful in an environment with lots of data located on fairly static machines. A data center is the ideal environment for Amanda, particularly as it does not support Windows machines. It is likely that if you wanted to fool around with Cygwin, you could get Amanda to back up Windows machines as well. Overall, Amanda is a great choice if you want to ensure your data center is backed up to tape without an expensive software purchase.
In contrast to Amanda's focus on supporting tape backup, rsync is focused on synching data from one disk location to another. It was created by Andrew Tridgell, one of Samba's core team.
Rather than using a full/incremental scheme like Amanda, rsync runs a full backup each time it performs synchronization. That may seem wasteful, but rsync cleverly forwards only the changed bits in files, so it is actually very lightweight. Rsync ordinarily uses SSH as its transfer protocol, so the data is safe in transit -- making it ideal for syncing data to a remote machine outside the firewall -- thereby providing offsite backup.
Since the popular rsync is included in Linux distros, you can avoid the installation process. The most typical configuration of rsync operates in a client/server setup: The client machines contact the rsync server, which makes rsync a very good choice for dynamic environments. For example, rsync is a very good choice for backing up laptops that connect to the network intermittently. Of course, rsync can be configured to work in a polling fashion as well; in fact, it can be configured to work in a two-way fashion, enabling two machines to back up one another.
Best of all, rsync is very straightforward to implement. A simple configuration file indicates which files should be backed up and the location to which they should be backed up.
A sample configuration for Rsync, taken from Michael Holves' Everything Linux site, is as follows:
motd file = /etc/rsyncd.motd log file = /var/log/rsyncd.log pid file = /var/run/rsyncd.pid lock file = /var/run/rsync.lock [simple_path_name] path = /rsync_files_here comment = My Very Own Rsync Server uid = nobody gid = nobody read only = no list = yes auth users = username secrets file = /etc/rsyncd.scrt
The first four lines are housekeeping for information rsync uses during its operation. The simple path name is a nickname or shorthand name for a particular set of backups. The path is where the locations of the local and remote files are defined, while the remainder of the configuration relates to the security setup of the situation. For those of you familiar with Samba, the resemblance of configuration is quite obvious.
That's it! With this information you're ready to back up. As you can see, rsync is powerful but easy to use. Rsync can even operate on Windows if you install Cygwin. There's an even easier way to get Rsync on Windows up and running; a combined Rsync and Cygwin implementation called cwRsync is available. It comes ready to implement with an example batch file that can be modified for your circumstances.
Amanda and Rsync: Backup heaven
With the availability of these open source products, there is no excuse for time-wasting individual backups -- or, worse yet, not performing backups. Rsync is the easier of the two to get up and running, but neither implementation is really onerous. One last warning: The reason to back up is so you'll be able to restore your data if something terrible should happen. Be sure to test your backups occasionally to ensure that the data can be recovered. Otherwise, all you've got is a big pile of write-only bits.
Bernard Golden is CEO of Navica Inc., a systems integrator based in San Carlos, Calif. He is the author of Succeeding with Open Source (Addison-Wesley) and the creator of the Open Source Maturity Model, a formalized method of locating, assessing and implementing open source software.
This was first published in August 2005