[lug] More on robust backups

Tue May 3 01:01:03 MDT 2005

[This is a draft of a document I'm writing on very robust backup.  I'm not an
expert; I expect many or most of the people on this forum know more about this
than I do.  I just wrote this down as a way to organize my thoughts and figure
out how to do things better.  Please correct or suggest improvements, I figure
I'll put this on my web site when I'm done.]

The "How data can get corrupted and what to do about it" HOWTO
==============================================================

Ways data can be corrupted:

  1 During read or write to disk
    (filesystem driver bug, hardware bug, etc)
  2 While sitting on the disk doing nothing
  3 Application bug or memory error 
    (cosmic ray, telekenesis, God hates you, etc)
  4 Malicious cracker deletes anything that's connected to the internet
    (or you accidentally do something like: "rm -rf .*")

Backup goals:

  1 Maximum one day lost for user data
  2 Recovery of all user data should be possible, regardless of the cause
    of corruption above
  3 Corruption of user or system files should be detected as soon as
    possible to avoid their spread[1]

To detect type 1 corruption, I'm guessing that the application would
need to write checksum information as it's writing the data.  From what
I can tell, all modern databases do this (please correct if this is
wrong).  I've definitely seen subversion flip out if I change a bit in
the repository database with the hex editor.  

To detect type 2 corruption, you can store checksums with mtime
timestamps, then check for files whose checksums have changed but their
mtimes have not.  This will also often show up as "I/O error" or other
system error.  A superior way would be to do the same thing by
verifying that all files with unchanged mtimes match the files in the
previous night's backup.  I don't know an archiver that has this
feature.  Suggestions?

There's not really much to be done about type 3 error, so if you want to
be able to guarantee you won't lose anything, you have to use a revision
control system like Subversion, or an incremental mirror backup
(rdiff-backup[2] is the only one I know of).  This way you can go back
in time until the corruption occurred and manually correct it.

Type 4 corruption risk is reduced through security, but you really can
never be sure, especially if you run multiple servers such as apache or
bind.  For the scenario of someone out to get you who wipes all your
filesystems, you need offline backups.  

Online Backups
--------------

  Subversion and/or rdiff-backup

    I keep all my code in Subversion, and everything else, including the
    Subversion database dumps, is mirrored to another hard drive using
    rdiff-backup.  rdiff-backup is a mirroring incremental archiver.
    That means you have a true mirror, but at the same time you can go
    back to any previous version in the past because it stores a diff
    for each new revision of each file that changes.  This is sort of
    like using a revision control system, but the implementation is much
    simpler, it's more robust to damage, and it gives you a mirror.  

  ???

    I need something to make sure that files that haven't changed are
    still good.  Suggestions?  Is something like Tripwire a good way to
    do this?

  Unison

    For large collections of files that don't change much, such as music
    and movies, Subversion and rdiff-backup are not good solutions.
    Unison works well for this kind of data, and has the added benefit
    that you can make changes to either side of the mirror. 
    Corruption is likely to be detected if you manually run unison and
    notice that files you didn't change want to be propogated to the
    other side.  It can't go back in time, but that's not really
    important for this type of data.

Offline Backups
---------------

The best affordable ways to do this are to burn to CD or DVD, or buy a
USB drive enclosure and another hard drive (~$150 for 160 gig).[3]

Burning to CDs has the advantage that there's no worrying about whether
you will have a good snapshot back in time, because every snapshot is
self-contained.  Then the question becomes how to make a CD backup as
robust as possible.  

First, delete both tar and gzip from your computer.  They are the cause
of more misery around backups than should be allowed.  Because of the
way gzip works, a single corrupt bit near the beginning of a file
destroys the entire file.  If you have used tar -z, you can lose an
entire backup due to a single corrupt bit.  Why people do this anyway,
I'll never understand.  dump is the old-fashioned way to create a
reliable backup, but there are disadvantages[4].  I think the best way
is to use the cpio file format.  This format uses uncompressed, ASCII
text headers.  It turns a filesystem, which is a complex, hierarchal
data structure, into a clean linear stream that is easy to recover if
things don't go right.  

It is best not to use compression, because compression always amplifies
corruption loss.  If you must use compression, use bzip2 with the -1
option[5], combined with afio.  afio is an archiver that uses the cpio
format, but has the advantage that it can transparently compress files.
The difference between afio and tar with compression is critical: tar -j
creates an archive then bzip2s it, while afio bzip2s files and adds them
to the archive.  With tar -j, you're likely lose the whole archive past
the first corruption, but with afio you just lose that file.

When you burn the cpio/afio archive to disk, burn the archive directly
to disk (don't use a filesystem).  In other words, just use cdrecord,
don't use mkisofs piped into cdrecord.  For a backup, a filesystem gains
you nothing.  When reading a CD with corruption, dd_rescue can be used
to recover everything except the corrupt areas of the CD.  Without a
filesystem, it's very straightforward to rescue a corrupt CD with a cpio
archive.  dd_rescue lets you know which parts of the file are corrupt,
and you can then chop those files out of the cpio archive.

Things to add
-------------
- I need to learn how to use LVM snapshots.  Archiving a live read/write
  filesystem is probably right up there with tar -z on the dumb-scale,
  but that's what I've been doing.

=======================================================================
[1] User file corruption will propogate into backups, while system file
corruption could cause undetected user file corruption.

[2] The way rdiff-backup does this is by keeping each new diff to get
from revision R-1 to R.  Unfortunately, this means that if you get
corruption in the diff file to go from revision (NOW-1) to revision
(NOW), you lose all previous revisions.  These diff files are static
once created, so they should be checked with checksums daily.

[3] My experience with tapes over the years is that they are expensive
and extremely reliable... until you need them, at which time they have
somehow ended up completely destroyed.

[4] Dump only works with ext2/ext3, and is very bad is the filesystem is
mounted.

[5] This way you will only lose 100k at a time around any corrupt data.