Extended discussion on databases for backup indexes: WAS: Re: [lug] Bacula

Wed Sep 21 07:44:50 MDT 2005

Another possibility is a two-phase update.  Write your 'live'
capture to a Berkeley DB, then later you can copy the data into
your relational database for reports and ad-hoc queries.

If you want a quick snapshot of the system, you can use key:
pathname, value: lstat().

You can get much fancier, of course.  Use read locks to get some
idea of whether the file is being modified.  It's not 100%
accurate and some well-behaved programs use dot-locks instead of
flock(), but it's a start.  Append one or more hash values to the
value.

I agree that it's easy to capture the contents of the directory
and then sort them before processing.  It's easy to do two sorts,
e.g., first have directories, then files.

Some file systems support extended attributes.  They shouldn't be
ignored -- if the FS supports it things like SSH and PGP keys
should be marked "no dump" and the archive software should respect
it.  To do this right you need to use statfs to determine the FS
in use.

Some things can't be backed up.  Databases, in particular, should
be dumped instead of archived as files.  Either that or be
absolutely sure that the database is down.

Finally, it should go without saying but if you're using your own
find you should use lstat() and not follow symbolic links or
attempt to archive/checksum anything other than normal files.

Bear