[lug] [Slightly OT] File Management?

Tue Mar 24 12:15:59 MDT 2009

I work on a fairly modest archive of data, dozens of terabytes, exponential
growth, and all that.  Basically data comes in daily, or hourly or monthly
(it depends of the site).  Backups used to be a nightmare, we would do
incremental backups of the entire archive. (this took a long time, and we
had to have downtime of the entire filesystem while we wrote the tapes).

This issues we saw, were that once a file was written, the only change ever
made to that file was a deletion (and that was rare, but we had to scan the
entire archive to do the backup).  And whenever a file has been deleted,
it's been replaced by a better copy of that data or it's a removal of a
duplicate piece of data or something like that (never a modification).

We now have a process, where we create a new directory for every month.

Basically, we have a hash process written into our archival software based
on the date (we can update a constant in the hash process to cut a new
directory, if we find this month's directory is becoming too large (too
large being defined as not able to fit on a tape.))  We call these monthly
directories "runs" ("runs" was a joke in the past, and I don't remember
exactly what the joke was)

 A couple weeks to a month after we move to a new run (once we've done most
of the deletions that will be done). We mark the run RO and do a final rsync
of the run from the processing stacks to the archival stack.  I run some
scripts to verify that everything with the run agrees with the database. We
then write a couple tapes of that run and ship them to our offsite failover.

The local copies of the runs on the processing stacks are removed and we
mount the data from the archive stack.  When we need to delete data, we just
mark it in the database as deleted and write our queries to ignore the
deleted files  (excluding my checking scripts because they notice files on
the filesystem that the database doesn't know about).

This works fairly well for us; frankly I'm going to guess it will scale
until we are dealing with petabytes over nfs (though I'm sure we'll be using
new technologies for some things in the far of future of petabytes of
data).

The important thing to take away from this is we are able to define that
certain areas of the filesystem will no longer change (to avoid the days
long rsync scan of the filesystem).  We can write a tape that will never
change.

I can elaborate on anything, if you have specific questions; I've glossed
over a large number of details that are specific to our setup that would
make this email even longer.  For example: we also do nightly rsyncs of the
current processing runs to the archive, that aren't used for anything than
being able to roll back to yesterday if we have a catastrophic failure
today.

-Matt Beldyk
NOTE: `rsync --delete` is a very scary command, especially if the source
filesystem is having issues and you rsync an empy directory to your backup
copy.

On Tue, Mar 24, 2009 at 11:21 AM, Lee Woodworth <blug-mail at duboulder.com>wrote:

> Rob Nagler wrote:
>
>> Lee Woodworth writes:
>>
>>> It sounds like you have looked at rsync and decided it isn't useful
>>>
>>
>> We use rsync to mirror the data.
>>
>
> You must be doing humongous backups with a large percentage change of
> of data between backups.
>
> Is it a hard requirement to keep point-in-time views of the whole
> backup set for each day? If not, then maybe this is something to
> consider (this is essentially what my backup process is):
>
> for d in /root /etc /data ....; do
>    /bin/mkdir -p /backup/history/server1/20090324/$d
>    /bin/chmod 0700 /backup/history/server1/20090324/$d
>    /usr/bin/rsync -v --archive --stats --delete --backup
>         --backup-dir=/backup/history/ruby/20090324/$d $d/
> /backup/server1/$d/
>
>    /usr/bin/find $d -l >> /backup/history/ruby/20090324/files.lst
>
>    #Remove old history dirs
>    /bin/rm -rf --preserve-root /backup/history/server1/20090220/$d
> done
>
> This keeps the complete backup copy in sync with master and keeps a per-day
> directory tree of changed files between backups. Allows for handling "oops,
> deleted file ...." and does have list of the files at the time of the
> backup.
>
> For my small backups (56GB) this runs in 3min to a usb disk (but only 0.5%
> changed
> files) - its ~225k files with a variety of file sizes (with lots of small
> files)
>
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20090324/1d8b1834/attachment.html>