[lug] Bacula

Sat Sep 17 10:10:49 MDT 2005

George Sexton wrote:
>>-----Original Message-----
>>From: lug-bounces at lug.boulder.co.us 
>>[mailto:lug-bounces at lug.boulder.co.us] On Behalf Of Alan Robertson
>>Sent: Wednesday, September 14, 2005 10:50 PM
>>To: Boulder (Colorado) Linux Users Group -- General Mailing List
>>Subject: Re: [lug] Bacula
>>
>>
>>A simple example:
>>	If you want to back up 10 million small files, that means
>>	probably 10 million inserts to a database - which is a HUGE
>>	amount of work - and incredibly slow when compared to writing a
>>	10 million records to a flat file.
>>	And when you recycle a backup, that means 10 million deletions.
>>
> 
> Actually, for the single backup case, your argument is sound. For the second
> backup, it's wrong. Consider a table structure:
> 
> BackedUpFiles
> -------------
> File_ID
> FileName
> Size
> Owner
> Attributes
> Date
> MD5SUM

AND very importantly you need a reference count - or you can never 
delete anything from this relation.  And, this relation has to be 
indexed by anything you want to search on.  This includes AT LEAST the 
file id, and the combination of (FileName, Size, Owner, Attributes, Date 
and MD5sum).  Or you could add another field which you might 
attribute_sum which is the md5 sum of all the fields except the 
reference count.

> Tape
> ------------
> Tape_ID
> InitialUseDate
> UseCount
> Etc.
> 
> FilesTape
> ---------
> File_ID
> Tape_ID
> 
> It becomes pretty trivial to enumerate the tapes a specific file is on. For
> the 2nd and subsequent backups, the space usage is vastly more efficient.
> 
> I'll give you that removing 10 million entries from FilesTape is a job, but
> the advantage of being able to quickly enumerate the tapes a specific file
> is on vastly outweigh that issue.

Plus the 10 million inserts.  If each averages a lightning fast 10ms 
(which it won't), then updating the indexes will take about 28 hours - 
14 of which typically has to happen before the backup can start.  This 
may be a bit problematic for your average set of fulls over the weekend. 
  And, this has been my personal experience in exactly such 
circumstances.  Writing the corresponding flat files happens faster than 
the tapes can be written.  Deleting a 30mb file takes a fraction of a 
second.

And, on the average - you never refer to these records again.

>>From a space perspective, the second backup a database file name would use
> something like 30MB additional to store the contents while the flat file
> would probably use something like:
> 
> 10^6*(80(avg file name length)+6(owner and attributes)+8(file
> size)+8(MD5SUM)+8(file date)+10(delimiters and CRLF)

I've never seen a schema like yours used in practice.  You have to also 
keep a reference count for the BackedUpFiles database, and you have to 
hope you keep it correct - in the case of system crashes etc.  So, every 
so often, you have to walk through the whole set and audit them to see 
that they're consistent every time you crash during a backup or a 
recycle operation.

To recycle here's what you have to do:
        walk through the FilesTape relation by Tape keyand delete 10
           million tuples
        Do 10 million lookups on File_ID in BackedUpFiles, and
          decrement 10 million reference counts (fortunately this
          doesn't affect the indexes)
        Delete any tuples in BackedUpFiles whose reference count
          is now zero (updating >= two indexes).

To back up, here's what you have to do
        Before each file is backed up, you have look up its
         (md5sum,File_ID, FileName, Size, Owner, Attributes, Date)
        to see if you already have a fileid for that file
        If you don't, then get a new file id (increment counter
          probably, and sync that counter out to disk (in case of crash)
        If you already have one, you have to add a record to the
          update the reference count in the BackedUpFiles relation
        Insert a record into the FilesTape relation

So... If nothing new is being backed up (best case), then
   For the recycle:
       walk sequentially through 10 million tuples in FilesTape
       do 10 million searches by attributes on BackedUpFiles
       do 10 million in-place updates in BackedUpFiles (ref count)
       plus 10 million deletes in FilesTape
   For backing up:
       do 10 million search by keys - on a complex key in FilesTape
       plus 10 million in-place updates in BackedUpFiles (ref count)
       plus 10 million inserts inserts in FilesTape

If everything being backed has at least an attribute changed, every time 
you back up (worst case) then you will do this:
    For the recycle:
       Walk sequentially through 10 million tuples in FilesTape
       do 10 million searches by complex key on BackedUpFiles
       do 10 million deletes in BackedUpFiles
       do 10 million deletes in FilesTape
    For backing up:
       do 10 million search by keys - on complex key in FilesTape
       plus 10 million inserts in BackedUpFiles
       plus 10 million inserts in FilesTape

If you keep 6 weeks of fulls online, then FilesTape will have 60 million 
records in it (best or worst case) - making massive inserts and deletes 
into it really expensive.

In the best case from above, you have 10 million tuples in 
BackedUpFiles.  In the worst case, you have a relation of 60 million tuples.

Best case:
       read 10 million tuples sequentially by key
       do 10 million search by attributes
       do 10 million replace operations
       10 million deletes

       10 million searches by key
       10 million more record updates
       10 million inserts

If you make the recycle a transaction and writing a single tape a 
transaction, then  you have some extremely large transactions which will 
take a very long time to commit.

Let's say the sequential reads take .1ms each, the search by key take 3 
ms each, the replaces take 3 ms each, inserts take 4 ms, and deletes 
take 5 ms, then here's the time:

       read 10 million tuples sequentially by key:	 1000 s.
       do 10 million search by attributes		30000 s.
       do 10 million replace operations			30000 s.
       10 million deletes				50000 s.

       10 million searches by key			30000 s.
       10 million more replace operations		30000 s.		
       10 million inserts				40000 S.

This is about 58 hours - for the best case - where nothing has really 
changed.

       read 10 million tuples sequentially by key:	 1000 s.
       do 10 million search by attributes		30000 s.
       do 10 million deletes				50000 s.
       10 million more deletes				50000 s.

    For backing up:
       do 10 million search by keys			30000 s.
       plus 10 million inserts in BackedUpFiles		40000 s.
       plus 10 million inserts in FilesTape		40000 s.

This is nearly 70 hours.

If you make the recycle one transaction and the backup one transaction, 
then most of the delay for committing the backup transaction comes after 
the backup is completed.  If you break it up into one transaction per 
tape, it's a bit better - but still this database work will likely be 
going on for several days into the week - so the files backed up won't 
be available until tuesday or so...

Notice that this doesn't scale very well.  It's primarily bound by disk 
write speed - and this is very hard to speed up in reality.

You've made your backup problem into an enterprise-class database 
problem - just for the indexes - which are rarely used.  Now, if you 
talk about the backups themselves - they are much easier to scale.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce