[lug] Robust storage

Mon May 2 00:25:09 MDT 2005

On Sun, 2005-05-01 at 14:00 -0600, Daniel Webb wrote:

> I just noticed I have file corruption in an old mailing list archive (gzip
> fails about 1/4 of the way through).  It's not one I care about much,
> but it's got me thinking about the issue in general.  I have no idea
> when this corruption happened, sometime in the last two years.  Here are
> some questions I have for the experts out there:
> 
>  * How can I know if files have been corrupted through hardware errors?
>    Would Linux software RAID have prevented this?

In general I avoid software raid for valuable data storage.  Hardware
raid solutions are built for situations like a power failure.  They
solve data loss by using an on-board battery to save the write cache
until power returns.  Even with a UPS the hardware cache is better
because it will ensure write-through integrity if a disk fails.
Hardware raid solutions are optimized for what they do and for getting
attention if there is a problem.  SAN-style raids (internal controllers)
add the ability to recover after a host failure too.

Just for reference, some products like SnapAppliance and Ciprico's
linux-embedded line found other ways to solve these problems--not that I
recommend either.

I also suggest a matched system.  Buy the same brand for the complete
system and confirm it is tested by the manufacturer to work together.
This gives you a base level of what to expect.  For example, a Dell 1750
server with a Dell RAID is more trustworthy than a dell server with a
third party RAID.  The difference is in the timing of i/o operations
between devices and the behavior during power failures and other
problems.  It's always good to check independent reviews when
purchasing--some manufactures don't hold up to their claims.

>  * How can I know if files have been corrupted by bugs in the low-level
>    block drivers (the filesystem drivers or in my case drbd)?
>    Would Linux software RAID have prevented this?  What happens if the
>    corruption is cause by the RAID driver?

I doubt the hardware is to blame unless you had an unclean down (namely
RAID).  Probably you had a file-system, bus, or network stall or the
original data was bad.  ext2/3 are likely culprits of minor filesystem
corruption.  Mostly I have found problems with ext* when restoring up
very large filesystems.  My current favorate solution is SuSE Enterprise
Server 9 with reiserfs.  I have tested it by verifying the filesystem
after pulling the plug 30 times.  Including testing a powerfailure right
after the filesystem is made read/write but no journal has been started
(causes a bad sync).  This is an easy way to break a lot of filesystems.
Including big names like Solaris's journaled UFS and Apple's HFS+.

If your problems are reproducible check with the hardware manufacturer
for any known problems.  For example, Seagate is now claiming 10% of all
Cheetah 6 drives will show problems and eventually fail due to a bad
bearing design.  I've already lost 4 such disks (including the
replacements from Seagate).

>  * What are some inexpensive solutions to this problem?

Search google for open source backup solutions.  You are looking for
good file integrity verification features.  Even bit-wise if you so
desire.  The value of the data should determine the amount of hoops you
want to jump through to ensure data integrity.  Even then if the
original data is corrupt you're plain out of luck.  This could be the
case in your situation unless you have verified that the originals were
clean.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20050502/367171c2/attachment.html>