[lug] Reliable SATA server?

Sun May 6 10:17:27 MDT 2012

Hi Sean,

> Because we're using ZFS we can in general get by with fewer discs, so we

I just bought a Dell 2950 with 6 x wd2003fyys.  I've had very good
luck with Dell refurb even when buying from 3rd parties.  I'll report
back after a year if I am still seeing the problems.  The box has 8GB
and 8 cores so it is way overpowered, but that may help with the
performance.  I decided to go with SATA after reading around some
more.  Likely my problems have more to do with the low-end drives than
anything else.  We have never had anything serious on our Dells so I'm
going with my gut on this one.

> Why are you using this new-fangled ext3 stuff?  It's far more complex and
> has many more moving parts than taring your backups out to the raw disc
> device...  :-)

Well, once upon a time, that's exactly what I did.  :)  I just trust
ext3 now, since it has been mainstream for 10 years(?).

> think is relevant...  Based on the description, it sounds like "cp -l" is
> causing your system to thrash, which can cause exactly what you were
> describing with the system taking a very long time to respond to commands.

My concern isn't slowness when it is busy.  Rather, it is the machine
locking up completely, requiring a hard reset.  It isn't exactly clear
that it is happening because of load.  It happens infrequently enough
that it's hard to debug except that I was able to determine my Intel
motherboards were failing with 2 CPUs and didn't fail with one.  This
was a lucky find, because we had some boxes with only one CPU (for
cost), and they worked.  I do know that I had some immediate failures
with various brands of controllers (go back in the archives to see
posts by me about this) and ended up with 3ware.

> I have, personally, run into the same problem and resolved it by switching
> away from using "cp -l", so really all I could do was say what you
> interpreted as "it's the users fault".

A system shouldn't thrash for doing simple operations like this.
Thrashing is running out of resources.  If the I/Os are queued, they
just take longer, but the system operates completely normally.  If the
systems can't handle the load, it's because of poorly written software
(kernel drivers, "firmware", not cp) at some level.  Drive
manufacturers spend a lot of time making their drives "slow" so they
can use the same hardware to keep costs low and differentiate pricing.
 It's what IBM used to do with mainframes.  These delay loops seem to
cause serious problems with RAID controllers.  I don't understand that
in particular, but I suspect the controllers aren't designed with
fast-fail in mind.  Rather, the programmers probably assume they can
"work around" the problem in the software, and things get really
mucked up.  Writing hard real-time systems is pretty tricky stuff.
Many RAID controllers are dual processor nowadays so the problem is, I
suspect, only going to get worse.

Rob