[lug] Reliable SATA server?

Lee Woodworth blug-mail at duboulder.com
Sat Apr 21 14:57:48 MDT 2012


On 04/21/12 09:03, Rob Nagler wrote:
> Over the years we've struggled to find a reliable SATA server for our
> backups.  I have tried numerous versions of SATA computers, and they
> all seem to fail under our loads, always.  The failure is usually in
> the form of the computer hanging in such a bad state that it requires
> a power cycle.

My heavy load case is of course very different than yours, but it can
reliably make disks run hot, especially laptops: a complete source
rebuild of a Gentoo install (1100 packages, 24hrs). When the disks get
hot enough they lock the I/O system by hanging. This is a multi-core,
multi-process parallel make so there are often many processes doing
sequential reads  and writes to separate files/directories and also
doing rm -r temp dir clean ups.

Doing this rebuild on SATA-based DELL rack-mount servers has never been
a problem even though the servers have more cores and processes running
at the same time (there can 7 compiles running at once). I do however
use the higher end lines of the SATA disks for these systems: WD Caviar Black
(5yr warranty vs. minimal) and some more 'enterprise-level' Seagates.
The cases are also probably better at thermal heat transfer from the
drives than typical towers.

Another bit of evidence that drive temp may matter, in ancient times I ran a
production data-warehouse loader on a whitebox, dual-cpu, pentium-based MB
with 4 consumer IDE disks. This process absolutely bashed the drives
continuously for 24+ hours with lots of seeks. Never had a temperature
or lockup problem with the drives or system -- it was big case with a
fan in front of each drive. Those drives took a year of abuse and lasted
3 more years of normal use after that. Maybe it was luck, but dual cpu
and IDE wasn't considered so reliable in those days.

I would compare the internal disk temps (maximums if recorded) for the
less stable to more stable systems if you can.

> 
> We do have a fairly stable server these days, but now I'm finding the
> disk slots are failing -- along with the disks.  The server isn't that
> old, but it has been relatively reliable (only the occasional hang).
> 
> I've attached a vmstat when the server is busy.  I can't do much on
> the server at this time, e.g. tab completion of a command takes a few
> seconds.  This is very different from a benchmark load so the numbers
> are deceptive.
> 
> At this stage, I'm tempted to bite the bullet and go with SAS.  We
> have SCSI3 machines running with heavy (but dissimilar) loads without
> a problem.

As for SAS vs SATA, I think the case and drive can make as much difference.
Your SCSCI3 disks are probably designed for enterprise use (continuous use).
The consumer disk drives that I have looked up the duty cycle ratings for
have always been less than 50%, some only 20%.

The DELL servers we use have the MPT fusion controller with a SATA backplane.
The drive temps seem to running around 32C.  No problems even though the
server is doing the rebuild 3 times as fast as an HP consumer tower.

On the other hand, laptop drives are often 40C idle. On our least stable laptop,
the temp is 48C and can go up 61C during the Gentoo rebuild. Higher than
that is where failures happen on this machine.


> 
> I've asked this question on this list before, and not gotten a
> satisfactory answer.  Nobody seems to run the same type of loads that
> we do, and all their servers run just fine.  However, if you do have
> experience with non-sequential writing of large amounts of data to
> on a SATA server, please let me know. 
> 
> Thanks,
> Rob
> 
> # vmstat 1
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
>  0  2    144  15672 328852 826992    0    0   300   184    9    3  2  1 91  6  0
>  0  2    144  16136 328436 827052    0    0  3684     0 1544 1361  3  2 55 39  0
>  0  2    144  19764 326164 826964    0    0 10948     0 1270 1253 10  2 52 37  0
>  2  1    144  26640 321224 826268    0    0 27828  2948 1295 2364 21  3 50 26  0
>  0  2    144  26504 320472 827268    0    0  5832     0 1375 1150 10  1 54 35  0
>  0  3    144  29356 319048 826360    0    0  4688   784 1371  984 13  2 52 33  0
>  0  3    144  29232 319048 826528    0    0    32  5624 1114   90  0  0 62 37  0
>  0  2    144  29232 319048 826508    0    0     8   748 1119   72  0  0 54 46  0
>  0  3    144  29232 319052 826532    0    0     8    36 1094   56  0  0 26 74  0
>  0  2    144  29608 318708 826432    0    0  3244     4 1446 1138  3  2 52 43  0
>  0  2    144  29884 318204 826952    0    0  2344  1000 1294  656  2  1 55 42  0
>  1  2    144  32428 316516 826648    0    0  6388     0 1167  732 10  2 49 40  0
>  1  2    144  35916 314220 825756    0    0  7476     0 1205  837 11  2 49 38  0
>  0  2    144  37940 312616 825584    0    0  6484     0 1228  835 10  2 49 40  0
>  0  2    144  39108 311712 825816    0    0  4184 14048 1235  553  5  1 50 44  0
>  0  3    144  39428 310848 826544    0    0  4528  4956 1333  939  6  2 53 40  0
>  0  3    144  39428 310852 826676    0    0     8   784 1131   80  0  0 51 49  0
>  0  2    144  39428 310852 826644    0    0     8   468 1099   78  0  0 50 50  0
>  0  2    144  40328 310620 825876    0    0   912     0 1338  694  0  0 52 48  0
>  0  2    144  39600 310476 826804    0    0  3092   104 1492 1285  1  2 60 37  0
>  0  2    144  40972 309896 826492    0    0  3636  1052 1312  806  3  1 54 42  0
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety




More information about the LUG mailing list