[lug] HW Problem Finally Solved
George Sexton
gsexton at mhsoftware.com
Sat Dec 15 12:14:10 MST 2007
I recently solved a problem that I have had with one of my servers. In
this case it took me a REALLY long time. Aside from the interest of the
issue, this tale sheds some light on the effectiveness (or lack thereof)
of HW burn-in protocols which is a topic that comes up here regularly.
********
History
********
I have a SuperMicro 5015P-TRB server with a pair of 200GB SATA drives
configured for RAID 1 that I bought in September 2005. After putting it
in production, I ended up removing it from production 3 months later
because of file system corruption issues. Initially, I thought the issue
was 64 bit kernel/ReiserFS issues. Here's a thread I posted at the start:
http://archive.lug.boulder.co.us/Week-of-Mon-20060123/031529.html
Over the next 18 months I would try various things. I would do a
burn-in and then put the machine in production and then some time later
pull it back out because of FS corruption. For burn-in testing, I used
badblocks, memtest86+, and then I would run the Seti at Home client, along
with periodic bonnie++ jobs to stress the system. In all my testing,
things never showed a problem.
Eventually, I decided that the Disk I/O sub-system had problems but I
didn't want to throw out a $1900 computer, so I bought a 3Ware
9550SXU-4LP HW Raid controller. As normal, I did my burn-in procedure,
and things looked good. I put the server in production and boom, the
very first night in service it rebooted itself. From the timing, I KNEW
it was the backup job that had caused it. Sure enough, I manually
started the backup job and the system again powered itself off, and 10
seconds later powered itself on.
I've seen instances before where 3Ware controllers can push disk
sub-system so hard that marginal power supplies can not keep up. At this
point I decided to see if I could get the manufacturer to look at
things. I knew that if I had originally gone back to them and said
"Well, I get file system corruption every few weeks..." I wouldn't have
gotten very far. OTOH a machine that turns itself off when 2 drives are
used heavily is a little more dramatic and repeatable.
So, I did a fresh install of Linux and wrote everything up. The exercise
routine to cause a power-off was just starting two shells doing an
MD5SUM of all files copied from the SUSE install DVD. After starting the
two shells, the system would consistently reboot in less than one minute.
I thought pretty sure it was a power supply issue. But, I tried swapping
in multiple units to no effect. Luckily, I got a sympathetic tech at
SuperMicro and they had me ship it in.
After testing various components, they were able to reproduce the issue
using a different motherboard of the same model and revision. A slightly
updated motherbard (1.01->1.02) did not have the problem. The
technicians agreed with me that whatever the issue was in the MB with
the 3Ware controller, the same issue was causing the corruption using
the onboard SATA controller.
They replaced the motherboard with the updated model, and shipped the
unit back to me. My total cost was freight out.
**************************
Lessons Learned
**************************
The obvious "SuperMicro sucks don't buy them" isn't it. I've got 5 units
that I've never had a single issue with. Additionally, I've had
infinitely worse experiences with Dell servers. I'm pleased that even
though the unit was pretty old, they still fixed it at no charge. When I
called their SuperServer support, I got a pretty knowledgeable guy that
didn't read from a script. Overall, it was a pretty decent tech support
experience.
I guess the biggest lesson is that it's not enough to burn-in test a
disk sub-system by doing a lot of I/O to it. The testing really needs to
verify the accuracy of the reads/writes done to the disk sub-system. It
seems likely that if I had done something like that I could have saved
TONS of grief. The obvious "badblocks -wv" command just isn't enough by
itself. It's all sequential I/O, and in my case it just didn't stress
things enough to show a problem. Manufacturer disk tests probably
wouldn't help as well because they would test the media sequentially as
well.
If anyone knows of a disk sub-system test routine that VERIFIES data and
stresses a system I would be very interested in it. I've done a little
searching on Google, and I can't seem to find anything that does both
I/O benchmarking as well as verifying the integrity of the data during
the test.
--
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL: http://www.mhsoftware.com/
More information about the LUG
mailing list