[lug] HW Problem Finally Solved

Sat Dec 15 12:14:10 MST 2007

I recently solved a problem that I have had with one of my servers. In 
this case it took me a REALLY long time. Aside from the interest of the 
issue, this tale sheds some light on the effectiveness (or lack thereof) 
of HW burn-in protocols which is a topic that comes up here regularly.

********
History
********
I have a SuperMicro 5015P-TRB server with a pair of 200GB SATA drives 
configured for RAID 1 that I bought in September 2005. After putting it 
in production, I ended up removing it from production 3 months later 
because of file system corruption issues. Initially, I thought the issue 
was 64 bit kernel/ReiserFS issues. Here's a thread I posted at the start:

http://archive.lug.boulder.co.us/Week-of-Mon-20060123/031529.html

Over the next 18 months I would try various things. I would  do a 
burn-in and then put the machine in production and then some time later 
pull it back out because of FS corruption. For burn-in testing, I used 
badblocks, memtest86+, and then I would run the Seti at Home client, along 
with periodic bonnie++ jobs to stress the system. In all my testing, 
things never showed a problem.

Eventually, I decided that the Disk I/O sub-system had problems but I 
didn't want to throw out a $1900 computer, so I bought a 3Ware 
9550SXU-4LP HW Raid controller. As normal, I did my burn-in procedure, 
and things looked good. I put the server in production and boom, the 
very first night in service it rebooted itself. From the timing, I KNEW 
it was the backup job that had caused it. Sure enough, I manually 
started the backup job and the system again powered itself off, and 10 
seconds later powered itself on.

I've seen instances before where 3Ware controllers can push disk 
sub-system so hard that marginal power supplies can not keep up. At this 
point I decided to see if I could get the manufacturer to look at 
things. I knew that if I had originally gone back to them and said 
"Well, I get file system corruption every few weeks..." I wouldn't have 
gotten very far. OTOH a machine that turns itself off when 2 drives are 
used heavily is a little more dramatic and repeatable.

So, I did a fresh install of Linux and wrote everything up. The exercise 
routine to cause a power-off was just starting two shells doing an 
MD5SUM of all files copied from the SUSE install DVD. After starting the 
two shells, the system would consistently reboot in less than one minute.

I thought pretty sure it was a power supply issue. But, I tried swapping 
in multiple units to no effect. Luckily, I got a sympathetic tech at 
SuperMicro and they had me ship it in.

After testing various components, they were able to reproduce the issue 
using a different motherboard of the same model and revision. A slightly 
updated motherbard (1.01->1.02) did not have the problem.  The 
technicians agreed with me that whatever the issue was in the MB with 
the 3Ware controller, the same issue was causing the corruption using 
the onboard SATA controller.

They replaced the motherboard with the updated model, and shipped the 
unit back to me. My total cost was freight out.

**************************
Lessons Learned
**************************

The obvious "SuperMicro sucks don't buy them" isn't it. I've got 5 units 
that I've never had a single issue with. Additionally, I've had 
infinitely worse experiences with Dell servers. I'm pleased that even 
though the unit was pretty old, they still fixed it at no charge. When I 
called their SuperServer support, I got a pretty knowledgeable guy that 
didn't read from a script. Overall, it was a pretty decent tech support 
experience.

I guess the biggest lesson is that it's not enough to burn-in test a 
disk sub-system by doing a lot of I/O to it. The testing really needs to 
verify the accuracy of the reads/writes done to the disk sub-system. It 
seems likely that if I had done something like that I could have saved 
TONS of grief. The obvious "badblocks -wv" command just isn't enough by 
itself. It's all sequential I/O, and in my case it just didn't stress 
things enough to show a problem. Manufacturer disk tests probably 
wouldn't help as well because they would test the media sequentially as 
well.

If anyone knows of a disk sub-system test routine that VERIFIES data and 
stresses a system I would be very interested in it. I've done a little 
searching on Google, and I can't seem to find anything that does both 
I/O benchmarking as well as verifying the integrity of the data during 
the test.

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/