[lug] Server Problem Update

Thu Mar 30 08:57:46 MST 2006

I thought I'd continue to update everyone on the status of my problem
server. As you may recall, it was dying about once per week in production as
a web and mail server.

I put a new machine into production about 3 weeks ago and everything with it
is going fine. 

I brought the old machine with the Dual Core Pentium D down here. I set it
up running BOINC/SetiAtHome. I also set a cron job to run Bonnie++ every 20
minutes. The system ran for 3 weeks without a problem.

Yesterday, I set it up so that it ran Bonnie++ every 20 minutes, and also
another cron job would run a 2nd copy of Bonnie++ at 5 minutes and 37
minutes after the hour. When I came downstairs this morning, the system was
locked up.

So, it looks to me like there is a race condition in either ReiserFS or the
MD sub-system when running on this system. This race condition only seems to
be triggered when multiple processes are writing to the disk sub system. 

I'm curious about what burn-in procedures people use. Would your burn-in
procedure have run enough processes and load to have caught this?

My next course of action will be to put a 32 bit version of Linux on the
machine and repeat the BOINC/Bonnie++ regimen to see what happens.

George Sexton
MH Software, Inc.
http://www.mhsoftware.com/
Voice: 303 438 9585