[lug] HW Problem Finally Solved

George Sexton gsexton at mhsoftware.com
Thu Dec 20 09:23:01 MST 2007



Nate Duehr wrote:
> 
> On Dec 15, 2007, at 12:14 PM, George Sexton wrote:
> 
> [snipped good story of in-depth but time-intensive troubleshooting...]
> 
>> They replaced the motherboard with the updated model, and shipped the 
>> unit back to me. My total cost was freight out.
> 
> 
> Actually -- just to be fully accurate here, your total cost was freight 
> out AND all that time you spent.  Your company may or may not have had 
> other "more useful" things they might have done with you during all that 
> time if the box did not have an intermittent problem, right?  I know -- 
> it's hard to truly quantify this, but call it "time lost".

This is a valid point.

> I think it would have been interesting to see if that experience had 
> been the same if you'd contacted them earlier during the "once every few 
> weeks file system corruption happens" stage.

I know better. No vendor is going to help you with this kind of problem 
unless you're just such a huge customer they would replace the 
equipment. If they just replace it with the same model, and the issue is 
  a design flaw in the MB as it was in this case, a replacement wouldn't 
get you anywhere.

> Again, time vs. cost.  Is there an option to have SuperMicro (or any 
> other vendor) do this stress testing BEFORE they ship the box to you?  
> Nowadays, I doubt it -- but if you're busy doing more valuable things or 
> you're short-staffed, it could really be worth having someone else ship 
> you the machine and a test report...

And you're trusting that they just didn't fake the test report and take 
the money...

> Another thing that would help, if you have money to burn (I know most of 
> us don't, but I mention it because I've worked in environments that did 
> -- they had cash but everyone was madly trying to keep up with the 
> workload) is "Platinum" type service contracts.  At the first sign of 
> trouble (or maybe the second if they don't believe you at first) someone 
> confirms what you're seeing via a phone call (or in really expensive 
> ones, on-site service) and the sick box is removed from your site (or 
> shipped away for free) and a new one magically appears, very few 
> questions asked.

Again you're trusting that the issue isn't a design flaw in the device. 
I had one of the very first Dell 486's, and one of the very first Dell 
Pentiums, and believe me they were just pure shit. Replacement parts 
don't help if the design is defective.

A lot of vendors advertise this kind of service, but from what I've 
read, very few actually follow through with the kind of support that is 
contracted for. Again, the quality of support usually varies depending 
upon the size of the company. If you're a really important customer you 
get your butt kissed. If you're not, you get Rajnish in Mumbai reading a 
script.

> 
> How it goes for us:
> 
 > [snip] really long detailed maintenance procedure.
 >

This really all comes down to risk management. The essential element of 
risk management is

amount spent on mitigation = (estimated loss * probability)

If someone has a really large estimated loss, then these kinds of 
procedures make sense.

> 
> 
>> If anyone knows of a disk sub-system test routine that VERIFIES data 
>> and stresses a system I would be very interested in it. I've done a 
>> little searching on Google, and I can't seem to find anything that 
>> does both I/O benchmarking as well as verifying the integrity of the 
>> data during the test.
> 
> http://www.coker.com.au/bonnie++/

Actually, I used bonnie++ and I've examined the source code and read the 
documentation. It doesn't VERIFY the data. The closest it comes is 
verifying that the correct number of bytes went out, and the correct 
number of bytes came in. This is where I got killed. I used bonnie++ but 
disk corruption was never detected. Bonnie++ will not detect disk 
corruption caused by bad memory, OS bugs, or driver bugs, or hardware 
bugs that corrupt the data in transit on the bus.

I'm in the process of writing my own implementation in another 
programming language that does verify the data that was written. In some 
ways, it won't be quite as nice because it's going to be OS portable, 
but OTOH, it verifies the data which is what I now demand.

-- 
George Sexton
MH Software, Inc.
Voice: +1 303 438 9585
URL:   http://www.mhsoftware.com/



More information about the LUG mailing list