[lug] HW Problem Finally Solved

Thu Dec 20 00:43:53 MST 2007

On Dec 15, 2007, at 12:14 PM, George Sexton wrote:

[snipped good story of in-depth but time-intensive troubleshooting...]

> They replaced the motherboard with the updated model, and shipped  
> the unit back to me. My total cost was freight out.

Actually -- just to be fully accurate here, your total cost was  
freight out AND all that time you spent.  Your company may or may not  
have had other "more useful" things they might have done with you  
during all that time if the box did not have an intermittent problem,  
right?  I know -- it's hard to truly quantify this, but call it "time  
lost".

TCO has to include something of a metric to measure the admin's time  
and/or opportunity costs lost to the organization when troubleshooting  
is taking place.   But even the biggest management brains on the  
planet don't really seem to have a good metric for this.  It's all  
very subjective and specific to any one admin and company's specific  
situation.

> The obvious "SuperMicro sucks don't buy them" isn't it. I've got 5  
> units that I've never had a single issue with. Additionally, I've  
> had infinitely worse experiences with Dell servers. I'm pleased that  
> even though the unit was pretty old, they still fixed it at no  
> charge. When I called their SuperServer support, I got a pretty  
> knowledgeable guy that didn't read from a script. Overall, it was a  
> pretty decent tech support experience.

I think it would have been interesting to see if that experience had  
been the same if you'd contacted them earlier during the "once every  
few weeks file system corruption happens" stage.

> I guess the biggest lesson is that it's not enough to burn-in test a  
> disk sub-system by doing a lot of I/O to it. The testing really  
> needs to verify the accuracy of the reads/writes done to the disk  
> sub-system. It seems likely that if I had done something like that I  
> could have saved TONS of grief. The obvious "badblocks -wv" command  
> just isn't enough by itself. It's all sequential I/O, and in my case  
> it just didn't stress things enough to show a problem. Manufacturer  
> disk tests probably wouldn't help as well because they would test  
> the media sequentially as well.

Again, time vs. cost.  Is there an option to have SuperMicro (or any  
other vendor) do this stress testing BEFORE they ship the box to you?   
Nowadays, I doubt it -- but if you're busy doing more valuable things  
or you're short-staffed, it could really be worth having someone else  
ship you the machine and a test report...

Another thing that would help, if you have money to burn (I know most  
of us don't, but I mention it because I've worked in environments that  
did -- they had cash but everyone was madly trying to keep up with the  
workload) is "Platinum" type service contracts.  At the first sign of  
trouble (or maybe the second if they don't believe you at first)  
someone confirms what you're seeing via a phone call (or in really  
expensive ones, on-site service) and the sick box is removed from your  
site (or shipped away for free) and a new one magically appears, very  
few questions asked.

Of course, those cost a lot of money because the hardware vendor  
covers their costs up-front, basically.  You could literally set the  
box on fire, call up and they'd send a replacement, and still have  
made a little money.

It's always worth beating up any vendor's sales staff a bit if you're  
buying a LOT of product (I know, most of us here aren't...) and twist  
their arms into giving you their top-tier support contract for free  
for a while or at a significantly reduced cost.  Always try, no matter  
what.  All they can do is say no, as the saying goes.

Once you've had 4-hour on-site hardware replacement contracts, you can  
get VERY spoiled, and it only makes economic sense for the most  
critical of applications -- but I've worked on a few in telco.   
Systems that if they remain down, cost the owners something like  
$40,000 an hour in direct cash revenue at peak traffic.  That kind of  
cash cow demands huge amounts of redundancy and fail-safe mechanisms  
in hardware and software, plus that 4-hour contract from a depot in  
that city... but those kinds of applications are rare.

On those systems, you don't have time to troubleshoot -- you have time  
to swap hardware and that's it.  If you have to recover from backups  
for any reason, it takes many hours, and it's a major incident and  
someone's going to want to know why the redundant stuff failed.  MOST  
of the time, real outages on that kind of system are caused by human  
error, and it's common to see large amounts of "process" around even  
the simplest of system changes.  Outages not caused by human error are  
opened as a Priority 1 case on our side, which alerts the CEO of our  
company.  "System down" isn't in our vocabulary, when it comes to the  
telco carrier class gear and software.

How it goes for us:

I have one customer who ONLY does maintenance activities in a window  
from 11PM to 2AM Eastern Time on Friday nights.  Any planned  
maintenance outside of those hours requires Director-level approval on  
their side.  Unplanned maintenance usually requires notifying their VP.

Every single command except for the login and password that's going to  
be typed into the production Sun servers MUST be documented in a  
written procedure document and reviewed by both us (we're the software/ 
system integration vendor) and their engineering staff in a meeting at  
least a day before the planned maintenance window.  All maintenance is  
done with both side's technicians required to have a paper copy of the  
planned activities in front of them at all times, and cross-checking  
each other's steps to see where you are on the steps on the document  
at all times.

Usually there's at least one manager (either company, sometimes both)  
monitoring on a conference call on an 800 number that must be active  
during any maintenance activities.

Any deviation from the written document or any unforeseen problems  
encountered will trigger an instant call out during the maintenance  
conference to both a manager on our side and often both a manager and  
Director-level person on the customer's side for approval to deviate  
from the written procedure.  If any signifcant risk of data loss is  
foreseen, every effort must be taken to stop and make a full backup of  
all critical data, before proceeding.  Usually this also triggers a  
call to a VP.

Root access is never available without Management approval of what is  
going to be done with it (the written document), and is turned on/off  
through a tracking system by one of four or five people nationwide.   
Access to the system is via a special dial-up number or an RSA-key-fob  
based VPN.

(Fun, huh?  Heh heh.  This particular customer is the toughest, but at  
least two others implement MOST of the above procedures also.)

The work is interesting, but also can be boring in-between windows.   
You can read/review logs, query the DB, and do pretty much anything  
"read-only" to troubleshoot, or even request a special "we need root  
to run X, Y and Z diagnostics" maintenance window after 11PM Eastern  
on most weeknights, but making changes outside of Friday night --  
almost never happens.

Approval to do a major upgrade that required 8 hours of downtime per  
system and maintenance windows that ran overnight into Saturday  
mornings was a year-long scheduling process, and only one system at a  
time could be upgraded, with at least a week in-between systems.   
Early small glitches in the documentation were found, delaying the  
third system's activities by a month while the procedure was re-written.

Having to write down and pre-test every command issued into a  
production box on a non-production lab system, is something few  
sysadmins ever encounter.  It should be done more, but it's  
expensive.  All of this can only be afforded because the systems make  
a lot of cold, hard cash for their owners.

Sadly, I don't see any of that cash.  LOL!  Ha!  It pays decently, but  
we've had a lot of "traditional" admins come and go.  The rigid  
environment drives your average sysadmin nuts.  They can't just "try"  
something real quick in a shell or perl, and they can't even log into  
most Production systems as anything other than a standard user that's  
in a group that has access to the application and system log files.

> If anyone knows of a disk sub-system test routine that VERIFIES data  
> and stresses a system I would be very interested in it. I've done a  
> little searching on Google, and I can't seem to find anything that  
> does both I/O benchmarking as well as verifying the integrity of the  
> data during the test.

http://www.coker.com.au/bonnie++/

--
Nate Duehr
nate at natetech.com