[lug] HW Problem Finally Solved
Nate Duehr
nate at natetech.com
Thu Dec 20 00:43:53 MST 2007
On Dec 15, 2007, at 12:14 PM, George Sexton wrote:
[snipped good story of in-depth but time-intensive troubleshooting...]
> They replaced the motherboard with the updated model, and shipped
> the unit back to me. My total cost was freight out.
Actually -- just to be fully accurate here, your total cost was
freight out AND all that time you spent. Your company may or may not
have had other "more useful" things they might have done with you
during all that time if the box did not have an intermittent problem,
right? I know -- it's hard to truly quantify this, but call it "time
lost".
TCO has to include something of a metric to measure the admin's time
and/or opportunity costs lost to the organization when troubleshooting
is taking place. But even the biggest management brains on the
planet don't really seem to have a good metric for this. It's all
very subjective and specific to any one admin and company's specific
situation.
> The obvious "SuperMicro sucks don't buy them" isn't it. I've got 5
> units that I've never had a single issue with. Additionally, I've
> had infinitely worse experiences with Dell servers. I'm pleased that
> even though the unit was pretty old, they still fixed it at no
> charge. When I called their SuperServer support, I got a pretty
> knowledgeable guy that didn't read from a script. Overall, it was a
> pretty decent tech support experience.
I think it would have been interesting to see if that experience had
been the same if you'd contacted them earlier during the "once every
few weeks file system corruption happens" stage.
> I guess the biggest lesson is that it's not enough to burn-in test a
> disk sub-system by doing a lot of I/O to it. The testing really
> needs to verify the accuracy of the reads/writes done to the disk
> sub-system. It seems likely that if I had done something like that I
> could have saved TONS of grief. The obvious "badblocks -wv" command
> just isn't enough by itself. It's all sequential I/O, and in my case
> it just didn't stress things enough to show a problem. Manufacturer
> disk tests probably wouldn't help as well because they would test
> the media sequentially as well.
Again, time vs. cost. Is there an option to have SuperMicro (or any
other vendor) do this stress testing BEFORE they ship the box to you?
Nowadays, I doubt it -- but if you're busy doing more valuable things
or you're short-staffed, it could really be worth having someone else
ship you the machine and a test report...
Another thing that would help, if you have money to burn (I know most
of us don't, but I mention it because I've worked in environments that
did -- they had cash but everyone was madly trying to keep up with the
workload) is "Platinum" type service contracts. At the first sign of
trouble (or maybe the second if they don't believe you at first)
someone confirms what you're seeing via a phone call (or in really
expensive ones, on-site service) and the sick box is removed from your
site (or shipped away for free) and a new one magically appears, very
few questions asked.
Of course, those cost a lot of money because the hardware vendor
covers their costs up-front, basically. You could literally set the
box on fire, call up and they'd send a replacement, and still have
made a little money.
It's always worth beating up any vendor's sales staff a bit if you're
buying a LOT of product (I know, most of us here aren't...) and twist
their arms into giving you their top-tier support contract for free
for a while or at a significantly reduced cost. Always try, no matter
what. All they can do is say no, as the saying goes.
Once you've had 4-hour on-site hardware replacement contracts, you can
get VERY spoiled, and it only makes economic sense for the most
critical of applications -- but I've worked on a few in telco.
Systems that if they remain down, cost the owners something like
$40,000 an hour in direct cash revenue at peak traffic. That kind of
cash cow demands huge amounts of redundancy and fail-safe mechanisms
in hardware and software, plus that 4-hour contract from a depot in
that city... but those kinds of applications are rare.
On those systems, you don't have time to troubleshoot -- you have time
to swap hardware and that's it. If you have to recover from backups
for any reason, it takes many hours, and it's a major incident and
someone's going to want to know why the redundant stuff failed. MOST
of the time, real outages on that kind of system are caused by human
error, and it's common to see large amounts of "process" around even
the simplest of system changes. Outages not caused by human error are
opened as a Priority 1 case on our side, which alerts the CEO of our
company. "System down" isn't in our vocabulary, when it comes to the
telco carrier class gear and software.
How it goes for us:
I have one customer who ONLY does maintenance activities in a window
from 11PM to 2AM Eastern Time on Friday nights. Any planned
maintenance outside of those hours requires Director-level approval on
their side. Unplanned maintenance usually requires notifying their VP.
Every single command except for the login and password that's going to
be typed into the production Sun servers MUST be documented in a
written procedure document and reviewed by both us (we're the software/
system integration vendor) and their engineering staff in a meeting at
least a day before the planned maintenance window. All maintenance is
done with both side's technicians required to have a paper copy of the
planned activities in front of them at all times, and cross-checking
each other's steps to see where you are on the steps on the document
at all times.
Usually there's at least one manager (either company, sometimes both)
monitoring on a conference call on an 800 number that must be active
during any maintenance activities.
Any deviation from the written document or any unforeseen problems
encountered will trigger an instant call out during the maintenance
conference to both a manager on our side and often both a manager and
Director-level person on the customer's side for approval to deviate
from the written procedure. If any signifcant risk of data loss is
foreseen, every effort must be taken to stop and make a full backup of
all critical data, before proceeding. Usually this also triggers a
call to a VP.
Root access is never available without Management approval of what is
going to be done with it (the written document), and is turned on/off
through a tracking system by one of four or five people nationwide.
Access to the system is via a special dial-up number or an RSA-key-fob
based VPN.
(Fun, huh? Heh heh. This particular customer is the toughest, but at
least two others implement MOST of the above procedures also.)
The work is interesting, but also can be boring in-between windows.
You can read/review logs, query the DB, and do pretty much anything
"read-only" to troubleshoot, or even request a special "we need root
to run X, Y and Z diagnostics" maintenance window after 11PM Eastern
on most weeknights, but making changes outside of Friday night --
almost never happens.
Approval to do a major upgrade that required 8 hours of downtime per
system and maintenance windows that ran overnight into Saturday
mornings was a year-long scheduling process, and only one system at a
time could be upgraded, with at least a week in-between systems.
Early small glitches in the documentation were found, delaying the
third system's activities by a month while the procedure was re-written.
Having to write down and pre-test every command issued into a
production box on a non-production lab system, is something few
sysadmins ever encounter. It should be done more, but it's
expensive. All of this can only be afforded because the systems make
a lot of cold, hard cash for their owners.
Sadly, I don't see any of that cash. LOL! Ha! It pays decently, but
we've had a lot of "traditional" admins come and go. The rigid
environment drives your average sysadmin nuts. They can't just "try"
something real quick in a shell or perl, and they can't even log into
most Production systems as anything other than a standard user that's
in a group that has access to the application and system log files.
> If anyone knows of a disk sub-system test routine that VERIFIES data
> and stresses a system I would be very interested in it. I've done a
> little searching on Google, and I can't seem to find anything that
> does both I/O benchmarking as well as verifying the integrity of the
> data during the test.
http://www.coker.com.au/bonnie++/
--
Nate Duehr
nate at natetech.com
More information about the LUG
mailing list