[lug] server spec

Nate Duehr nate at natetech.com
Thu Jun 14 15:17:49 MDT 2007


Sean Reifschneider wrote:
> On Wed, Jun 13, 2007 at 06:59:50PM -0600, Lee Woodworth wrote:
>>> cooling WILL fail.  Never been in any data center where it didn't,
>>> eventually.
>> Even when there are backups for the AC?
> 
> Yeah, this is weird.  I've never seen a data center cooling system fail.  I
> mean, yes, one of a pair of redundant units may fail, but overall the
> system has continued working.  Not in a class A facility.  I've only heard
> about it once, the data center for a national warehouse related to one of
> the railroads, IIRC.  This was back in the '80s.

Speaking of "class A" facilities, that's another problem for most folks 
-- finding those.  Which facilities in Colorado would you consider 
"class A" that are open to the public?

Level 3?  SunGard?  ummm... ViaWest (their downtown site is only what, 
about 10 feet above the Platte river? - ever seen the Platte flood?  It 
will.  Eventually.)  Qwest's various co-los?

The failures I've seen... (all of these are in-person or via phone where 
I was directly involved in recovery activities)...

- Shared cooling tower coolant failure.  Multiple AC units, but all were 
using the same cooling tower.  Whoops.  Coolant everywhere, quite a 
mess.  Glycol everywhere.

- Rainwater infiltration to AC control panel area, AC had to be shut down.

- Extended power failure to a site that had enough generator to run the 
equipment but not the AC.

Other "fun" failures... non-AC...

- Both diesel generators failed to start.  Always entertaining. 
(Battery plant usually doesn't run AC, depending on the site.)

- AC inverter to generate AC off the telco-style battery plant literally 
blew up after failure of a high-voltage filter-capacitor array.  Damage 
to array was caused during Florida hurricane two weeks prior to failure. 
  No one hurt, replacement inverter not available/on backorder, 
load-shedding of all non-essential equipment done on its "twin" that was 
powering the other AC equipment, and emergency electrical work to attach 
the rest of the AC load (carefully) to the remaining inverter. 
Replacement arrived via flatbed truck the next week and took a crack 
crew five days to install.

- Diesel generator blew an oil hose, lost all oil pressure and shut 
down.  UPS system tripped off-line completely during surges from the 
input side caused by the hiccuping generator.   Site down.

- Electrical contractor took EPO switch off the wall and was digging 
around behind it without authorization when his escort wasn't looking, 
shorted terminals on EPO when putting it back on the wall.  (Yeah, we 
couldn't believe it either, and I got to watch some VERY pissed off 
managers throw him into the hallway and ask him never EVER to return.)

And the best one of all...

- New hire technician couldn't get conduit cover at floor level (it came 
up out of the raised floor and ran up the wall) to go back on, after 
adding multiple -48VDC electrical runs through it.  Decided kicking it 
was the best he could think of.  Rubbed insulation off two of the 6 
gauge wires already active in the conduit, shorting that -48 VDC circuit 
to ground.  Breaker on -48 VDC master distribution blew before breaker 
on that circuit (shouldn't happen, but it did, and was investigated 
later), and all -48 VDC equipment in the facility went down.

[This one was the one where I was the data center "engineer" and man was 
I pissed.  We had a "come to Jesus meeting" that afternoon where I 
described how far up someone's ass my foot was going to become lodged if 
I ever heard about anyone kicking a power conduit in my data center ever 
again... after we'd recovered all the customer gear that had dropped 
off-line.  Also had a little "chat" with the person who was supposed to 
be supervising the new-hire.  That was probably the maddest I've ever 
been in my professional career, because not only had they caused the 
outage - they did it while I was on a lunch break and didn't call me.  I 
came back to chaos.  Oh man, I was HOT.]

Two other "fun and almost deadly" -48 VDC stories, both were techs doing 
the STUPIDEST thing of all, pulling LIVE cables.  NEVER EVER EVER do this.

In one, I was on the phone with the 2nd tech at the site, when loud 
popping and sparking was heard just before we, 25 miles away, saw all of 
our data terminals drop dead (and they were Wyse 50's and 60's, and our 
DACS/Mux equipment we were using at the time would spew garbage when it 
lost DS1 synchronization... including CTRL-G (bell) characters.  The 
entire callcenter started beeping, and I was the only person who knew 
why... because I'd heard the zapping through the phone and the cussing). 
  The tech had pulled a live -48 VDC cable (in a hurry) and had let it 
hit the frame of the DACS/Mux equipment we were using.  Two cards blown, 
the rest of the system survived after a complete power-cycle.  Dumb. 
And dangerous.

In the other, I came in one morning to find various techs excited about 
an early morning incident where the overnight tech had started pulling 
cables not realizing that he'd accidentally made them live when he threw 
the wrong breaker.  The cable "zapped" over to the top of a grounded 
cabinet while he was standing on a ladder and he had the foresight to 
pull the cable AWAY from the cabinet.  But then he was stuck.  Up a 
ladder, alone, with a cable on the end of a wooden broomstick held up in 
the air.  He was there a while until someone found him and safetied the 
circuit properly.  We had a number of conversations about how wise it 
was to have people working on power alone, the need to have anyone doing 
that carrying a phone or communication's device.

> Of course, there have been class A power outages in the news recently, but
> those seem to have been caused by someone intentionally pressing the EPO.
> That happened to Live Journal twice in the last several years.

EPO power downs are boring compared to what happens out there in the 
real world.  Data center companies will go to GREAT lengths not to have 
these stories get out.

It's very useful to have your OWN temperature monitoring in your cabinet 
and have it alarm if it reaches certain temperatures... I've seen telco 
co-lo's lose all AC for a day or two and never tell any customers who 
didn't ask why the temp went up.

In general, it's hard to get across the undercurrent of levity that goes 
with these stories in e-mail.  They're the "war stories" of telco, and 
as long as you weren't the idiot that caused the outage, they're usually 
chuckled about, and a good laugh is had by all when things are restored 
and the meetings are long over, and the customers aren't yelling anymore.

But if you work in either CO or datacenter environments for long enough 
(and in enough of them, they come in all shapes, sizes, and capability 
levels) -- you WILL see these failures.  As the old guys retire, one of 
the things that's lost in telco is the level-headedness and cool 
response to big outages at CO's.  The new guys panic, and/or do things 
that put themselves or others in danger while trying to effect the 
repairs.

Sometimes, when the electrical transfer autoswitch didn't throw, and 
there's 2 feet of standing water that just came in through the roof into 
the electrical distribution room... you don't play hero and try to turn 
things back on.  It takes some experience to know when you're risking a 
life (yours or anyone else's) versus a couple of million in lost 
revenue, and to know the life is worth more than the job.

Nate



More information about the LUG mailing list