[lug] Automatic removal of cron job by cron script

Nate Duehr nate at natetech.com
Fri Nov 2 23:54:04 MDT 2007


On Nov 2, 2007, at 5:24 PM, Michael J. Hammel wrote:

> I would think you'd want to have the alert send messages for as long  
> as
> the error is detected, up until the error is cleared.

You've never been paged by a monitoring system during a data-center  
power outage that takes down over 400 monitored items, have you?  :-)

If there's one thing I've learned being on-call 24/7 for almost (oh  
lord, has it been that long?) 11 years now...

You ALWAYS want a way to SHUT IT UP remotely... not from a command  
line, but from the device you're being plastered with.

(At one point in my past the bosses even paid for an IVR system that  
was scriptable with events and we could both have it call us for  
really bad stuff and announce it in voice, as well as call into it on  
an 800 number and tell it to run certain scripts... one of those  
scripts was the "SHUT IT ALL UP" script.  But even though it was  
relatively cost-effective, I haven't seen too many places go that far  
ever since then.  It was kinda cool... it could also call on escalated  
ticket system tickets.

(I suppose a properly scripted Asterisk box -- especially if you were  
already using Asterisk anyway -- could do much of what that very  
proprietary system could do back then... cheaper, today.)

2-way pagers can be the best and the worst in the monitoring  
environment... most carriers actually store ALL messages and guarantee  
delivery.  And most devices will only hold 20 at a time and "block" on  
receiving anymore until you clear the 20.

SMS is a mixed-bag.  Some carriers hold some messages but have an  
upper limit.  Some drop anything over a specific rate of messages from  
the same source.  Never saw one that stored/guaranteed everything...  
lots of messages go to the bit-bucket in the SMS/cell phone world, but  
they try not to... it'll really tick you off when the most important  
system-generated message of the day doesn't hit anyone's phone.  I  
don't trust SMS only with my "life", that's for sure.

Blackberries with real push e-mail as well as Treo's (when you can  
keep them from spontaneously rebooting all on their own every few  
days) work great, and usually can be used for SSH or other forms of  
"reply/control", too.

Mobile broadband cards and never logging out of your laptop ever are  
also an option, I suppose.  Ha... not one that I'd want but hey, a  
free company sponsored mobile broadband card for key personnel  
actually would be more sane than most VPN setups I've seen for support  
people over the years.  Trying to remain "near your house" to have  
broadband access is highly annoying when you're never really "not on  
call"... and more than one network element has been fixed via stolen  
802.11, I'm sure!

Data center outages are a great stress test for your chosen text  
messaging system.  You can learn a lot about how your carrier handles  
floods of text messages in the post-mortem.  :-)

>
Anyway, just my two cents... always have a "MAKE THAT DAMNED THING  
SHUT UP" button that is EASY to "push" remotely for large outages.

Multiple techs (who of course, already know about the problem)  
standing in front of racks in the data center reaching down to clear  
20 messages at a time from their pagers for an hour while also trying  
to type and bring things back online is quite comical, for about the  
first 10 minutes of it.

--
Nate Duehr
nate at natetech.com






More information about the LUG mailing list