[lug] Automatic removal of cron job by cron script
Nate Duehr
nate at natetech.com
Fri Nov 2 23:54:04 MDT 2007
On Nov 2, 2007, at 5:24 PM, Michael J. Hammel wrote:
> I would think you'd want to have the alert send messages for as long
> as
> the error is detected, up until the error is cleared.
You've never been paged by a monitoring system during a data-center
power outage that takes down over 400 monitored items, have you? :-)
If there's one thing I've learned being on-call 24/7 for almost (oh
lord, has it been that long?) 11 years now...
You ALWAYS want a way to SHUT IT UP remotely... not from a command
line, but from the device you're being plastered with.
(At one point in my past the bosses even paid for an IVR system that
was scriptable with events and we could both have it call us for
really bad stuff and announce it in voice, as well as call into it on
an 800 number and tell it to run certain scripts... one of those
scripts was the "SHUT IT ALL UP" script. But even though it was
relatively cost-effective, I haven't seen too many places go that far
ever since then. It was kinda cool... it could also call on escalated
ticket system tickets.
(I suppose a properly scripted Asterisk box -- especially if you were
already using Asterisk anyway -- could do much of what that very
proprietary system could do back then... cheaper, today.)
2-way pagers can be the best and the worst in the monitoring
environment... most carriers actually store ALL messages and guarantee
delivery. And most devices will only hold 20 at a time and "block" on
receiving anymore until you clear the 20.
SMS is a mixed-bag. Some carriers hold some messages but have an
upper limit. Some drop anything over a specific rate of messages from
the same source. Never saw one that stored/guaranteed everything...
lots of messages go to the bit-bucket in the SMS/cell phone world, but
they try not to... it'll really tick you off when the most important
system-generated message of the day doesn't hit anyone's phone. I
don't trust SMS only with my "life", that's for sure.
Blackberries with real push e-mail as well as Treo's (when you can
keep them from spontaneously rebooting all on their own every few
days) work great, and usually can be used for SSH or other forms of
"reply/control", too.
Mobile broadband cards and never logging out of your laptop ever are
also an option, I suppose. Ha... not one that I'd want but hey, a
free company sponsored mobile broadband card for key personnel
actually would be more sane than most VPN setups I've seen for support
people over the years. Trying to remain "near your house" to have
broadband access is highly annoying when you're never really "not on
call"... and more than one network element has been fixed via stolen
802.11, I'm sure!
Data center outages are a great stress test for your chosen text
messaging system. You can learn a lot about how your carrier handles
floods of text messages in the post-mortem. :-)
>
Anyway, just my two cents... always have a "MAKE THAT DAMNED THING
SHUT UP" button that is EASY to "push" remotely for large outages.
Multiple techs (who of course, already know about the problem)
standing in front of racks in the data center reaching down to clear
20 messages at a time from their pagers for an hour while also trying
to type and bring things back online is quite comical, for about the
first 10 minutes of it.
--
Nate Duehr
nate at natetech.com
More information about the LUG
mailing list