[lug] Open source tools to monitor distributed services

Vince Dean vdean at ucar.edu
Thu Nov 30 18:10:32 MST 2006


I am managing a sized distributed system that depends
on services running on several Unix hosts. I've found that one
of the best ways to monitor the health of the system is to
run independent tests:  to check that a port is open on a given
host, a file has been modified in the last two hours, an
HTTP URL can be retrieved, and so on.  There are a few dozen
dozen conditions, distributed among eight machines, that I want to 
check every few minutes.

I'm using ad-hoc scripts and cron jobs but I  feel the 
need for a more general, configurable solution.  I suspect that 
this is a well-studied problem.  Are there any solutions
that you can recommend?

My ideal solution:
- is open source
- runs on Linux, but preferably is portable to other 
     Unix systems and Windows 
     (i.e. written in Java, Python, Ruby, or Perl)
- is easily configured for some standard types of tests:
   - FTP server is running
   - HTTP URL can be retrieved
   - a given port is open on a given host
   - a given file exists and has been recently modified
   - a process is running with a given name
   - etc.
-  is easily extended by custom code to check for 
      application-specific conditions
-  notifies by email and/or writes messages to a log file when a test fails
-  checks at a configurable interval and suppresses redundant messages
      (doesn't tell me the same service is down every minute)
-  notifies me when a service is back up

I'll be grateful for any suggestions.

Vince
-- 
Vince Dean
University of Colorado 
Center for Lower Atmospheric Studies
3450 Mitchell Lane, Rm FL0-2514
Boulder, CO 80301
Phone: (303) 497-8077





More information about the LUG mailing list