[lug] monitoring jobs on linux

Davide Del Vento davide.del.vento at gmail.com
Fri Mar 16 13:16:17 MDT 2012


This in fact is what we will do (for other reasons).
However, we would like to have that information on the running jobs
*before* we can have the scheduler installed and configured and the
users (they are several hundreds!) trained and convinced to use it.
Any other idea?
Thanks,
Dav

On Fri, Mar 16, 2012 at 12:37, Will Sterling <will.sterling at gmail.com> wrote:
> Install a job scheduler then have your users submit their jobs using
> the scheduler.  You will then be able to run canned reports for all
> kinds of info you never knew you were missing.
>
> On Mar 16, 2012, at 12:31 PM, Davide Del Vento
> <davide.del.vento at gmail.com> wrote:
>
>> Hi,
>> we have a server where users have shell access, and they usually
>> submit nohupped background jobs (or cron jobs). I would like to
>> monitor what users are doing. At the bare minimum how long the jobs
>> last on average and what the distribution looks like. Better yet if I
>> can get more details, such as when those jobs run (e.g. is the
>> distribution changing during the weekends? is there any particular
>> user doing something much off the others? etc.) I am particularly
>> interested in long-running stuff, so a sampling would work fine, even
>> at low frequency (e.g. 1-10 minutes)
>>
>> None of this is rocket science, filtering the output of ps happening
>> in a cron every 5m or so would do the trick. However I don't want to
>> do this myself, since there are many small details that would make
>> this a serious project and not a quick test to collect some data to
>> slap on a manager's desk. For example: what if PID rolls over? What
>> about spawned processes? I care only about the "top level" jobs
>> submitted by the user, so if in the system there is only a single
>> 10-hour bash script calling 10 1-hour things, I want and easy way to
>> be able to find the information I want which is "the average running
>> time is 10 hours", and not the quick answer "the average running time
>> is 1.8 hours" (since there have been 1 10h + 10 1h processes running).
>> Again, since ps can do some parent-child stuff this is possible....
>>
>> But instead of reinventing the wheel, I'm wondering if such a tool
>> exists (maybe withing Nagios and/or Ganglia which are already running
>> on the system - I can just go to the system administrators and ask for
>> what I need). I didn't find anything on Google, but that's probably
>> because I am not a system administrator so I asked the "wrong"
>> question (and Google is not smart enough to accept very elaborate
>> queries like this by email :-)
>>
>> Thanks,
>> Davide
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety



More information about the LUG mailing list