[lug] monitoring jobs on linux

Will will.sterling at gmail.com
Fri Mar 16 15:54:17 MDT 2012


Turn on accounting and you can track resource consumption, commands run,
etc.

On Fri, Mar 16, 2012 at 1:16 PM, Davide Del Vento <
davide.del.vento at gmail.com> wrote:

> This in fact is what we will do (for other reasons).
> However, we would like to have that information on the running jobs
> *before* we can have the scheduler installed and configured and the
> users (they are several hundreds!) trained and convinced to use it.
> Any other idea?
> Thanks,
> Dav
>
> On Fri, Mar 16, 2012 at 12:37, Will Sterling <will.sterling at gmail.com>
> wrote:
> > Install a job scheduler then have your users submit their jobs using
> > the scheduler.  You will then be able to run canned reports for all
> > kinds of info you never knew you were missing.
> >
> > On Mar 16, 2012, at 12:31 PM, Davide Del Vento
> > <davide.del.vento at gmail.com> wrote:
> >
> >> Hi,
> >> we have a server where users have shell access, and they usually
> >> submit nohupped background jobs (or cron jobs). I would like to
> >> monitor what users are doing. At the bare minimum how long the jobs
> >> last on average and what the distribution looks like. Better yet if I
> >> can get more details, such as when those jobs run (e.g. is the
> >> distribution changing during the weekends? is there any particular
> >> user doing something much off the others? etc.) I am particularly
> >> interested in long-running stuff, so a sampling would work fine, even
> >> at low frequency (e.g. 1-10 minutes)
> >>
> >> None of this is rocket science, filtering the output of ps happening
> >> in a cron every 5m or so would do the trick. However I don't want to
> >> do this myself, since there are many small details that would make
> >> this a serious project and not a quick test to collect some data to
> >> slap on a manager's desk. For example: what if PID rolls over? What
> >> about spawned processes? I care only about the "top level" jobs
> >> submitted by the user, so if in the system there is only a single
> >> 10-hour bash script calling 10 1-hour things, I want and easy way to
> >> be able to find the information I want which is "the average running
> >> time is 10 hours", and not the quick answer "the average running time
> >> is 1.8 hours" (since there have been 1 10h + 10 1h processes running).
> >> Again, since ps can do some parent-child stuff this is possible....
> >>
> >> But instead of reinventing the wheel, I'm wondering if such a tool
> >> exists (maybe withing Nagios and/or Ganglia which are already running
> >> on the system - I can just go to the system administrators and ask for
> >> what I need). I didn't find anything on Google, but that's probably
> >> because I am not a system administrator so I asked the "wrong"
> >> question (and Google is not smart enough to accept very elaborate
> >> queries like this by email :-)
> >>
> >> Thanks,
> >> Davide
> >> _______________________________________________
> >> Web Page:  http://lug.boulder.co.us
> >> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> >> Join us on IRC: irc.hackingsociety.org port=6667
> channel=#hackingsociety
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20120316/858e1fc3/attachment.html>


More information about the LUG mailing list