[lug] monitoring jobs on linux

Fri Mar 16 15:53:13 MDT 2012

Davide, for starters, take a look at the venerable GNU process
accounting package, "acct", almost certainly pre-packaged for your
distribution.

http://www.gnu.org/software/acct/
http://www.cyberciti.biz/tips/howto-log-user-activity-using-process-accounting.html

It may not do everything you want out of the box, but it can likely be
used as the basis for further customization.

On Fri, Mar 16, 2012 at 1:16 PM, Davide Del Vento
<davide.del.vento at gmail.com> wrote:
> This in fact is what we will do (for other reasons).
> However, we would like to have that information on the running jobs
> *before* we can have the scheduler installed and configured and the
> users (they are several hundreds!) trained and convinced to use it.
> Any other idea?
> Thanks,
> Dav
>
> On Fri, Mar 16, 2012 at 12:37, Will Sterling <will.sterling at gmail.com> wrote:
>> Install a job scheduler then have your users submit their jobs using
>> the scheduler.  You will then be able to run canned reports for all
>> kinds of info you never knew you were missing.
>>
>> On Mar 16, 2012, at 12:31 PM, Davide Del Vento
>> <davide.del.vento at gmail.com> wrote:
>>
>>> Hi,
>>> we have a server where users have shell access, and they usually
>>> submit nohupped background jobs (or cron jobs). I would like to
>>> monitor what users are doing. At the bare minimum how long the jobs
>>> last on average and what the distribution looks like. Better yet if I
>>> can get more details, such as when those jobs run (e.g. is the
>>> distribution changing during the weekends? is there any particular
>>> user doing something much off the others? etc.) I am particularly
>>> interested in long-running stuff, so a sampling would work fine, even
>>> at low frequency (e.g. 1-10 minutes)
>>>
>>> None of this is rocket science, filtering the output of ps happening
>>> in a cron every 5m or so would do the trick. However I don't want to
>>> do this myself, since there are many small details that would make
>>> this a serious project and not a quick test to collect some data to
>>> slap on a manager's desk. For example: what if PID rolls over? What
>>> about spawned processes? I care only about the "top level" jobs
>>> submitted by the user, so if in the system there is only a single
>>> 10-hour bash script calling 10 1-hour things, I want and easy way to
>>> be able to find the information I want which is "the average running
>>> time is 10 hours", and not the quick answer "the average running time
>>> is 1.8 hours" (since there have been 1 10h + 10 1h processes running).
>>> Again, since ps can do some parent-child stuff this is possible....
>>>
>>> But instead of reinventing the wheel, I'm wondering if such a tool
>>> exists (maybe withing Nagios and/or Ganglia which are already running
>>> on the system - I can just go to the system administrators and ask for
>>> what I need). I didn't find anything on Google, but that's probably
>>> because I am not a system administrator so I asked the "wrong"
>>> question (and Google is not smart enough to accept very elaborate
>>> queries like this by email :-)
>>>
>>> Thanks,
>>> Davide
>>> _______________________________________________
>>> Web Page:  http://lug.boulder.co.us
>>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>> _______________________________________________
>> Web Page:  http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety