[PEAK] Use cases/requirements for process monitoring

Wed Dec 3 17:09:50 EST 2003

Scope
-----

These notes deal with web applications using peak.running's tools to manage 
multiple-process FastCGI applications, whether or not peak.web is used.  We 
assume a need to manage multi-server clusters, which may be serving one or 
more FastCGI applications, with or without load balancing.  Multiple 
clusters or subclusters may need to be monitored simultaneously, or with 
the ability to switch between monitored clusters.  We also assume that the 
applications use a database of some kind.

Up till now, we had a proprietary, application-specific monitoring tool 
that Ty wrote for this sort of thing.  However, as we are now starting to 
make use of the PEAK process supervisor tool, the existing monitor is both 
less appropriate to the new runtime environment, and less useful than it 
could be, now that the environment is changing.

So, we'd like to create a new, generic open source monitoring tool for 
applications run under the process supervisor.  Then, as we develop 
additional applications, we won't have to create new monitoring tools for 
them, and we'll get the benefit of third party bugfixes, as we do for other 
tools.

Users
-----

The broad types of users we want to address are:

Help Desk/Monitoring Center -- these folks mainly want to glance at 
something to see if everything's okay, or there's an obvious problem.  If 
there's a problem, they mainly want to know who they should call -- is the 
database down?  Web server problem?  Software problem?  They aren't going 
to do detailed troubleshooting or performance tuning.

Application Administrator -- they need troubleshooting and performance 
data, ranging from current SQL being executed in the case of a database 
app, to per-application CPU utilization, queueing and processing times, 
memory usage, and so on.  They'll also need to be able to compare these 
things to historical levels, i.e., is this measurement "out of line" with 
long term averages, and is there an upward or downward trend?

In addition to this diagnostic information, it would be desirable for an 
application administrator to be able to access controls to start, stop, or 
put an application "on hold" (i.e. change it to a "we're sorry, this app is 
down" page), and obtain contact information for users who are using the 
application at that time.

Note that many of these functions might be generic (e.g. stop/start), but 
others are application specific (contact info for user, currently executing 
SQL or other queries, "on hold" page, etc.)

Application Developer -- Similar to the administrator, only more so.  Wants 
not only app-wide statistics, but wants to drill down on historical data, 
based on information about the request, to identify application hotspots.

Monitoring Model
----------------

Our previous monitor used point-in-time measurements only, that snapshotted 
current process states.  This is not very helpful when dealing with 
transient issues, or for understanding the true "state" of a machine or 
application, since it only measures some per-request data, and only for a 
request which is in progress at the very moment you ask for the information.

This was a limitation of our previous runtime environment, which used the 
process manager supplied by Apache's mod_fastcgi to control server 
processes.  Now that the limitation has been lifted, we'd like to take 
advantage of per-request and over-time measurements, for completed as well 
as in-progress requests.

Scenarios and Measurements
--------------------------

* Database is slow -- all apps dependent on that database have increased 
response times, requests get backlogged.

* App is chewing CPU -- high CPU time as a percentage of request processing 
time

* Backlog/heavy load -- queue (non-processing) time for requests is high; 
maximum number of children are spawned, but connections are pending on the 
application socket.

* Software error (parent) -- no child processes, no active server process

* Software error (child) -- parent lives, but children keep dying

Some measurements we might therefore want to use:

* Total request time -- the time elapsed from Apache initially receiving 
the request, to the app telling Apache that its response is finished.  This 
is roughly proportional to the response time perceived by the application user.

* Processing time -- the time elapsed from when the application receives 
the request data from Apache, until the app tells Apache that its response 
is finished.

* Queuing time -- Total request time minus processing time

* CPU % per request (CPU time used, as a % of processing time)

* Number of requests

* Number of cancelled requests (User hit stop, connection lost, etc.)

The above would all be useful as averages over various time scales, e.g. 1 
minute, 5 minutes, an hour, etc., and per application process as well as 
per application (either by web server, or for the cluster as a whole)

Other useful measurements would include process size (i.e. memory usage).

Conceptual Model (rough cut)
----------------------------

"Requests" are performed by "processes" which are part of an "application" 
and run on a "server" which is part of a "cluster".  Each "server" also has 
a "supervisor process" for each "application" running on that server.

Most of our measurements are based on "requests", summarized along various 
dimensions such as process, application, server, and time.  "Requests" may 
be in-progress or completed, and a snapshot of in-progress requests' 
statistics may be included as part of a point-in-time summary.

Some other measurements, by object type:

* Process: memory usage, age, idle time, time between requests

* Application: number of processes

* Process supervisor: number of child processes

Of course, any per-process statistic may be summarized per supervisor or 
per application.

In general, "per application" statistics are harder to obtain in a cluster 
monitoring scenario, and "per supervisor" may be sufficient for most 
purposes, as they are are easy to obtain from a single machine.  But, we 
should still give some thought as to how we might collect that information 
in a central location in the future.

In addition to measurements, there are also actions we may wish to take, 
such as:

* Application -- start, shutdown, reload configuration, update software, 
put "on hold"

* Process -- hard kill to stop a problem, "friendly kill" to request the 
process stop its current request and give the user an error message.  (I 
can dream, can't I?)

* Supervisor -- same as application, possibly substituting for 
per-application capabilities if we don't have a good cross-cluster 
mechanism for this.

User Interface
--------------

I'm mostly going to leave the subject of UI for another time.  I am 
assuming, though, that this will use peak.web, and need security to 
distinguish user levels (and possibly distinguish who has rights to what 
apps and/or servers).  It should depend on as few other resources as 
possible though, since if an outage that wipes out an application also 
wipes out the monitor, one of your key troubleshooting resources will have 
gone AWOL.

Our previous monitor was pure CGI, to avoid dependence on FastCGI, but I'm 
not sure we need this any more.  One reason was that we had some odd 
outages in the past that appeared to be caused by Apache running out of 
file handles, perhaps from creating and destroying lots of sockets for 
FastCGI child processes.  The new process manager should drastically reduce 
the opportunity for such a problem, and, more importantly, if Apache 
doesn't serve FastCGI requests, the monitor disappearing on that box will 
make it bloody obvious that Apache is broken on that box.  IIRC, if Apache 
breaks in this way currently, the FastCGI processes on that box will simply 
appear to be idling "between requests".  There's no way to tell from the 
monitor that something's wrong.

An interesting side effect, by the way, of running the process monitor 
itself as a supervised process, is that it would then be able to monitor 
and control itself!  Of course, a downside is that you might shut it down, 
and then be unable to restart it.  :)  However, if the monitor is run as a 
dynamic FastCGI (such that Apache starts the monitor's supervisor process), 
then this is not really an issue.

Another interesting UI possibility is that if applications are persistent 
objects referenced by the monitor, then it might be possible to have a 
"deployment UI" for configuring new or existing applications.  This would 
be quite handy in many environments besides our own.