[PEAK] Use cases/requirements for process monitoring
Phillip J. Eby
pje at telecommunity.com
Wed Dec 3 17:09:50 EST 2003
Scope
-----
These notes deal with web applications using peak.running's tools to manage
multiple-process FastCGI applications, whether or not peak.web is used. We
assume a need to manage multi-server clusters, which may be serving one or
more FastCGI applications, with or without load balancing. Multiple
clusters or subclusters may need to be monitored simultaneously, or with
the ability to switch between monitored clusters. We also assume that the
applications use a database of some kind.
Up till now, we had a proprietary, application-specific monitoring tool
that Ty wrote for this sort of thing. However, as we are now starting to
make use of the PEAK process supervisor tool, the existing monitor is both
less appropriate to the new runtime environment, and less useful than it
could be, now that the environment is changing.
So, we'd like to create a new, generic open source monitoring tool for
applications run under the process supervisor. Then, as we develop
additional applications, we won't have to create new monitoring tools for
them, and we'll get the benefit of third party bugfixes, as we do for other
tools.
Users
-----
The broad types of users we want to address are:
Help Desk/Monitoring Center -- these folks mainly want to glance at
something to see if everything's okay, or there's an obvious problem. If
there's a problem, they mainly want to know who they should call -- is the
database down? Web server problem? Software problem? They aren't going
to do detailed troubleshooting or performance tuning.
Application Administrator -- they need troubleshooting and performance
data, ranging from current SQL being executed in the case of a database
app, to per-application CPU utilization, queueing and processing times,
memory usage, and so on. They'll also need to be able to compare these
things to historical levels, i.e., is this measurement "out of line" with
long term averages, and is there an upward or downward trend?
In addition to this diagnostic information, it would be desirable for an
application administrator to be able to access controls to start, stop, or
put an application "on hold" (i.e. change it to a "we're sorry, this app is
down" page), and obtain contact information for users who are using the
application at that time.
Note that many of these functions might be generic (e.g. stop/start), but
others are application specific (contact info for user, currently executing
SQL or other queries, "on hold" page, etc.)
Application Developer -- Similar to the administrator, only more so. Wants
not only app-wide statistics, but wants to drill down on historical data,
based on information about the request, to identify application hotspots.
Monitoring Model
----------------
Our previous monitor used point-in-time measurements only, that snapshotted
current process states. This is not very helpful when dealing with
transient issues, or for understanding the true "state" of a machine or
application, since it only measures some per-request data, and only for a
request which is in progress at the very moment you ask for the information.
This was a limitation of our previous runtime environment, which used the
process manager supplied by Apache's mod_fastcgi to control server
processes. Now that the limitation has been lifted, we'd like to take
advantage of per-request and over-time measurements, for completed as well
as in-progress requests.
Scenarios and Measurements
--------------------------
* Database is slow -- all apps dependent on that database have increased
response times, requests get backlogged.
* App is chewing CPU -- high CPU time as a percentage of request processing
time
* Backlog/heavy load -- queue (non-processing) time for requests is high;
maximum number of children are spawned, but connections are pending on the
application socket.
* Software error (parent) -- no child processes, no active server process
* Software error (child) -- parent lives, but children keep dying
Some measurements we might therefore want to use:
* Total request time -- the time elapsed from Apache initially receiving
the request, to the app telling Apache that its response is finished. This
is roughly proportional to the response time perceived by the application user.
* Processing time -- the time elapsed from when the application receives
the request data from Apache, until the app tells Apache that its response
is finished.
* Queuing time -- Total request time minus processing time
* CPU % per request (CPU time used, as a % of processing time)
* Number of requests
* Number of cancelled requests (User hit stop, connection lost, etc.)
The above would all be useful as averages over various time scales, e.g. 1
minute, 5 minutes, an hour, etc., and per application process as well as
per application (either by web server, or for the cluster as a whole)
Other useful measurements would include process size (i.e. memory usage).
Conceptual Model (rough cut)
----------------------------
"Requests" are performed by "processes" which are part of an "application"
and run on a "server" which is part of a "cluster". Each "server" also has
a "supervisor process" for each "application" running on that server.
Most of our measurements are based on "requests", summarized along various
dimensions such as process, application, server, and time. "Requests" may
be in-progress or completed, and a snapshot of in-progress requests'
statistics may be included as part of a point-in-time summary.
Some other measurements, by object type:
* Process: memory usage, age, idle time, time between requests
* Application: number of processes
* Process supervisor: number of child processes
Of course, any per-process statistic may be summarized per supervisor or
per application.
In general, "per application" statistics are harder to obtain in a cluster
monitoring scenario, and "per supervisor" may be sufficient for most
purposes, as they are are easy to obtain from a single machine. But, we
should still give some thought as to how we might collect that information
in a central location in the future.
In addition to measurements, there are also actions we may wish to take,
such as:
* Application -- start, shutdown, reload configuration, update software,
put "on hold"
* Process -- hard kill to stop a problem, "friendly kill" to request the
process stop its current request and give the user an error message. (I
can dream, can't I?)
* Supervisor -- same as application, possibly substituting for
per-application capabilities if we don't have a good cross-cluster
mechanism for this.
User Interface
--------------
I'm mostly going to leave the subject of UI for another time. I am
assuming, though, that this will use peak.web, and need security to
distinguish user levels (and possibly distinguish who has rights to what
apps and/or servers). It should depend on as few other resources as
possible though, since if an outage that wipes out an application also
wipes out the monitor, one of your key troubleshooting resources will have
gone AWOL.
Our previous monitor was pure CGI, to avoid dependence on FastCGI, but I'm
not sure we need this any more. One reason was that we had some odd
outages in the past that appeared to be caused by Apache running out of
file handles, perhaps from creating and destroying lots of sockets for
FastCGI child processes. The new process manager should drastically reduce
the opportunity for such a problem, and, more importantly, if Apache
doesn't serve FastCGI requests, the monitor disappearing on that box will
make it bloody obvious that Apache is broken on that box. IIRC, if Apache
breaks in this way currently, the FastCGI processes on that box will simply
appear to be idling "between requests". There's no way to tell from the
monitor that something's wrong.
An interesting side effect, by the way, of running the process monitor
itself as a supervised process, is that it would then be able to monitor
and control itself! Of course, a downside is that you might shut it down,
and then be unable to restart it. :) However, if the monitor is run as a
dynamic FastCGI (such that Apache starts the monitor's supervisor process),
then this is not really an issue.
Another interesting UI possibility is that if applications are persistent
objects referenced by the monitor, then it might be possible to have a
"deployment UI" for configuring new or existing applications. This would
be quite handy in many environments besides our own.
More information about the PEAK
mailing list