[PEAK] Organizing documentation of Python code
Phillip J. Eby
pje at telecommunity.com
Wed Sep 22 14:13:01 EDT 2004
My personal inclination is to prefer to document Python code in-line,
rather than writing separate documentation, wherever possible. It's easier
to update when you change something, and there's less chance of forgetting
to update things.
But, Python has some limitations when it comes to producing this kind of
documentation, compared to say, Perl. Perl's POD format makes it easy to
use a more "literate programming" style, where you group elements together
logically, interspersing code and documentation to tell the "story" of the
module.
In principle, you can do this with Python, but in practice, it only works
if you read the original source code. This is because of the two kinds of
documentation generation tools for Python, both suck at literate
programming. :)
Specifically, the two kinds are object-extracting tools and source-reading
tools. Object-extracting tools like 'pydoc' and 'epydoc' are the most
common. They basically import the module(s) to be documented, and then
inspect the objects to pull out documentation. Their typical flaw is that
they only understand relatively few kinds of objects, and produce ludicrous
results when trying to document PEAK's descriptors, metaclasses,
interfaces, and so on. So, they lack flexibility and extensibility.
A second, perhaps even more fatal flaw of these tools is that they document
things "as-is" from the point of view of Python. By that, I mean that they
are limited by the structure of the objects formed by the code, and do not
see the underlying organization of a module or class. I usually group
related classes together in a module, and related methods or attributes in
a class. This information is lost when you extract data from a module or
class dictionary, instead of reading the source code.
But source-reading tools are few and far between, because you have to be
able to write a parser or figure out how to use Python's internal parser in
order to write one. Really, HappyDoc is the only practical source-reading
tool out there that I know of. But even though it reads the source, it
*also* discards the ordering information, and doesn't understand
descriptors or interfaces or metaclasses.
Finally, neither kind of tool understands PEAK's API organization, where
users are generally discouraged against directly importing from a package's
contained modules. In PEAK, we don't necessarily want to document where an
API is coded, but rather where it should be used from.
For a long time, I've considered writing a documentation tool, but I've
always been hung up on the issue of writing a suitable source parser,
because I assumed that was the only way to get the sequence/grouping
information needed. But, a source-reading tool is also inherently limited
in flexibility by what it can parse. So, I've never really taken any steps
to implement it.
What I realized from today's exchange with Duncan about a "quick reference"
generator, is that there's a third alternative: create a documentation API
that's invoked within a module, to bypass the limitations of today's
object-extracting tools.
Specifically, Python module and class dictionaries don't preserve sequence
or grouping, and some objects currently can't be documented. (For example,
you can't document integer or string constants in a module, except by
putting text in the module docstring.) Last, but not least, such tools
localize documentation in the "defining" module, rather than the
"exporting" module (e.g. PEAK's API mdoules), and can't handle esoteric
relationships such as "X implements Y" or "A requires B".
So here's the idea: create a "documentation kernel" API that does the
following things:
* Allows you to register docstrings for otherwise-undocumentable items in a
namespace
* Allows you to create categorized indexes of a namespace's contents
* Allows you to nest indexes (e.g. class within a module)
* Allows you to record arbitrary metadata and relationships between items
(using synthetic keys to represent the objects, to avoid object lifetime/GC
issues)
* Allows to to direct the documentation for a given namespace to actually
appear in another namespace, while retaining identity as to the defining
namespace
Then, build a "convenience API" that lets you easily use short function
calls in a module body to classify the module or class contents, attach doc
strings to attributes and constants, etc., indicate that certain items are
part of a given interface's implementation, etc.
Then, expand that convenience API by *invoking it from PEAK's own API's*,
so that documentation for a PEAK component can include, for example, what
configuration keys it offers or requires. In other words, when a PEAK
component is defined, it would invoke the documentation API to record this
sort of metadata.
Because this approach is based on a simple kernel API that just records
information, it should be easy to add in to any new APIs. For example,
peak.security APIs could record security documentation, and so on.
Finally, we could then write documentation formatters, that simply take
data from the documentation indexes, and output it in different
formats. Such formatters would need to be configurable as to what metadata
they extract, of course, and what they do with it, especially for
relationship data (e.g. inheritance trees).
For some objects, it's probably better not to insert documentation API
calls into them directly, but instead to process them only when
documentation is needed. I think that to do this, you'd basically go
through all the modules you were documenting, and adapt each target object
to a "documentation contributor" interface, and ask the object to update
the appropriate indexes.
Indeed, that approach seems particularly useful for documenting object
relationships, since the extent of a given set of relationships can't be
known until all the relevant objects have been imported. (For example, the
fact that Adpater X adapts from type Y to interface Z can't be documented
unless X, Y, and Z are all in memory.) Also, using the actual objects to
manage relationships would mean we wouldn't need synthetic keys, or
otherwise have to deal with relationship data unless a documentation tool
is actually being run.
Also, if instead of using adapters to do this generation, we used generic
functions, it would also be possible to define optional new kinds of
additional metadata. For example, one could add an extra generation rule
to scan a function's bytecode or source code for exceptions raised, and
then create a link between the function object and the exception object in
the relationship index. However, the documentation formatter's
configuration could determine whether this additional rule would ever be
used, e.g.:
[when("cfg.allows('function-raises') and item in FunctionType")]
def generate_doc(cfg, item, index):
# etc.
...thus avoiding doing complex generation of unneeded data.
Hm. Really, I guess the pre-generation API only needs to deal with data
that can't reasonably be extracted after-the-fact, like sequencing,
grouping, and undocumentables.
Anyway, the basic idea here is that I think this is a way to get some of
the features in a documentation system, that I previously thought could
only be obtained by parsing source code. I'm going to have to give some
more thought to the specifics of the indexing scheme, particularly with
respect to how categories should be grouped at various levels, what are
global (systemwide) vs. local (package-specific) categories, and so on. To
do this, I'll need to think about different kinds of output we'd want to
have, and then determine what kind of indexes are needed to support them.
I think it's also possible we'll want a way to incorporate external text
files into a larger documentation scheme, such that a combination
developer's guide and API reference could be generated from the code plus
the external files.
Such a tool would be useful not only for PEAK itself, but applications
developed with PEAK. After all, an "enterprise" toolkit ought to make it
easy for teams within an organization to share and maintain code.
Of course, as I mentioned earlier, none of these ideas should stop us from
getting a quick-and-dirty "quick reference" generator up and running,
especially since I have so many other items before this one on my to-do
list. I just wanted to record this idea for future reference.
More information about the PEAK
mailing list