Parsing and formatting based on production rules
This module provides some special parsing and formatting features not found
in other Python parsing libraries. Specifically it:
supports reversing the production process to convert parsed output back
to a canonical version of the input
Supports "smart tokenizing" based on contextually-available delimiters
does not separate "lexing" or "tokenizing" from "parsing", so lexical
analysis can be parse-context dependent
is designed to accomodate parsing things other than strings (e.g. object
streams, SAX event lists,...?)
allows the definition of arbitrary "rule", "input" and "state" objects
that can be fitted into the framework to handle specialized input types,
context passing, etc.
The first three features were critical requirements for PEAK's URL-parsing
tools. We wanted to make it super-easy to create robust URL syntaxes that
would produce canonical representations of URLs from input data as well as
sensibly parse input strings. And part of "super-easy" meant not having
to write bazillions of regular expressions to parse every field in a URL.
Limitations:
The framework isn't designed for "formatting" to non-strings.
Specifically, most rules assume that their sub-rules will only
write() things that can be joined with "".join() when formatting.
Some parts of the framework may not be 100% Unicode-safe, even if a
UnicodeInput type were implemented. Code review and patches appreciated.
TODO:
Docstrings, docstrings, docstrings... and a test suite!
ParseError should provide line/column info about the error location, not
just offset, and it should be provided by input.error() rather than by
the rule signalling the error. Of course, all the rules should be
calling input.error() instead of creating ParseError instances...
Perform timing tests and investigate parsing speedups for:
"Compiling" rules to regular expressions + postprocessors
"Optimizing" rules (e.g. convert Optional(user, @ ) to something
that forward-scans for @ before trying to match user ).
Moving speed-critical parts to C
Imported modules
|
|
from peak.binding.once import Make
from peak.util.symbols import NOT_GIVEN
import re
|
Functions
|
|
format
parse
uniquechars
|
|
format
|
format ( aDict, syntax )
|
|
parse
|
parse ( input, rule )
Exceptions
|
|
ParseError("Expected EOF, found %r at position %d in %r" %( input [ state : ], state, input ) )
state
|
|
|
uniquechars
|
uniquechars ( s )
|
Classes
|
|
|
|