The PEAK Developers' Center   scale.dsl UserPreferences
 
HelpContents Search Diffs Info Edit Subscribe XML Print View
The following 496 words could not be found in the dictionary of 50 words (including 50 LocalSpellingWords) and are highlighted below:
All   And   As   Assigned   Back   Block   Blocks   But   Contents   Context   Converting   Declarations   Error   Example   Examples   Expected   Expression   For   Here   However   If   In   Instead   It   Let   Library   Many   More   Most   None   Note   Notice   Now   Open   Parsing   Python   Return   Sequences   So   Splitting   Statements   Table   Text   That   The   Therefore   This   Thus   Token   Tokenizing   Tokens   Traceback   True   Using   Valid   We   With   Yield   You   Zang   abc   actual   addition   affect   after   all   already   also   an   and   another   answer   any   appear   appeared   append   are   as   assign   assigned   assignment   assigns   assumed   at   back   bar   based   baz   be   because   becomes   before   being   beneath   blk   block   blocks   boy   braces   brackets   but   by   call   can   case   change   children   class   clause   clauses   code   column   comment   comments   consists   constants   contain   contained   containing   contents   context   convenience   converts   copying   create   creating   decl   declaration   declarations   decoding   dedents   def   default   depending   detail   detokenize   detokenizing   dialects   dictionary   did   different   directly   display   do   docstring   docstrings   does   doing   don   done   dsl   each   easier   either   element   else   empty   encoding   encounters   end   entirely   error   errors   especially   even   every   exactly   example   examples   except   expand   expr   express   expression   expressions   far   feeds   file   filename   first   flatten   flattened   flattening   flush   following   follows   for   formatted   found   from   function   functions   gee   generated   generator   generators   get   given   guarantee   guaranteed   has   have   high   how   however   identifier   identifiers   identify   if   ignored   import   important   impossible   in   includes   incorporated   incurring   indent   indentation   indented   indenting   indents   individual   information   input   inside   instead   intended   into   introduce   introduces   is   it   item   items   iterable   iterator   iterators   its   join   just   keep   keywords   know   lambda   language   last   left   len   let   level   library   like   line   lines   list   lists   ll   logical   look   looking   looks   make   makes   markers   match   matched   me   meaning   means   meanwhile   meet   might   module   more   most   multi   must   name   names   necessary   need   nested   nesting   newlines   non   normal   not   note   now   number   numeric   object   occurence   occurrence   occurring   odd   oddly   of   offset   oh   okay   omit   on   one   only   open   operates   operation   operations   operator   optional   or   org   original   other   out   output   over   paired   pairs   parameter   parentheses   parse   parsed   parser   parsing   particular   partition   partitioning   pass   penalty   pep   peps   performance   portion   position   positions   possibility   possibly   prefixes   prepared   prescribe   prescribed   presence   present   print   process   provides   pseudo   punctuation   python   quoted   raised   rather   re   reading   realigns   reason   recent   reformatter   regardless   reindent   reindenting   remember   removed   repartitioning   repeated   representing   result   resulting   return   returned   right   rpartition   rstrip   same   samples   scale   see   seen   sep   separates   separator   sequence   sequences   shifted   should   simple   single   so   some   something   source   spammity   span   specifically   split   splitting   squidge   standard   start   starting   statement   statements   still   stmt   stream   streams   string   strings   strip   stripped   stripping   sub   subexpr   subexpression   subexpressions   subsequences   such   supply   support   surrounding   syntax   text   than   that   the   their   them   then   there   these   they   this   those   three   to   tok   token   tokenize   tokenizing   tokenlist   tokens   tools   top   traverse   treated   tree   try   tuple   tuples   turn   turns   two   type   under   unicode   uniform   unless   until   usage   use   used   useful   using   valid   value   values   ve   want   was   we   were   what   whatever   whee   when   where   whether   which   whitespace   whiz   whole   whose   width   will   with   within   without   ws   xyz   yield   yields   you   zing  

Clear message


Using SCALE's DSL Parsing Library

>>> from scale import dsl

Table of Contents

Tokenizing Text

Most of the DSL parsing API operates on sequences or iterators of tokens, as generated by the standard library tokenize module. You can use that module directly, or you can use these convenience functions to do the tokenizing:

tokenize_string(text)
Yield the tokens of text
tokenize_stream(file)
Yield the tokens found in the open iterable stream file
tokenize_file(filename)
Open filename for text reading, and yield its tokens

All of these functions support source encoding comments and BOM markers as prescribed by PEP 263. However, if you supply a unicode string to tokenize_string() or a unicode stream to tokenize_stream(), any PEP 263 source encoding information will be ignored, as it will be assumed you already have done any necessary decoding.

Example usage:

>>> from tokenize import tok_name

>>> [(tok_name[t],v) for t,v,s,e,line in dsl.tokenize_string("1+2")]
[('NUMBER', '1'), ('OP', '+'), ('NUMBER', '2'), ('ENDMARKER', '')]

Converting Tokens Back to Text

The detokenize() function converts an iterable of tokens back into a string:

>>> print dsl.detokenize(dsl.tokenize_string("1+2"))
1+2
>>> print dsl.detokenize(dsl.tokenize_string("1+ 2   #foo"))
1+ 2   #foo

The resulting string will have every token on its original line in the input:

>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+  \\
...    \\
... 2"""))
1+  \
\
2

But the tokens will be shifted to the left such that the first non-whitespace, non-comment token is in the first column of the output:

>>> print dsl.detokenize(dsl.tokenize_string("""
...     print '''foo
...     bar''' + '''spam
...     baz''';"""))
<BLANKLINE>
print '''foo
    bar''' + '''spam
    baz''';

unless you use the optional indent parameter to change the default indentation:

>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=2)
'  print foo'
>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=4)
'    print foo'

>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+  \\
...    \\
... 2"""), indent=4)
    1+  \
    \
    2

But note that re-indentation doesn't affect the contents of multi-line strings, such as docstrings. That is, after reindenting, the string has the same value it did before, even if it makes the contents of the string look odd:

>>> print dsl.detokenize(dsl.tokenize_string("""\
...         # a comment that's oddly indented - or is it?
...     def x():
...         '''more than one
...            line in the docstring'''
... """), indent=12)
            # a comment that's oddly indented - or is it?
            def x():
                '''more than one
           line in the docstring'''
...

Notice also that any comments occurring before the first non-whitespace token in the token stream are formatted flush left to the indent column, regardless of their position in the input. (This is because detokenize() doesn't know how far to offset the input lines from their starting positions until it encounters a non-comment token.)

You can strip whitespace tokens like indents, dedents, comments, and newlines from a token list with strip_ws(), which yields all the non-whitespace tokens in a sequence:

>>> dsl.detokenize(dsl.strip_ws(dsl.tokenize_string("123 #xyz")))
'123'

strip_ws() is intended to make parsing individual statements easier. But you should not use it on token streams that span more than one logical line, because the NEWLINE whitespace token separates logical lines, and the INDENT and DEDENT tokens are used to identify blocks. With these tokens removed, parsing blocks into statements and nested blocks becomes impossible. Therefore, if you are doing block parsing you should only strip whitespace from individual statements, not from the input to parse_block().

Splitting Token Sequences

Many parsing operations need to split a sequence of tokens into subsequences based on the presence of a particular token. For this, the dsl module provides two tokenlist-splitting functions:

partition(tokens, sep)

Return a 3-tuple (before, sep, after), such that before is a list of the tokens occurring before the first occurence of sep in tokens, and after is an iterator that will yield the portion of tokens that is after the separator. (This means that you can keep partitioning the "after" portion without incurring an O(N^2) performance penalty for copying the same list items over and over.)

If the separator is found, the returned sep is a 1-element list containing the actual token that matched sep. If the separator is not found, the returned sep is an empty list, and before will contain all of tokens.

rpartition(tokens,`sep`)
This is just like partition(), except that the split is done at the last occurrence of sep in tokens instead of the first, and all three return values are lists, rather than two lists and an iterator. (Note that this means repeated right-partitioning is an O(N^2) operation, so you should try to use partition() if you need to keep repartitioning the before value.)

The sep value for these functions can be a string, in which case the token value must exactly match that string, or else it can be one of the tokenize module constants like tokenize.OP or tokenize.NAME, in which case it will match any token of that type.

Examples:

>>> before, sep, after = dsl.partition(dsl.tokenize_string("1+2"), "+")
>>> print dsl.detokenize(before)
1
>>> print dsl.detokenize(sep)
+
>>> after
<generator object at ...>

>>> print dsl.detokenize(after)
2

>>> before, sep, after = dsl.partition(dsl.tokenize_string("1*2"), "+")
>>> print dsl.detokenize(before)
1*2
>>> sep     # empty if not found
[]
>>> list(after)     # remember, it's an iterator
[]

>>> from tokenize import OP     # now let's split on any operator
>>> before, sep, after = dsl.partition(dsl.tokenize_string("1*2"), OP)
>>> print dsl.detokenize(before)
1
>>> print dsl.detokenize(sep)
*
>>> print dsl.detokenize(after)
2

>>> before, sep, after = dsl.rpartition(
...     dsl.tokenize_string("class when class foo"), "class"
... )
>>> before      # all three are lists
[(...'class'...), (...'when'...)]
>>> sep
[(...'class'...)]
>>> after
[(...'foo'...)]

>>> before, sep, after = dsl.rpartition(
...     dsl.tokenize_string("class when class foo"), "spam"
... )   # now let's try a not-found separator

>>> before
[]
>>> sep
[]
>>> dsl.detokenize(after)
'class when class foo'

Block Parsing

The parse_block(tokens) function turns an iterable of tokens into a block, which is a list of statements and the blocks that appear indented under those statements. More specifically, it is a list of two-item "(statement, block)" tuples, where statement is a list of the tokens representing a single statement, and block is a (possibly-empty) nested list of "(statement, block)" pairs:

>>> dsl.parse_block(dsl.tokenize_string("1+2"))
[([(...'1'...), (...'+'...), (...'2'...)], [])]

Blocks can be flattened back into a token sequence using flatten_block(), so you can then detokenize the result back into a string:

>>> print dsl.detokenize(
...     dsl.flatten_block(dsl.parse_block(dsl.tokenize_string("1+2")))
... )
1+2

Thus, you can parse a file into a block, then traverse the statement tree and turn sub-blocks back into strings at whatever indentation level you like. This is especially useful for creating parser generators or other tools that express a high-level language that then includes blocks of Python code that must be incorporated into their output.

Using Statements

In addition to creating a tree of blocks and statements, parse_block() also turns each statement into a tree of nested subexpressions, so that parentheses, brackets, and braces are paired and their contents nested into a single token:

>>> block = dsl.parse_block(dsl.tokenize_string("foo(bar[baz])"))
>>> stmt,block = block[0]   # get the first statement

>>> [(tok_name[t],v) for t,v,s,e,line in stmt]
[('NAME', 'foo'), ('SUBEXPR', [...])]

>>> stmt[1][0] == dsl.SUBEXPR
True

As you can see, the statement consists of a NAME token and a SUBEXPR pseudo-token (whose numeric value is dsl.SUBEXPR). Instead of a string in the normal position for the token's value, there's a list. Let's expand it:

>>> subexpr = stmt[1][1]    # get the nested token tree
>>> print dsl.detokenize(subexpr)
(bar[baz])

>>> [(tok_name[t],v) for t,v,s,e,line in subexpr]
[('OP', '('), ('NAME', 'bar'), ('SUBEXPR', [...]), ('OP', ')')]

As you can see, the parentheses surrounding the subexpression are contained in the nested token list. And there's still another nested subexpression:

>>> subexpr = subexpr[2][1]
>>> print dsl.detokenize(subexpr)
[baz]

>>> [(tok_name[t],v) for t,v,s,e,line in subexpr]
[('OP', '['), ('NAME', 'baz'), ('OP', ']')]

This process of nesting subexpressions makes it easier to parse statements by looking for keywords or punctuation that may have different meaning when found in a subexpression. For example, the ":" operator in Python has different meaning in a dictionary display than it does at the top level of a statement, where it might introduce a block or lambda expression. However, if for some reason you want to flatten out the nesting, you can use flatten_stmt():

>>> [v for (t,v,s,e,line) in dsl.flatten_stmt(stmt)]
['foo', '(', 'bar', '[', 'baz', ']', ')']

Using Blocks

Now that we've seen what a statement looks like inside, let's look at blocks in more detail:

>>> block = dsl.parse_block(dsl.tokenize_string("""\
... def foo():
...     pass
... def bar(baz,spam):
...     whee()
... """))

The block we just parsed has two statements in it:

>>> len(block)
2

We'll print them, stripping whitespace so that they don't end with line feeds:

>>> for stmt,blk in block:
...     print dsl.detokenize(dsl.strip_ws(stmt))
def foo():
def bar(baz,spam):

Now let's print the blocks nested beneath the two statements, indenting them to match their original positions:

>>> for stmt,blk in block:
...     print dsl.detokenize(dsl.flatten_block(blk), indent=4)
    pass
...
    whee()
...

We can also print the whole block by flattening and detokenizing it:

>>> print dsl.detokenize(dsl.flatten_block(block))
def foo():
    pass
def bar(baz,spam):
    whee()
...

Now, let's create a simple code reformatter, that realigns a block and its children to meet a uniform indentation width:

>>> def reindent(block, indent_by=4, start=0):
...     out = []
...     for stmt,blk in block:
...         out.append(dsl.detokenize(stmt, indent=start))
...         if blk:
...             out.append(reindent(blk, indent_by, start+indent_by))
...     return ''.join(out)

>>> print reindent(block)
def foo():
    pass
def bar(baz,spam):
    whee()
...

>>> print reindent(block, 1)
def foo():
 pass
def bar(baz,spam):
 whee()
...

>>> print reindent(block, 7, 3)
   def foo():
          pass
   def bar(baz,spam):
          whee()
...

Parsing Declarations

The SCALE language consists entirely of "declaration" statements. A declaration is an expression with optional assignment prefixes, an optional "context" clause, and an optional nested block. Here are some examples of valid SCALE declarations:

>>> samples = dsl.parse_block(dsl.tokenize_string("""
...
... foo(gee):
...     zing from Zang
...     bar = baz = squidge() from spammity>=1.2:
...         answer = 42
...
... "whiz" = "oh boy" = 123 * 456
...
... something = me from:
...     me = 789
...
... """))

The parse_declarations() function yields 4-element tuples containing the assigned names, value expression, context clause, and nested block for each statement in a given block:

>>> def print_decl(names, expr, context, block):
...     if names:
...         print "Assigned to:", names
...     print "Expression:", `dsl.detokenize(expr)`
...     if context is not None:
...         print "Context:", `dsl.detokenize(context)`
...     if block:
...         print "Block:"
...         print dsl.detokenize(dsl.flatten_block(block),8).rstrip()
...     print

>>> for names, expr, context, block in dsl.parse_declarations(samples):
...     print_decl(names, expr, context, block)
...
Expression: 'foo(gee)'
Block:
        zing from Zang
        bar = baz = squidge() from spammity>=1.2:
            answer = 42
...
Assigned to: ['whiz', 'oh boy']
Expression: '123 * 456'
...
Assigned to: ['something']
Expression: 'me'
Context: ''
Block:
        me = 789
...

>>> for names,expr,context,block in dsl.parse_declarations(samples[0][1]):
...     print_decl(names, expr, context, block)
...
Expression: 'zing'
Context: 'Zang'
...
Assigned to: ['bar', 'baz']
Expression: 'squidge()'
Context: 'spammity>=1.2'
Block:
        answer = 42
...

As you can see, the context of a declaration is either a token list or None, depending on whether a "from" clause appeared in the declaration. names is a (possibly empty) list of the names to which the expression is being assigned, with quoted strings being treated as if they were normal identifiers. expr, meanwhile, is the token list for the declaration's value expression, and block is the nested block beneath the statement, if any. The expr and context token lists are stripped of whitespace tokens at the top level, and do not contain the : if one was present at the end of the declaration.

If a parsed line is not a valid declaration, a TokenError is raised. Valid declarations must end with a : if and only if an indented block follows:

>>> def try_parsing(s):
...     list(
...         dsl.parse_declarations(dsl.parse_block(dsl.tokenize_string(s)))
...     )

>>> try_parsing("xyz:")
Traceback (most recent call last):
  ...
TokenError: ("Expected indented block following ':'", (1, 4))

>>> try_parsing("xyz\n  foo")
Traceback (most recent call last):
  ...
TokenError: ("Expected ':' before indented block", (1, 3))

And the "from" clause of a declaration can only be empty if it introduces a block:

>>> try_parsing("xyz from:\n  foo")     # okay
>>> try_parsing("xyz from 123")         # also okay
>>> try_parsing("xyz from")             # not okay
Traceback (most recent call last):
  ...
TokenError: ("Expected context or ':' after 'from'", (1, 8))

It's also not valid to omit the expression if the declaration assigns to any names:

>>> try_parsing("from foo")         # okay
>>> try_parsing("foo = from bar")   # not okay
Traceback (most recent call last):
  ...
TokenError: ('Expected expression', (1, 5))
>>> try_parsing("foo =")   # also not okay
Traceback (most recent call last):
  ...
TokenError: ('Expected expression', (1, 5))

It's also an error to assign to something that's not an identifier or quoted string:

>>> try_parsing("abc = '123' = 456")    # okay, assigns 'abc' and '123'
>>> try_parsing("123 = 456")            # can't assign to number!
Traceback (most recent call last):
  ...
TokenError: ("Expected name or string before '='", (1, 4))

It's important to note, however, that parse_declarations() does not prescribe or guarantee any particular syntax for the expr and context clauses. So, they are not guaranteed to be valid Python expressions. This means that you can create SCALE dialects with different expression or context syntax, but it also means that you should be prepared for the possibility of syntax errors within the expression or context clause.


PythonPowered
EditText of this page (last modified 2007-03-25 17:27:07)
FindPage by browsing, title search , text search or an index
Or try one of these actions: AttachFile, DeletePage, LikePages, LocalSiteMap, SpellCheck