Using SCALE's DSL Parsing Library

>>> from scale import dsl

Table of Contents

Tokenizing Text

Most of the DSL parsing API operates on sequences or iterators of tokens, as generated by the standard library tokenize module. You can use that module directly, or you can use these convenience functions to do the tokenizing:

Yield the tokens of text
Yield the tokens found in the open iterable stream file
Open filename for text reading, and yield its tokens

All of these functions support source encoding comments and BOM markers as prescribed by PEP 263. However, if you supply a unicode string to tokenize_string() or a unicode stream to tokenize_stream(), any PEP 263 source encoding information will be ignored, as it will be assumed you already have done any necessary decoding.

Example usage:

>>> from tokenize import tok_name

>>> [(tok_name[t],v) for t,v,s,e,line in dsl.tokenize_string("1+2")]
[('NUMBER', '1'), ('OP', '+'), ('NUMBER', '2'), ('ENDMARKER', '')]

Converting Tokens Back to Text

The detokenize() function converts an iterable of tokens back into a string:

>>> print dsl.detokenize(dsl.tokenize_string("1+2"))
>>> print dsl.detokenize(dsl.tokenize_string("1+ 2   #foo"))
1+ 2   #foo

The resulting string will have every token on its original line in the input:

>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+  \\
...    \\
... 2"""))
1+  \

But the tokens will be shifted to the left such that the first non-whitespace, non-comment token is in the first column of the output:

>>> print dsl.detokenize(dsl.tokenize_string("""
...     print '''foo
...     bar''' + '''spam
...     baz''';"""))
print '''foo
    bar''' + '''spam

unless you use the optional indent parameter to change the default indentation:

>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=2)
'  print foo'
>>> dsl.detokenize(dsl.tokenize_string("print foo"), indent=4)
'    print foo'

>>> print dsl.detokenize(dsl.tokenize_string("""\
... 1+  \\
...    \\
... 2"""), indent=4)
    1+  \

But note that re-indentation doesn't affect the contents of multi-line strings, such as docstrings. That is, after reindenting, the string has the same value it did before, even if it makes the contents of the string look odd:

>>> print dsl.detokenize(dsl.tokenize_string("""\
...         # a comment that's oddly indented - or is it?
...     def x():
...         '''more than one
...            line in the docstring'''
... """), indent=12)
            # a comment that's oddly indented - or is it?
            def x():
                '''more than one
           line in the docstring'''

Notice also that any comments occurring before the first non-whitespace token in the token stream are formatted flush left to the indent column, regardless of their position in the input. (This is because detokenize() doesn't know how far to offset the input lines from their starting positions until it encounters a non-comment token.)

You can strip whitespace tokens like indents, dedents, comments, and newlines from a token list with strip_ws(), which yields all the non-whitespace tokens in a sequence:

>>> dsl.detokenize(dsl.strip_ws(dsl.tokenize_string("123 #xyz")))

strip_ws() is intended to make parsing individual statements easier. But you should not use it on token streams that span more than one logical line, because the NEWLINE whitespace token separates logical lines, and the INDENT and DEDENT tokens are used to identify blocks. With these tokens removed, parsing blocks into statements and nested blocks becomes impossible. Therefore, if you are doing block parsing you should only strip whitespace from individual statements, not from the input to parse_block().

Block Parsing

The parse_block(tokens) function turns an iterable of tokens into a block, which is a list of statements and the blocks that appear indented under those statements. More specifically, it is a list of two-item "(statement, block)" tuples, where statement is a list of the tokens representing a single statement, and block is a (possibly-empty) nested list of "(statement, block)" pairs:

>>> dsl.parse_block(dsl.tokenize_string("1+2"))
[([(...'1'...), (...'+'...), (...'2'...)], [])]

Blocks can be flattened back into a token sequence using flatten_block(), so you can then detokenize the result back into a string:

>>> print dsl.detokenize(
...     dsl.flatten_block(dsl.parse_block(dsl.tokenize_string("1+2")))
... )

Thus, you can parse a file into a block, then traverse the statement tree and turn sub-blocks back into strings at whatever indentation level you like. This is especially useful for creating parser generators or other tools that express a high-level language that then includes blocks of Python code that must be incorporated into their output.

Here's a more detailed example. First, let's parse a block:

>>> block = dsl.parse_block(dsl.tokenize_string("""\
... def foo():
...     pass
... def bar(baz,spam):
...     whee()
... """))

The block has two statements in it:

>>> len(block)

We'll print them, stripping whitespace so that they don't end with line feeds:

>>> for stmt,blk in block:
...     print dsl.detokenize(dsl.strip_ws(stmt))
def foo():
def bar(baz,spam):

Now let's print the bodies of the statements, indenting them to match their original positions:

>>> for stmt,blk in block:
...     print dsl.detokenize(dsl.flatten_block(blk), indent=4)

Or, to print the whole block, we can simply flatten and detokenize it:

>>> print dsl.detokenize(dsl.flatten_block(block))
def foo():
def bar(baz,spam):

Now, let's create a simple code reformatter, that realigns a block and its children to meet a uniform indentation width:

>>> def reindent(block, indent_by=4, start=0):
...     out = []
...     for stmt,blk in block:
...         out.append(dsl.detokenize(stmt, indent=start))
...         if blk:
...             out.append(reindent(blk, indent_by, start+indent_by))
...     return ''.join(out)

>>> print reindent(block)
def foo():
def bar(baz,spam):

>>> print reindent(block, 1)
def foo():
def bar(baz,spam):

>>> print reindent(block, 7, 3)
   def foo():
   def bar(baz,spam):

