Analyzing and Transforming Abstract Syntax

You manage a large Python codebase that interacts with another vendor's library. A new version of this library is not backwards compatible: it has switched from lists to iterators throughout its API. This is a problem because your codebase uses slice notation on list values throughout, which must now be replaced with calls to islice. You are tasked with quantifying the extent to which the codebase under your management must be modified, and with designing and/or implementing a process for performing the updates across this codebase. How can you avoid having to manually identify and rewrite every occurrence of slice notation? Is there a way both to measure the extent of the problem and to perform the necessary code transformations in an automated way?

More generally, the ability to work directly with the abstract syntax data structures of a language can be useful in at least two kinds of scenarios.

  • It may be necessary to analyze code (possibly a very large amount of code) in an automated way to obtain some useful information (e.g., whether patterns that may represent vulnerabilities or errors exist, where and how often a language feature is used, the portion of the code for which tests and documentation are present, and so on).
  • It may be necessary to transform code (again, possibly a very large amount of code) in some way (e.g., to update its compatibility with a newer version of an API, to transpile it to another language, and so on).

Python's built-in libraries include powerful modules that make it possible to retrieve and operate over parse trees or abstract syntax trees (ASTs) of modules, classes, and functions/methods. This article provides an overview of the Python syntax definition, how to build and operate on abstract syntax trees, and some use cases. With this information in hand, you should be ready to start writing Python code that examines and transforms other Python code.

Abstract Syntax Grammar and Data Structure

For any given programming language, the concrete syntax corresponds to the actual strings of characters that make up a program, and the abstract syntax corresponds to the data structure instances that a parser for that language generates from a string of characters (and which may subsequently be processed by an interpreter or a compiler). An instance of the abstract syntax data structure is sometimes called a parse tree, an abstract syntax tree, or an AST.

The definitions of abstract syntax grammar for Python is presented in the documentation for the ast module. The ast module itself includes an implementation of the data structure corresponding to the grammar definition, as well as associated functionalities (such as features that enable parsing a concrete syntax string into an AST instance and performing transformations on an AST instance).

In [1]:
import ast

# Programmatically build an AST for the
# Python concrete syntax expression `-5`.
node =\
  ast.UnaryOp(
    ast.USub(),
    ast.Num(5, lineno=0, col_offset=0),
    lineno=0,
    col_offset=0
  )

The dump function returns a string representation of the data structure.

In [2]:
ast.dump(node)
Out[2]:
'UnaryOp(op=USub(), operand=Constant(value=5))'

You can use the inspect module to retrieve the concrete syntax of the body of a Python function definition.

In [3]:
def f(x):
    return x + 2

import inspect
inspect.getsource(f)
Out[3]:
'def f(x):\n    return x + 2\n'

You can then parse the concrete syntax using ast.parse() to generate an instance of the abstract syntax data structure.

In [4]:
t = ast.parse(inspect.getsource(f))
ast.dump(t)
Out[4]:
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[], args=[arg(arg='x', annotation=None, type_comment=None)], vararg=None, kwonlyargs=[], kw_defaults=[], kwarg=None, defaults=[]), body=[Return(value=BinOp(left=Name(id='x', ctx=Load()), op=Add(), right=Constant(value=2, kind=None)))], decorator_list=[], returns=None, type_comment=None)], type_ignores=[])"

How can this data structure be inspected and traversed using the ast API? Every instance of the data structure is also a node in the tree that represents the root of the module, statement, or expression that was parsed to build it. You can determine what fields exist in the node using the _fields attribute. For example, you can retrieve the body of the function definition above in the following way. Notice that the body field is a list of AST nodes in which the first (and only) node is the function definition.

In [5]:
if 'body' in t._fields and len(t.body) > 0:
    stmt_fun = t.body[0]
    print(isinstance(stmt_fun, ast.FunctionDef))
True

You can retrieve the single statement in the body of the function by referencing the abstract syntax definition and working your way down through the various parameters of the data structure.

In [6]:
if 'body' in stmt_fun._fields and len(stmt_fun.body) > 0:
    stmt_return = stmt_fun.body[0]
    print(isinstance(stmt_return, ast.Return))
True

Going deeper, you can retrieve the expression inside the return statement in the following way.

In [7]:
if 'value' in stmt_return._fields:
    expr_add = stmt_return.value
    print(isinstance(expr_add, ast.BinOp))
True

Finally, you can extract the expression operator and its individual arguments.

In [8]:
print(expr_add.left)
print(expr_add.left.id)
print(expr_add.op)
print(expr_add.right)
print(expr_add.right.n)
<_ast.Name object at 0x05A85100>
x
<_ast.Add object at 0x03567A78>
<_ast.Constant object at 0x05A85118>
2

Analyzing an Abstract Syntax Tree

The ast API allows you to programmatically write you own functions for operating on Python ASTs. The example below collects all the variables in an AST (as long as the AST consists of the specific kinds of nodes that are handled by the various if branches in the body of the function).

In [9]:
def collect_variables(a):
    if type(a) is ast.Module:
        return [v for s in a.body for v in collect_variables(s)]
    
    elif type(a) is ast.FunctionDef:
        vs = [v for s in a.body for v in collect_variables(s)]
        return [a.name] + vs

    elif type(a) is ast.Assign:
        vs = [v for s in a.targets for v in collect_variables(s)]
        return vs + collect_variables(a.value)
    
    elif type(a) is ast.Return:
        return collect_variables(a.value)

    elif type(a) is ast.Name:
        return [a.id]

    elif type(a) is ast.BinOp:
        return\
            collect_variables(a.left) + collect_variables(a.right)
    
    else:
        print(type(a)) # Display trees not captured by cases above.
        return []

Invoking the collect_variables function on an example yields a list of all the variables encountered during the recursive traversal. Note that because there was no explicit case handling constants, they are displayed by the print statement at the end of the function.

In [10]:
def f(x, y):
    a = b
    u = 1 + 2
    z = x + y
    return z

a = ast.parse(inspect.getsource(f))
collect_variables(a)
<class '_ast.Constant'>
<class '_ast.Constant'>
Out[10]:
['f', 'a', 'b', 'u', 'z', 'x', 'y', 'z']

You can instead use the ast.walk() function to traverse a tree if you are not worried about its internal structure but wish to see and process all of its nodes (usually, to produce some kind of less granular aggregate analysis). The example below performs the same task as the function collect_variables above.

In [11]:
vs = []
for node in ast.walk(a):
    if isinstance(node, ast.FunctionDef):
        vs.append(node.name)
    if isinstance(node, ast.Name):
        vs.append(node.id)
vs
Out[11]:
['f', 'a', 'b', 'u', 'z', 'z', 'x', 'y']

Returning to the motivating example in the first paragraph of this article, this approach can be used to find all occurrences of slice notation.

In [12]:
def slice_occurrences(a):
    occurs = 0
    for node in ast.walk(a):
        if isinstance(node, ast.Subscript):
            if isinstance(node.slice, ast.Slice):
                occurs += 1
    return occurs

Using the above function on some examples yields the expected results.

In [13]:
def f(xs):
    return xs[0:10] + xs[20:30]

slice_occurrences(ast.parse(inspect.getsource(f)))
Out[13]:
2
In [14]:
def g(xs):
    return [xs[0], xs[1]]

slice_occurrences(ast.parse(inspect.getsource(g)))
Out[14]:
0

Transforming an Abstract Syntax Tree

You can use the ast.NodeVisitor and ast.NodeTransformer base classes to create your own algorithms that traverse or modify ASTs. This allows your implementation to have a more structured approach than one that uses ast.walk, while also allowing you to avoid explicitly handling every possible kind of AST node in your implementation as was done in the first example involving collect_variables.

In the example below, the DisplayVariables class visits every node and, if it is a function definition or a variable occurrence, displays its name. The call to generic_visit is necessary to ensure that, despite our replacing the particular methods for the Name and FunctionDef node types, the ast.NodeVisitor class can continue to recursively traverse the subtrees of these node types.

In [15]:
class DisplayVariables(ast.NodeVisitor):
    def visit_Name(self, node):
        print(node.id)
        self.generic_visit(node)

    def visit_FunctionDef(self, node):
        print(node.name)
        self.generic_visit(node) # Needed for overriding method.

Any class derived from NodeVisitor can be applied to an AST instance in the following way.

In [16]:
def f(x, y):
    a = b
    u = 1 + 2
    z = x + y
    return z

a = ast.parse(inspect.getsource(f))
DisplayVariables().visit(a)
f
a
b
u
z
x
y
z

In the example below, the ChangeVariable class is a transformer that visits every node. If the node is a variable (i.e., x in the Python concrete syntax), the variable node is replaced with a subtree corresponding to the Python concrete syntax for a dictionary access (i.e., data['x'] in the Python concrete syntax).

In [17]:
class ChangeVariables(ast.NodeTransformer):
    def visit_Name(self, node): # Only transform `Name` nodes.
        a =\
          ast.copy_location(
            ast.Subscript(
              value=ast.Name(id='data', ctx=ast.Load()),
              slice=ast.Index(value=ast.Str(s=node.id)),
              ctx=node.ctx
            ),
            node
          )
        return a

As with NodeVisitor, any class derived from NodeTransformer can be applied to an AST instance in the following way.

In [18]:
def f(x, y):
    return x + y

a = ChangeVariables().visit(ast.parse(inspect.getsource(f)))

You can use the astor module to convert an AST back into concrete syntax. This is a straightforward way to write transformed code back to a file.

In [19]:
import astor
print(astor.to_source(a))
def f(x, y):
    return data['x'] + data['y']

Returning to the motivating example in the first paragraph of this article, the ChangeSlices transformer below replaces all subtrees that consist of slice notation with a subtree that consists of a call to islice. Notice that it is necessary to preserve the subtrees corresponding to the upper and lower bounds of the slice range and make them the new arguments to islice. This example only handles cases where both the upper and lower bounds are explicit in the original slice notation.

In [20]:
class ChangeVariables(ast.NodeTransformer):
    def visit_Subscript(self, node):
        a =\
          ast.copy_location(
            ast.Call(
              func=ast.Name(id='islice', ctx=ast.Load()),
              args=[
                ast.Name(id=node.value.id, ctx=ast.Load()),
                node.slice.lower,
                node.slice.upper
              ],
              ctx=node.ctx,
              keywords=[]
            ),
            node
          )
        return a

Below, the ChangeSlices transformer is called on an example input.

In [21]:
def f(xs):
    return xs[0:10]

a = ChangeVariables().visit(ast.parse(inspect.getsource(f)))
print(astor.to_source(a))
def f(xs):
    return islice(xs, 0, 10)

Notice that this approach works even if the parameter subexpressions within each instance of slice notation are highly complex because the subtrees in the AST are simply inserted into the new node that invokes islice. Thus, this is both a general and a scalable approach for replacing instances of slice notation with islice throughout a large codebase.

Further Reading

This article briefly introduced some concepts involved in defining and implementing a programming language (such as abstract syntax and abstract syntax trees), as well as a number of tools for writing Python programs that can analyze and transform other Python programs. These features allow Python to support families of techniques such as reflection), source-to-source compilation, and static analysis, and it can be worthwhile to learn more about these in order to expand the set of tools available to you on your next project. This article also provided a brief overview of the Python language definition; you can explore the full language definition (including all the various constructs and the concrete and abstract syntax representations) in the Python Language Reference.