You manage a large Python codebase that interacts with another vendor's library. A new version of this library is not backwards compatible: it has switched from lists to iterators throughout its API. This is a problem because your codebase uses slice notation on list values throughout, which must now be replaced with calls to islice
. You are tasked with quantifying the extent to which the codebase under your management must be modified, and with designing and/or implementing a process for performing the updates across this codebase. How can you avoid having to manually identify and rewrite every occurrence of slice notation? Is there a way both to measure the extent of the problem and to perform the necessary code transformations in an automated way?
More generally, the ability to work directly with the abstract syntax data structures of a language can be useful in at least two kinds of scenarios.
Python's built-in libraries include powerful modules that make it possible to retrieve and operate over parse trees or abstract syntax trees (ASTs) of modules, classes, and functions/methods. This article provides an overview of the Python syntax definition, how to build and operate on abstract syntax trees, and some use cases. With this information in hand, you should be ready to start writing Python code that examines and transforms other Python code.
For any given programming language, the concrete syntax corresponds to the actual strings of characters that make up a program, and the abstract syntax corresponds to the data structure instances that a parser for that language generates from a string of characters (and which may subsequently be processed by an interpreter or a compiler). An instance of the abstract syntax data structure is sometimes called a parse tree, an abstract syntax tree, or an AST.
The definitions of abstract syntax grammar for Python is presented in the documentation for the ast
module. The ast
module itself includes an implementation of the data structure corresponding to the grammar definition, as well as associated functionalities (such as features that enable parsing a concrete syntax string into an AST instance and performing transformations on an AST instance).
import ast
# Programmatically build an AST for the
# Python concrete syntax expression `-5`.
node =\
ast.UnaryOp(
ast.USub(),
ast.Num(5, lineno=0, col_offset=0),
lineno=0,
col_offset=0
)
The dump
function returns a string representation of the data structure.
ast.dump(node)
You can use the inspect
module to retrieve the concrete syntax of the body of a Python function definition.
def f(x):
return x + 2
import inspect
inspect.getsource(f)
You can then parse the concrete syntax using ast.parse()
to generate an instance of the abstract syntax data structure.
t = ast.parse(inspect.getsource(f))
ast.dump(t)
How can this data structure be inspected and traversed using the ast
API? Every instance of the data structure is also a node in the tree that represents the root of the module, statement, or expression that was parsed to build it. You can determine what fields exist in the node using the _fields
attribute. For example, you can retrieve the body of the function definition above in the following way. Notice that the body
field is a list of AST nodes in which the first (and only) node is the function definition.
if 'body' in t._fields and len(t.body) > 0:
stmt_fun = t.body[0]
print(isinstance(stmt_fun, ast.FunctionDef))
You can retrieve the single statement in the body of the function by referencing the abstract syntax definition and working your way down through the various parameters of the data structure.
if 'body' in stmt_fun._fields and len(stmt_fun.body) > 0:
stmt_return = stmt_fun.body[0]
print(isinstance(stmt_return, ast.Return))
Going deeper, you can retrieve the expression inside the return
statement in the following way.
if 'value' in stmt_return._fields:
expr_add = stmt_return.value
print(isinstance(expr_add, ast.BinOp))
Finally, you can extract the expression operator and its individual arguments.
print(expr_add.left)
print(expr_add.left.id)
print(expr_add.op)
print(expr_add.right)
print(expr_add.right.n)
The ast
API allows you to programmatically write you own functions for operating on Python ASTs. The example below collects all the variables in an AST (as long as the AST consists of the specific kinds of nodes that are handled by the various if
branches in the body of the function).
def collect_variables(a):
if type(a) is ast.Module:
return [v for s in a.body for v in collect_variables(s)]
elif type(a) is ast.FunctionDef:
vs = [v for s in a.body for v in collect_variables(s)]
return [a.name] + vs
elif type(a) is ast.Assign:
vs = [v for s in a.targets for v in collect_variables(s)]
return vs + collect_variables(a.value)
elif type(a) is ast.Return:
return collect_variables(a.value)
elif type(a) is ast.Name:
return [a.id]
elif type(a) is ast.BinOp:
return\
collect_variables(a.left) + collect_variables(a.right)
else:
print(type(a)) # Display trees not captured by cases above.
return []
Invoking the collect_variables
function on an example yields a list of all the variables encountered during the recursive traversal. Note that because there was no explicit case handling constants, they are displayed by the print
statement at the end of the function.
def f(x, y):
a = b
u = 1 + 2
z = x + y
return z
a = ast.parse(inspect.getsource(f))
collect_variables(a)
You can instead use the ast.walk()
function to traverse a tree if you are not worried about its internal structure but wish to see and process all of its nodes (usually, to produce some kind of less granular aggregate analysis). The example below performs the same task as the function collect_variables
above.
vs = []
for node in ast.walk(a):
if isinstance(node, ast.FunctionDef):
vs.append(node.name)
if isinstance(node, ast.Name):
vs.append(node.id)
vs
Returning to the motivating example in the first paragraph of this article, this approach can be used to find all occurrences of slice notation.
def slice_occurrences(a):
occurs = 0
for node in ast.walk(a):
if isinstance(node, ast.Subscript):
if isinstance(node.slice, ast.Slice):
occurs += 1
return occurs
Using the above function on some examples yields the expected results.
def f(xs):
return xs[0:10] + xs[20:30]
slice_occurrences(ast.parse(inspect.getsource(f)))
def g(xs):
return [xs[0], xs[1]]
slice_occurrences(ast.parse(inspect.getsource(g)))
You can use the ast.NodeVisitor
and ast.NodeTransformer
base classes to create your own algorithms that traverse or modify ASTs. This allows your implementation to have a more structured approach than one that uses ast.walk
, while also allowing you to avoid explicitly handling every possible kind of AST node in your implementation as was done in the first example involving collect_variables
.
In the example below, the DisplayVariables
class visits every node and, if it is a function definition or a variable occurrence, displays its name. The call to generic_visit
is necessary to ensure that, despite our replacing the particular methods for the Name
and FunctionDef
node types, the ast.NodeVisitor
class can continue to recursively traverse the subtrees of these node types.
class DisplayVariables(ast.NodeVisitor):
def visit_Name(self, node):
print(node.id)
self.generic_visit(node)
def visit_FunctionDef(self, node):
print(node.name)
self.generic_visit(node) # Needed for overriding method.
Any class derived from NodeVisitor
can be applied to an AST instance in the following way.
def f(x, y):
a = b
u = 1 + 2
z = x + y
return z
a = ast.parse(inspect.getsource(f))
DisplayVariables().visit(a)
In the example below, the ChangeVariable
class is a transformer that visits every node. If the node is a variable (i.e., x
in the Python concrete syntax), the variable node is replaced with a subtree corresponding to the Python concrete syntax for a dictionary access (i.e., data['x']
in the Python concrete syntax).
class ChangeVariables(ast.NodeTransformer):
def visit_Name(self, node): # Only transform `Name` nodes.
a =\
ast.copy_location(
ast.Subscript(
value=ast.Name(id='data', ctx=ast.Load()),
slice=ast.Index(value=ast.Str(s=node.id)),
ctx=node.ctx
),
node
)
return a
As with NodeVisitor
, any class derived from NodeTransformer
can be applied to an AST instance in the following way.
def f(x, y):
return x + y
a = ChangeVariables().visit(ast.parse(inspect.getsource(f)))
You can use the astor
module to convert an AST back into concrete syntax. This is a straightforward way to write transformed code back to a file.
import astor
print(astor.to_source(a))
Returning to the motivating example in the first paragraph of this article, the ChangeSlices
transformer below replaces all subtrees that consist of slice notation with a subtree that consists of a call to islice
. Notice that it is necessary to preserve the subtrees corresponding to the upper and lower bounds of the slice range and make them the new arguments to islice
. This example only handles cases where both the upper and lower bounds are explicit in the original slice notation.
class ChangeVariables(ast.NodeTransformer):
def visit_Subscript(self, node):
a =\
ast.copy_location(
ast.Call(
func=ast.Name(id='islice', ctx=ast.Load()),
args=[
ast.Name(id=node.value.id, ctx=ast.Load()),
node.slice.lower,
node.slice.upper
],
ctx=node.ctx,
keywords=[]
),
node
)
return a
Below, the ChangeSlices
transformer is called on an example input.
def f(xs):
return xs[0:10]
a = ChangeVariables().visit(ast.parse(inspect.getsource(f)))
print(astor.to_source(a))
Notice that this approach works even if the parameter subexpressions within each instance of slice notation are highly complex because the subtrees in the AST are simply inserted into the new node that invokes islice
. Thus, this is both a general and a scalable approach for replacing instances of slice notation with islice
throughout a large codebase.
This article briefly introduced some concepts involved in defining and implementing a programming language (such as abstract syntax and abstract syntax trees), as well as a number of tools for writing Python programs that can analyze and transform other Python programs. These features allow Python to support families of techniques such as reflection), source-to-source compilation, and static analysis, and it can be worthwhile to learn more about these in order to expand the set of tools available to you on your next project. This article also provided a brief overview of the Python language definition; you can explore the full language definition (including all the various constructs and the concrete and abstract syntax representations) in the Python Language Reference.