Click here to Skip to main content
15,882,152 members
Articles / Programming Languages / Python

Analyzing Python with the AST Package

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
22 Aug 2021CPOL9 min read 14.1K   4   2
This article explains how to analyze Python code using the open-source ast package
The open-source ast package provides a wealth of capabilities for parsing, analyzing, and generating Python code. This article discusses these capabilities and demonstrates how they can be used in code. It also introduces the capabilities of the astor package.

1. Introduction

As StackOverflow makes clear, Python's popularity has risen dramatically in recent years. As a result, more software tools need to be able to read and analyze Python code. The open-source ast package provides many capabilities for this purpose, and the goal of this article is to introduce its features.

AST stands for abstract syntax tree, and a later section will explain what these trees are and why they're important. The ast package makes it possible to read Python code into ASTs, and each node of an AST is represented by a different Python class.

The last part of this article discusses the astor (AST observe/rewrite) package. This provides helpful functions for reading and writing ASTs.

2. Two Fundamental Functions

Before I get into the details of Python analysis, I'd like to start with a simple example that shows why the ast package is so useful. There are two fundamental functions to know:

  • parse(code_str) - creates an abstract syntax tree from a string containing Python code
  • dump(node, annotate_fields=True, include_attributes=False, *, indent=None) - converts an abstract syntax tree into a string

To demonstrate how these methods are used, the following code calls ast.parse to create an abstract syntax tree for a simple for loop. Then it calls ast.dump to convert the tree into a string.

Python
tree = ast.parse("for i in range(10):\n\tprint('Hi there!')")
print(ast.dump(tree, indent=4))

When this code is executed, it produces the following result:

Module(
    body=[
        For(
            target=Name(id='i', ctx=Store()),
            iter=Call(
                func=Name(id='range', ctx=Load()),
                args=[
                    Constant(value=10)],
                keywords=[]),
            body=[
                Expr(
                    value=Call(
                        func=Name(id='print', ctx=Load()),
                        args=[
                            Constant(value='Hi there!')],
                        keywords=[]))],
            orelse=[])],
    type_ignores=[])

To a casual coder, this may look like a complete mess. But it's important to anyone trying to build tools that analyze Python code. This output identifies the structure of the code, and the structure is given in the form of an abstract syntax tree.

3. Abstract Syntax Trees (ASTs)

To make sense of the parser's result, it's important to understand abstract syntax trees, or ASTs. These trees embody the structure of a document's content, whether it's written in a programming language like Python or a natural language like English.

This section explains what ASTs are and then presents the classes that represent nodes of an AST. But before I introduce abstract syntax trees, I'd like to take a step back and explain what tree structures are.

3.1 Tree Structures

When data elements form a hierarchy beginning with a single element, the elements and their relationships can be expressed as a tree. Common trees include organization charts, file navigators, and family trees. Tree structures are frequently encountered in software development, particularly in networking, graphics, and text analysis.

When working with trees, developers rely on a common set of terms:

  • Each element in a tree is called a node.
  • The topmost element is called the root node.
  • If a node is connected to nodes below it, the first node is called a parent node and the connected nodes are the children of the parent.
  • Every node except the root has a parent node. A node with one or more children is a branch node and a node without children is called a leaf node.

Figure 1 depicts a simple tree. Node E is the root node and Nodes B, C, and D are its children. Nodes A and F are the children of B and Node G is the child of D. Nodes A, F, C, and G have no children, so they're leaf nodes. The other nodes are branch nodes.

Image 1

Figure 1: A Simple Tree Hierarchy

Each node in the tree has a depth value that identifies how many connections separate it from the root. In this example, Node E has a depth of 0, Node C has a depth of 1, and Node G has a depth of 2.

3.2 Abstract Syntax Trees (ASTs)

When I was in grade school, we had to analyze sentences using sentence trees. The root node represents the entire sentence and every root has two children: one for the subject and one for the predicate. In a simple sentence, subjects are represented by noun phrases and predicates are represented by verb phrases. Figure 2 presents the tree for the sentence: This sentence is simple.

Image 2

Figure 2: Example Sentence Tree

In this tree, the leaf nodes contain the individual strings that make up the text. The branch nodes identify the purpose of each leaf node and the role it plays in the sentence.

If you can see how sentence trees represent English sentences, you won't have any trouble understanding how abstract syntax trees represent code written in Python. When ast.parse analyzes Python code, the root node takes one of four forms:

  • module - collection of statements
  • function - definition of a function
  • interactive - collection of statements in an interactive session
  • expression - simple expression

Figure 3 illustrates the AST for the simple Python for loop presented earlier. The root node is a module.

Image 3

Figure 3: Example Python AST

Almost every Python AST I've encountered has a module as its root node. A module is made up of one or more statements, and most types of statements are made up of one or more expressions. The following discussions explore the topics of statements and expressions.

3.2.1 Statements

In the preceding AST, the module contains a single statement that represents a for loop. In addition to for loops, AST statements can represent function definitions, class definitions, while loops, if statements, return statements, and import statements.

Each statement node has one or more children, and the number and types of its children change depend on the statement's type. For example, a function definition has at least four children: an identifier, arguments, a decorator list, and a set of statements that form its body. To see this, the following code parses Python code that defines a function named foo.

Python
tree = ast.parse("def foo():\n\tprint('Hello!')")
print(ast.dump(tree, indent=4))

The second line creates the following string from the AST:

Module(
    body=[
        FunctionDef(
            name='foo',
            args=arguments(
                posonlyargs=[],
                args=[],
                kwonlyargs=[],
                kw_defaults=[],
                defaults=[]),
            body=[
                Expr(
                    value=Call(
                        func=Name(id='print', ctx=Load()),
                        args=[
                            Constant(value='Hello!')],
                        keywords=[]))],
            decorator_list=[])],
    type_ignores=[])

Reading from left to right, it's clear that the root node is a module and its child is a function definition. The function definition has four children, and the child representing the body has one child because the function's body contains one line of code.

Class definitions are particularly important, and each has five children: a name, zero or more base classes, zero or more keywords, zero or more statements, and zero or more decorators. Each method in a class is represented by a function definition statement.

To demonstrate this, consider the following simple class definition:

Python
class Example:
    def __init__(self):
        self.prop = 4
        
    def printProp(self):
        print(self.prop)

The following code parses this class definition to obtain an AST.

Python
tree = ast.parse("class Example:\n\tdef __init__(self):\n\t\tself.prop = 
       4\n\n\tdef printProp(self):\n\t\tprint(self.prop)")

Rather than print out the entire AST, Figure 4 illustrates its top-level nodes.

Image 4

Figure 4: AST for a Class Definition

Many statements, such as return statements and import statements, are very simple. But other statements, such as if statements and assignment statements, are composed of child structures called expressions. I'll discuss these next.

3.2.2 Expressions

We're all familiar with mathematical expressions like 2+2 and 8*9, but expressions in a Python AST are harder to pin down. There's no clear distinction between a statement and an expression, and in fact, an expression can be a statement. In a Python AST, an expression can take one of several different forms, including the following:

  • binary, unary, and boolean operations
  • comparisons involving values and containers
  • function calls (not function definitions)
  • containers (lists, tuples, dicts, sets)
  • attributes, subscripts, and slices
  • constants and names (strings)

The last bullet is important. Almost every leaf node in an AST will be a name or a constant, so it's important to distinguish between the two expressions. A name is an identifier, such as a function name, class name, or variable name. A constant is any value that isn't an identifier.

To see how expressions are parsed, it helps to look at an example. The following code parses a simple mathematical expression and prints its AST.

Python
tree = ast.parse("(x+3)*5")
print(ast.dump(tree, indent=4))

The printed AST is given as follows:

Module(
    body=[
        Expr(
            value=BinOp(
                left=BinOp(
                    left=Name(id='x', ctx=Load()),
                    op=Add(),
                    right=Constant(value=3)),
                op=Mult(),
                right=Constant(value=5)))],
    type_ignores=[])

This module contains a single statement, and that statement is an expression. The expression consists of two binary operations: addition and multiplication. The variable x is identified by a name node and the two numeric values are identified by value nodes.

3.2 AST Classes

Every node type in the Python AST has a corresponding class in the ast package. Modules are represented by instances of the Module class and expressions are represented by Expr instances. Function definitions are represented by FunctionDef instances and class definitions are represented by ClassDef instances.

Every child of a node corresponds to a property of the corresponding class. In Figure 4, the class definition node has children named name, body, bases, keywords, and decorator list. To store this information, the ClassDef class has properties named name, body, bases, keywords, and decorator_list.

Every node class extends from the central AST class. This has a handful of useful properties that provide information about the node:

  • _fields - a tuple containing the names of the node's children (which correspond to class properties)
  • lineno - first line number containing the node
  • endlineno - last line number containing the node
  • colno - first column containing the node
  • endcolno - last column containing the node

For example, the following code lists the children of an if statement:

Python
print(ast.If._fields)

The printed output is ('test', 'body', 'orelse').

There isn't a lot of documentation on the node classes and their constructors. But you can see how a node is constructed by looking at the output of the dump method. To convert this output into a constructor, simply preface each node class with the ast prefix. For example, the following code relies on the preceding output to define an expression containing two binary operations:

Python
firstOp = ast.BinOp(left=ast.BinOp(left=ast.Name(id='x', ctx=ast.Load()), 
    op=ast.Add(), right=ast.Constant(value=3)))

secondOp = ast.Mult()

e = ast.Expr(value=firstOp, op=secondOp, right=ast.Constant(value=5))

Once you understand how to instantiate node classes, you can programmatically construct ASTs. Then you can generate Python code from an AST using the ASTOR package, which I'll discuss next.

4. Using ASTOR

To augment the capabilities of the ast package, Berker Peksag released astor, which stands for AST Observe/Rewrite. If you have pip available, you can install astor with the command pip install astor. As of this writing, the current version is 0.8.1.

astor provides a number of useful classes and functions that simplify working with Python ASTs. Table 1 lists six of these functions and provides a description of each.

Table 1: Functions of the astor Package
Function Description

to_source(ast, indent_with=' '*4,
add_line_information=False)

Convert an AST to Python code
code_to_ast(codeobj) Recompile a module into an AST and
extract a sub-AST for the function
parse_file(file) Parse a Python file into an AST

dump_tree(node, name=None,
initial_indent='',
indentation=' ',
maxline=120, maxmerged=80)

Pretty print an AST with indentation
strip_tree(node) Recursively remove attributes from an AST
iter_node(node, unknown=None) Iterates over an AST node

The first function, to_source, is particularly helpful because it accepts an AST (or a node) and prints Python code. To demonstrate this, the following code calls ast.parse to obtain an AST for a function definition. Then it calls astor.to_source to convert the AST to Python code.

Python
tree = ast.parse("def foo():\n\tprint('Hello!')")
print(astor.to_source(tree))

The output of the second line is given as follows:

Python
def foo():
    print('Hello!')

In this manner, a Python script can generate Python code programmatically. This can be very helpful when you need to translate text from one language into Python.

5. History

  • 22nd August, 2021: Initial publication
  • 24th August, 2021: Added link to ASTOR

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
I've been a programmer, engineer, and author for over 20 years.

Comments and Discussions

 
QuestionLink? Pin
Chris Maunder23-Aug-21 9:41
cofounderChris Maunder23-Aug-21 9:41 
AnswerRe: Link? Pin
Matt Scarpino24-Aug-21 8:41
mvaMatt Scarpino24-Aug-21 8:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.