• data structure to represent the structure of a program or code snippet.
  • Each node of the tree denotes a construct occuring in the text
    • e.g. an if-condition-then statement may be denoted by means of a single node with three branches. (condition, if-body, else-body)

Example

while b ≠ 0:
    if a > b:
        a := a - b
    else:
        b := b - a
return a
  • Abstract syntax tree for the above code

Usage in ML

  • As described in Qwen2.5-Coder Technical Report, one can use an AST to parse the code snippets and extract the basic logic blocks as the middle code to infill (when using FIM).
    • By traversing and manipulating the AST, we can randomly extract the nodes of multiple levels and use the code context of the same file to uncover the masked node

Properties

  • Can be edited and enhanced with information such as properties and annotations for every element it contains.

    • Such editing and annotation is impossible with the source code of a program, since it would imply changing it.
  • an AST does not include inessential punctuation and delimiters (braces, semicolons, parentheses, etc.).

  • An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler.

    • For example, it may store the position of each element in the source code, allowing the compiler to print useful error messages.