- data structure to represent the structure of a program or code snippet.
- Each node of the tree denotes a construct occuring in the text
- e.g. an if-condition-then statement may be denoted by means of a single node with three branches. (condition, if-body, else-body)
Example
while b ≠ 0:
if a > b:
a := a - b
else:
b := b - a
return a
- Abstract syntax tree for the above code
Usage in ML
- As described in Qwen2.5-Coder Technical Report, one can use an AST to parse the code snippets and extract the basic logic blocks as the middle code to infill (when using FIM).
- By traversing and manipulating the AST, we can randomly extract the nodes of multiple levels and use the code context of the same file to uncover the masked node
Properties
-
Can be edited and enhanced with information such as properties and annotations for every element it contains.
- Such editing and annotation is impossible with the source code of a program, since it would imply changing it.
-
an AST does not include inessential punctuation and delimiters (braces, semicolons, parentheses, etc.).
-
An AST usually contains extra information about the program, due to the consecutive stages of analysis by the compiler.
- For example, it may store the position of each element in the source code, allowing the compiler to print useful error messages.