Abstract Syntax Tree

The representation of SourceCode as a tree of nodes representing constants or variables (leaves) and operators or statements (inner nodes). Also called a "parse tree". An AbstractSyntaxTree is often the output of a parser (or the "parse stage" of a compiler), and forms the input to semantic analysis and code generation (this assumes a phased compiler; many compilers interleave the phases in order to conserve memory).

Note that the AbstractSyntaxTree reveals the lexical/syntactical structure of the program text - what blocks and statements are lexically contained within in what. This may - or may not - be related to the semantic structure of the program. For example, in most OO languages with inheritance, the inheritance hierarchy is not revealed by examining the arrangement of the AST; you have to further and "match names" (to be imprecise). OTOH, in C and C++, lexical scopes mean next to nothing (other than being important for public/private in C++), all named C++ scopes are "flattened" and the lexical structure has little runtime significance. In other languages with LexicalScoping or things resembling it (Java InnerClasses, SmalltalkBlocks?), the lexical structure has greater significance.

Unlike concrete syntax, which consists of a linear sequence of characters and/or tokens, along with a set of rules for parsing them, abstract syntax doesn't (generally) have to worry about issues such as parser ambiguity, operator precedence, etc.

AbstractSyntaxTrees are a common intermediate form during compilation of SourceCode.

See http://en.wikipedia.org/wiki/Abstract_syntax_tree


TopMind wrote:

Note that the data representation does not necessarily have to be a tree. It can also be nodes (records) with references to other nodes. And, a "syntax tree" is often not a "pure" tree because leaves may reference the same information, such as the same function name or variable name. Thus, "tree" is used in kind of a loose sense.

''Sorta correct, but not quite. The AST is (typically) the output of a parse phase; and it is structured as a tree. Why? That's what happens when you parse a ContextFreeGrammar? -

3.if the grammar cannot be represented with a tree, it isn't context-free. This by no means limits the expressibility of the underlying language (as context-sensitive features are introduced in the semantic analysis level of most languages), but use of CFGs makes parsing much easier to deal with. Via indirection (such as provided by variables), you are correct that non-tree-like relationships can be expressed; and some optimizations such as CommonSubexpressionElimination? might be done on an AST, making it a DAG shape. But prior to any such semantic analysis; the output of a parser is generally tree-shaped.''

Being representable as a tree and being a tree are two different things. Any graph can be represented as a tree (IIRC) if we duplicate stuff.

["Tree" means, in this context, a conceptual arrangement of data relationships. Regardless of how you represent it, a program that conforms to a context-free grammar forms such a tree. Be careful to separate implementation details from logical concepts. An AST is a logical concept, so that's the appropriate terminological context here. Digressing into discussion of possible representations for the same conceptual information is irrelevant unless the topic is implementation of ASTs.]

I don't see where I mixed up logical and implementation details. Anyhow, this is about "syntax" and not necessarily context-free grammars. I don't see the connection.

Then - not to be rude - you have some reading to do. This is fundamental ComputerScience, top.

Are we talking about any syntax tree here, or a specific kind that goes with CFG's? We should make such clear to the reader.

A syntax tree is a syntax tree. Sometimes one finds them as the output of a parse stage; sometimes one synthesizes one from scratch for purposes of discussion or analysis. Even in languages (such as C/C++) who have non-context-free elements that the parser must be aware of (different syntax reduction rules are used depending on whether a given term has previously been defined as a type or not; a "feature" that makes C code more difficult to parse than other languages), the output of the parse phase is a tree. There is little advantage in having it otherwise; more complicated relationships can be introduced by the semantic analysis phase; doing this sort of thing at parse time is a bad idea - it complicates the grammar terribly; and can produce very slow parsers.

The important point. An AST does not constrain, in any way, the "shape" of the resulting program. SQL can be turned into an AST, as can TutorialDee, BusinessSystemTwelve, or any other relational language you care to think of. Perhaps an advanced graphical ImageBasedLanguage might not have need of an AST but anything that involves converting text will need a parser; and outputting an AST is the most straightforward way to do it.

One more important point. In some languages, the AST is highly representative of the runtime structure of the code. Many HomoIconic? languages are like this - EssExpressions have a trivial translation into abstract form; and simple low-level lists are used to represent the tree, rather than creating a special AST_NODE datatype. In many other languages, the AST is simply the input to the next phase of translation; and it is then discarded. (In some implementations of such languages, an AST is never generated; SemanticAnalysis? is interleaved with parsing, and the output of the combined parse/analyze phase is not a tree).


See also RegisterTransferLanguage?, ControlFlowGraph, StaticSingleAssignmentForm.


Reinvention of ASTs, moved from CodeAsTrees?:

"Reinvention"? Nobody claimed it original that I know of.

This code:

   sub x(a,b,c) {
     i = a
     while (i < 20) {
       if odd(i) {
         i = i + b
       } else {
         i = i + c
         print("in else")
       }
       print("in loop")
     }  // end-while
     print(i)
   }
Can be represented as a tree:

                           sub x(*)
                           --------
                             * * *
                            *  *  *
                           *   *   *
                          *    *    *
                         *     *     *
                        *      *      *
                    i = a   while(*)  print(i)
                    -----    -----    --------
                              * *
                             *   *
                            *     *
                           if(*)  print("in loop")
                           --     ----------------
                           **
                          *  *
                         *    *
                        *      *
                      i=i+b    else
                      -----    ----
                               *  *
                              *    *
                          i=i+c   print("in else")
                          -----   ---------------------

(*) = full expression not shown
Actually, code is already tree-shaped, just in "outline format" instead of "traditional".

Example 2 - Outline Format

Expression: (a + b) - (c * (d / e))

A tree-based query interface example is given in BusinessRulesMetabase.
See Also: StepwiseRefinement
CeeLanguageFamilyFrontEnd (Clang) contains an implementation for CeeLanguage, CeePlusPlus and ObjectiveCee.
CategoryCompilers CategorySyntax

EditText of this page (last edited January 4, 2014) or FindPage with title or text search