Should type information be encoded in parse tree?

https://stackoverflow.com/questions/23265473

08-07-2023
|

Pregunta

I am working on a project including a small DSL. Lexing and parsing a string in this language results in a parse tree, implemented as an abstract class called Expr, which then has many of the usual derived classes such as AssignmentExpr, InvokeExpr, AdditionExpr, et cetera, corresponding to parse tree nodes which are assignments, function invocations, additions and so forth. The project is implemented in C#.

I am currently considering the implementation of type inference for this DSL. This means that I would like to be able to take an instance of the Expr class and return something encoding information about the types of the different nodes in the tree. This type information depends on a symbol table (types of variables) and a function table (function signatures). Thus, I would like to do something like:

TypedExpr typedExpr = inferTypes(expr, symbolTable, functionTable)

Here, TypedExpr would ideally be like Expr, except with a Type property giving the type of the expression. This, however, presents the following design problems:

It would make sense for TypedExpr to inherit from Expr and simply implement an additional property, Type. However, this would create two parallel inheritance hierarchies, one for TypedExpr (TypedAssignmentExpr, TypedInvokeExpr et cetera) and one for Expr (AssignmentExpr, InvokeExpr, et cetera). This is inconvenient to maintain, and the problem expands if further extensions of parse trees are required. I am not sure how this can be mitigated. One possibility would be the bridge design pattern, but I don't think this is capable of entirely solving the problem.
Alternatively, Expr could simply implement a Type property, which is then null at the time of construction from the parser, and later filled out by the type inference algorithm. However, passing around objects with null fields invites NullReferenceExceptions. The TypedExpr idea would have mitigated this. Furthermore, given that the idea of the Expr class is to express a parse tree, type information is not really a part of the tree: typing is context-sensitive, and requires particular symbol and function tables.
Third, the type inference method could also simply return a Dictionary< Expr, Type> which encodes type information about all nodes. This would mean that Expr remains representative of just the parse tree. The drawback of this is that the dictionary object constructed does not have any obvious properties showing that it is linked specifically to the Expr object passed to the type inference method.

I am not entirely satisfied with either of the three solutions given above.

My question is: What are the benefits and drawbacks of various approaches to this problem? Should type information be encoded directly in the parse tree, or should a parallel tree class be used? Or is the Dictionary solution the best? Is there an accepted "best practice" solution?

Solución

Go ahead with option two. This is what can be considered a “best practice”.

The reason is that a compiler usually works in many passes (stages, phases). Parsing being the first one, type resolution another one. You can later add an optimization pass, a code generation pass etc. Usually, a single data structure, an abstract syntax tree (AST; or parse tree), is maintianed allong these passes.

The idea that “passing around objects with null fields invites NullReferenceExceptions” is just false worries. You have to handle invalid cases a introduce counter-measures to validate inputs / outputs anyway. Compilers, including simple expression processors, are pretty complex things driven by complicated rules, which involve high degrees of data structure complexity and application logic you can't simply avoid.

It is very normal for an AST to have uninitialized data. Each compilation pass, besides initial construction of the AST by the parser, then manipulates the AST, computes more information (like your type resolution phase). The AST may even change substantially, i.e. due to an optimization pass.

Side note: modern compilers, such as the latest C# compiler, employ a non-mutability policy over ASTs and other internal data structures. In that case each pass builds its own new data structure. You could then design a new set of data structures for each pass, but that may turn into an overly complex code to maintain. Someone from the C# compiler team could elaborate more on this topic.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow