Lecture 10 - 480/580 (compiler construction) Symbol tables. In order to give meaning to programs we must know what individual symbols mean. Some symbols, such as keyword tokens or punctuation, are given meaning by the grammar, and are thus handled implicitly by the parser. Other symbols, such as user identifiers, do not have an implicit meaning, and must be handled more directly. A problem with user identifiers is that they do not necessarily exhibit textual locality; that is, the place we learn something about an identifier (the declaration statement) may be far away from where that information is necessary for understanding the program (an assignment statement or an expression). Because of this lack of locality, purely syntactic techniques for maintaining symbol table information is difficult. That is, it is hard to write a grammar that will assure that a variable declared as a scalar integer will not be used as an array, or a pointer, or a real. (You can, in fact, prove that such a task is impossible for CFL grammars. Various other forms of grammars have been proposed, with a notable lack of success or enthusiasm on the part of compiler writers). So instead of syntax, we must use semantic checks to detect this. Even worse, the place where we learn about the symbol (the declaration part), cannot easily be made close to the place where this information is needed (the identifier node). To see this, try drawing a typical parse tree, starting a subprogram_declaration. [ draw tree ] remember that information can only flow UP the parse tree. Thus the fact that this symbol information is setting up high off in some ``cousin'' branch (that is, not a direct ancestor branch) makes it very difficult to access using attributes. Thus, we must go outside the attribute structure, placing the symbol table in a global variable that is accessed by the routines that generate the identifier node. [ as an exercise, try rewriting the grammar so that declarations are lower in the parse three than the nodes that use the declarations. ] It is this global variable that contains the symbol information that is generalled called by the name symbol table. Let us start by considering the nature of symbols, and what the information we would like to maintain should include. Once we understand the nature of the problem, we can then go on to propose various data structures that can be used to hold the information. The following attributes of a symbol would seem to be important: NAME - how a symbol will appear when textually used. SCOPE - in what portions of a program the symbol can legally be used. USE - how a symbol is being used; as a variable, as a field name, as a keyword, and so on. TYPE - a description of the set of values a symbol can be bound to. LIFETIME - a description of what period during run time a variable will denote a legal value. SIZE - the amount of memory the values associated with a variable may occupy. LOCATION - where in memory the value associated with the symbol will be found. Of course, not all symbol tables must maintain all these values; some may be implicit in the definition of the language. In Algol-like languages, (such as Pascal), for example, scope and lifetime are often implicitly defined because of the block structured nature of the language; a variable scope is local to a procedure (and any procedure contained in it), and exists as long as that procedure is being called. Size can often be inferred indirectly from type, and so on. But this need not be the case in all languages. Now, one problem we see immediately with symbol tables is that much of this information is variable length (names and types) and much is shared (types). Let us therefore consider the layout for an individual record for a symbol table (regardless of how groups of records are combined). There are two choices: * fixed length records with truncation (maximum length on variable names), * arbitrary length records with pointers and sharing. Give advantages and disadvantages of both. Now let us consider how symbol tables might be organized. There are various conflicting requirements that drive the design of a symbol table: * It must solve the problem at hand * It must not use too many resources such as memory * It must not be too slow. * It should be easy to use, as this makes for better reliability. Let us start with a very simple design, the one used in your assignment, and then give examples of how it could be extended. In your assignment there are various simplifications that greatly make our task easier. Among these are the following: * The language is block structured. * There are two levels of procedure, which we can call global and local. * Keywords are reserved, and thus need not appear in the symbol table. * Once we start to see local variables, new global variables cannot be introduced until we are completely finished with all local variables. * We are not making a production system, and thus are interested more in ease of use than in speed. Now, let us start to add features to the system and see how things would change. Add 1) nested procedures (not hard, have to redefine our local/global notion). 2) Labels and gotos (add more type information in table). 3) Structures and Records (much more complex, must have different symbol tables). 4) An emphasis on speed (array lookup too slow, use hash tables).