Lecture 12 - 480/580 (compiler construction) Internal representation Almost no compiler produces code without first converting a program into some intermediate representation that is used just inside the compiler. This intermediate represetation is called by various names: * internal representation * intermediate representation * intermediate language and sometimes by the form the intermediate language takes: * tuples * triples * abstract syntax trees. We will use for our intermediate representation abstract syntax trees. Generally an internal form is kept around only until the compiler generates code; then it can be discarded. Another difference in compilers is how much of the program is kept in internal form; this is related to the question of how much of the program the compiler looks at before it starts to generate code. There is a wide spectrum of choices: * generate code totally on the fly; as you read an expression in generate code for it (this requires a very simple internal form). * generate code as each statement is recognized. That is, build an internal representation for each statement, and then when we have totally recognized the statement, generate code for it. * generate code as each basic block is recognized. (we will talk about basic blocks a lot more when we get to optimization, but simply put a basic block is a block of assignment statements- statements with no control flow). This requires that we keep an entire basic block in some internal form before we start generating code. * generate code as each procedure is recognized. This requires an internal form that can hold an entire procedure. * generate code when an entire file has been parsed. * generate code when an entire program (which may be many files) has been parsed. The first choice and the last choice are very seldom encountered, but all the others are relatively common. For our programming assignment we will be doing the second choice; recognizing a single statement and then immediately generating code for it. However in class I will frequently mention other possibilities. Since we are generating code immediately after we have seen a statement, our internal form pretty much only needs to hold expressions (which may include such things as function calls). Thus the representation we will use is a simple tree (usually binary), the abstract syntax tree. I will start a bottom-up description of the abstract syntax trees we will use; that is, I will start at the lowest nonterminal (factor or variable) and move upwards. What must be included in the abstract syntax tree is everything we need to generate code; this tree is the ONLY thing the code generator will see. Thus type checking, subscripts, variable addressing, and so on must all be explicitly present in the trees. We will talk about these in the next few days. Let us start with variables. The first thing to note about variables is that there are two characteristics that are important--the address and the contents of the address. Usually we don't think about this, and conventional programming languages don't force the difference upon us. But consider, when I say something like: a := b What I am saysing is "take the CONTENTS associated with the name b, and move them to the ADDRESS associated with the name a". I know of only one language (BLISS) which forces the programmer to make this explicit; almost all other languages (that have assignment) adopt this as a convention, and so it becomes second nature and the programmer seldom thinks about it. Well, this sloppiness will not do when we are talking about code generation. We need to be much more precise. Therefore, we will adopt the convention that in our abstract syntax trees we will always refer to ADDRESSES, unless we explicitly say otherwise. We say otherwise by introducing a new unary operator, called DEREFERENCE, written @. We will see the use of dereference in a few moments. Let us return to simple variables. Remember we said we would keep only local variables on the activation record; global variables would be kept in static areas (thus we can get away with no display and no static chain). Thus we need two different types of variable nodes; for global variables, all the code generator needs to know is the name. Thus we have a node that has only a name field. For local variables (and call by value parameters) we know the address of the variable as an offset relative to the activation frame pointer (register 6). Thus we can discover the ADDRESS of the variable by making a tree representing the SUM of the register and the offset: + / \ r6 offset Note that there are THREE nodes in the abstract syntax tree. The left node is a register, the right node is a constant; exactly the SAME type of node as would be generated for an explicit constant in the program. The third node is a plus, exactly the SAME type of node that would be generated for an addition operator in the program. Question: How would we make a tree that represents the ADDRESS where a call-by-reference parameter is stored? (don't worry, all our parameters will be call by value). Now lets work up to subscripts. How are arrays stored in memory? Just as a run of storage locations; [ location 0, location 1, location 2, ...., location n] So how do we find the i'th entry in the array? By getting the address of the start of the block and adding i. Problems: lower bounds are often something other than 0, and sizes of each element are often something other than 1. So, in full glory, the computation that must take place when we want to reference a[i] is as follows (address of a) + ((value of i) - lowerbound) * (size of element). Let's take an example, suppose a is a local variable that starts at location 12, the array lowerbound is 1, the array is an array of integers (each 4 bytes long), i is a local variable that starts at location 8, we have the following (r6 + 12) + (@(r6 + 8) - 1) * 4 Note the dereference, we want the VALUE of i, not the ADDRESS of i. (Yes, this can be optimized; and yes, we are going to optimize this later). Lets look at some of the other possiblities for factor. When a variable appears as a factor, we want the value, not the address, right? So we must produce an explicit dereference node. Constants just generate a constant node (either integer or floating point). We have already seen a constant node in the variables. What about function calls? Well, the problem here is that functions can in general have any number of arguments, and thus are hard to represent as binary trees. To get around this using a trick, we treat a comma as a binary operator; and thus function calls have two attributes of interest, their name and the tree representing their arguments. The latter can either be nil (no arguments), and expression (one argument), or a comma (two or more argumnents). Unary and binary nodes are easy, then just generate unary and binary tree nodes with the right operator. We have already seen the plus. Others are similar. Note that we now must use semantics to make sure that or is not used as a unary operator. (call yyerror if it is, with some appropriate message). Next lecture we will discuss how to add type checking to our nodes, this will be followed by a sequence of lectures on how we are going to generate code.