Lecture 13 - 480/580 (compiler construction) Type Checking Today we are going to discuss type checking. Before we do, we need to first overview the nature of types in conventional programming languages. Why do we have types in programming languages at all? Basically, there are three main reasons: 1) documentation 2) error checking 3) optimization Documentation types help document exactly what a section of code is doing. Types can convey information often even better than procedural code. It has often been said ``if I see your data structure declarations, I don't even need to know your code - I can tell what your program will do''. This is an exagaration, of course, but it does indicate the communicating power of types. but of course, THIS use of types, important as it is, has nothing to do with the compiler. Error Checking When the compiler can determine types a large number of errors are often catchable at compile time, rather than at run time. Examples include assigning incompatable types, passing incorrect types as arguments, and so on. Thus development time can be cut, at the expense of having to enter all those type declaration statements....... Optimization For the compiler, this is probably the most important reason to have types. If a compiler can determine that a ``+'' sign represents integer addition, it can produce only the integer addition code and not the code that, at run time, checks the types of the arguments and does the right thing. If types can be known at compile time, memory allocation can be simplified. For example, if the size of all local variables can be determined at compile time we can simply hack off a single large block for local variables. --------------------------------------------------------------- So, if we look at these uses for types, what forms seem most useful. The obvious types for optimization point of view are those that map easily on to the basic machine types - for example integers and reals. Of course these types can represent lots of things. A character is just a small integer, a real could be either a degree celicius or a degree farenheit. But of course if an integer is representing a character you don't want to perform a multiplication on it, or if a real number is a degree-C you don't want to add a degree-F to it. So from the point of view of error checking and optimization you want to add some ``syntactic suger''. Types such as character boolean enumerated datatypes color: (orange, red, purple) subranges pointers are just fancy coverings for more basic types. They do, however, permit better error checking. In this sense types are acting as a type of shield, to protect the raw bits from improper use by the programmer. The implementation of most conventional languages gives us some rather strong constraints a) the legality of operations should be checkable (as much as possible) at compile time b) the size of variables must be determinable at compile time. Think about some of the implications of the latter. A fixed size array of fixed size elements is OK, but an array with unknown bounds is not. A record structure is ok, but a set of indefinite size is not. I'm sure there are lots of types you can think of that WOULD be useful in programming languages, but aren't usually found simply because of this size restriction. Notice that requiriment (a) cannot always be satisfied. Array bounds checking is a good example, or pointer dereferencing. These require run time checks to insure things happen the right way. Some languages, for example PASCAL, take this to extreme lengths. For example in Pascal the bounds on an array are part of the array type - so you cannot write a sorting procedure which will work for arbitrary length arrays. Which brings us to talking about how to tell if one thing is the same type as another. When we had only a few types it was relatively easy to tell if an expression had the same type as another expression. 3 and 47, for example, were both clearly integers. But as types became more complex, it became more and more difficult to answer even simple questions. This problem became compounded when Pascal and ALGOL-68 introduced the idea of naming types and of subranges: type w : 1 .. 10 x : array [ w ] of integer; y : 3 .. 5; var a : x; b : array [ 1 .. 10 ] of integer; c : integer; d : y; e : 1 .. 10; f : w; Do the variables a and b have the same type? Can either be indexed by c? by d? by e? or by f? There are several issues here. structural equivalence vs. name equivalence narrowing and widening It is perhaps surprising that it took many years to hammer out these issues, and there is still a great deal of debate that can be generated. Now, there are two important concepts related to types that you should know. These are overloading and polymorphism: A function (or operator) is said to be overloaded when the same symbol is used for two or more quite different meanings. For example, the symbol "+" is overloaded in most programming languages. This can mean integer addition or floating point addition (which generate quite different code segments) or even set union (in Pascal or Setl, for example). So overloading means one text sequence, many code sequences. A polymorphic function, on the other hand, is one function (code sequence) that can work for many different types. Pascal is hindered by its inability to write polymorphic functions. This situation was compounded by the decision in Pascal to make the index into an array be a separate type. Thus it is impossible in Pascal to even do something so simple as to write a routine to sort an array of numbers, where you don't know the limits of the array. This routine, if you could write it, would be polymorphic; it would have one code fragment that could work on multiple types. Newer languages (Modula, Simula, Smalltalk, ML) have given much more support to polymorphic functions. (depending upon time, cite virtues of polymorphic functions). Related to these concepts is the notion of coercion: A coercion is the conversion of an expression from one type to another. I actually have already mentioned widening and narrowing, which are two kinds of coercion which don't change the representation of values. Other coercions, however, can change representations. When I type: 4 + 7.5 for example, the 4 (normally an integer) is converted into a floating point value before the addition. We say it is coerced into being of type float. Notice the programmer has done nothing to indicate this coercion, we say it is *implicit*. So; how do I relate this all to your programming assignment? Well, we must keep track of types as we build the abstract syntax tree nodes. How do we do this? Well, there is a field in the node structure to keep track of type information. Types of constants are self evident. Types of ID's can be discovered from the symbol table (notice we don't need to kept track of type info for lower level nodes, such as registers and so on). Types of operators depend upon the operator. For mod and div the types must be integer (give an error if not), for +, - and * it depends upon the types of the arguments; we may have to change one or the other if the computation if real. Thus we introduce a new unary operator, inttoreal, which converts an integer into a real. What about other static checks? Make sure arguments to or and and are relationals, make sure subscript expressions are integer (we don't have subranges, and aren't doing subscript range checking at the moment). Arrays are used properly (subscripts not applied to non arrays, arrays not used in arithmetic operations).