Lecture 19 - 480/580 (compiler construction) Data-Flow Analysis I touched on a topic in the last lecture that I want to expand on today, that that is the idea of data-flow analysis. Basically, the idea behind data-flow analysis is to understand the relationship between where a variable is defined and where it is used. Once we have this information, all sorts of optimizations become possible. Recall how we built a flow graph for our program. Each node in the graph consisted of a basic block (perhaps empty) and a control transfer statement (perhaps just the default falling into the next node). We first note that, barring side effects (such as function calls), control transfer statements do not change the values of any variables. We are therefore just concerned with the changes brought about by the statements in each of the basic blocks. In each of these blocks, we can define two sets: Gen : the set of values generated (that is, given values) in this block (plus we also keep a pointer to the expression representing the value being assigned) Kill : the names of identifiers that are killed in this block - that is, identifiers that do not have the same value leaving the block that they did coming into the block. For example, in the following code: i := m - 1 a[i] := 17 i := 1 + m The variable i is generated, and also killed. We also say that the variable a is both genrated and killed. The value m is neither generated nor killed, but it is used. (A value that is generated is also killed, but a value can be killed but not generated; for example pointers in C will kill values but not generate new ones). Two more sets can be defined which describe how nodes in the flow graph relate to each other: in : the set of identifiers and symbolic values that potentially are available at the start of a node in the flow graph. out : the set of identifiers and symbolic values that are available at the end of a node in the flow graph. Relationships between these values can be described by equations, such as the following: out = gen union (in minus kill) which states that the values available on output from a node are those generated by the node plus those available on input minus those killed by the node. We will imagine a new value, called ``undefined'', and set all values to undefined at the beginning of the program. Thus every value is available when we begin. I will simplify a bit, but basically the idea behind dataflow is the following iterative procedure: compute gen and kill for each node. starting either from the top (which has no input) or the bottom (which has no output), traverse every arc updating the in and out sets. Repeat this (which may require going over the graph many times), until no changes are found. Now let us assume that we have performed data-flow analysis; and for every use of a variable we know the *set* of possible places the variable might have been defined, and for every generation of a variable we know the *set* of places it might potentially be used. What can we do with this information? What happens if we have a use and can't find a previous setting? * detection of undefined variables * detection of variables set and not used. Another powerful use is: * constant propagation. Suppose we have a statement such as upperlimit := 105; and nowhere else do we ever assign to upperlimit (such statements are common in languages where we don't have symbolic constants). We can replace each occurrence of upperlimit by the value 105; which can possibly then be used in further optimizations. * dead code elimination Performing the above may allow us to determine some conditions at compile time, rather than run time. For example: debug = false ... if (debug) ... If we can we may be able to find "dead code", code which can never be executed. If we can, we can eliminate the code entirely. * copy propagation Suppose we have a statement such as: x := y and later we can detect that a use of the variable x can only proceed from this statement and y has not been changed. We can change an occurence of x to be y. (This may allow further optimizations, for example if y is part of a common subexpression). We have already mentioned (last lecture): * detection of loop invariant computations * detection of induction variables (variables changing by arithmetic progressions) and the subsequent reduction in strength. * available expression As long as we know where variables are modified, we can perform a form of common subexpression elimination between blocks. For example, suppose we have the following: c[i, j] := x + y; if (x < 42) then c[i, k] := 10; recall that the first subscript was involved in a fairly significant computation (subtract off the lower bound, multiply by the size); we can save that computation and not have to regenerate it. * busy expressions Even if a value is not available, we might want to perform code motion to make it available if it is used in two different subsequent blocks, for example: if (x < 42) then c[i, k] := 10 else c[i, j] := 42; this reduces the amount of code generated, but not the speed. * live-dead analysis Finally, in some of the code generation techniques we will discuss next week, it is important to know when and how often a variable is being used, and more importantly when it is no longer being used. For example, if we are holding a variable in a register we will certainly want to free up the register when the variable is no longer used. ----------------------------------------------------------------- It should be clear that performing all of these transformations may result in an object program that has only little resemblence to the original source program. While these may make the program run faster, they also make it considerably more difficult to debug. Some (and often times, myself) would argue that this doesn't matter. People programming in a high level language often, and shouldn't know the assembly language generated anyway. So who cares if it has been moved around and changed? On the other hand, sometimes it does matter. Sometimes we do want to look at the bits and somehow relate them back to the original program. This is a problem that has not been totally and adequately addressed. For example, under Unix if you want to use the symbolic debugger sdb, you have to tell the COMPILER of this fact and all optimizations are automatically turned off. The problem of how to relate optimized code back to the source level is very difficult, currently the stuff of PhD disserations, and will probably not be solved for a few years (by which time improvments in machines, for example RISCness, will have made the problem irrelevant).