Lecture 18 - 480/580 (compiler construction) Flow Graphs and Optimization In previous lectures I have noted the term Basic Block, which describes a sequence of assignment statements (or, more formally, a sequence of statements that do not have control flow, alternatively a sequence of statements that have the property that if any one of them are executed they all must be executed). Given the style of code generation I described in a previous lecture, it is clear that a basic block must be followed immediately by either a: 1) unconditional transfer 2) conditional transfer (including indexing into an array of labels and braching to the label found there) 3) function or procedure exit (basically a fancy type of branch). When attempting to optimize an entire function, it is useful to group the code into a *flow graph*. Nodes in the flow graph consist of a basic block (which may be empty) and the next control flow statement. Arcs point to the possible directions of control flow. Let us reconsider the program we looked at last time: for i := 1 to n for j := 1 to m c [i,j] := 0 for k := 1 to p c[i,j] := c[i,j] + a[i,k] * b[k,j] Remember, we performed: for i := 1 to n as: temp := n i := 1 L1: if (i > temp) goto L2 ... i := i + 1 goto L1 That being the case, we have the following flow graph. (I will expand the DAG into temporary variables, just because it is hard to type a DAG on a character-oriented terminal. In actual fact, it would probably continue to be represented internally by a DAG). Node 1 temp1 <- n i <- 1 (implicit fall into node 2) Node 2 If i > temp1 goto node 10 Node 3 temp2 <- m j <- 1 Node 4 if j > temp2 goto node 9 Node 5 (c + ((i*4-4) * m + (j*4-4))) <- 0 temp3 <- p k <- i Node 6 If k > temp3 goto node 8 Node 7 temp4 <- i*4-4 temp5 <- j*4-4 temp7 <- temp4 * m + temp5 temp8 <- k*4-4 (c + temp7) <- deref(c+temp7) + (a + (temp4 * p + temp8)) * (b + (temp8 * m + temp5)) k <- k + 1 goto node 6 Node 8 j <- j + 1 goto Node 4 Node 9 i <- i + 1 goto Node 2 Node 10 (procedure exit or whatever) Of course, we could represent the basic block in each of these as a DAG, and thereby detect common subexpressions. Anybody see any room for improvement in this code yet? Although by reducing the graph to simple GOTO structures we seem to have lost some information, in reality we can still talk about such things as loops. What does a loop look like in terms of flow graphs? A loop has a single entry node, which we call the header. The header must dominate the loop, that is, all paths that can reach any node in the body of the loop must pass through the header. A loop must also loop, i.e., it must have some path that goes back to the header for the loop. Finally, a loop must have a body. This will be any node that can be reached between two vists to the header. What is the smallest loop we have? Once we have a loop, we can ask questions such as the following: * what variables are potentially modified in any iteration of the loop? Let us see how we might answer this question. We certainly can, for each node, construct the set of variables that are modified BY THAT NODE (this is simply the set of assignments statements; although if our language was richer and has functions with side effects or pointers or what have you building the set might be harder). A simple algorithm would then be to take the union of all the generated values for nodes in the body of the loop. (There are more sophisticated techniques that give us more detailed information, but this will do for now). What variables get modified in our smallest loop? Lets look at the DAG we have for this statement in light of the information on what variables get modified. We see that big sections of the DAG will produce the same result EVERY TIME we go through the loop. We say such code is *loop invariant*. Well, that seems like a bit of a waste. Why recompute the same thing over and over again. How about we perform *code motion* and move the loop invariant code out of the loop, putting it in temporary variables. (Note, if only one node can go to the header, we can put it in that node. Otherwise, we may have to create a new node just prior to the header). Thus we have Node 5 temp3 <- p k <- 1 temp4 <- i*4 - 4 temp5 <- j*4 - 4 temp6 <- temp4 * p temp7 <- temp4 * m + temp5 (c + temp7) <- 0 Node 7 temp8 <- k * 4 - 4 (c + temp7) <- deref(c + temp7) + (a + (temp6 + temp8)) * (b + ((temp8 * m) + temp5)) k <- k + 1 goto node 6 Does this look good? What about the next outermost loop? What is its header? Its body? What variables are changed in its body? What computations are loop invariant? Node 1 temp1 <- n temp2 <- m temp3 <- p i <- 1 Node 3 j <- 1 temp4 <- i*4 - 4 temp6 <- temp4 * p temp9 <- temp4 * m Node 5 k <- 1 temp5 <- j*4 - 4 temp7 <- temp9 + temp5 (c + temp7) <- 0 Does this look good? See any further optimizations we could make? How about considering this: On many machines, multiplication is considerably more expensive (takes more cycles) than addition. What do we know about the variables i, j and k? Consider the innermost loop, we know not only that it is changing each iteration through the loop, but we know HOW it is changing. Particularly, we know that each iteration through the loop gives us: new k <- old k + constant How do we know this? Well, we can modify the procedure we used previously for telling what variables are defined within the loop. Now, however, we need to keep a little more information. We need to know: 1) the fact that a variable is changed, 2) the computation used to change it 3) and whether that change will ever, always, or never be executed each time through the loop. Let me illustrate the third one with an example; suppose we have: Node a a <- a + 1 if (b < c) goto node c Node b b <- b + 1 goto node a Node c a <- b + 1 goto node a Now, these three nodes form the body of a loop, but it is NOT true that a is modified by adding one to it EACH time we execute the loop. The problem is in node C, where the assignment is said to KILL the previous assignment. In order to detect how a variable is changed each time through the loop it requires we keep careful track of how variables are generated (created by assignment) and killed (overwritten). The book gives these algorithms in tremendous detail (it is dataflow analysis). For our purposes it is only necessary to know that this can be done. Now, let us get back to our program. Somehow, we discover that each time through the inner loop k is having one added to it. What is that doing to the expression k*4-4? If we detect this, we can replace the multiplication by an addition: Node 5 k <- 1 temp8 <- 0 (this is k*4 - 4 at this point) temp5 <- j*4 - 4 temp7 <- temp9 + temp5 (c + temp7) <- 0 Node 7 ... k <- k + 1 temp8 <- temp8 + 4 goto node6 We can do a similar trick for variables i and j. This technique is called *reduction in strength*, because we are replacing a stronger operator (multiplication) by a weaker one (addition). But if temp8 is being changed by 4 each time we go through the loop, how is the computation of temp8*m being changed? Since it is loop invariant, we can precompute the term 4*m once, in the initial block: Node 1 temp10 <- m * 4 temp2 <- temp10 - 4 ... Node 5 temp8 <- 0 temp11 <- 0 (temp8 * m) (c + temp7) <- 0 Node 7 (c + temp7) <- deref(c + temp7) + (a + (temp6 + temp8)) * (b + (temp11 + temp5) temp8 <- k * 4 - 4 temp11 <- temp11 + temp10 k <- k + 1 goto node 6 Other reductions in strength could be replacing exponentiation by addition. If we do this, note that we have almost completely removed any references to variable k. The only occurrences are now: * setting it to 1 * testing it against the upper limit p * incrementing it by one each time through the loop. We can remove all of these by testing temp8, rather than k. This only means that we have to modify the upper bound (temp3) So we now have: Node 1 temp3 <- p*4 - 4 Node 5 temp8 <- 0 ... Node 6 if temp8 > temp3 goto node 8 Finally, note that again we might have interference with optimization techniques; we either have to have similar rules for replacing left shifts with addition, or defer replacing multiplications with shifts until after we have passed this point. Here is the final code with all loop invariant code being moved and all reductions in strength: Node 1 temp10 <- m * 4 temp1 <- n * 4 - 4 temp2 <- temp10 - 4 temp12 <- p*4 temp3 <- temp12 - 4 temp4 <- 0 temp6 <- 0 (temp4 * p) temp9 <- 0 (temp4 * m) Node 2 if temp4 > temp1 goto node 10 Node 3 temp5 <- 0 Node 4 if temp5 > temp2 goto node 9 Node 5 temp8 <- 0 temp11 <- 0 temp7 <- temp9 + temp5 (c + temp7) <- 0 Node 6 if temp8 > temp3 goto node 8 Node 7 (c + temp7) <- deref(c+temp7) + (a + temp6 + temp8)) * (b + temp11 + temp5) temp8 <- temp8 + 4 temp11 <- temp11 + temp10 goto node 6 Node 8 temp5 <- temp5 + 4 goto node 4 Node 9 temp4 <- temp4 + 4 temp6 <- temp6 + temp12 temp9 <- temp9 + temp10 goto Node 2 Node 10 whatever Note that we have eliminated almost all the multiplications. There are still a couple optimizations we might want to consider (further removing the loop invariants c+temp7, a+temp6, and b+temp5, for example), but this is pretty good.