Lecture 9 - 480/580 (compiler construction) Bottom Up parsing (part 3) You may have noticed that my introduction of LR parsing differs a bit from the book. Let me tell you why; LR parsing is probably the hardest topic we will discuss in this class, and it has been my observation in the past that most students have an extremely difficult time understanding it. I have tried to understand why this is so, and have tried to see if I can come up with a different way of presenting the material that makes it easier. My conclusion has been that the traditional approach (the one used by our textbook, and almost all other textbooks), presents the traditional LR parsing table construction algorithm without giving you any idea where it came from. So I have tried something different - giving you the complete story where it came from, and building up to the traditional algorithm. My hope is that by doing so I will take the mystery out of LR parsing, and most of you will have a better understanding of how it works. (Since I have not seen any textbook that presents the material in this way, if I think this approach is successful I may just have to sit down and write my own book!) As with all matters, if you have any comments or suggestions on this or any other topic related to the course, I would be glad to hear them. Now let us relate the approach taken in lectures 07 and 08 to the material in the book. An "item" is simply a production with a marker (cursor) in it. Remember that we built NDFAs for each nonterminal (put these back on board). First note that we can associate with each state a set of items. (I should perhaps have done this from the start, back in lecture 07). For example, for E, we had the following states in our NDFA E + T E -> @ E + T =====> E -> E @ + R =======> E -> E + @ T =======> E -> E + T@ E -> @ T \ \ T \=========> E -> T@ (do all the rest as well). Recall the epsilon arcs tied two states together for us. We can build these states directly from the grammar, without going through the NDFA. That is, for this particular construction, we can go directly to the DFA. How do we do this? Start from the E start state. Go through closure algorithm. Notice that the closure gives us exactly the same set of states that we get by the epsilon arcs. That is, closure corresponds to our tying our NDFA's together by epsilon arcs. Go over GOTO. This is just the finite automata GOTO. Go quickly over the table representation. This is just an encoding of the FS automaton. ==================================================================== Remember the shift/reduce and reduce/reduce problems? Here are various ways to solve them. The algorithm I gave in lecture 08 always reduced. This algorithm is simple, but turns out to be useful only on very simple grammars. The simplest algorithm that is practically useful is a bit more complex, is called SLR, and can be described as follows: SLR RULES: Reduce when the current token is in the follow set for the nonterminal being reduced to. Otherwise shift if you can. This is easy to do, although it does mean we have to compute those dreaded follow sets. However there are grammars for which this doesn't work, such as the one given on page 229. Go through this example. The problem is that the follow set does not provide a fine enough filter. It can happen that some nonterminal can be followed by another symbol, but not in all cases. We want to know not that a nonterminal can be followed by some symbol, but that it can be followed by that symbol in this particular case. We can compute this information by augmenting our items with a set of tokens that we expect to follow the production, in this particular case. ( Go over the LR(1) construction for the grammar on page 229 - note particularly that we have to duplicate the FA for L because it has two different follow sets). Augmenting the FA in this manner is called cannonical LR parsing. The parsing rules are then as follows: LR(1) parsing rules: Shift whenever you can Reduce only when the next token is in the LR(1) item, that is, when it can follow the nonterminal IN THIS SITUATION. Unfortunately, you will notice that we had to duplicate the FA for nonterminal L, which makes the table bigger, which makes the parser bigger. In a relatistic programming language, such as Pascal, this can cause serious trouble (many thousands or states, instead of hundreds for SLR). However, we can compromise, and so something a little bit better than SLR, not as good as LR(1), but small. Suppose instead of duplicating the FA's, we just joined all the information together. (Go through this example). Now, the DFA we get will be no bigger than the one for SLR (do you see why?) But the rule for when to reduce is a bit stronger: LALR parsing rules: Shift whenever you can Reduce only when the next token is in the LALR item, that is, when it can (possibly) follow the nonterminal IN THIS SITUATION. LALR parsing is the most common, for example it is the algorithm used by Yacc. It works almost all of the time, and it produces relatively small parsers.