Lecture 9 - 480/580 (compiler construction)
Bottom Up parsing (part 3)
You may have noticed that my introduction of LR parsing differs a bit from
the book. Let me tell you why; LR parsing is probably the hardest topic we
will discuss in this class, and it has been my observation in the past that
most students have an extremely difficult time understanding it. I have
tried to understand why this is so, and have tried to see if I can come up
with a different way of presenting the material that makes it easier. My
conclusion has been that the traditional approach (the one used by our
textbook, and almost all other textbooks), presents the traditional LR
parsing table construction algorithm without giving you any idea where it
came from. So I have tried something different - giving you the complete
story where it came from, and building up to the traditional algorithm.
My hope is that by doing so I will take the mystery out of LR parsing, and
most of you will have a better understanding of how it works. (Since I
have not seen any textbook that presents the material in this way, if I
think this approach is successful I may just have to sit down and write my
own book!)
As with all matters, if you have any comments or suggestions on this or any
other topic related to the course, I would be glad to hear them.
Now let us relate the approach taken in lectures 07 and 08 to the material
in the book.
An "item" is simply a production with a marker (cursor) in it.
Remember that we built NDFAs for each nonterminal (put these back on
board). First note that we can associate with each state a set of items.
(I should perhaps have done this from the start, back in lecture 07).
For example, for E, we had the following states in our NDFA
E + T
E -> @ E + T =====> E -> E @ + R =======> E -> E + @ T =======> E -> E + T@
E -> @ T \
\ T
\=========> E -> T@
(do all the rest as well).
Recall the epsilon arcs tied two states together for us.
We can build these states directly from the grammar, without going
through the NDFA. That is, for this particular construction, we can go
directly to the DFA.
How do we do this? Start from the E start state. Go through
closure algorithm. Notice that the closure gives us exactly the same set
of states that we get by the epsilon arcs. That is, closure corresponds to
our tying our NDFA's together by epsilon arcs.
Go over GOTO. This is just the finite automata GOTO.
Go quickly over the table representation. This is just an encoding of the
FS automaton.
====================================================================
Remember the shift/reduce and reduce/reduce problems?
Here are various ways to solve them.
The algorithm I gave in lecture 08 always reduced. This algorithm is
simple, but turns out to be useful only on very simple grammars.
The simplest algorithm that is practically useful is a bit more complex,
is called SLR, and can be described as follows:
SLR RULES:
Reduce when the current token is in the follow set for
the nonterminal being reduced to.
Otherwise shift if you can.
This is easy to do, although it does mean we have to compute those dreaded
follow sets. However there are grammars for which this doesn't work,
such as the one given on page 229. Go through this example.
The problem is that the follow set does not provide a fine enough filter.
It can happen that some nonterminal can be followed by another symbol, but
not in all cases. We want to know not that a nonterminal can be followed
by some symbol, but that it can be followed by that symbol in this
particular case.
We can compute this information by augmenting our items with a set of
tokens that we expect to follow the production, in this particular case.
( Go over the LR(1) construction for the grammar on page 229 - note
particularly that we have to duplicate the FA for L because it has two
different follow sets).
Augmenting the FA in this manner is called cannonical LR parsing.
The parsing rules are then as follows:
LR(1) parsing rules:
Shift whenever you can
Reduce only when the next token is in the LR(1) item, that is,
when it can follow the nonterminal IN THIS SITUATION.
Unfortunately, you will notice that we had to duplicate the FA for
nonterminal L, which makes the table bigger, which makes the parser bigger.
In a relatistic programming language, such as Pascal, this can cause
serious trouble (many thousands or states, instead of hundreds for SLR).
However, we can compromise, and so something a little bit better than SLR,
not as good as LR(1), but small. Suppose instead of duplicating the FA's,
we just joined all the information together. (Go through this example).
Now, the DFA we get will be no bigger than the one for SLR (do you see
why?) But the rule for when to reduce is a bit stronger:
LALR parsing rules:
Shift whenever you can
Reduce only when the next token is in the LALR item, that is,
when it can (possibly) follow the nonterminal IN THIS SITUATION.
LALR parsing is the most common, for example it is the algorithm used by Yacc.
It works almost all of the time, and it produces relatively small parsers.