Oregon State University CS 480 Home Page
CS 480
-- Translators
General Information -- Winter Term 2004
http://classes.engr.oregonstate.edu/eecs/winter2004/csHandouts/
Calendar
See also the detailed timetable at end of this web page
Instructor
Lectures:
Prof. Timothy Budd,
Office: Dearborn 218, Phone: 737-5581,
E-mail: budd@cs.orst.edu
TA: Dan Moffitt and Christoph Neumann
E-mail: moffitt@engr.orst.edu and
neumann@engr.orst.edu
Office Hours
My office hours (Professor Budd) will be MWF 11:00-12:00, in my office
(Dearborn 218).
TA office hours:
- Dan: Monday and Tuesday 4-6 and Wednesday 2-4 in Hovland 108,
and Tuesday Thursday 12-2 in Hovland 01A (downstairs).
- Christoph: Monday 1-3 in Hovland 108.
Important Dates and Times
(see also the more detailed schedule).
- Class Time
- MWF 10:00 - 11:00 Owen 103
- Holiday
- Monday, January 19, no class
- Midterm
- Wednesday, Feb 11, in class.
- Final
- Tuesday, March 16, 9:30 AM.
Purpose of this Course
Introduction to translators (specifically compilers, but the techniques
apply to other types of translators), development of large software
projects, working in programming teams.
Textbooks
The textbook we will use this term is
Compilers, Principles, Techniques and Tools.
It is considered to be the definitive text on this topic, and
covers much more than we will be able to explore in the ten week
period.
I will also distribute my
lecture notes, which are very rough.
I will be revising my lecture notes this term to deal with more modern
topics, such as object-oriented programming.
I've rewritten some of the lecture notes to be more
Java specific, and in the process cleaned them up and put then
in book format. These will be distributed with the lecture notes.
If I find time, I may be writing more chapters as the term
progresses.
Since you will be doing your programming assignments in Java,
you may want to pick up a java tutorial as an optional text.
I'll give you some names if you ask.
I'll also be handing out introductory java material as needed.
Course Learning Objectives
On completion of the course, students must demonstrate the ability to:
- Describe a wide variety of different forms of translators
(XML->html, Tex->dvi, hardware description languages, and so on)
- Describe the various phases of a compiler
- Use regular expressions and context free languages to define a language.
- Create a grammar for a simple context free language
- Implement a lexical analyser to recognise tokens defined by
regular expressions
- Implement a parser, using either top down (recursive descent)
or bottom up (LR) techniques
- Generate working target language for simple programming constructs.
Grading
There will be six graded homeworks, and as many programming assignments.
I have scheduled the dates for these already. There will be one midterm,
and the final.
Grading will be based on a weighted formula which results in the
programming assignments and homework being given roughly the same weight as
the two exams. I will periodically post class standings and averages.
Completing all work correctly is sufficient to obtain a grade of B.
The A grade will be reserved for work that is above average in creativity
or completness. A grade of C or below is given for work that is below
standard.
If you are taking the classs S/U, be aware that this fact is not
reported to me by the administration. A grade of C- or better is
necessary to obtain the grade of S.
Grades (including class standings) are posted on-line
Old Exams
I have placed a number of old exams on-line, in both postscript and pdf format:
Languages and Machines
You will be doing your programming assignments in the language
Java. I assume that many of you will have prior experience in Java,
but I will be giving some introductory material in lectures.
Because we are using java, you hava a wide flexibility in the range of
machines you can use. There are free Java systems for the mac, PC's,
and on the HP machines. I personally do all my development on my mac
at home.
You will be GRADED on the HP systems. It would therefore be prudent
to make sure everything you hand in will run on those systems.
I will distribute a lot of source code
from the web.
You are welcome to copy these files to your own machine.
Note that the compiled programs are applications, not applets, so you
cannot run them from your web browser.
Programming Assignments
There are six programming assignments, on the following topics:
-
Writing a lexical analyzer.
-
Writing a parser.
-
Adding symbol table routines to your parser.
-
Creating abstract syntax trees.
-
Optimization.
-
Code Generation.
You are encouraged to work in programming teams of up to four people,
and hand in team solutions. When you work in a team, put the names of all
team members in your source code, as well as in the message box provided
on the web-based submission form.
The following links provide access to web-based copies of the
assignments:
Programming assignment 4
homework assignment 5
Programming assignment 5
Files for assignment 5
Programming assignment 6
Files for assignment 6
Late Assignments
I will generally hand out a working solution to the programming assignments
a day or two after the due date.
Because of this,
under no circumstances will assignments be accepted that are more than one
week late.
Assignments that are past the due date, but less than a week
late, will be accepted if previous arrangements have been agreed upon,
and if you can produce a listing showing the state of the solution prior to
the due date.
However I reserve the right to deduct points for being late.
Cheating
You are encouraged to discuss problems and programming assignments with
each other. Helping others learn is often the most powerful way of
mastering material yourself.
However, taking somebody elses solution
without their knowledge or consent is cheating and will be punished.
Do not leave copies of the programming assignments in the trash can in
a public place -- throw them away at home or some other private place.
Also do not leave your directories unprotected.
(Last year we had several situations where students were completing their
programming assignments using the trash can rather than the computer,
or were taking assignments from unprotected directories).
The College rules on this issue are harsh and are enforced.
Departmental
Policy on Dishonesty
Mail List
We will be using a mail list for this course.
The web page for the mail list is
here.
From the web page you can access an archive, as well as make various
modifications to your subscription.
You are allowed, even encouraged, to post message to the mail list if
you think they will
be of iterest to the rest of the class.
I will also often post information to the mail list.
Otherwise, please use e-mail to
communicate directly with the professor or the TA.
Information that is longer than a paragraph or two I'll probably type up as
a web page, adding the links here:
Detailed Schedule
- -------------Week 1
- M
Lecture 01,
Read textbook chapter 1 up to and including section 1.5.
Section 1.6 is interesting, but not essential.
summary: General introduction to topic.
study questions
- W
Lecture 02 and 03,
read chapter 3 of book, paying special attention
to sections 3.1, 3.3, 3.6 and 3.7.
I am assuming that all the discussion of converting RE into FA, and NDFA into
DFA, is all review for you. and will spend very little time on it.
The book advocates the use of double pointers for solving the buffering
problem, while I've found that a push-back buffer is both easier to implement
and to understand.
summary: A lexer needs buffering because you don't know when to stop until you
have gone too far. In this lecture I discussed how one would build a lexical
analyzer by hand. This is one of the few areas where I think the discusssion
in the book is weak -- if you are building a lexer by hand you probably don't
want to use regular expressions, as they are very hard to debug.
study questions
- F , HW 0 due, Lecture 02 and 03, continued, again, be reading chapter 3.
Understand in general terms how Lex goes about contructing a recognizer.
You don't need to know all the details about how the finite state table is encoded,
or the techniques used to minimize the size of the tables.
summary: In this lecture I discussed how an automatic tool, such as LEX, goes about
creating a lexical analyzer. The programmar gives lex a series of
regular expression-action pairs. LEX builds a huge DFA from all of these, and when
prompted simulates the action of the FA until it can go no further, then backs up to
the most recent final state for one of the FA, and executes the associated action.
study questions
- ------------- Week 2
- M - HW 1 due, PA 1 due
Lecture 4, Read Chapter 4, sections 4.1, 4.2, and 4.3.
This should be review material for most of you.
summary: Introduced parse trees, AST's, manipulation of grammars, removal of left
recursion, etc.
study questions
- W
Lecture 5, Book section 4.4. (I'll go into more detail than the book).
summary: recursive descent is the method of choice if you have to build a parser
by hand. Using recursive descent, every nonterminal is recognized by a
simple procedure. Using this technique, grammars are not allowed to have left
recursion. A problem is how to handle the situation where there are two (or more)
alternatives for a given nonterminal. One solution is to try each in turn,
backing up the input when you try a different selection. Another possibility is
to require the grammar be LL(1), which means it is only necessary to look at the
next symbol to determine which alternative is the correct one to be pursued.
The latter is the most common, and allows us to write relatively simple parsing
routines. While these are easy to create, they are also big. Towards the
end of the lecture I introduced an entirely different technique, which is
table driven and thus smaller.
study questions
- F
Lecture 6, Book section 4.4
First and follow sets. What they are, how to compute them. I'll give an intuitive
description that is slightly easier to understand than the formalizm of the book.
Will then try to explain once more how we could use these to build a table
driven parser that was smaller, albeit harder to debug or generate by hand.
study questions
- ------------- Week 3
- M - MLK day, no class
- W - HW 2 due, PA 2 due
Lecture 7, read Section 4.5, 4.7 and 4.8, you can skip 4.6. Also read
the chapter 1
(pdf) in the manuscript I handed out.
I'll be discussing these sections over several days. I think the book is
a little weak on presenting where LR parsing comes from, or why it works,
so I'll be going at the idea from a slightly different angle.
Continuing the thumbnail description of parsers.
LR parsers: pro - removes LL restriction, grammars can be much more general,
con - (almost) impossible to get right by hand, requires a parser generator (such as
YACC).
study questions
- F
Lecture 8, Continue reading Chapter 4
Lecture 9, although I'm omitting the discussion of cannonical LR(1) parsing
I'm going to skip the discussion of cannonical LR(1) parsing, and jump directly
from SLR to LALR parsing. Cannonical LR(1) is mostly just an academic exercise,
not used in practice.
First topic -- building the DFA directly using the idea of items, which are basically
productions with cursors (locations).
Second topic -- note that the SRL rule, which uses the follow set, is too general.
The follow set basically forms the union of ALL uses for a nonterminal, whereas
nonterminals are often used in different ways in different parts of the grammar.
In our sample grammar, for example, the nonterminal E could either be the start,
or it could be part of a parenthesized expression. Remember that these got mapped onto
DIFFERENT states when we did the conversion to DFA, but the follow set notion doesn't
recognize this.
Better rule, keep track of what we are hoping to see next, IN THE CURRENT CONTEXT.
Initially, what we hope to see after the start symbol is the end of input.
From there, use the following variation on closure:
If there is a nonterminal X->alpha dot Y beta , a
then we form the set
given by first(beta), and if epsilon is in this set add a as well (intuition, this
is what we are expecting to see next after we have seen the Y). Let d represent
this set. Then add Y -> dot gamma , d
to the closure set.
Revised rule (the LALR parsing rule). When we are at the end of a production,
reduce only if the next symbol is in the lookahead set. This is called LALR
parsing, and resolves many more shift/reduce conflicts correctly than does SLR.
study questions
- ------------- Week 4
- M
- W, HW 3 due
Lecture 10, read chapter 7 for this week, this lecture is in 7.6
Also read my chapter on
symbol tables
(pdf)
in the lecture notes I handed out.
The book spends a lot of time talking about how symbols can be
represented internally (hash tables and the like). I don't think
this is necessary (you have all had a data structures class), so I'll
skip over it.
In my lecture I'll concentrate more on what type of information is
stored in a symbol table, and how tables might be organized, and less
on the actual representation technique.
study questions
- F
Lecture 11, part 1.
Chapter 7 of book,
chapter 3
(pdf)
of my manuscript.
(This lecture will follow the manuscript more closely than the
printed lecture 11).
Run time representation.
I start by describing the general categories of memory -- global
memory, including code, the activation record stack, and the heap.
They each have their own requirements. Global values must be (a) unique
and (b) have a known fixed size. Lots of different items, not all global
in the sense of programming languages, have this property. Global values
may or may not be initialized, and may or may not be changable.
The activation record stores values that are tied to procedure entry
and exit. They also must have a size known at compile time.
The heap is for everything else.
I then discuss the structure of the activation record, and the
process of calling a procedure. Various parameter passing mechanisms
are discussed.
study questions
- ------------- Week 5
- M
More on uplevel addressing
Section 7.4 of book,
section 5.3
(pdf)
of my notes.
The book calls static links access links, for
reasons I don't understand, as the term static link is
used almost universally by everybody else.
Review: Activation Record (AR) is run-time memory for an active procedure.
Current AR is pointed to by a register, called the Frame Pointer (FP).
AR holds three types of items, local variables, parameters,
and bookkeeping stuff. Among the most notable of the latter is
the old FP, the return address, saved registers, and (if static
links are used) the old static link.
Problem of up-level address is how to get hold of non-local,
non-global variable values. The run-time structure of the AR's
along the static links mirrors the compile-time structure of the
symbol tables. We can use this fact to construct the address of
variables as we search the symbol table.
The caller is the only one who can set the static chain correctly
(the callee doesn't know from what level it was called). Again,
the caller can discover the correct value automatically as the
symbol tables are searched.
study questions
- W, PA 3 due
Functions as values. Call by name.
The book doesn't cover this, so I'll be handing out
chapters from yet another book
(pdf)
I wrote that discusses these issues.
Interesting things happen when you allow functions as argument
or functions as return values.
Functions as arguments must remember their context,
so code and context are packaged together as something called
the closure.
Functions as values must also ensure their activation record
has not disappeared. This can be done either by not popping
the activation record when a procedure returns, or by saving
the activation record on the heap.
Call by name is an interesting parameter passing mechanism,
that uses a hidden function as a parameter. This function is
invoked each time the parameter value is requested.
study questions
- F
Implementation of object-oriented features.
The textbook discusses this minimally, a better discussion
is in a chapter from
my
most recent book.
There are four features of OOP languages I wish to discuss.
The first is the invocation of member functions.
When a member function is executed, the receiver (this) is
passed as the first argument, a hidden argument.
This allows access to the member data fields (pun intended).
The second is the inheritance of member functions. What property
is it that allows a member function to work on a variety of different
types of data values? The key insight is that a child class is
always an extension of the parent class.
The third is the idea of polymorphic variables. If a variable
is allowed to hold child class types (an important feature of
OOP languages) what is the impact on memory layout. Discuss
the slicing problem, and the pointer solution. Java uses pointers.
The forth is how member functions are implemented. This must
be a decision made at run-time (in some cases, not all). Also
would be nice if it were fast. The virtual method dispatch
table solves this problem.
If there is time I might mention multiple inheritance
study questions
- ------------- Week 6
- M -- Review for Midterm
- W - MIDTERM
See the results: (will be posted after the exam)
- F
Lecture 12, Chapter 8 of book, section 8.1, although you can ignore the
discussion of 3 and 4 address code, which we won't use.
An intermediate representation is used between parsing and
code generation. The intermediate representation is both more
concise then machine code, and also processor independent.
This allows many optimizations to be performed on the
intermediate form, before we generate code.
The choice of intermediate representation depends in part on
at what level you generate code -- which can be anything
from as each statement is seen to delying until an entire file
is processed.
The internal form we will use is AST's. The AST must encode
everything the code generator will need, as well as everything
the type checker will use. Each AST node has a type field.
The address calculation used to get hold of a variable is
spelled out explicitly in the AST, as in the address calculation
used to get array elements, field or class references, and so on.
study questions
- -------------Week 7
- M
Lecture 13
Chapter 6 of text
What are types? Why do we have types in programming languages?
Types are used for documentation, error checking, and optimization
(by that I mean generating efficient code). Note that in the latter
an tradeoff has been made. Types have been designed so that
the compiler writers life is easier, even at the expense of making
the programmers life more difficult. Examples: you must
type every identifier, data structures can have only fixed sizes,
and so on. (There are languages that go the other direction,
and try to make the programmers life easier, even if it makes
the compiler writers life more difficulty).
In this lecture we investigate the meanings of types, and
issues such as telling if two types represent the same thing.
study questions
- W (now eliminated)
Lecture 13a, Section 4.2 in my chapter on types
More on types. Discuss the interaction between parameters
and memory model and arrays. Sets. Lists.
Type inference.
- W
Lecture 14
Section 8.4 of book
Everything must eventually be mapped on to a machine that
has very simple instructions, basically just
unconditional branches (including procedure calls), and
conditional branches based on simple conditions (comparisons,
for example). Lecture 14 deals with the flow of control
part of this process; lecture 15 deals with the boolean
expression part.
We examine if statements, both with and without the else
branch, loops, the need for temporary variables,
and switch (or case) statements.
(I suppose if this were being written today it would also
talk about exceptions.. )
study questions
- F
Lecture 15, read section 10.4.4 in
my manuscript
(pdf).
Book actually doesn't do a very good job of
describing this, but the details are not complex.
Again, I must own up to obsolete lecture notes.
In my lecture nodes, I use a single procedure
genCond(expr, tl, fl) where expr is an AST
and tl and fl are labels, but ONE OF THESE MUST BE
ZERO. Subsequent to writing the notes, I decided
the logic would be made more clear (at the cost of
only a small increase in code size) if I separated
these into two different functions, namely
branchIfTrue and branchIfFalse.
This is the way I describe it in Section 10.4.4 of my
manuscript, so you should read that for an alternative
discussion.
Lecture basically covers:
- Not operator, which merely inverts the test
and does not generate any code
- Logical operators, which produce short circuit
evaluation using control flow, not operators.
- Relational operators, which (on some machines)
may need to invert the test, if you can only
branch on true.
- Conditional expression operator.
study questions
- -------------Week 8
- M , pa 4 due
Lecture 16
Although the book devotes chapter 10 to optimizations
(and you should start reading this), the specific techniuque
I discuss today, and which you use in your programming
assignments, is given short shrift by the text. It was
explored in more detail by a friend of mine,
Dave Hanson (now at Princeton University), in a paper
entitiled ``low-cost and high yield optimizations''.
The idea is to have a sequence of algebraic transformations
that preserve the value, but for which the resulting
tree can be more efficiently evaluated than the original
tree. Basically the rule is to try to move constants to
the right and up the tree (so that they can hopefully
be combined with other constants at compile time),
and to replace costly operations with less costly ones
(such as replace multiplication with addition or with
left shifts).
It is an easy technique, that is in practice
surprizingly effective.
study questions
- W
Lecture 17
Introduce idea of DAG (Directed Acyclic Graph) and
Basic block (sequence of assignment statements treated
as one unit). Dags are mainly used to find common
subexpressions. Again, finding these may be a big win
because a lot of the calculation that occurs as a result
of arrays or records is ``hidden'', can not be manipulated
in the high level language. Finding a common subexpression
saves us from calculating the same thing twice.
study questions
- F
Lecture 18
Today I want to start talking about optimizations
that could be applied to loops. First, we need a new
IR if we are going to do this optimizations. Just as
DAGS replaced ASTs when we wanted to optimize basic blocks
to find common subexpressions, now we will use control flow
graphs to replace DAGs. A Control flow graph is a graph
made up of nodes, each of which is a DAG followed by
a transfer of control (which could be conditional,
unconditional, or procedure return). Even in a graph like
form, we can identify things like a loop, and then
talk about algorithms that remove loop invariant code.
study questions
- -------------Week 9
- M
Lecture 18, 2nd part (reduction in strength)
The goal of reduction in strength is to replace
multiplications, which are relatively costly, with
additions, which on almost all machines are much less costly.
The key insight is to find an induction variable,
which is a variable that changes (increases or decresases)
by a constant amount every time we go through the loop.
Such values readily arise in for loops, but can also
develop in other ways. Having found an induction variable,
we then look for expressions of the form
var*const + const; again, these readily arise in
subscript calculations, but can arise in other ways as
well. Such an expression will also change by only a
constant amount each time we go through the loop, hence
we can introduce a temporary that will track the value of
the expression, without needing to do the multiplication.
study questions
- W
Lecture 19, Data Flow Analysis
This is discussed in the text in sections 10.5 to 10.7.
Section 10.5 is an overview, 10.6 presents the algorithms
in much more detail then I will do, and 10.7 discusses
what types of optimizations can be performed once we have
done data flow analysis.
The basic idea of data flow analysis is to tie variable
uses to the set of possible places the variables
might have been defined. There is a systematic way to do
this -- for each node in the control flow graph you
compute the set of values that are defined, the set of values
that can potentially be overwritten (the same KILL set we
have been discussing over several lectures). Using these
you can write a set of equations that describe now information
can flow through a node -- these are similar to the equations
we wrote back in the beginning of the term for first and
follow sets -- then you refine the calculations over and over
until they no longer change.
Once we have the information provided by data flow
analysis, there is a host of optimizations, as well as
error checking, that we can perform.
study questions
- F
Optional class. I'm trying to see if I can get Career Services
to give a talk.
- -------------Week 10
- M
Lecture 20
Code generation is almost as difficult and ad-hoc as
optimization, and unfortunately the discussion in the
book reflects this. Code generation is described
in chapter 9, but there is no good overview, like
I will endevor to present in lecture. Instead the book
plows ahead into far more detail than I will require you
to understand. So you should read chapter 9 once,
and then concentrate more on my lecture notes.
Basically, the main point I want to get
across today is found near the end of lecture 20.
This is
- Generation of code for stack-style evaluation is easy,
permits the handling of arbitrarily complex expressions.
However, every instruction must then go all the way back to
main memory (caching may sometimes help this), the code is
not very efficient (it is longer than necessary), it is
difficult to handle common subexpressions efficiently.
- Code generation that makes use of registers can produce
more efficient code, but it is more difficult, you never have
enough registers, and when you have too many registers you must
pay the cost of saving them all on a procedure call.
Real bottom line, the compiler writer is always facing making
compromises, and needs to know the trade-offs.
study questions
- W March 13
Lectures 21 and 22 (two for the price of one)
The idea of peephole optimization is to take an existing
sequence of machine instructions, and find a simpler sequence
that has the same effect. This can come from unnecessary
loads (common if you compile code statement at a time),
branch chains, or machine idioms.
In the second part of the lecture I will talk about
the RISC and CISC revolution. Mainly I want to emphasize
the connections between assumptions about programmers
and hardware and instruction set design.
study questions
- F
Review for exam.
- -------------
- Tuesday March 16 - 480 FINAL, 9:30AM
Prepared by Tim Budd,
budd@cs.orst.edu,
1/5/2004.