Lecture 2 - 480/580 (compiler construction) Review from lecture 1. Computer science is important becuase the computer permits one to solve problems that would otherwise be difficult or impossible. Also good at doing tedious tasks that bore people. Problem is that machines work on a very low level, people on a very abstract high level. Translators allow users to solve problems in a manner closer to the way they conceive of them. In this class, we will look at *compilers*, a very special form of translators. We will restrict ourselves (for the sake of simplicity) to translators that take textual input. Since we are concerned with text, and must impose some structure on the input, *languages* are important. Recall the language hierarchy 1. regular expressions 2. context free languages 3. context sensitive languages We will defer further discussion on the latter two until later (next week or so), lets talk about regular expressions. Recall steps in translator (from last lecture) 1. recognize when input is available. 2. break input into individual components 3. merge individual pieces into meaningful structures 4. process structures 5. produce output note this may not be sequential, may be in a loop or even in parallel We can ignore 1 in this class (read a line of input, don't care where it comes from). Lets look at 2. When I write The Brown Fox most people will look at it as 3 symbols, not 13. We have internally grouped input into larger symbolic units, not simply characters. Similarly, when I write if (i !=32) then j := 12 I have written 10 symbols. Observations 1. are spaces important? (they seem to be here - then followed by j) 2. space do not necessarily separate tokens ( !=32, for example) As part of the description of a language, we need a set of rules telling us how to form tokens. No all languages are the same in this regard last payment = 132 DO 10 I = 1.15 (give Venus probe story) Lets look at assignment 1 (also see some of the difficulty in formal specifications) See page 748. 1. Does this mean a program cannot begin with a comment? 2. What about if( or array[1 - try to come up with something better. How about blanks between tokens are optional, with the exception that keywords and identifiers must be surrounded by non-digit, non-letter characters. 3. What is this strange notation? Rewrite as regular expression. English text may seem straightforward - there are pitfalls. What is a letter? (one of .... regular expressions) 'a' | 'b' | 'c' | ... too bulky [abc ... ] notational shorthand [a-z] even better 4. similar trick for numbers 5. keywords are reserved? what does this mean? 6,7,8,9 giving other tokens. SO - we can define a set of tokens, all as regular expressions array begin if ( ) [ ] , ; [a-zA-Z][a-zA-Z0-9]* [0-9]+ [0-9]+(.[0-9]+)?(E[+-]?[0-9]+)? oops, what about spaces, comments? [ \t\n]+ { not a { } * - talk about notation Well - what next? write a procedure that accepts characters and produces tokens (a type of mini-translator in itself). picture here how might something like this work? read the next character, its an "i" could be int, if or an identifier. read next character, "f" could be if, could still be an identifier, read next character "(" oops. we have gone too far, it is keyword if, push back symbol "(" two observations: 1. we may start out with many possibilities, must stop when correct one has been completely recognized. 2. need some sort of buffer, in particular need to distinguish between characters we have read and accepted (never want to see again) and those we have only looked at (left paren in example). Lets solve second problem first two general solutions: 1. two pointers into buffer (next real character, look ahead character) 2. push back buffer (let me see this again) option 1 has conceptual simplicity, hard to implement. option 2 is easy to do, most often used (many languages, such as Pascal, need only push back 1 character, others often need more). Now consider the first problem, that of too many alternatives. Return to our regular expressions. Recall that regular expressions can be recognized by finite automata. (build automata for numbers and identifiers) 1. Build finite automata for all patterns, 2. connect start states together, giving a huge NDFA 3. simplify to make a DFA 4. Use the following algorithm: when asked for a new token, start in (combined) start state. read characters, advancing state, until you cannot advance further. push back last character (one couldn't advance on) back up until last finish state, pushing back characters. return token associated with last finish state. Although conceptually not hard, steps 1, 2 and 3 are tedious and error- prone. Recall this was one of the reasons I stated for using computers to solve problems, thus we would like a computer solution. We are lucky. There is one.