Lecture 3 - 480/580 (compiler construction) Review from last time - steps in building a lexical analyzer 1. defining tokens in terms of regular expressions 2. make a finite automata for each RE 3. connect the start states for these FA, giving a huge NDFA 4. Make a DFA out of the NDFA 5. when asked for a token, simulate the DFA until you can't go any futher, then back up to the last final state. Problems: step 1 is easy, steps 2-5 are not hard, but they are tedious and error-prone. Far better to let a computer do them. LEX basically, a program that takes regular expressions and action pairs, and builds a recognizer. Why actions? want more than just text of token, want values of integers, names of symbols, etc etc. pattern action ... ... go through actions in assignment. Notes - 1. LEX makes NO assumptions about tokens, must recognize ALL of input, including spaces, comments even illegal symbols ('.' default action). 2. Multiple rules can match the same input. Ambiguity is resolved using the following two rules: A) choose the pattern which matches the longest input string B) if two patterns match the same size string, choose that which was listed first. 3. can trade size of actions for code size (keywords and identifiers, one char symbols) 4. If you were buillding this by hand we would do it differently. (more in a moment) Beauty of LEX is that YOU tell it what to return (or even if to return), you tell it WHAT to do, not HOW to do it. If you don't have a return statement, none takes place (spaces, comments) It provides you automatically with text of token (in yytext) You can return a value, in addition. Called yyval. Gets type from Yacc (more later) - go through integer action. Can, on occasion, build very large recognizers. Can trade code size for ease in construction. Decide what is important - your time or the computers'. BUILDING IT BY HAND would do it differently. Use classes of tokens with the first characters. built on top of push back buffer (if time, go through buffer). skip over blanks, tabs, newlines and comments switch on first character case letter: read until end of identifier, push back last character. check to see if it is a keyword, else return identifier case digit: read until end of integer, if '.' or 'E', read until end of real. else return integer. push back last character. case '<' if next char is = return less than or equal, else push back char and return less than case '(' return left paren etc. etc.