TDT4205 Recitation 3 Lexical analysis Last week: Make and - - PowerPoint PPT Presentation

tdt4205 recitation 3 lexical analysis
SMART_READER_LITE
LIVE PREVIEW

TDT4205 Recitation 3 Lexical analysis Last week: Make and - - PowerPoint PPT Presentation

TDT4205 Recitation 3 Lexical analysis Last week: Make and makefiles Text filters inside and out Some C, idiomatically Today: problem set 2 We've raced through the preliminaries, time for compiler stuff ( yay! ) Analysis


slide-1
SLIDE 1

TDT4205 Recitation 3 Lexical analysis

  • Last week:

– Make and makefiles – Text filters inside and out – Some C, idiomatically

  • Today: problem set 2

– We've raced through the preliminaries, time for

compiler stuff (yay!)

– Analysis by hand, and by generated analyzer

– (This lecture is given both Monday and Thursday, to keep

everyone on board even with the off-beat timing)

slide-2
SLIDE 2

Today's tedious practical matter

  • The exercises are part of your evaluation

– I'm not the one holding the ultimate responsibility that your

evaluation is fair

– Thus, I can't decide on any kind of differential treatment – In plain English, I can not extend your deadlines – No, not even for world tours, moon landings or funerals – Where it says “contact the instructor”, that's Dr. Elster – (Generally, after Feb. 15th the deadlines harden)

slide-3
SLIDE 3

Worthwhile questions in plenary

  • (This one is from rec. 1, but I gave a somewhat foofy answer at the time...)
  • Does it make a difference whether main ends in “return int” or

“exit ( int )”?

– As it turns out, No. – The reason I hesitated was that one can register

function calls to happen at program exit (w. function pointers and the atexit function).

– This mechanism is required to behave the same in both

cases, so it's really a clear case. (Live and learn...)

  • For myself, I'll keep writing exit for “stop the program” and

return for “stop the function” (unless there turns out to be a good reason why it's silly).

slide-4
SLIDE 4

Where we're at

  • Things needed to

– Submit homework (pdfs and tarballs) – Build programs (make, cc) – Build scanners (Lex/flex) – Build parsers (Yacc/bison) – Build symbol tables (hashtables/libghthash) – Assemble machine code (as)

  • ...but first, a bit of handwaving
slide-5
SLIDE 5

The science of computer languages: even experts reach for magical metaphors

Battle /w ferocious dragon “The spirit which lives in the computer”

slide-6
SLIDE 6

My humble perspective

  • n the subject
  • Compiler construction lies at an intersection between

– Hardware architecture (very practical things) – Software architecture (very complicated things) – Complexity theory (very theoretical things) – Theories of language (very non-computery things)

  • What's cool about it is that handling the resulting calamity in

the middle is a success story of computer science

  • Even so, the dragon's sweater reads “complexity of compiler

design”, and the knight's sword is a parser generator

  • Moral: bring tools to the job

– Dragons find hero programmers crunchy, and good with ketchup

slide-7
SLIDE 7

General terminology: bits and bobs of languages

  • Different language models are suitable depending on what

you want to look at:

– Lexical models say what goes into a word, and

where they are separated from each other

– Syntactical models tell which roles a given word

can play in a statement it is part of

– Semantics speak of what it means when a given

word appears playing a given role

  • There's a whole heap of other stuff which isn't usually applied to

programming languages (morphology, pragmatics, …)

  • What we're after today is lumping characters into words.
slide-8
SLIDE 8

Lexical analysis, the ad-hoc way

  • Say that we want to recognize an arbitrary fraction; should be

easy, <ctypes.h> is full of functions to classify characters...

– read character – while ( isdigit ( character ) ) { read another } – if ( character != ' / ' ) { die horribly } – while ( isdigit ( character ) ) { keep reading }

  • First loop takes an integer numerator
  • Second loop takes an integer denominator
  • Condition in the middle requires that they're separated by what

we expect.

  • This works if you only have a few different words in your care.
slide-9
SLIDE 9

The automaton way I

  • DFAs are good too, they chew a character at a time
  • Looking at the state diagram, each state has a finite

number of transitions...

  • ...so we can code them up in a finite amount of time.
  • Here goes:

– if ( state=1 and c='a', or state=1 and c='b', or... )

{ state = 14; /* lowercase letters go to 14 */ }

– else if ( state = 1 and c='0', or state=1 and c='1', or... )

{ state = 42; /* digits in state 1...*/ }

– else if (… else if...

  • (I'm beginning to think this wasn't such a fantastic idea after all)
slide-10
SLIDE 10

The automaton way II

  • DFA can be tabulated.

– First, punch in the table – Next, set a start state – Loop while state isn't known to accept or reject:

  • Next state = table [ this state ] [ character ]
  • (A recipe like this is in the Dragon book, pp. 151)
  • Wonder of wonders, one algorithm will work for any

DFA, just change the table!

  • This is pretty much get_token in Task 2, it's not that hard.
slide-11
SLIDE 11

Surrounding logic of vm.c

  • Basically, it's like last week's text filter description, but with tokens:

T = token_get(); while ( T != wrong ) { do_something ( T ); T = token_get(); }

  • 'token_get' is a little more involved than 'readchar()', but it's still just

an integer to branch on

  • The 'do_something' is already in place, you won't have to write that
slide-12
SLIDE 12

Inside token_get: where did I leave my chars?

  • DFA have horrible short term memory, they barely just know where

they are.

  • When the time comes to accept:

– What are we accepting? (Answer is the token) – Why did we accept this? (Answer is the lexeme)

  • At the accept state in the PS2 diagram, neither is known
  • To do this, impart a sense of history to your code:

– The 2n

d to last state determines the token, it can be set then to

be recalled if reaching the accept state

– There's a buffer 'lexeme' to plop each char into as you go

along, to tell “127” from “74” even though they both match to integer tokens

slide-13
SLIDE 13

A few more notes on vm.c

  • The table is all set up, table[state]['a'] gives transition from state on 'a'

– Initially, all lead to state -1, which works for 'reject' – We'll assume transitions not noted lead there – Table is (much) bigger than it has to be, for the convenience

  • f indexing with character codes

– There's a macro T(state,c) which expands to table[state][c],

this is just to save on the typing.

  • The language def. isn't splendidly clean (mixes in whitespace for

good measure), but the intention is (hopefully) clear

  • The 'lexeme' buffer is fixed-length, and can be easily overrun with

long integers. We could fix it, but it's kind of beside the point at the moment, let's assume input is friendly.

  • (The stack is finite too, so it won't do long programs.)
slide-14
SLIDE 14

Testing

  • There are two files included, one for checking just the

tokenizer, and one small program

  • “./vm -t” will drop execution, this is used to test with an

included list of lexemes

  • (In a few cases, the input is broken through 'sed', to see if

errors come out. Sed is just a text filter which can apply reg.exp. substitutions. It's a handy tool.)

  • Just starting “./vm -t” without any pipeline will take stdin

from the keyboard. (On most terminals, Ctrl + D will send the end-of-file character.)

slide-15
SLIDE 15

The bridge to Lex

  • What we just saw is exactly what Lex does:

– Take some regular expressions – Write out mother load table – Implement traversal

  • The names are a little different:

– 'token_get (stdin)' is called “yylex()” – The lexeme buffer is called yytext

  • Major win: the tabulation is automated; less

tedious, far less prone to mistakes

slide-16
SLIDE 16

Lex specifications: declarations

  • Declarations section contains initializer C code, some directives, and
  • ptionally, some named regular expressions

– “TABSTRING [\t]+” will define a symbolic name

TABSTRING for the expression (which here matches a sequence of at least one tabulator character)

– References to these names can go into other expressions in

the rules section: {TABSTRING}123 will match a string

  • f at least one tab, followed by '123'

– Not necessary, but a boon for readability when expressions

grow complicated

  • Anything enclosed between '%{' and '%}' in this section will go

verbatim in at the start of the generated C code

  • There's a nasty macro in there, which gets more attention in a minute
slide-17
SLIDE 17

Lex specifications: rules

  • The rules section is just regular expressions annotated with

basic blocks (semantic actions):

– a(a|b)* { return SOMETHING; }

will see the yylex() function return “SOMETHING” whenever what appears on stdin matches an 'a' followed by zero or more ('a' or 'b')-s

– Any code could go into the semantic action, it's just a

block of C. If it's empty, the reg.exp. will strip text from the input.

– A set of token values to return are already included in

“parser.h”, so you don't have to invent token values

slide-18
SLIDE 18

Gritty details

  • The one rule already implemented in scanner.l is

“. { RETURN(yytext[0]);}”, which matches any one character and returns its ASCII code as a token value.

  • Keep this rule (as the last one), it will save us from defining long

symbolic names for single-char tokens like '+' and '}' (...even though this overlaps the lexeme with the token value...)

  • The RETURN() macro is a hack, but a useful one:

– #ifdef-d on DUMP_TOKENS, it not only returns a token

value, but also prints it along with its lexeme. Thus, we can define DUMP_TOKENS and test the scanner without plopping a greater frame around it.

– When we're done, dropping DUMP_TOKENS will give us a

well-behaved scanner which just returns values.

slide-19
SLIDE 19

Bringing it all together

  • The specification file consists of declarations, rules and

function definitions (in that order), separated by %%

  • We won't need the function definitions, but you can stuff

auxiliary functions in there. (If you implement main there, you can make a standalone program in a Lex spec.)

slide-20
SLIDE 20

Outline of the vslc directory

  • 'src' is for keeping handwritten sources.
  • 'obj' fills up with object code on every build
  • 'work' fills up with generated source code
  • 'bin' contains the compiler binary
  • 'vsl_programs' contains little examples in the language

we're compiling, for testing purposes

  • There's a separate makefile under vsl_programs, to manage

the testing stuff separately

  • 'clean' and 'purge' simply wipe out all the generated

material

slide-21
SLIDE 21

Outline of the vslc directory

  • All this separation is a little over-engineered, but the idea is to keep

the building blocks apart (since we're dismantling the compilation process anyway).

  • Test cases are there to verify things at the end, noodling around with

inventing your own input in such a way that you can verify it is still

  • invaluable. Reminders:

– echo “FUNC main () { VAR a,b RETURN 0 }” | bin/vslc – From file, 'cat myfile.vsl | bin/vslc' or 'bin/vslc -f myfile.vsl'

  • Testing little bits at a time and trying to predict the outcome is likely

to get things finished quicker than trying to write the entire scanner

  • spec. correctly on the first go (even if that is still possible here...)
slide-22
SLIDE 22

Notes which didn't fit anywhere else

  • Lex regular expressions support a few more conveniences that

the Dragon flavor. I found one nice reference by going to Google with “flexout flex regular expression cheatsheet” (gives 1s

t hit to a likeable one, there are hundreds of things like this)

  • Mind the string literals; most escape sequences will take care of

themselves (we'll interpolate them with printf later), but '“' has to be escaped by us because it's the delimiter of the strings themselves.

  • “#define MACRO(p) do { a(p); b(p,0); p = 42; } while ( false )”

may look a little redundant, but it's a/the way to wrap multiple statements in macros so that they look like single statements.

“#define MACRO(p) a(p); b(p,0); p=42;” fails as 'if' body “#define MACRO(p) {a(p); b(p,0); p=42;} won't allow ';' after