Recursive-Descent Parsing First, a digression on lexing Lets assume - - PowerPoint PPT Presentation

recursive descent parsing first a digression on lexing
SMART_READER_LITE
LIVE PREVIEW

Recursive-Descent Parsing First, a digression on lexing Lets assume - - PowerPoint PPT Presentation

Recursive-Descent Parsing First, a digression on lexing Lets assume the get-token function will give me the next token (define lex (lexer ; skip spaces: [#\space (lex input-port)] ; skip newline: [#\newline (lex input-port)] [#\+


slide-1
SLIDE 1

Recursive-Descent Parsing

slide-2
SLIDE 2

First, a digression on lexing

Let’s assume the get-token function will give me the next token

slide-3
SLIDE 3

(define lex (lexer ; skip spaces: [#\space (lex input-port)] ; skip newline: [#\newline (lex input-port)] [#\+ 'plus] [#\- 'minus] [#\* 'times] [#\/ 'div] [(:: (:? #\-) (:+ (char-range #\0 #\9))) (string->number lexeme)] ; an actual character: [any-char (string-ref lexeme 0)]))

slide-4
SLIDE 4

Assume current token is curtok (accept c) matches character c

slide-5
SLIDE 5

(define curtok (next-tok)) (define (accept c) (if (not (equal? curtok c)) (raise 'unexpected-token) (begin (printf "Accepting ~a\n" c) (set! curtok (next-tok)))))

slide-6
SLIDE 6

L L

eft to right eft derivation

1 token of lookahead

slide-7
SLIDE 7

Let’s say I want to parse the following grammar

S -> aSa | bb

slide-8
SLIDE 8

First, a few questions

S -> aSa | bb

If I were matching the string bb, what would my derivation look like?

If I were matching the string abba, what would my derivation look like?

Is this grammar ambiguous?

slide-9
SLIDE 9

First, a few questions

S -> aSa | bb

Key idea: if I look at the next input, at most one of these productions can “fire”

If I see an a I know that I must use the first production

If I see a b, I know I must be in second production

slide-10
SLIDE 10

This is called a predictive parser. It uses lookahead to determine which production to choose (My friend Tom points out that predictive is a dumb name because it is really “determining”, no guess)

slide-11
SLIDE 11

In this class, we’ll restrict ourselves to grammars that require only one character of lookahead Generalizing to k characters is straightforward

slide-12
SLIDE 12

S -> aaS | abS | c

I need two characters of lookahead

S -> aaaS | aabS | c

I need three characters of lookahead

S -> aaaaS | aaabS | c

I need four characters of lookahead …

slide-13
SLIDE 13

Slight transformation..

S -> A | B A -> aSa B -> bb

slide-14
SLIDE 14

Slight transformation..

Now, I write out one function to parse each nonterminal

S -> A | B A -> aSa B -> bb

slide-15
SLIDE 15

Intuition: when I see a, I call parse-A

when I see b, I call parse-B

S -> A | B A -> aSa B -> bb

slide-16
SLIDE 16

(define (parse-A) (match curtok [#\a (begin (accept #\a) (parse-A) (accept #\a))] [#\b (parse-B)]))

slide-17
SLIDE 17

(define (parse-B) (begin (accept #\b) (accept #\b)))

slide-18
SLIDE 18

Livecoding this parser in class

slide-19
SLIDE 19

Three parsing-related pieces of trivia

slide-20
SLIDE 20

FIRST(A)

FIRST(A) is the set of terminals that could occur first when I recognize A

slide-21
SLIDE 21

NULLABLE

Is the set productions which could generate ε

slide-22
SLIDE 22

FOLLOW(A)

FOLLOW(A) is the set of terminals that appear immediately to the right of A in some form

slide-23
SLIDE 23

Why learn these? A: They help your intuition for building parsers (as we’ll see)

slide-24
SLIDE 24

What is FIRST for each nonterminal What is NULLABLE for the grammar What is FOLLOW for each nonterminal

S -> A | B A -> aAa B -> bb

slide-25
SLIDE 25

E TE' E' +TE' E' ε T FT' T' *FT' T' ε F (E) F id

What is FIRST for each nonterminal What is NULLABLE for the grammar What is FOLLOW for each nonterminal More practice…

slide-26
SLIDE 26

We use the FIRST set to help us design our recursive-descent parser!

slide-27
SLIDE 27

LL(1)

A grammar is LL(1) if we only have to look at the next token to decide which production will match! I.e., if S -> A | B, FIRST(A) ∩ FIRST(B) must be empty

slide-28
SLIDE 28

Recursive-descent is called top-down parsing because you build a parse tree from the root down to the leaves

slide-29
SLIDE 29

There are also bottom-up parsers, which produce the rightmost derivation

Won’t talk about them, in general they’re impossibly-hard to write / understand, easier to use

slide-30
SLIDE 30
slide-31
SLIDE 31

Basically everyone uses lex and yacc to write real parsers Recursive-descent is easy to implement, but requires lots of messing around with grammar

slide-32
SLIDE 32

More practice with parsers

slide-33
SLIDE 33

Plus -> num MoreNums MoreNums -> + num MoreNums | ε

This one is more tricky!! How would you do it? (Hint: Think about NULLABLE)

slide-34
SLIDE 34

Code up collectively….

slide-35
SLIDE 35

(define (parse-Plus) (begin (parse-num) (parse-MorePlus))) (define (parse-MorePlus) (match curtok ['plus (begin (accept 'plus) (parse-num) (parse-MorePlus))] ['eof (void)]))

slide-36
SLIDE 36

Key rule: At each step of the way, if I see some token next, what rule production must I choose

slide-37
SLIDE 37

Now yet another…. This will use the intuition from FOLLOW

slide-38
SLIDE 38

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

slide-39
SLIDE 39

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

Consider how we would implement MoreTerms

slide-40
SLIDE 40

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

If you’re at the beginning of MoreTerms you have to see a +

slide-41
SLIDE 41

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

If you’ve just seen a + you have to see FIRST(Term)

slide-42
SLIDE 42

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

After Term you recognize something in FOLLOW(Term)

slide-43
SLIDE 43

Add -> Term MoreTerms MoreTerms -> + Term MoreTerms MoreTerms -> ε Term -> num MoreNums MoreNums -> * num MoreNums | ε

Because MoreTerms is NULLABLE, have to account for null

slide-44
SLIDE 44

Code up collectively….

slide-45
SLIDE 45

Let’s say I want to generate an AST

slide-46
SLIDE 46

(struct add (left right) #:transparent) (struct times (left right) #:transparent)

Model my AST…

slide-47
SLIDE 47

More Recursive-descent practice…

slide-48
SLIDE 48

Write recursive-descent parsers for the following….

slide-49
SLIDE 49

A grammar for S-Expressions

slide-50
SLIDE 50

datum ::= number | string | identifier | ‘SExpr SExpr ::= (SExprs) | datum SExprs ::= SExpr SExprs | ε

Parsing mini-Racket / Scheme

slide-51
SLIDE 51

S -> a C H | b H C H -> b H | d C -> e C | f C

slide-52
SLIDE 52

E -> A E -> L A -> n A -> i L -> ( S ) S -> E S’ S’ -> , S S’ -> ε

slide-53
SLIDE 53

So far, I’ve given you grammars that are amenable to LL(1) parsers… (Many grammars are not) (But you can manipulate them to be!)

slide-54
SLIDE 54

What about this grammar?

E -> E - T | T T -> number

slide-55
SLIDE 55

This grammar is left recursive

E -> E - T | T T -> number

What happens if we try to write recursive-descent parser?

slide-56
SLIDE 56

This grammar is left recursive

E -> E - T | T T -> number

slide-57
SLIDE 57

We really want this grammar, because it corresponds to the correct notion of associativity

slide-58
SLIDE 58

5 - 3 - 1

E -> E - T | T T -> number

slide-59
SLIDE 59

Infinite loop!

slide-60
SLIDE 60

5 - 3 - 1

E -> E - T | T T -> number

A recursive descent parser will first call parse-E And then crash

slide-61
SLIDE 61

5 - 3 - 1

Draw the rightmost derivation for this string

E -> E - T | T T -> number

slide-62
SLIDE 62

If we could only have the rightmost derivation, our problem would be solved

slide-63
SLIDE 63

The problem is, a recursive-descent parser needs to look at the next input immediately

slide-64
SLIDE 64

Recursive descent parsers work by looking at the next token and making a decision / prediction Rightmost derivations require us to delay making choices about the input until later As humans, we naturally guess which derivation to use (for small examples)

Thus, LL(k) parsers cannot generate rightmost derivations :(

slide-65
SLIDE 65

We can remove left recursion

slide-66
SLIDE 66

E -> E - T | T T -> number E -> T E’ E’ -> - T E’ E’ -> ε

Factor!

slide-67
SLIDE 67

In general, if we have

A -> Aa | bB

Rewrite to…

A -> bB A’ A’ -> a A’ | ε Generalizes even further

https://en.wikipedia.org/wiki/LL_parser#Left_Factoring

slide-68
SLIDE 68

But this still doesn’t give us what we want!!!

E -> T E’ E’ -> - T E’ E’ -> ε E -> T E’

  • > T - T E’
  • > T - T - T E’
  • > T - T - T
slide-69
SLIDE 69

So how do we get left associativity?

Answer: Basically, hack in implementation

slide-70
SLIDE 70

Sub -> num Sub’ Sub’ -> + num Sub’ | epsilon Sub -> num Sub’ (+ num)*

Is basically…

slide-71
SLIDE 71

Intuition: treat this as while loop, then when building parse tree, put in left-associative order

Sub -> num Sub’ (+ num)*

slide-72
SLIDE 72

Sub -> num Sub’ Sub’ -> + num Sub’ | epsilon

slide-73
SLIDE 73

If you want to get rightmost derivation, you need to use an LR parser

slide-74
SLIDE 74

input: /* empty */ | input line ; line: '\n' | exp '\n' { printf ("\t%.10g\n", $1); } ; exp: NUM { $$ = $1; } | exp exp '+' { $$ = $1 + $2; } | exp exp '-' { $$ = $1 - $2; } | exp exp '*' { $$ = $1 * $2; } | exp exp '/' { $$ = $1 / $2; } /* Exponentiation */ | exp exp '^' { $$ = pow ($1, $2); } /* Unary minus */ | exp 'n' { $$ = -$1; } ;

slide-75
SLIDE 75

Parsing is lame, it’s 2017

slide-76
SLIDE 76
slide-77
SLIDE 77

If you can, just use something like JSON / protobufs / etc… Inventing your own format is probably wrong For small / prototypical things, recursive-descent For real things, use yacc / bison / ANTLR