Tools and Analyses for Ambiguous Input Streams Andrew Begel and - - PowerPoint PPT Presentation

tools and analyses for ambiguous input streams
SMART_READER_LITE
LIVE PREVIEW

Tools and Analyses for Ambiguous Input Streams Andrew Begel and - - PowerPoint PPT Presentation

Tools and Analyses for Ambiguous Input Streams Andrew Begel and Susan L. Graham University of California, Berkeley LDTA Workshop - April 3, 2004 Harmonia: Language-aware Editing Programming by Voice Code dictation Voice-based


slide-1
SLIDE 1

Tools and Analyses for Ambiguous Input Streams

Andrew Begel and Susan L. Graham University of California, Berkeley LDTA Workshop - April 3, 2004

slide-2
SLIDE 2

April 3, 2004 LDTA 2004 2

Harmonia: Language-aware Editing

 Programming by Voice

– Code dictation – Voice-based editing commands

 Program Transformations

– Transformation actions – Pattern-matching constructs

slide-3
SLIDE 3

April 3, 2004 LDTA 2004 3

Harmonia: Language-aware Editing

 Programming by Voice

– Code dictation – Voice-based editing commands

 Program Transformations

– Transformation actions – Pattern-matching constructs Human Speech

slide-4
SLIDE 4

April 3, 2004 LDTA 2004 4

Harmonia: Language-aware Editing

 Programming by Voice

– Code dictation – Voice-based editing commands

 Program Transformations

– Transformation actions – Pattern-matching constructs Human Speech Embedded Languages

slide-5
SLIDE 5

April 3, 2004 LDTA 2004 5

Harmonia: Language-aware Editing

 Programming by Voice

– Code dictation – Voice-based editing commands

 Program Transformations

– Transformation actions – Pattern-matching constructs

Each kind of input stream ambiguity requires new language analyses

Human Speech Embedded Languages

slide-6
SLIDE 6

April 3, 2004 LDTA 2004 6

Speech Example

for (int i = 0; i < 10; i++ ) { ❙ }

for int i equals zero i less than ten i plus plus

slide-7
SLIDE 7

April 3, 2004 LDTA 2004 7

Ambiguities

for (int i = 0; i < 10; i++ ) { ❙ }

4 int eye equals 0 aye less then 10 i plus plus

slide-8
SLIDE 8

April 3, 2004 LDTA 2004 8

Ambiguities

for (int i = 0; i < 10; i++ ) { ❙ }

4 int eye equals 0 aye less then 10 i plus plus

KW or #? ID Spelling? KW or ID?

slide-9
SLIDE 9

April 3, 2004 LDTA 2004 9

Another Utterance

for times ate equals zero two plus equals one

slide-10
SLIDE 10

April 3, 2004 LDTA 2004 10

Many Valid Parses!

4 * 8 = zero; to += won ❙

for times ate equals zero two plus equals one

for (times; ate == 0; to += 1) { ❙ } fore.times(8).equalsZero(2, plus == 1) ❙

slide-11
SLIDE 11

April 3, 2004 LDTA 2004 11

Embedded Language Example

 C and Regexps embedded in Flex

Flex Rule for Identifiers

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

slide-12
SLIDE 12

April 3, 2004 LDTA 2004 12

Embedded Language Example

 C and Regexps embedded in Flex

Flex Rule for Identifiers

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

 Why not this interpretation?

[_a-zA-Z]([_a-zA-Z0-9])* i++; RETURN_TOKEN(ID);

slide-13
SLIDE 13

April 3, 2004 LDTA 2004 13

Legacy Language Example

Fortran

DO 57 I = 3,10

slide-14
SLIDE 14

April 3, 2004 LDTA 2004 14

Legacy Language Example

Fortran

  • Do Loop

DO 57 I = 3,10

slide-15
SLIDE 15

April 3, 2004 LDTA 2004 15

Legacy Language Example

Fortran

  • Do Loop

DO 57 I = 3,10 DO 57 I = 3

slide-16
SLIDE 16

April 3, 2004 LDTA 2004 16

Legacy Language Example

Fortran

  • Do Loop

DO 57 I = 3,10

  • Assignment

DO 57 I = 3

slide-17
SLIDE 17

April 3, 2004 LDTA 2004 17

Legacy Language Example

Fortran

  • Do Loop

DO 57 I = 3,10

  • Assignment

DO57I = 3

slide-18
SLIDE 18

April 3, 2004 LDTA 2004 18

Legacy Language Example

PL/I

  • Non-reserved Keywords

IF IF = THEN THEN THEN = ELSE ELSE ELSE = END END

slide-19
SLIDE 19

April 3, 2004 LDTA 2004 19

Legacy Language Example

PL/I

  • Non-reserved Keywords

IF IF = THEN THEN THEN = ELSE ELSE ELSE = END END KW ID ID ID

slide-20
SLIDE 20

April 3, 2004 LDTA 2004 20

Homophones Non-reserved keywords Ambiguous interpretations Multiple Lexical Categories Homophone IDs Lexical misspellings Unambiguous Single Lexical Category Multiple Spellings Single Spelling

Input Stream Classification

slide-21
SLIDE 21

April 3, 2004 LDTA 2004 21

Homophones Non-reserved keywords Ambiguous interpretations Multiple Lexical Categories Homophone IDs Lexical misspellings Unambiguous Single Lexical Category Multiple Spellings Single Spelling

Input Stream Classification

Embedded Languages Fall in all Four Categories!

slide-22
SLIDE 22

April 3, 2004 LDTA 2004 22

GLR Analysis Architecture

Lexer GLR Parser Semantics FOR I

FOR I

for (i = 0; i < 10; i++ ) { ❙ }

(

slide-23
SLIDE 23

April 3, 2004 LDTA 2004 23

GLR Analysis Architecture

Lexer GLR Parser Semantics FOR I

FOR I

for (i = 0; i < 10; i++ ) { ❙ }

( Handles syntactic ambiguities

slide-24
SLIDE 24

April 3, 2004 LDTA 2004 24

Our Contribution: XGLR Analysis Architecture

Lexer XGLR Parser Semantics for i equals zero ... FOR I

FOR I

slide-25
SLIDE 25

April 3, 2004 LDTA 2004 25

Our Contribution: XGLR Analysis Architecture

Lexer XGLR Parser Semantics for i equals zero ... FOR I 4 EYE

FOR I

Handles input stream ambiguities

slide-26
SLIDE 26

April 3, 2004 LDTA 2004 26

LR Parsing

S7 R3 S9 3 Err S4 R1 2 Err S3 S2 1 # KW ID

I

ID

FOR

KW

=

KW

1 Parse Table

#

Input Stream Parse Stack

slide-27
SLIDE 27

April 3, 2004 LDTA 2004 27

LR Parsing

S7 R3 S9 3 Err S4 R1 2 Err S3 S2 1 # KW ID

I

ID

FOR

KW

=

KW

1 Parse Table

#

Input Stream Parse Stack

slide-28
SLIDE 28

April 3, 2004 LDTA 2004 28

LR Parsing

S7 R3 S9 3 Err S4 R1 2 Err S3 S2 1 # KW ID

I

ID

=

KW

1 Parse Table

#

Input Stream Parse Stack FOR

KW

3

slide-29
SLIDE 29

April 3, 2004 LDTA 2004 29

GLR Parsing

S7 R3 S9 3 Err S4 R1 R2 2 Err S3 R5 S2 1 # KW ID

I

ID

FOR

KW

=

KW

1 Parse Table

#

Input Stream Parse Stack

slide-30
SLIDE 30

April 3, 2004 LDTA 2004 30

GLR Parsing

S7 R3 S9 3 Err S4 R1 R2 2 Err S3 R5 S2 1 # KW ID

I

ID

FOR

KW

=

KW

1 Parse Table

#

Input Stream Parse Stack

slide-31
SLIDE 31

April 3, 2004 LDTA 2004 31

GLR Parsing

S7 R3 S9 3 Err S4 R1 R2 2 Err S3 R5 S2 1 # KW ID

I

ID

FOR

KW

=

KW

1 Parse Table

#

Input Stream Parse Stack 2 5

slide-32
SLIDE 32

April 3, 2004 LDTA 2004 32

GLR Parsing

S7 R3 S9 3 Err S4 R1 R2 2 Err S3 R5 S2 1 # KW ID

I

ID

=

KW

1 Parse Table

#

Input Stream Parse Stack 2 FOR

KW

3 FOR

KW

4 5

slide-33
SLIDE 33

April 3, 2004 LDTA 2004 33

Example 1 Example 2 Multiple Lexical Categories Example 1 Not Shown Single Lexical Category Multiple Spellings Single Spelling

XGLR in Action

slide-34
SLIDE 34

April 3, 2004 LDTA 2004 34

23 FOR

Parsing Homophones

BAR

slide-35
SLIDE 35

April 3, 2004 LDTA 2004 35

23 FOR FORE 4

ID KW NUM

XGLR Extension: Multiple Spellings, Single and Multiple Lexical Categories BAR FOUR

slide-36
SLIDE 36

April 3, 2004 LDTA 2004 36

23 FOR FORE 4 23 23

ID KW NUM

XGLR Extension: Parsers fork due to input ambiguity BAR FOUR

slide-37
SLIDE 37

April 3, 2004 LDTA 2004 37

23 26 FOR FORE 4 29 35 23 23

ID KW NUM

Each parser shifts its now unambiguous input BAR FOUR

slide-38
SLIDE 38

April 3, 2004 LDTA 2004 38

23 26 FOR FORE 4 29 35 BAR 23 23

ID ID KW NUM

The next input is lexed unambiguously FOUR

slide-39
SLIDE 39

April 3, 2004 LDTA 2004 39

23 26 FOR FORE 4 29 35 BAR 23 23 42 49

ID ID KW NUM

ID is only a valid lookahead for two parsers FOUR

slide-40
SLIDE 40

April 3, 2004 LDTA 2004 40

Parsing Embedded Languages

Example BNF Grammar Contains Languages L and W bL → loopL dW ENDL loopL → LOOPL | ε dW → WHILEW NUMW doW doW → DOW | ε

L W

slide-41
SLIDE 41

April 3, 2004 LDTA 2004 41

Parsing Embedded Languages

Example BNF Grammar Contains Languages L and W bL → loopL dW ENDL loopL → LOOPL | ε dW → WHILEW NUMW doW doW → DOW | ε

LOOP WHILE 34 END WHILE 56 DO END

L W

slide-42
SLIDE 42

April 3, 2004 LDTA 2004 42

slide-43
SLIDE 43

April 3, 2004 LDTA 2004 43

slide-44
SLIDE 44

April 3, 2004 LDTA 2004 44

slide-45
SLIDE 45

April 3, 2004 LDTA 2004 45

slide-46
SLIDE 46

April 3, 2004 LDTA 2004 46

S

Parsing Embedded Languages

LOOP WHILE 34

slide-47
SLIDE 47

April 3, 2004 LDTA 2004 47

S LOOP WHILE 34 Current parse state has ambiguous lexical language

slide-48
SLIDE 48

April 3, 2004 LDTA 2004 48

S

W L

LOOP WHILE 34 XGLR Extension: Fork parsers, assign one to each lexical language

slide-49
SLIDE 49

April 3, 2004 LDTA 2004 49

S LOOP LOOP

W L W L KW ID

WHILE 34 XGLR Extension: Single spelling, Multiple lexical categories Lex lookahead both in language L and W

slide-50
SLIDE 50

April 3, 2004 LDTA 2004 50

S 4

L

LOOP LOOP

W L W L KW ID

WHILE 34 Only LOOPL is valid lookahead, and is shifted

slide-51
SLIDE 51

April 3, 2004 LDTA 2004 51

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34 XGLR Extension: State 4 has lexer lookaheads

  • nly in language W
slide-52
SLIDE 52

April 3, 2004 LDTA 2004 52

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

Lex lookahead in language W

slide-53
SLIDE 53

April 3, 2004 LDTA 2004 53

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

REDUCE by rule 2 and GOTO state 1 loop

L

slide-54
SLIDE 54

April 3, 2004 LDTA 2004 54

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

loop

L

slide-55
SLIDE 55

April 3, 2004 LDTA 2004 55

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W

Shift into state 2 loop

L

slide-56
SLIDE 56

April 3, 2004 LDTA 2004 56

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W W NUM

XGLR Extension: Lex lookahead in language W loop

L

slide-57
SLIDE 57

April 3, 2004 LDTA 2004 57

S 4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W W NUM

loop

L

slide-58
SLIDE 58

April 3, 2004 LDTA 2004 58

4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W W NUM

3

W

Shift into state 3 S loop

L

slide-59
SLIDE 59

April 3, 2004 LDTA 2004 59

4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W W NUM

3

W

Shift into state 3, which has ambiguous lexical language S loop

L

slide-60
SLIDE 60

April 3, 2004 LDTA 2004 60

4

W

LOOP LOOP

W L W L KW ID

WHILE 34

KW W

1

W

2

W W NUM

3

W

3

L

XGLR Extension: Single spelling, Multiple lexical categories Fork parsers, assign one to each lexical language S loop

L

slide-61
SLIDE 61

April 3, 2004 LDTA 2004 61

GLR Ambiguity Support

1.

Fork parser on shift-reduce conflict

2.

Fork parser on reduce-reduce conflict

slide-62
SLIDE 62

April 3, 2004 LDTA 2004 62

XGLR Ambiguity Support

1.

Fork parser on shift-reduce conflict

2.

Fork parser on reduce-reduce conflict

slide-63
SLIDE 63

April 3, 2004 LDTA 2004 63

XGLR Ambiguity Support

1.

Fork parser on shift-reduce conflict

2.

Fork parser on reduce-reduce conflict

3.

Fork parsers on ambiguous lexical language

  • Single spelling, Multiple lexical categories

4.

Fork parsers on ambiguous lexical lookahead

  • Single/Multiple Spellings, Multiple lexical

categories

  • Shift-shift conflict resolution
slide-64
SLIDE 64

April 3, 2004 LDTA 2004 64

XGLR Ambiguities

 Many GLR programming language specs

have finite, few ambiguities

 XGLR language specs also have finite, but

slightly more, ambiguities

– Lexical ambiguity due to ambiguous input does result in more ambiguous parse forests

slide-65
SLIDE 65

April 3, 2004 LDTA 2004 65

XGLR Ambiguities

 Many GLR programming language specs

have finite, few ambiguities

 XGLR language specs also have finite, but

slightly more, ambiguities

– Lexical ambiguity due to ambiguous input does result in more ambiguous parse forests

 Ambiguity causes parsers to fork  GLR maintains efficiency by merging parsers

when ambiguity is over

slide-66
SLIDE 66

April 3, 2004 LDTA 2004 66

Parser Merging

 GLR: Parsers merge when in same parse state

DO

KW

8 3 1 DO

KW

5 57

#

5

slide-67
SLIDE 67

April 3, 2004 LDTA 2004 67

Parser Merging

 GLR: Parsers merge when in same parse state

DO

KW

8 3 1 DO

KW

5 57

#

4 5 57

#

slide-68
SLIDE 68

April 3, 2004 LDTA 2004 68

Parser Merging

 XGLR: Parsers merge when in same parse

state and same lexical state

DO

KW

8 3 1 DO

KW

5 57

# A A A A A W

5

A

slide-69
SLIDE 69

April 3, 2004 LDTA 2004 69

Parser Merging

 XGLR: Parsers merge when in same parse

state and same lexical state

DO

KW

8 3 1 DO

KW

5

A A A A A W

5

A

57

#

57

# A W

slide-70
SLIDE 70

April 3, 2004 LDTA 2004 70

Parser Merging

 XGLR: Parsers merge when in same parse

state and same lexical state

DO

KW

8 3 1 DO

KW

5

A A A A A W

5

A

57

#

57

# A W

4

W

slide-71
SLIDE 71

April 3, 2004 LDTA 2004 71

Parser Merging

 XGLR: Parsers merge when in same parse

state and same lexical state

DO

KW

8 3 1 DO

KW

5

A A A A A W

5

A

57

#

57

# A W

4

A

slide-72
SLIDE 72

April 3, 2004 LDTA 2004 72

Parser Merging

 XGLR: Parsers merge when in same parse

state and same lexical state

DO

KW

8 3 1 DO

KW

5

A A A A A W

5

A

57

#

57

# A W

4

A

slide-73
SLIDE 73

April 3, 2004 LDTA 2004 73

Out of Sync Parsers

DO57I=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

slide-74
SLIDE 74

April 3, 2004 LDTA 2004 74

Out of Sync Parsers

57I=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

=3

slide-75
SLIDE 75

April 3, 2004 LDTA 2004 75

Out of Sync Parsers

57I=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

=3 3

A

5

W

slide-76
SLIDE 76

April 3, 2004 LDTA 2004 76

Out of Sync Parsers

I=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

3 3

A

5

W

=

KW W

57

ID A

slide-77
SLIDE 77

April 3, 2004 LDTA 2004 77

Out of Sync Parsers

I=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

3 3

A

5

W

=

KW W

57

ID A

6

W

4

A

slide-78
SLIDE 78

April 3, 2004 LDTA 2004 78

Out of Sync Parsers

=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

3 3

A

5

W

=

KW W

57

ID A

6

W

4

A # W

I

ID A

slide-79
SLIDE 79

April 3, 2004 LDTA 2004 79

Out of Sync Parsers

=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

3 3

A

5

W

=

KW W

57

ID A

6

W

4

A # W

I

ID A

9

W

slide-80
SLIDE 80

April 3, 2004 LDTA 2004 80

Out of Sync Parsers

=3

 XGLR: Parsers merge when in same parse

state and same lexical state and same input position

8 1

A W

5

A

DO

KW A

DO57I

ID W

3 3

A

5

W

=

KW W

57

ID A

6

W

4

A # W

I

ID A

9

W

slide-81
SLIDE 81

April 3, 2004 LDTA 2004 81

Implementation

 Keep map: lookahead → parser to use when

looking for parsers to merge with

 Sort parsers by position of lookahead in the input

– Enables pruning of map as parsers move past a particular input location – Extra memory required is bounded by dynamic separation between first and last parsers

slide-82
SLIDE 82

April 3, 2004 LDTA 2004 82

Related Work

 GLR Parsing Algorithm

– Tomita [1985] – Farshi [1991] – Rekers [1992] – Johnstone et. al. [2002]

 Incremental GLR

– Wagner [1997]

 GLR Implementations

(that I heard of before today)

– ASF+SDF [1993] – Elkhound [2004] – Bison [2003] – DParser [2002] – Aycock and Horspool [1999]

 Scannerless Parsing

(or Context-Free Scanning)

– Salomon and Cormack [1989] – Visser [1997] van den Brand [2002]

 Ambiguous Input Streams

– Aycock and Horspool [2001]

 Embedded Languages

– ASF+SDF [1997] – Van de Vanter and Boshernitsan (CodeProcessor) [2000]

slide-83
SLIDE 83

April 3, 2004 LDTA 2004 83

Future Work

 Semantic Analysis of Embedded

Languages

 Automated Semantic Disambiguation

slide-84
SLIDE 84

April 3, 2004 LDTA 2004 84

Contributions

1.

Generalized GLR to handle input stream ambiguities

2.

Classified input stream ambiguities into four categories

3.

Implemented XGLR algorithm in Harmonia framework

4.

Constructed combined lexer and parser generator to support embedded languages and lexical ambiguities at each stage of analysis

5.

Enabled analysis of embedded languages, programming by voice, and legacy languages

slide-85
SLIDE 85

April 3, 2004 LDTA 2004 85