Parsing DMS BNF git XCK JS Pascal Inkscape LCF Assembly - - PowerPoint PPT Presentation

parsing
SMART_READER_LITE
LIVE PREVIEW

Parsing DMS BNF git XCK JS Pascal Inkscape LCF Assembly - - PowerPoint PPT Presentation

MetaEnvironment QBasic Eclipse LaTeX BGF PHP M3 SVG LDF jQuery TSR FST CSS JCL BGF DCG XSD C++ PDG XHTML SQL Parsing DMS BNF git XCK JS Pascal Inkscape LCF Assembly Graphviz ksh GWBasic GraphML DHTML Erlang


slide-1
SLIDE 1

CSS CodeSurfer Erlang Unlambda Assembly Pascal Delphi SVG LaTeX DITA sh PCRE Promela EMF Ecore ATL Ada Turbo Vision Haskell Rascal Java C# Ruby Python make git C++ Scheme PHP ANTLR Jenkins XSLT C ksh DHTML JS Perl CGI Matlab Maple Prolog XML XSD DTD LCF LDF XLDF BGF XBGF EDD BNF EBNF FST ΞBGF ASF SDF GDK GRK Smalltalk COBOL GWBasic QBasic VB Flash GIF Inkscape Eclipse GrammarLab LCI Grammar Hunter DMS MetaEnvironment HTML XHTML Django Zope Wordpress MediaWiki Wikidot Wikia Markdown bibTeX IDA WinIce M3 JSON 80x86 SQL SPARQL XCK FPU yED GraphML SPIN SoftIce DeGlucker TSR Graphviz dot Subversion CVS Grammar Hawk PDG DCG jQuery phpbb CRC Blowfish HASP OS/400 JCL JAXB

Parsing with Grammars

  • Dr. Vadim Zaytsev, Universiteit van Amsterdam

UvA MSc SE: Software Construction 2015

slide-2
SLIDE 2

Grammars & parsing
 are among the most established areas of CS/SE

slide-3
SLIDE 3
  • N. Chomsky,

Syntactic Structures,
 1957

slide-4
SLIDE 4
  • N. Chomsky,

Aspects of the Theory of Syntax, 1965

slide-5
SLIDE 5

A.V. Aho & J.D. Ullman, The Theory of Parsing, Translation and Compiling, Volumes I + II, 1972

slide-6
SLIDE 6

A.V. Aho,


  • R. Sethi,

J.D. Ullman, Compilers: Principles, Techniques and Tools, 1986

slide-7
SLIDE 7
  • D. Grune,


C.J.H. Jacobs, Parsing Techniques: A Practical Guide, 2 ed, 2008

slide-8
SLIDE 8

Why are grammars and parsing relevant?

slide-9
SLIDE 9

Language

  • Programming languages: C, Java, C#, JavaScript
  • Markup languages: HTML, XML, TeX, Markdown, wikis
  • Domain-specific languages: BibTeX, CSS, SQL, QL
  • Data formats: JSON, log files, protocol data, bytecode
  • (formally: a set of strings)
slide-10
SLIDE 10

How to define a language?

  • List all the sentences!
  • Infinite languages?
  • Finite recipes = grammars
  • Infinite grammars?
  • Two level grammars
slide-11
SLIDE 11

Example

  • Valid sentences/programs/instances:
  • Alice
  • Alice and Bob
  • Alice, Bob and Coen
  • Alice, Bob, Coen and Daenerys
  • How to define a recipe?
slide-12
SLIDE 12

ABCD Grammar

Name → Alice Name → Bob Name → Coen Name → Daenerys Sentence → List End List → Name List → List , Name , Name End → and Name

slide-13
SLIDE 13

ABCD Grammar

Name → Alice Name → Bob Name → Coen Name → Daenerys Sentence → List End List → Name List → List , Name , Name End → and Name

Terminal symbols

slide-14
SLIDE 14

Name → Alice Name → Bob Name → Coen Name → Daenerys Sentence → List End List → Name List → List , Name , Name End → and Name

Terminal symbols

ABCD Grammar

Nonterminal symbols

slide-15
SLIDE 15

ABCD Grammar

Name → Alice Name → Bob Name → Coen Name → Daenerys Sentence → List End List → Name List → List , Name , Name End → and Name

Terminal symbols Starting symbol Nonterminal symbols

slide-16
SLIDE 16

Starting symbol

ABCD Grammar

Name → Alice Name → Bob Name → Coen Name → Daenerys Sentence → List End List → Name List → List , Name , Name End → and Name

Terminal symbols Production rules Nonterminal symbols

slide-17
SLIDE 17
  • Alice and Bob
  • Alice and Bob → Name and

Bob → Name and Name → Name , Name End →
 List , Name End → List End → Sentence

  • (analytic semantics)
  • Alice and Bob
  • Sentence → List End →

List , Name End →
 Name , Name End → Name and Name → Alice and Name → Alice and Bob

  • (generative semantics)

Using ABCD

Production

slide-18
SLIDE 18

Notations

  • Name → Alice | Bob | Coen | Daenerys
  • Name → "Alice" | "Bob" | "Coen" | "Daenerys"
  • Name → “Alice” | “Bob” | “Coen” | “Daenerys”
  • ⟨Name⟩ → Alice | Bob | Coen | Daenerys
  • ⟨Name⟩ → “Alice” | “Bob” | “Coen” | “Daenerys”
slide-19
SLIDE 19

Notations

  • List → List , Name
  • ⟨List⟩ ::= ⟨List⟩ "," ⟨Name⟩ ;
  • List "," Name -> List
  • List <- List ',' Name
  • define List [List] , [Name] end define
  • syntax List = List "," Name;
  • List -> List ',' Name : ['$1'|'$3'].
slide-20
SLIDE 20

Common metaconstructs

  • Optional symbols
  • A?, [A]
  • Zero or more (Kleene star)
  • A*, {A}
  • One or more
  • A+
  • Choice (disjunction)
  • A | B, A / B
  • Less common (careful!)
  • conjunction
  • negation
  • exact repetition
  • reference naming
  • priorities
slide-21
SLIDE 21

Chomsky-Schützenberger hierarchy

  • Type-0: Recursively enumerable
  • Rules: α → β (unrestricted)
  • Type-1: Context-sensitive
  • Rules: αAβ → αγβ
  • Type-2: Context-free
  • Rules: A → γ
  • Type-3: Regular
  • Rules: A → a and A → aB

Noam Chomsky. On Certain Formal Properties of Grammars, Information & Control 2(2):137–167, 1959.

slide-22
SLIDE 22

CFG for ABCD

⟨Name⟩ → “Alice” | “Bob” | “Coen” | “Daenerys” ⟨Sentence⟩ → ⟨Name⟩ | ⟨List⟩ “and” ⟨Name⟩ ⟨List⟩ → ⟨Name⟩ “,” ⟨List⟩ | ⟨Name⟩ ⟨List⟩ → ⟨Name⟩ (“,” ⟨Name⟩)* ⟨List⟩ → {⟨Name⟩ “,”}+

slide-23
SLIDE 23

Regexp for ABCD

^\w+((, \w+)* and \w+)?$

  • S. C. Kleene, Representation of Events in Nerve Nets and Finite Automata. In Automata Studies, pp. 3–42, 1956.

photo from: Konrad Jacobs, S. C. Kleene, 1978, MFO.

slide-24
SLIDE 24

Rose by Arwen Grune; p.58 of Grune/Jacobs’ “Parsing Techniques”, 2008

slide-25
SLIDE 25

Finite world

  • Explicitly given lists
  • Acyclic automata
  • Finite choice grammars (non-recursive, non-iterating)
  • i.e., users, keywords, postcodes
slide-26
SLIDE 26

Regular world

  • Regular expressions
  • Finite automata
  • Grammars:
  • A → a
  • A → aB
  • i.e., substring search, substring replace, counting
slide-27
SLIDE 27

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r \n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?: (?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r \n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;: \\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)? [ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)? [ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r \n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\ [\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)? [ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)* \](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\ [\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)? [ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r \n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*) (?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r \n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^ \"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(? =[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)? [ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*) (?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r \n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\ \.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\ [([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\ ["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r \n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)? [ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

“somewhat pushes the limits

  • f what it is

sensible

to do with regular expressions”

Jeff Atwood, Regex use vs. Regex abuse, 16 Feb 2005. RFC822. Paul Warren, Mail::RFC822::Address: regexp-based address validation, 17/09/2012.

slide-28
SLIDE 28

Context-free world

  • Grammarware and software languages
  • Nondeterministic pushdown automata
  • Grammars:
  • A → γ
  • i.e., parsing, pretty-printing, etc
  • A → BC
  • A → a
  • A → ε

Jochgem, https://commons.wikimedia.org/wiki/File:Pushdown-overview.svg, CC-BY-SA, 2008.

slide-29
SLIDE 29

Context-sensitive world

  • Computer
  • Linear-bounded automata
  • Grammars:
  • αAβ → αγβ
  • i.e., anything practical

http://www.legoturingmachine.org

slide-30
SLIDE 30

Unrestricted world

  • Imaginary machines
  • Turing machine, λ-calculus, semi-Thue rewriting systems,

Lindenmayer systems, Markov algorithm, …

  • Grammars:
  • α → β
  • i.e., almost anything

parsing is impossible

recognising is impossible

slide-31
SLIDE 31

In practice…

  • Regular grammars are used for lexical analysis
  • keywords
  • constants
  • comments (if not nested)
  • Context-free grammars: for structured/nested constructs
  • class declaration
  • if statement
  • Everything else: annotations + hacking
slide-32
SLIDE 32

A sentence

position := initial + rate * 60

slide-33
SLIDE 33

Lexical tokens

slide-34
SLIDE 34

Parse tree

slide-35
SLIDE 35

PEG

  • Parsing Expression Grammars, introduced in 2004
  • Analytic grammars; top-down unambiguous recognition
  • Explicit backtracking, ordered disjunction
  • Linear parsing time (with memoisation)
  • Do not fit into the Chomsky-Schützenberger hierarchy
  • Conjunction and negation are purely lookahead-based
slide-36
SLIDE 36

PEG example

  • S ← &X 'a'+ Y
  • X ← Z 'c'
  • Z ← 'a' Z? 'b'
  • Y ← 'b' Y? 'c'

{aⁿbⁿcⁿ | n>0}

fixed from https://en.wikipedia.org/wiki/Parsing_expression_grammar

slide-37
SLIDE 37

What is a parser?

(recogniser, compiler, …)

slide-38
SLIDE 38

Recogniser

slide-39
SLIDE 39

Parser

slide-40
SLIDE 40

Interpreter

slide-41
SLIDE 41

Compiler

slide-42
SLIDE 42

Parsing

slide-43
SLIDE 43

Bottom-up parsing

  • Reduce the input back to the start symbol
  • Recognise terminals
  • Replace terminals by nonterminals
  • Replace terminals and nonterminals by left-hand side of rule
  • LR, LR(0), LR(1), LR(k), LALR, SLR, GLR, SGLR, CYK, …
slide-44
SLIDE 44

Top-down parsing

  • Imitate the production process by rederivation
  • Each nonterminal is a goal
  • Replace each goal by subgoals (= elements of its rule)
  • Parse tree is built from top to bottom
  • LL, LL(1), LL(k), LL(*), GLL, DCG, rec. descent, Packrat, Earley
slide-45
SLIDE 45

How to parse with a grammar?

  • Write a parser manually
  • good error handling
  • possible fine-tuning
  • a lot of work (seriously, years)
  • Generate with a parser generator
  • less work (maybe)
  • complex, rigid, idiosyncratic frameworks
  • difficult error handling
slide-46
SLIDE 46

Manually: recursive descent

  • Grammar:
  • A ::= x B C;
  • Parser:
  • A() { match(‘x’); B(); C(); }
slide-47
SLIDE 47

Manually: recursive descent

  • Grammar:
  • A ::= x B C | y D | ε ;
  • Parser:
  • A() { if (lookahead==‘x’) { match(‘x’); B(); C(); }


else if (lookahead==‘y’) { match(‘y’); D(); }
 else ; }

slide-48
SLIDE 48

Manually: recursive descent

  • Grammar:
  • A ::= x B C | x D ;
  • Parser:
  • A() { if (lookahead==‘x’) { match(‘x’); B(); C(); }


else { match(‘x’); D(); } } solved by backtracking: try and if fail, return

slide-49
SLIDE 49

Left factoring

  • For some parsers, this is inefficient:
  • S -> if E then S else S | if E then S
  • Can be rewritten as this:
  • S -> if E then S P
  • P -> else S | ε

sometimes the only way

slide-50
SLIDE 50

Manually: recursive descent

  • Grammar:
  • Expr ::= Expr ‘+’ Expr ;
  • Parser:
  • Expr() { Expr(); match(‘+’); Expr(); }

solved by ‘optimising’ the grammar

slide-51
SLIDE 51

Left recursion removal

  • Expr -> Expr + Term
  • Expr -> Expr – Term
  • Expr -> Term
  • Term -> [0-9]
  • Expr -> Term Rest
  • Rest -> + Term Rest


| - Term Rest
 | ε

  • Term -> [0-9]
slide-52
SLIDE 52

Ambiguity:

  • ne sentence, several possible trees

sometimes possible to annotate the grammar with priorities

slide-53
SLIDE 53

2 1 2 1 1 3c 3e 3a 3 + 5 + 1 9 3 6 5 1 2 2 1 1 1 3c 3e 3a 3 + 5 + 1 9 8 1 3 5

  • Fig. 3.2. Spurious ambiguity: no change in semantics

2 1 2 1 1 3c 3e 3a 3

  • 5
  • 1

−1 3 4 5 1 2 2 1 1 1 3c 3e 3a 3

  • 5
  • 1

−3 −2 1 3 5

  • Fig. 3.3. Essential ambiguity: the semantics differ

p.64 of Grune/Jacobs’ “Parsing Techniques”, 2008

slide-54
SLIDE 54

Disambiguation

  • Expr -> Expr + Atom
  • Expr -> Expr – Atom
  • Expr -> Atom
  • Atom -> Number
  • Atom -> Variable
  • Expr -> Expr + Term
  • Expr -> Term
  • Term -> Term * Primary
  • Term -> Primary
  • Primary -> Number
  • Primary -> Variable
slide-55
SLIDE 55
slide-56
SLIDE 56

ALL “optimisations” are parsing tech-specific!

(and make grammars uglier)

slide-57
SLIDE 57

Recall: parser

slide-58
SLIDE 58

Parser generator

slide-59
SLIDE 59

Parser generators: ↑

  • LALR(1)
  • Beaver
  • YACC, byacc, bison, etc
  • Eli
  • Irony
  • SableCC
  • yecc
  • GLR
  • bison
  • DMS
  • GDK
  • Tom
  • SGLR
  • ASF+SDF MetaEnv
  • Spoofax, Stratego/XT
slide-60
SLIDE 60

Parser generators: ↓

  • LL(k)
  • JavaCC
  • LL(*)
  • ANTLR
  • Earley
  • Marpa
  • ModelCC
  • GLL
  • Rascal — SGTDBF
  • gll-combinators in Scala
  • Packrat
  • Rats!
  • OMeta
  • PetitParser
  • Others
  • TXL
slide-61
SLIDE 61

Summary of parsing techniques

  • Top-down
  • predict-match and variants/improvements
  • backtracking is good; memoisation is good
  • left recursion is generally problematic
  • Bottom-up
  • shift-reduce and variants
  • deterministic CFLs in linear time
slide-62
SLIDE 62

Conclusion

  • “Making a linear-time parser for an arbitrary given grammar is

10% hard work; the other 90% can be done by computer”. [Grune/Jacobs, Parsing Techniques, 2008, p.81]

  • Every notation/metatool defines a class of Gs/Ls
  • Parsing should never be more complex than O(n³)
  • Many books/papers exist; beware of bullshit!
slide-63
SLIDE 63

Credits

  • Given on the bottom of each slide
  • unless self-made or public domain
  • Font
  • Intuitive by Bruno de Souza Leão, OFL


http://openfontlibrary.org/en/font/intuitive

  • Feedback
  • http://grammarware.net, http://grammarware.github.io, …