Parsing Input: Sequence of tokens Output: Abstract Syntax Tree CS - PDF document

9/12/2012 Parsing Input: Sequence of tokens Output: Abstract Syntax Tree CS 1622: Example: IF ( ID(‘x’) > NUM(‘3’) ) { ID(‘y’) INCREMENT ; } Syntax Analysis if-statement cond_expr stmt_list Jonathan Misurda > post-inc jmisurda@cs.pitt.edu 3 x y Parsing Can We Use REs for Parsing? The lexing phase has left us with a set of tokens. Quintessential example of the lack of power of REs: Matching parenthesis. We now need to determine the role of those tokens in context. Alphabet: ( and ) Language: All strings that contain properly matched and nested parenthesis We’ll use a parser to produce a parse tree that represents the structure of the Describe strings with pattern: ( i ) i (i ≥ 1): input. A tree is used because the rules of a programming language are usually recursive. Our finite automata would need to have states that represent each number of currently open parenthesis. (That is, a state for “(”, “((”, “(((”, …) For example: if-statement = if ( condition ) statement; That number could be infinite. REs are converted into finite state automata. This is a contradiction. statement = if-statement | while-statement | … More Power Grammar If regular expressions and finite state automata are insufficient for parsing, we will Recall the definition of a language: need a more powerful formalism. Language: set of strings over alphabet Alphabet: finite set of symbols To do this, we will use the concept of a Context Free Language . Null string:  Now that we have multiple categories of languages, let us generalize this notion Sentences: strings in the language first. It is possible to describe a language using a grammar • Define English using English grammar (as we learn in school) 1

9/12/2012 Grammars Derivation A grammar consists of 4 components (T, N, s,  ): “LHS → RHS” • Replace LHS with RHS T — set of terminal symbols • Specifies how to transform one string to another • Essentially tokens — appear in the input string � ⇒ � : string  derives  N — set of non-terminal symbols • � ⇒ � — 1 step ∗ • Categories of strings impose hierarchical language structure • � ⇒ � — 0 or more steps • Useful for analysis. Examples: declaration, statement, loop, ... � • � ⇒ � — 1 or more steps s — a special non-terminal start symbol that denotes every sentence is derivable from it  — a set of production rules: “LHS → RHS”: left-hand-side produces right-hand-side Example Chomsky Hierarchy of Languages Language L = { any string with “00” at the end } ( /0{2}$/ ) A classification of languages based on the form of grammar rules • Classify not based on how complex the language is 1 Grammar G = (T, N, s,  ) • Classify based on how complex the grammar (the describe the language) is T = {0, 1} 0 0 N = {A, B} A B C s = A Four types of grammars:  = { A → 0A | 1 A | 0 B, 0 • Type 0 — recursive grammar B → 0 • Type 1 — context sensitive grammar } • Type 2 — context free grammar • Type 3 — regular grammar Derivation : from grammar to language • A ⇒ 0A ⇒ 00B ⇒ 000 • A ⇒ 1A ⇒ 10B ⇒ 100 • A ⇒ 0A ⇒ 00A ⇒ 000B ⇒ 0000 • A ⇒ 0A ⇒ 01A ⇒ ... Regular Languages Context Free Languages Form of rules: Form of rules: A →  A →  or A →  B where A  N,   (N  T) + A can be replaced by  at any time. where A,B  N,  T Proper CFLs have no “erase rule” where a production is replaced by  . Regular grammars define REs. • If there are rules deriving empty string, rewrite to remove empty rule (Such as in Chomsky Normal Form) Example: A → 1A Example: A → 0 S → SS S → ( S ) S →  2

9/12/2012 Context Sensitive Languages Unrestricted/ Recursive Languages Form of rules: Form of rules:  A  →   →  where A  N + ;    (N  T);  (N  T) + ; |A| ≤ |  | where  (N  T) + ,   (N  T) * Replace A by  only if found in the context of  and  The erase rule is allowed. No erase rule. No restrictions on form of grammar rules. Example: Example: aAB → aCB aAB → aCD aAB → aB A →  Are CFGs enough for PLs? Are CFGs enough for PLs? We’ve determined that because of nesting and recursive relationships in The CFG allows for the following derivations: programming languages that REs (type 3 grammars) are insufficient. S ⇒ DU ⇒ int x; x=0; S ⇒ DU ⇒ int x; y=0; What about Context Free (type 2) grammars? S ⇒ DU ⇒ int y; x=0; S ⇒ DU ⇒ int x; y=0; Imagine we want to describe the grammar of valid C or Java programs that have the declaration of a variable before their use: You would need a Context Sensitive grammar (type 1) to match the definition to the use. S → DU D → int identifier; So why do we seem to want to use CFGs? U → identifier ‘=’ expr; • Some PL constructs are context free: If-stmt, declaration • Many are not: def-before-use, matching formal/actual parameters, etc. • We’ll like CFGs because they are powerful and easily understood. • But we’ll need to add the checks that CFGs miss in later phases of the compiler. Language Classification Summary Regular Grammar ⊆ CFG ⊆ CSG ⊆ Recursive Grammar 3

Parsing Input: Sequence of tokens Output: Abstract Syntax Tree CS - PDF document

9/12/2012 Parsing Input: Sequence of tokens Output: Abstract Syntax Tree CS 1622: Example: IF ( ID(x) > NUM(3) ) { ID(y) INCREMENT ; } Syntax Analysis if-statement cond_expr stmt_list Jonathan Misurda >

Introduction to Bottom-Up Parsing Shift-reduce parsing The LR parsing algorithm

CSC 4181 Compiler Construction Parsing 1 1 Outline Top-down v.s. Bottom-up Top-down parsing

Robust Incremental Neural Semantic Graph Parsing Jan Buys and Phil Blunsom Dependency Parsing vs

Basic Parsing Algorithms Chart Parsing Seminar Recent Advances in Parsing Technology WS

Models of Human Parsing Experimental Data 2 Informatics 2A: Lecture 22 Eye-tracking Reading

Outline LR Parsing Review of bottom-up parsing LALR Parser Generators Computing the

Graph-Based Parsing Joakim Nivre Uppsala University Department of Linguistics and Philology

Dependency Parsing II CMSC 470 Marine Carpuat Graph-based Dependency Parsing Slides credit:

Generalised Parsing and Combinator Parsing A Happy Marriage? L. Thomas van Binsbergen

Parsing as Deduction Joseph K uhner March 24, 2007 Joseph K uhner Parsing as Deduction

Bottom-up parsing LR parsing Construct parse tree for input from leaves up LR( k ) parsing

Compilers Shift-Reduce Parsing Alex Aiken Shift-Reduce Parsing Important Fact #1 about

Parsing, Part I Jim Royer April 2, 2019 CIS 352 Parsing, Part I 1 Miss Teen South

Programming Languages: Parsing Onur Tolga S ehito glu Computer Engineering,METU 27 May

* 07/16/96 Plan for Today Shift-reduce parsing The problem with predictive top down parsing

Statistical Dependency Parsing in Korean: From Corpus Generation To Automatic Parsing Workshop on

Slowing Down Top Trees for Better Worst-Case Compression Bartomiej Dudek 1 Pawe Gawrychowski 1

PCFGs: Viterbi CKY CMSC 473/673 UMBC November 13 th , 2017 Recap from last time

s tt s

INF2080 Context-Free Langugaes Daniel Lupp Universitetet i Oslo 1st February 2016 Department

Definiton: Derivation tree Let G = ( V, T, S, P ) be a cfg. An ordered tree is called a derivation

CSCI 5832 Natural Language Processing Lecture 14 Jim Martin 2/28/07 CSCI 5832 Spring 2007 1

Proc. of the 37th ACL (Assoc. for Computational Linguistics) (1999) Ecien t P

Probabilistic Context-Free Probabilistic Context-Free Grammars (PCFGs) Grammars (PCFGs) Berlin