automatic scanner generation in minijava symbol class
play

Automatic scanner generation in MiniJava Symbol class We use the - PDF document

Automatic scanner generation in MiniJava Symbol class We use the jflex tool to automatically create a scanner from a Lexemes are represented as instances of class Symbol specification file, Scanner/minijava.jflex class Symbol { (We use the CUP


  1. Automatic scanner generation in MiniJava Symbol class We use the jflex tool to automatically create a scanner from a Lexemes are represented as instances of class Symbol specification file, Scanner/minijava.jflex class Symbol { (We use the CUP tool to automatically create a parser from a int sym; // which token class? specification file, Parser/minijava.cup , which also Object value; // any extra data for this lexeme generates all the code for the token classes used in the ... scanner, via the Symbol class.) } The MiniJava Makefile automatically rebuilds the scanner A different integer constant is defined for each token class, in the (or parser) whenever its specification file changes sym helper class class sym { static int CLASS = 1; static int IDENTIFIER = 2; static int COMMA = 3; ... } Can use this in printing code for Symbol s • see symbolToString in minijava.jflex Craig Chambers 40 CSE 401 Craig Chambers 41 CSE 401 Token declarations jflex token specifications Declare new token classes in Parser/minijava.cup , Helper definitions for character classes and regular expressions using terminal declarations letter = [a-zA-Z] • include Java type if Symbol stores extra data eol = [\r\n] Examples: (Simple) token definitions are of the form: regexp { Java stmt } /* reserved words: */ terminal CLASS, PUBLIC, STATIC, EXTENDS; regexp can be (at least): ... • a string literal in double-quotes, e.g. "class" , "<=" /* operators: */ • a reference to a named helper, in braces, e.g. {letter} terminal PLUS, MINUS, STAR, SLASH, EXCLAIM; • a character list or range, in square brackets, e.g. [a-zA-Z] ... • a negated character list or range, e.g. [^\r\n] /* delimiters: */ • . (which matches any single character) terminal OPEN_PAREN, CLOSE_PAREN; • regexp regexp , regexp | regexp , regexp * , regexp + , regexp ? , ( regexp ) terminal EQUALS, SEMICOLON, COMMA, PERIOD; ... Java stmt (the accept action) is typically: /* tokens with values: */ • return symbol(sym. CLASS ); for a simple token terminal String IDENTIFIER; • return symbol(sym. CLASS , yytext()); for a token terminal Integer INT_LITERAL; with extra data based on the lexeme string yytext() • empty for whitespace Craig Chambers 42 CSE 401 Craig Chambers 43 CSE 401

  2. Syntactic Analysis / Parsing Context-free grammars (CFG’s) Purpose: stream of tokens ⇒ abstract syntax tree (AST) Syntax specified using CFG’s • RE’s not powerful enough AST: • can’t handle nested, recursive structure • general grammars (GG’s) too powerful • captures hierarchical structure of input program • not decidable ⇒ parser might run forever! • primary representation of program for rest of compiler CFG’s: convenient compromise • capture important structural & nesting characteristics Plan: • some properties checked later during semantic analysis • study how grammars can specify syntax • study algorithms for constructing ASTs from token streams • study MiniJava implementation Common notation for CFG’s: Extended Backus-Naur Form (EBNF) Craig Chambers 44 CSE 401 Craig Chambers 45 CSE 401 CFG terminology EBNF description of initial MiniJava syntax Program ::= MainClassDecl {ClassDecl} MainClassDecl::= class ID { Terminals : alphabet of language defined by CFG public static void main ( String [ ] ID ) { {Stmt} } } ClassDecl ::= class ID [ extends ID] { Nonterminals : symbols defined in terms of terminals and {ClassVarDecl} {MethodDecl} } nonterminals ClassVarDecl ::= Type ID ; MethodDecl ::= public Type ID Production : rule for how a nonterminal (l.h.s.) is defined in ( [Formal { , Formal}] ) { {Stmt} return Expr ; } terms of a finite, possibly empty sequence of terminals & Formal ::= Type ID nonterminals Type ::= int | boolean | ID • recursive productions allowed! Stmt ::= Type ID ; | { {Stmt} } Can have multiple productions for same nonterminal | if ( Expr ) Stmt else Stmt | while ( Expr ) Stmt • alternatives | System.out.println ( Expr ) ; | ID = Expr ; Start symbol : root symbol defining language Expr ::= Expr Op Expr | ! Expr | Expr . ID ( [Expr { , Expr}] ) | ID | this Program ::= Stmt | Integer | true | false Stmt ::= if ( Expr ) Stmt else Stmt | ( Expr ) Stmt ::= while ( Expr ) Stmt Op ::= + | - | * | / | < | <= | >= | > | == | != | && Craig Chambers 46 CSE 401 Craig Chambers 47 CSE 401

  3. Transition diagrams Derivations and parse trees “Railroad diagrams” Derivation: sequence of expansion steps, beginning with start symbol, • another, more graphical notation for CFG’s leading to a string of terminals • look like FSA’s, where arcs can be labelled with nonterminals as well as terminals Parsing: inverse of derivation • given target string of terminals (a.k.a. tokens), want to recover nonterminals representing structure Can represent derivation as a parse tree • concrete syntax tree Craig Chambers 48 CSE 401 Craig Chambers 49 CSE 401 Example grammar Ambiguity E ::= E Op E | - E | ( E ) | id Some grammars are ambiguous : Op ::= + | - | * | / • multiple distinct parse trees with same final string Structure of parse tree captures much of meaning of program; ambiguity ⇒ multiple possible meanings for same program Craig Chambers 50 CSE 401 Craig Chambers 51 CSE 401

  4. Famous ambiguities: “dangling else” Resolving the ambiguity Stmt ::= ... | Option 1: add a meta-rule e.g. “ else associates with closest previous if ” if ( Expr ) Stmt | if ( Expr ) Stmt else Stmt • works, keeps original grammar intact • ad hoc and informal “ if ( e 1 ) if ( e 2 ) s 1 else s 2 ” Craig Chambers 52 CSE 401 Craig Chambers 53 CSE 401 Option 2: rewrite the grammar to resolve ambiguity explicitly Option 3: redesign the language to remove the ambiguity Stmt ::= ... | Stmt ::= MatchedStmt | UnmatchedStmt if Expr then Stmt end | MatchedStmt ::= ... | if Expr then Stmt else Stmt end if ( Expr ) MatchedStmt else MatchedStmt • formal, clear, elegant UnmatchedStmt ::= if ( Expr ) Stmt | if ( Expr ) MatchedStmt • allows sequence of Stmt s in then and else branches, no { , } needed else UnmatchedStmt • extra end required for every if • formal, no additional rules beyond syntax • sometimes obscures original grammar Craig Chambers 54 CSE 401 Craig Chambers 55 CSE 401

  5. Another famous ambiguity: expressions Resolving the ambiguity E ::= E Op E | - E | ( E ) | id Option 1: add some meta-rules, e.g. precedence and associativity rules Op ::= + | - | * | / “ a + b * c ” Example: E ::= E Op E | - E | E ++ | ( E ) | id Op ::= + | - | * | / | % | ** | == | < | && | || operator precedence associativity postfix ++ highest left prefix - right ** (expon.) right * , / , % left + , - left == , < none && left || lowest left Craig Chambers 56 CSE 401 Craig Chambers 57 CSE 401 Option 2: modify the grammar to explicitly resolve the ambiguity Example, redone Strategy: E ::= E0 • create a nonterminal for each precedence level E0 ::= E0 || E1 | E1 left associative • expr is lowest precedence nonterminal, E1 ::= E1 && E2 | E2 left associative each nonterminal can be rewritten with higher E2 ::= E3 ( == | < ) E3 non associative precedence operator, E3 ::= E3 ( + | - ) E4 | E4 left associative highest precedence operator includes atomic exprs E4 ::= E4 ( * | / | % ) E5 | E5 left associative • at each precedence level, use: E5 ::= E6 ** E5 | E6 right associative • left recursion for left-associative operators E6 ::= - E6 | E7 right associative • right recursion for right-associative operators E7 ::= E7 ++ | E8 left associative • no recursion for non-associative operators E8 ::= id | ( E ) Craig Chambers 58 CSE 401 Craig Chambers 59 CSE 401

  6. Designing a grammar Parsing algorithms Concerns: Given grammar, want to parse input programs • accuracy • check legality • unambiguity • produce AST representing structure • formality • be efficient • readability, clarity • ability to be parsed by particular parsing algorithm Kinds of parsing algorithms: • top-down parser ⇒ LL(k) grammar • top-down • bottom-up parser ⇒ LR(k) grammar • bottom-up • ability to be implemented using a particular strategy • by hand • by automatic tools Craig Chambers 60 CSE 401 Craig Chambers 61 CSE 401 Top-down parsing Predictive parsing Build parse tree for input program from the top (start symbol) Predictive parser: down to leaves (terminals) top-down parser that can select correct rhs looking at at most k input tokens (the lookahead ) Basic issue: Efficient: • when "expanding" a nonterminal with some r.h.s., how to pick which r.h.s.? • no backtracking needed • linear time to parse E.g. Stmt ::= Call | Assign | If | While Implementation of predictive parsers: Call ::= Id ( Expr { , Expr } ) • recursive-descent parser Assign ::= Id = Expr ; • each nonterminal parsed by a procedure If ::= if Test then Stmts end | • call other procedures to parse sub-nonterminals, recursively if Test then Stmts else Stmts end • typically written by hand While ::= while Test do Stmts end • table-driven parser • PDA: like table-driven FSA, plus stack to do recursive FSA calls Solution: look at input tokens to help decide • typically generated by a tool from a grammar specification Craig Chambers 62 CSE 401 Craig Chambers 63 CSE 401

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend