compiler construction
play

Compiler Construction Lecture 4: Lexical Analysis III (Practical - PowerPoint PPT Presentation

Compiler Construction Lecture 4: Lexical Analysis III (Practical Aspects) Thomas Noll Lehrstuhl f ur Informatik 2 (Software Modeling and Verification) noll@cs.rwth-aachen.de http://moves.rwth-aachen.de/teaching/ss-14/cc14/ Summer Semester


  1. Compiler Construction Lecture 4: Lexical Analysis III (Practical Aspects) Thomas Noll Lehrstuhl f¨ ur Informatik 2 (Software Modeling and Verification) noll@cs.rwth-aachen.de http://moves.rwth-aachen.de/teaching/ss-14/cc14/ Summer Semester 2014

  2. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.2

  3. The Extended Matching Problem Definition Let n ≥ 1 and α 1 , . . . , α n ∈ RE Ω with ε / ∈ � α i � � = ∅ for every i ∈ [ n ] (= { 1 , . . . , n } ). Let Σ := { T 1 , . . . , T n } be an alphabet of corresponding tokens and w ∈ Ω + . If w 1 , . . . , w k ∈ Ω + such that w = w 1 . . . w k and for every j ∈ [ k ] there exists i j ∈ [ n ] such that w j ∈ � α i j � , then ( w 1 , . . . , w k ) is called a decomposition and ( T i 1 , . . . , T i k ) is called an analysis of w w.r.t. α 1 , . . . , α n . Problem (Extended matching problem) Given α 1 , . . . , α n ∈ RE Ω and w ∈ Ω + , decide whether there exists a decomposition of w w.r.t. α 1 , . . . , α n and determine a corresponding analysis. Compiler Construction Summer Semester 2014 4.3

  4. Ensuring Uniqueness Two principles : Principle of the longest match (“maximal munch tokenization”) 1 for uniqueness of decomposition make lexemes as long as possible motivated by applications: e.g., every (non-empty) prefix of an identifier is also an identifier Principle of the first match 2 for uniqueness of analysis choose first matching regular expression (in the given order) therefore: arrange keywords before identifiers (if keywords protected) Compiler Construction Summer Semester 2014 4.4

  5. Implementation of FLM Analysis Algorithm (FLM analysis – overview) Input: expressions α 1 , . . . , α n ∈ RE Ω , tokens { T 1 , . . . , T n } , input word w ∈ Ω + Procedure: for every i ∈ [ n ] , construct A i ∈ DFA Ω such that 1 L ( A i ) = � α i � (see DFA method; Algorithm 2.9) construct the product automaton A ∈ DFA Ω such that 2 L ( A ) = � n i =1 � α i � partition the set of final states of A to follow the 3 first-match principle extend the resulting DFA to a backtracking DFA which 4 implements the longest-match principle let the backtracking DFA run on w 5 Output: FLM analysis of w (if existing) Compiler Construction Summer Semester 2014 4.5

  6. (4) The Backtracking DFA Definition (Backtracking DFA) The set of configurations of B is given by ( { N } ⊎ Σ) × Ω ∗ · Q · Ω ∗ × Σ ∗ · { ε, lexerr } The initial configuration for an input word w ∈ Ω + is ( N , q 0 w , ε ). The transitions of B are defined as follows (where q ′ := δ ( q , a )): normal mode: look for initial match if q ′ ∈ F ( i )  ( T i , q ′ w , W )  if q ′ ∈ P \ F ( N , q ′ w , W ) ( N , qaw , W ) ⊢ if q ′ / output: W · lexerr ∈ P  backtrack mode: look for longest match if q ′ ∈ F ( i )  ( T i , q ′ w , W )  if q ′ ∈ P \ F ( T , vaq ′ w , W ) ( T , vqaw , W ) ⊢ if q ′ / ∈ P ( N , q 0 vaw , WT )  end of input ( T , q , W ) ⊢ output: WT if q ∈ F ( N , q , W ) ⊢ output: W · lexerr if q ∈ P \ F ( T , vaq , W ) ⊢ ( N , q 0 va , WT ) if q ∈ P \ F Compiler Construction Summer Semester 2014 4.6

  7. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.7

  8. Time Complexity of FLM Analysis Lemma 4.1 The worst-case time complexity of FLM analysis using the backtracking DFA on input w ∈ Ω + is O ( | w | 2 ) . Proof. lower bound: α 1 = a , α 2 = a ∗ b , w = a m requires O ( m 2 ) upper bound: each run from mode N to T ∈ Σ consumes at least one input symbol (and possibly reads all input symbols), involving at most � | w | i =1 = n ( n +1) transitions 2 if no Σ-mode is reached, lexerr is reported after ≤ | w | transitions Remark: possible improvement by tabular method (similar to Knuth-Morris-Pratt Algorithm for pattern matching in strings) Literature: Th. Reps: “Maximal-Munch” Tokenization in Linear Time , ACM TOPLAS 20(2), 1998, 259–273 Compiler Construction Summer Semester 2014 4.8

  9. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.9

  10. A Backtracking NFA A similar construction is possible for the NFA method: A i = � Q i , Ω , δ i , q ( i ) 0 , F i � ∈ NFA Ω ( i ∈ [ n ]) by NFA method 1 “Product” automaton: Q := { q 0 } ⊎ � n i =1 Q i 2 A 1 ε . q 0 . . ε A n Partitioning of final states: 3 M ⊆ Q is called a T i -matching if M ∩ F i � = ∅ and for all j ∈ [ i − 1] : M ∩ F j = ∅ yields set of T i -matchings F ( i ) ⊆ 2 Q M ⊆ Q is called productive if there exists a productive q ∈ M yields productive state sets P ⊆ 2 Q Backtracking automaton: similar to DFA case 4 Compiler Construction Summer Semester 2014 4.10

  11. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.11

  12. Longest Match in Practice In general: lookahead of arbitrary length required that is, | v | unbounded in configurations ( T , vqw , W ) see Lemma 4.1: α 1 = a , α 2 = a ∗ b , w = a . . . a “Modern” programming languages (Pascal, Java, ...): lookahead of one or two characters sufficient separation of keywords, identifiers, etc. by spaces Pascal: two-character lookahead required to distinguish 1.5 (real number) from 1..5 (integer range) However: principle of longest match not always applicable! Compiler Construction Summer Semester 2014 4.12

  13. Inadequacy of Longest Match I Example 4.2 (Longest match in FORTRAN) Relational expressions 1 valid lexemes: .EQ. (relational operator), EQ (identifier), 12 (integer), 12. , .12 (reals) input string: 12�.EQ.�12 � 12.EQ.12 (ignoring blanks!) intended analysis: (int , 12)(relop , eq)(int , 12) LM yields: (real , 12 . 0)(id , EQ )(real , 0 . 12) ⇒ wrong interpretation DO loops 2 (correct) input string: DO�5�I�=�1,�20 � DO5I=1,20 intended analysis: (do , )(label , 5)(id , I )(gets , )(int , 1)(comma , )(int , 20) LM analysis (wrong): (id , DO5I )(gets , )(int , 1)(comma , )(int , 20) (erroneous) input string: DO�5�I�=�1.�20 � DO5I=1.20 LM analysis (correct): (id , DO5I )(gets , )(real , 1 . 2) Compiler Construction Summer Semester 2014 4.13

  14. Inadequacy of Longest Match II Example 4.3 (Longest match in C) valid lexemes: x (identifier) = (assignment) =- (subtractive assignment; K&R/ANSI-C: -= ) 1 , -1 (integers) input string: x=-1 intended analysis: (id , x )(gets , )(int , − 1) LM yields: (id , x )(dec , )(int , 1) ⇒ wrong interpretation Possible solutions: Hand-written (non-FLM) scanners FLM with lookahead (later) Compiler Construction Summer Semester 2014 4.14

  15. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.15

  16. Regular Definitions I Goal: modularizing the representation of regular sets by introducing additional identifiers Definition 4.4 (Regular definition) Let { R 1 , . . . , R n } be a set of symbols disjoint from Ω. A regular definition (over Ω) is a sequence of equations R 1 = α 1 . . . R n = α n such that, for every i ∈ [ n ], α i ∈ RE Ω ⊎{ R 1 ,..., R i − 1 } . Remark: since recursion is not involved, every R i can (iteratively) be substituted by a regular expression α ∈ RE Ω (otherwise = ⇒ context-free languages) Compiler Construction Summer Semester 2014 4.16

  17. Regular Definitions II Example 4.5 (Symbol classes in Pascal) Identifiers: Letter = A | . . . | Z | a | . . . | z Digit = 0 | . . . | 9 Id = Letter ( Letter | Digit ) ∗ Digits = Digit + Numerals: Empty = ∅ ∗ (unsigned) OptFrac = . Digits | Empty OptExp = E ( + | - | Empty ) Digits | Empty Num = Digits OptFrac OptExp RelOp = < | <= | = | <> | > | >= Rel. operators: Keywords: If = if Then = then Else = else Compiler Construction Summer Semester 2014 4.17

  18. Outline Recap: First-Longest-Match Analysis 1 Time Complexity of First-Longest-Match Analysis 2 First-Longest-Match Analysis with NFA 3 Longest Match in Practice 4 Regular Definitions 5 Generating Scanners Using [f]lex 6 Preprocessing 7 Compiler Construction Summer Semester 2014 4.18

  19. The [f]lex Tool Usage of [f]lex (“[fast] lexical analyzer generator”): [f]lex cc − → − → spec.l lex.yy.c a.out [f]lex specification Scanner (in C) Executable a . out Program − → Symbol sequence A [f]lex specification is of the form Definitions (optional) %% Rules %% Auxiliary procedures (optional) Compiler Construction Summer Semester 2014 4.19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend