9/25/17 1
CSCI 2320
Lexical Analysis
Ref: Ch 3 + Handout (Nishimura)
MOHAMMAD T. IRFAN
Plan
Chomsky Hierarchy Lexical Analysis
CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD - - PDF document
9/25/17 CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD T. IRFAN Plan Chomsky Hierarchy Lexical Analysis 1 9/25/17 Chomsky Hierarchy Faster computa?on Regular grammar BoGom of hierarchy Context-free grammar
9/25/17 1
MOHAMMAD T. IRFAN
Chomsky Hierarchy Lexical Analysis
9/25/17 2
Regular grammar Context-free grammar (CFG/BNF) Context-sensi?ve grammar Unrestricted grammar
More expressive power Faster computa?on
BoGom of hierarchy Top of hierarchy
Regular grammar Context-free grammar (CFG/BNF) Context-sensi?ve grammar Unrestricted grammar
A, B ∈ N ω ∈ T* α, β ∈ (T U N)* A → ω B A → ω A → β A → ω B | ω α → β, where |α| <= |β| α → β
9/25/17 3
Pros
Cons
A, B ∈ N ω ∈ T* A → ω B A → ω
Pros
Cons
A ∈ N β ∈ (T U N)* A → β
9/25/17 4
Pros
Cons
derived from a given context-sensi?ve grammar
A, B ∈ N ω ∈ T* α, β ∈ (T U N)* α → β, where |α| <= |β|
Pros
Cons
A, B ∈ N ω ∈ T* α, β ∈ (T U N)* α → β
9/25/17 5
Chomsky Hierarchy Lexical Analysis
Input: Lexemes (typed ASCII characters) Output: Tokens (sequence of characters having a collec?ve meaning) Discard: whitespace, comments
int count = 10; int count = 10 ;
Lexemes Tokens
keywo rd ident ifier
tor intLi teral separ ator
9/25/17 6
Simpler, faster grammar for parsing
75% of ?me spent in lexical analysis
RegExpr Meaning x a character x \x an escaped character, e.g., \n { Z } a reference to a reg expr Z M | N M or N, where M and N are reg expr M N M followed by N M* zero or more occurrences of M M+ One or more occurrences of M M? Zero or one occurrence of M
9/25/17 7
RegExpr Meaning [aeiou] the set of vowels [0-9] the set of digits . Any single character
Special symbols: ^ means not (e.g., [^aeiouAEIOU] is a non-vowel)
Category Defini3on AnyChar [ -~] LeGer [a-zA-Z] Digit [0-9] Whitespace [ \t] Eol \n
From space (ASCII 27) to ?lde (126) Space and tab
9/25/17 8
Category Defini3on Keyword bool | char | else | false | float | if | int | main | true | while Iden?fier {LeGer}({LeGer} | {Digit})* IntegerLit {Digit}+ FloatLit {Digit}+\.{Digit}+ CharLit '{AnyChar}' Category Defini3on Operator = | || | && | == | != | < | <= | > | + | - | * | / |! | [ | ] Separator : | . | { | } | ( | ) Comment // ({AnyChar} | {Whitespace})* {Eol}
9/25/17 9
hGps://docs.python.org/3/library/re.html import re #regex re.split(...)#Use regex argument to split a string into parts Common string matching regex: Symbol Defini3on \d [0-9] \D [^0-9] \w [a-zA-Z0-9_] \W [^a-zA-Z0-9_]
9/25/17 10
1. 0(0|1)+0 2. ((ε|0)1*)* 3. 0*10*10*10* 4. (00|11)*
1. All strings of lowercase leGers, where leGers appear in ascending order. 2. All strings of leGers containing vowels in order.
9/25/17 11
Coming Thursday, Sept 28 Start of class (30 min) Up to today's class
BEHIND THE SCENE OF REGULAR EXPRESSIONS
9/25/17 12
Σ: Input alphabet + unique end symbol ($) Set of states
State transi?on func?on
There is at most one outgoing arc from any state for any par?cular input symbol
9/25/17 13
Allows mul?ple outgoing arcs from a state for the same input symbol Allows transi?ons on empty string (ε)
1. DFA à regular expression 2. Regular expression à NFA 3. NFA à DFA
Language designer à implementa?on (parsing) DFA à Regex à NFA à DFA All 3 are equivalent!
9/25/17 14
Odd binary number
Regex à NFA à DFA à Regex (0|1)*1 à ? à ? à ?
Idea:
(More details on next slide) Idea:
tabulate all possible sets of NFA states that you can reach on 0 and 1 transi?ons.
State elimina?on algorithm
(More details soon)
ScoG, Programming Languages (2000)
9/25/17 15
How to preserve all paths a•er dele?ng a node? For each node to be deleted:
Do the following for binary numbers with an even number of 0s: Regular expression à NFA à DFA à Regular expression.