CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD - - PDF document

csci 2320 lexical analysis
SMART_READER_LITE
LIVE PREVIEW

CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD - - PDF document

9/25/17 CSCI 2320 Lexical Analysis Ref: Ch 3 + Handout (Nishimura) MOHAMMAD T. IRFAN Plan Chomsky Hierarchy Lexical Analysis 1 9/25/17 Chomsky Hierarchy Faster computa?on Regular grammar BoGom of hierarchy Context-free grammar


slide-1
SLIDE 1

9/25/17 1

CSCI 2320

Lexical Analysis

Ref: Ch 3 + Handout (Nishimura)

MOHAMMAD T. IRFAN

Plan

Chomsky Hierarchy Lexical Analysis

slide-2
SLIDE 2

9/25/17 2

Chomsky Hierarchy

Regular grammar Context-free grammar (CFG/BNF) Context-sensi?ve grammar Unrestricted grammar

More expressive power Faster computa?on

BoGom of hierarchy Top of hierarchy

Chomsky Hierarchy

Regular grammar Context-free grammar (CFG/BNF) Context-sensi?ve grammar Unrestricted grammar

A, B ∈ N ω ∈ T* α, β ∈ (T U N)* A → ω B A → ω A → β A → ω B | ω α → β, where |α| <= |β| α → β

slide-3
SLIDE 3

9/25/17 3

Regular grammar: pros and cons

Pros

  • Can do the first layer of abstrac?on in PL syntax
  • Integer → 0 Integer | 1 Integer | ... | 9 Integer | 0 | 1 | ... | 9
  • Note: following is not regular grammar (why?)
  • Integer à Integer Digit
  • Digit à 0 | 1 | ... | 9

Cons

  • Cannot check balanced parenthesis, braces, etc.
  • Cannot represent {an bn | n >= 1}

A, B ∈ N ω ∈ T* A → ω B A → ω

CFG/BNF/EBNF: pros and cons

Pros

  • Can do all layers of abstrac?ons in PL syntax
  • Assignment à Iden-fier = Expression;

Cons

  • Can't do lots of seman?c-type things
  • Variable declared before use?
  • Operand and operator compa?ble?
  • Can't represent languages like {ww | w ∈ T+}
  • Can do equality checking (an bn), but can't detect repe??on

A ∈ N β ∈ (T U N)* A → β

slide-4
SLIDE 4

9/25/17 4

Context-sensiRve: pros and cons

Pros

  • Can represent languages like {an bn cn | n >= 1}

Cons

  • It is undecidable whether a given sentence ω can be

derived from a given context-sensi?ve grammar

  • Can't do parsing!
  • Can't write a compiler for context-sensi?ve grammar!

A, B ∈ N ω ∈ T* α, β ∈ (T U N)* α → β, where |α| <= |β|

Unrestricted: pros and cons

Pros

  • Equivalent to Turing machine
  • That is, can compute any computable func?on

Cons

  • Can we do parsing?

A, B ∈ N ω ∈ T* α, β ∈ (T U N)* α → β

slide-5
SLIDE 5

9/25/17 5

Plan

Chomsky Hierarchy Lexical Analysis

Lexical Analysis

Input: Lexemes (typed ASCII characters) Output: Tokens (sequence of characters having a collec?ve meaning) Discard: whitespace, comments

int count = 10; int count = 10 ;

Lexemes Tokens

keywo rd ident ifier

  • pera

tor intLi teral separ ator

slide-6
SLIDE 6

9/25/17 6

Why do lexical analysis separately?

Simpler, faster grammar for parsing

  • Next: how?

75% of ?me spent in lexical analysis

  • Def. Regular Expressions

RegExpr Meaning x a character x \x an escaped character, e.g., \n { Z } a reference to a reg expr Z M | N M or N, where M and N are reg expr M N M followed by N M* zero or more occurrences of M M+ One or more occurrences of M M? Zero or one occurrence of M

slide-7
SLIDE 7

9/25/17 7

RegExpr Meaning [aeiou] the set of vowels [0-9] the set of digits . Any single character

  • Def. Regular Expressions

Special symbols: ^ means not (e.g., [^aeiouAEIOU] is a non-vowel)

CLite regular definiRon

Category Defini3on AnyChar [ -~] LeGer [a-zA-Z] Digit [0-9] Whitespace [ \t] Eol \n

From space (ASCII 27) to ?lde (126) Space and tab

slide-8
SLIDE 8

9/25/17 8

Category Defini3on Keyword bool | char | else | false | float | if | int | main | true | while Iden?fier {LeGer}({LeGer} | {Digit})* IntegerLit {Digit}+ FloatLit {Digit}+\.{Digit}+ CharLit '{AnyChar}' Category Defini3on Operator = | || | && | == | != | < | 
 <= | > | + | - | * | / |! | [ | ] Separator : | . | { | } | ( | ) Comment // ({AnyChar} | {Whitespace})* {Eol}

slide-9
SLIDE 9

9/25/17 9

ImplementaRon Using Python

Python's re package

hGps://docs.python.org/3/library/re.html import re #regex re.split(...)#Use regex argument to split a string into parts Common string matching regex: Symbol Defini3on \d [0-9] \D [^0-9] \w [a-zA-Z0-9_] \W [^a-zA-Z0-9_]

slide-10
SLIDE 10

9/25/17 10

Describe the language:

1. 0(0|1)+0 2. ((ε|0)1*)* 3. 0*10*10*10* 4. (00|11)*

Write regular expression for:

1. All strings of lowercase leGers, where leGers appear in ascending order. 2. All strings of leGers containing vowels in order.

slide-11
SLIDE 11

9/25/17 11

Exam 1

Coming Thursday, Sept 28 Start of class (30 min) Up to today's class

Finite State Automata (FSA)

BEHIND THE SCENE OF REGULAR EXPRESSIONS

slide-12
SLIDE 12

9/25/17 12

Finite State Automata (FSA)

Σ: Input alphabet + unique end symbol ($) Set of states

  • Represented by nodes
  • Unique start state
  • One or more final states

State transi?on func?on

  • Labelled (using alphabet) arcs in graph

DeterminisRc F.A. (DFA)

There is at most one outgoing arc from any state for any par?cular input symbol

  • Easy to parse: does x belong to LG?
slide-13
SLIDE 13

9/25/17 13

Non-determinisRc F.A. (NFA)

Allows mul?ple outgoing arcs from a state for the same input symbol Allows transi?ons on empty string (ε)

  • Easy to express a language
  • But difficult to parse

Known algorithms

1. DFA à regular expression 2. Regular expression à NFA 3. NFA à DFA

Language designer à implementa?on (parsing) DFA à Regex à NFA à DFA All 3 are equivalent!

slide-14
SLIDE 14

9/25/17 14

Example

Odd binary number

Regex à NFA à DFA à Regex (0|1)*1 à ? à ? à ?

Idea:

  • For |, symbols will be on the same arc
  • For concatena?on, create new state
  • For *, use self-loop

(More details on next slide) Idea:

  • Start with the NFA start symbol and

tabulate all possible sets of NFA states that you can reach on 0 and 1 transi?ons.

  • Each set of NFA state is a DFA state.

State elimina?on algorithm

  • Nishimura handout: + means |

(More details soon)

Regex à NFA

ScoG, Programming Languages (2000)

slide-15
SLIDE 15

9/25/17 15

NFA/DFA à Regex State eliminaRon algorithm

How to preserve all paths a•er dele?ng a node? For each node to be deleted:

  • Match each incoming arc with every outgoing arc

Class ParRcipaRon 3

Do the following for binary numbers with an even number of 0s: Regular expression à NFA à DFA à Regular expression.