Introduction to Lexical Analysis Outline Informal sketch of - - PowerPoint PPT Presentation

introduction to lexical analysis
SMART_READER_LITE
LIVE PREVIEW

Introduction to Lexical Analysis Outline Informal sketch of - - PowerPoint PPT Presentation

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions


slide-1
SLIDE 1

Introduction to Lexical Analysis

slide-2
SLIDE 2

Compiler Design 1 (2011) 2

Outline

  • Informal sketch of lexical analysis

– Identifies tokens in input string

  • Issues in lexical analysis

– Lookahead – Ambiguities

  • Specifying lexers

– Regular expressions – Examples of regular expressions

slide-3
SLIDE 3

Compiler Design 1 (2011) 3

Lexical Analysis

  • What do we want to do? Example:

if (i == j) then

Z = 0;

else

Z = 1;

  • The input is just a string of characters:

\tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1;

  • Goal: Partition input string into substrings

– Where the substrings are tokens

slide-4
SLIDE 4

Compiler Design 1 (2011) 4

What’s a Token?

  • A syntactic category

– In English:

noun, verb, adjective, …

– In a programming language:

Identifier, Integer, Keyword, Whitespace, …

slide-5
SLIDE 5

Compiler Design 1 (2011) 5

Tokens

  • Tokens correspond to sets of strings.
  • Identifier: strings of letters or digits,

starting with a letter

  • Integer: a non-empty string of digits
  • Keyword: “else” or “if” or “begin” or …
  • Whitespace: a non-empty sequence of blanks,

newlines, and tabs

slide-6
SLIDE 6

Compiler Design 1 (2011) 6

What are Tokens used for?

  • Classify program substrings according to role
  • Output of lexical analysis is a stream of

tokens . . .

  • . . . which is input to the parser
  • Parser relies on token distinctions

– An identifier is treated differently than a keyword

slide-7
SLIDE 7

Compiler Design 1 (2011) 7

Designing a Lexical Analyzer: Step 1

  • Define a finite set of tokens

– Tokens describe all items of interest – Choice of tokens depends on language, design of parser

  • Recall

\tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1;

  • Useful tokens for this expression:

Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;

slide-8
SLIDE 8

Compiler Design 1 (2011) 8

Designing a Lexical Analyzer: Step 2

  • Describe which strings belong to each token
  • Recall:

– Identifier: strings of letters or digits, starting with a letter – Integer: a non-empty string of digits – Keyword: “else” or “if” or “begin” or … – Whitespace: a non-empty sequence of blanks, newlines, and tabs

slide-9
SLIDE 9

Compiler Design 1 (2011) 9

Lexical Analyzer: Implementation An implementation must do two things:

1. Recognize substrings corresponding to tokens 2. Return the value or lexeme of the token

– The lexeme is the substring

slide-10
SLIDE 10

Compiler Design 1 (2011) 10

Example

  • Recall:

\tif (i == j)\nthen\n\tz = 0;\n\telse\n\t\tz = 1;

  • Token-lexeme groupings:

– Identifier: i, j, z – Keyword: if, then, else – Relation: == – Integer: 0, 1 – (, ), =, ; single character of the same name

slide-11
SLIDE 11

Compiler Design 1 (2011) 11

Why do Lexical Analysis?

  • Dramatically simplify parsing

– The lexer usually discards “uninteresting” tokens that don’t contribute to parsing

  • E.g. Whitespace, Comments

– Converts data early

  • Separate out logic to read source files

– Potentially an issue on multiple platforms – Can optimize reading code independently of parser

slide-12
SLIDE 12

Compiler Design 1 (2011) 12

True Crimes of Lexical Analysis

  • Is it as easy as it sounds?
  • Not quite!
  • Look at some programming language history . . .
slide-13
SLIDE 13

Compiler Design 1 (2011) 13

Lexical Analysis in FORTRAN

  • FORTRAN rule: Whitespace is insignificant
  • E.g., VAR1 is the same as VA R1
  • Footnote: FORTRAN whitespace rule was motivated

by inaccuracy of punch card operators

slide-14
SLIDE 14

Compiler Design 1 (2011) 14

A terrible design! Example

  • Consider

– DO 5 I = 1,25 – DO 5 I = 1.25

  • The first is DO 5 I

= 1 , 25

  • The second is DO 5I

= 1.25

  • Reading left-to-right, cannot tell if DO 5I

is a

variable or DO stmt. until after “,” is reached

slide-15
SLIDE 15

Compiler Design 1 (2011) 15

Lexical Analysis in FORTRAN. Lookahead. Two important points:

1. The goal is to partition the string. This is implemented by reading left-to-write, recognizing

  • ne token at a time

2. “Lookahead” may be required to decide where one token ends and the next token begins – Even our simple example has lookahead issues

i vs. if = vs. ==

slide-16
SLIDE 16

Compiler Design 1 (2011) 16

Another Great Moment in Scanning

  • PL/1: Keywords can be used as identifiers:

I F T HEN T HEN T HEN = EL SE; EL SE EL SE = I F

can be difficult to determine how to label lexemes

slide-17
SLIDE 17

Compiler Design 1 (2011) 17

More Modern True Crimes in Scanning

  • Nested template declarations in C++

ve c to r<ve c to r<int>> myVe c to r ve c to r < ve c to r < int >> myVe c to r (ve c to r < (ve c to r < (int >> myVe c to r)))

slide-18
SLIDE 18

Compiler Design 1 (2011) 18

Review

  • The goal of lexical analysis is to

– Partition the input string into lexemes (the smallest program units that are individually meaningful) – Identify the token of each lexeme

  • Left-to-right scan ⇒

lookahead sometimes required

slide-19
SLIDE 19

Compiler Design 1 (2011) 19

Next

  • We still need

– A way to describe the lexemes of each token – A way to resolve ambiguities

  • Is if two variables i and f?
  • Is == two equal signs = =?
slide-20
SLIDE 20

Compiler Design 1 (2011) 20

Regular Languages

  • There are several formalisms for specifying

tokens

  • Regular languages are the most popular

– Simple and useful theory – Easy to understand – Efficient implementations

slide-21
SLIDE 21

Compiler Design 1 (2011) 21

Languages

  • Def. Let Σ

be a set of characters. A language Λ

  • ver Σ

is a set of strings of characters drawn from Σ (Σ is called the alphabet of Λ)

slide-22
SLIDE 22

Compiler Design 1 (2011) 22

Examples of Languages

  • Alphabet = English

characters

  • Language = English

sentences

  • Not every string on

English characters is an English sentence

  • Alphabet = ASCII
  • Language = C programs
  • Note: ASCII character

set is different from English character set

slide-23
SLIDE 23

Compiler Design 1 (2011) 23

Notation

  • Languages are sets of strings
  • Need some notation for specifying which sets
  • f strings we want our language to contain
  • The standard notation for regular languages is

regular expressions

slide-24
SLIDE 24

Compiler Design 1 (2011) 24

Atomic Regular Expressions

  • Single character
  • Epsilon

{ }

' ' " " c c =

{ }

"" ε =

slide-25
SLIDE 25

Compiler Design 1 (2011) 25

Compound Regular Expressions

  • Union
  • Concatenation
  • Iteration

{ }

|

  • r

A B s s A s B + = ∈ ∈

{ }

| and AB ab a A b B = ∈ ∈

*

where ... times ...

i i i

A A A A i A

= =

U

slide-26
SLIDE 26

Compiler Design 1 (2011) 26

Regular Expressions

  • Def.

The regular expressions over Σ are the smallest set of expressions including

*

' ' where where , are rexp over " " " where is a rexp over c c A B A B AB A A ε ∈∑ + ∑ ∑

slide-27
SLIDE 27

Compiler Design 1 (2011) 27

Syntax vs. Semantics

  • To be careful, we should distinguish syntax

and semantics (meaning)

  • f regular expressions

{ }

*

( ) "" (' ') {" "} ( ) ( ) ( ) ( ) { | ( ) and ( )} ( ) ( )

i i

L L c c L A B L A L B L AB ab a L A b L B L A L A ε

= = + = ∪ = ∈ ∈ = U

slide-28
SLIDE 28

Compiler Design 1 (2011) 28

Example: Keyword Keyword: “else” or “if” or “begin” or …

else' + 'if' + 'begi ' n' + L

Note: abbrev 'else' 'e''l''s iates ''e'

slide-29
SLIDE 29

Compiler Design 1 (2011) 29

Example: Integers Integer: a non-empty string of digits

*

digit '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' integer = digit digit = + + + + + + + + +

*

Abbreviation: A AA

+ =

slide-30
SLIDE 30

Compiler Design 1 (2011) 30

Example: Identifier Identifier: strings of letters or digits, starting with a letter

*

letter = 'A' 'Z' 'a' 'z' identifier = letter (letter digit) + + + + + + K K

* *

(letter + di Is the s git ) ame?

slide-31
SLIDE 31

Compiler Design 1 (2011) 31

Example: Whitespace Whitespace: a non-empty sequence of blanks, newlines, and tabs

( )

' ' + '\n' + '\t'

+

slide-32
SLIDE 32

Compiler Design 1 (2011) 32

Example 1: Phone Numbers

  • Regular expressions are all around you!
  • Consider +46(0)18-471-1056

Σ = digits ∪ {+,−,(,)} country = digit digit city = digit digit univ = digit digit digit extension = digit digit digit digit phone_num = ‘+’country’(’0‘)’city’−’univ’−’extension

slide-33
SLIDE 33

Compiler Design 1 (2011) 33

Example 2: Email Addresses

  • Consider kostis@it.uu.se

{ }

+

name = letter address = name '@' name '.' letters name '. ' .,@ name ∑ = ∪

slide-34
SLIDE 34

Compiler Design 1 (2011) 34

Summary

  • Regular expressions describe many useful

languages

  • Regular languages are a language specification

– We still need an implementation

  • Next time: Given a string s and a regular

expression R, is

( )? s L R ∈