Lexical Analysis Problem: Want to break input into meaningful units - PDF document

9/5/2012 Lexical Analysis Problem: Want to break input into meaningful units of information Input: a string of characters CS 1622: Output: a set of partitions of the input string (tokens) Lexical Analysis Example: if(x==y) { z=1; } else { z=0; } Jonathan Misurda “if(x==y){\n\tz=1;\n} else {\n\tz=0;\n}” jmisurda@cs.pitt.edu Tokens Why Tokens? Token : A sequence of characters that can be treated as a single local entity. We need to classify substrings of our source according to their role. Tokens in English: Since a parser takes a list of tokens as inputs, the parser relies on token distinctions: • noun, verb, adjective, ... • For example, a keyword is treated differently than an identifier Tokens in a programming language: • identifier, integer, keyword, whitespace, ... Tokens correspond to sets of strings: • Identifier : strings of letters and digits, starting with a letter • Integer : a non-empty string of digits • Keyword : “else”, “if”, “while”, ... • Whitespace : a non-empty sequence of blanks, newlines, and tabs Design of a Lexer Lexer Implementation 1. Define a finite set of tokens An implementation must do two things: • Describe all items of interest 1. Recognize substrings corresponding to tokens • Depend on language, design of parser 2. Return the value or lexeme of the token recall “if(x==y){\n\tz=1;\n} else {\n\tz=0;\n}” A token is a tuple (type, lexeme): “if(x==y){\n\tz=1;\n} else {\n\tz=0;\n}” • Keyword, identifier, integer, whitespace • Should “==” be one token or two tokens? • Identifier: (id, ‘x’), (id, ‘y’), (id, ‘z’) • Keywords: if, else 2. Describe which string belongs to which token • Integer: (int, 0), (int, 1) • Single character of the same name: ( ) = ; • The lexer usually discards “non-interesting” tokens that don’t contribute to parsing, e.g., whitespace, comments Lexical analysis looks easy but there are problems 1

9/5/2012 Lexer Challenges Lexer Challenges FORTRAN compilation rule: whitespace is insignificant C++ template syntax: • Rule was motivated from the inaccuracy of card punching by operators vector<student> Consider: • DO 5I=1,25 C++ stream syntax: • DO 5I=1.25 cin >> var • The first: a loop iterates from 1 to 25 with step 5 • The second: an assignment The problem: vector<vector<student>> Reading left-to-right, cannot tell if DO5I is a variable or DO statement until , or . is reached. Lexer Implementation Languages Two important observations: Definition: Let  be a set of characters. • The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time. A language over  is a set of strings of the characters drawn from  . • Lookahead may be required to decide where one token ends and the next one begins. Examples: To describe tokens, we adopt a formalism based upon Regular Languages : Alphabet = English characters • Simple and useful theory Language = English sentences • Easy to understand • Efficient implementations Alphabet = ASCII Language = C programs Not every string on English characters is an English sentence Not all ASCII strings are valid C programs Notation Regular Expressions Languages are sets of strings. A single character denotes a set containing the single character itself: ‘x’ = { “x” } Need some notation for specifying which set we want to designate a language. Epsilon (  ) denotes an empty string (not the empty set): • Regular languages are those with some special properties.  = { “” } • The standard notation for regular language is using a regular expression Empty set is { } = ∅ size( ∅ ) = 0 size(  ) = 1 length(  ) = 0 2

9/5/2012 Compound REs Convenient Abbreviations Alternation: if A and B are REs, then: One or more: A | B = { s | s  A or s  B } A+ = A + AA + AAA + ... = A A* (one or more As) Concatenation of sets/strings: Zero or one: AB = { ab | a  A and b  B } A? = A |  Repetition (Kleene closure): Character class : � � where A i = A...A (i times) A* = ⋃ [abcd] = a | b | c | d �� A* = {  } + A + AA + AAA + ... (zero or more As) Wildcard: . (dot) matches any character (sometimes excluding newline) Examples Examples Regular expressions to determine Java keywords: Whitespace: if | else | while | for | int | … whitespace = [ \t\n] A literal string like “if” is shorthand for the concatenation of each letter C identifiers: Start with a letter or underscore Integer literal: Allow letters or underscores or numbers after the first letter digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Cannot be a keyword digit = [0123456789] digit = [0-9] id = [a-zA-Z_][a-zA-Z_0-9]* integer = digit digit* integer = digit+ Is this good enough? Examples Java RegEx Support import java.util.regex.Pattern; Valid Email Addresses: import java.util.regex.Matcher; (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0- 9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e- Pattern p = Pattern.compile("a*b"); \x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e- Matcher m = p.matcher("aaaaab"); \x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0- boolean b = m.matches(); 9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0- 9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0- 9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e- Or: \x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]) boolean b = Pattern.matches("a*b", "aaaaab"); String class: String s = new String(“aaaaab”); boolean b = s.matches ("a*b"); 3

9/5/2012 Predefined Patterns in Java Pattern Description [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w] ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary X { n } X , exactly n times X { n ,} X , at least n times X { n , m } X , at least n but not more than m times 4

Lexical Analysis Problem: Want to break input into meaningful units - PDF document

9/5/2012 Lexical Analysis Problem: Want to break input into meaningful units of information Input: a string of characters CS 1622: Output: a set of partitions of the input string (tokens) Lexical Analysis Example: if(x==y) { z=1; } else {

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Regular Expressions & Finite State Machines Main ideas Regular expressions / grammars can be

Regular Expressions I Example (0 1)0 This is a simplification of ( { 0 } { 1 } )

Intro to Strings Lecture 7 CGS 3416 Spring 2017 February 13, 2017 Lecture 7 CGS 3416 Spring

cse 311: foundations of computing Fall 2015 Lecture 24: DFAs, NFAs, and regular expressions

Announcements ICS 6B Final on Tuesday in class Comprehensive Boolean Algebra & Logic

tr".l) -' G^' ^ô c.nÊlrtq- h^r a^"-"1 uu- ( ary/ -Qôt" a

Operating Operating Systems I (1D (1DT044) Threads (Chapter 4) Tuesday february 1 Uppsala

Hardware Architectures For Embedded Systems Design Prepared By: Hind Alsalem; HIND SALEM HIND

Lexical Analysis Problem: Want to break input into meaningful units - PDF document

9/5/2012 Lexical Analysis Problem: Want to break input into meaningful units of information Input: a string of characters CS 1622: Output: a set of partitions of the input string (tokens) Lexical Analysis Example: if(x==y) { z=1; } else {

Compilers Lexical Analysis Alex Aiken Lexical Analysis 1. Lexical Analysis 2. Parsing 3.

Heterogeneous Lexical Resources MultiJEDI ERC 259234 Lexical Resource Lexical Resource Lexical

Lexical analysis Lexical analysis Lexical analysis checks the correctness of program words and

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

Lesson 2 Lexical Analysis CS 226/326 Spring 2003 Lexical Analysis Transform source program

Introduction to Lexical Analysis Outline Informal sketch of lexical analysis

Lexical Analysis Aslan Askarov aslan@cs.au.dk acknowledgments: E. Ernst Lexical analysis

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL SEMANTICS LEXICAL SEMANTICS CS 224N 2011 Gerald Penn Slides largely adapted from

Introduction to Lexical Analysis Identifies tokens in input string Issues in lexical

Lexical Analysis Therefore an implementation of a lexical analyser must do two things: Recognise

Lexical Analysis (2) Sukree Sinthupinyo 1 1 Department of Computer Engineering Chulalongkorn

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from

Lexical Databases Like a dictionary Lexical properties of interest to psycholinguists

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Regular Expressions &amp; Finite State Machines Main ideas Regular expressions / grammars can be

Regular Expressions I Example (0 1)0 This is a simplification of ( { 0 } { 1 } )

Intro to Strings Lecture 7 CGS 3416 Spring 2017 February 13, 2017 Lecture 7 CGS 3416 Spring

cse 311: foundations of computing Fall 2015 Lecture 24: DFAs, NFAs, and regular expressions

Announcements ICS 6B Final on Tuesday in class Comprehensive Boolean Algebra &amp; Logic

*tr*&quot;*.l) -' G^' ^ô c.*nÊlrtq- h^r a^&quot;-&quot;1 uu- ( ary/ -Qôt&quot; a

Operating Operating Systems I (1D (1DT044) Threads (Chapter 4) Tuesday february 1 Uppsala

Hardware Architectures For Embedded Systems Design Prepared By: Hind Alsalem; HIND SALEM HIND

Regular Expressions & Finite State Machines Main ideas Regular expressions / grammars can be

Announcements ICS 6B Final on Tuesday in class Comprehensive Boolean Algebra & Logic

tr".l) -' G^' ^ô c.nÊlrtq- h^r a^"-"1 uu- ( ary/ -Qôt" a