Compiler Development (CMPSC 401) Lexical Analysis Janyl Jumadinova - - PowerPoint PPT Presentation

compiler development cmpsc 401
SMART_READER_LITE
LIVE PREVIEW

Compiler Development (CMPSC 401) Lexical Analysis Janyl Jumadinova - - PowerPoint PPT Presentation

Compiler Development (CMPSC 401) Lexical Analysis Janyl Jumadinova January 24, 2019 Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 1 / 26 Outline Quick overview of basic concepts of formal grammars. Lexical


slide-1
SLIDE 1

Compiler Development (CMPSC 401)

Lexical Analysis Janyl Jumadinova January 24, 2019

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 1 / 26

slide-2
SLIDE 2

Outline

Quick overview of basic concepts of formal grammars. Lexical specification of programming languages. Scanners and Tokens. Regular expressions.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 2 / 26

slide-3
SLIDE 3

Programming Language Specifications

Since the 1960s, the syntax of every significant programming language has been specified by a formal grammar. Borrowed from the linguistics community - Chomsky.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 3 / 26

slide-4
SLIDE 4

Overview of Formal Languages and Automata Theory

Starring Mr. Pig

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 4 / 26

slide-5
SLIDE 5

Overview of Formal Languages and Automata Theory

Starring Mr. Pig Alphabet: a finite set of symbols and characters E.g., i, k, n, o, !,

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 4 / 26

slide-6
SLIDE 6

Overview of Formal Languages and Automata Theory

Starring Mr. Pig Alphabet: a finite set of symbols and characters E.g., i, k, n, o, !, String: a finite, possibly empty sequence E.g., “oink”

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 4 / 26

slide-7
SLIDE 7

Overview of Formal Languages and Automata Theory

Starring Mr. Pig Alphabet: a finite set of symbols and characters E.g., i, k, n, o, !, String: a finite, possibly empty sequence E.g., “oink” Language: a set of strings (possibly empty or infinite) E.g., “oink!”, “oink oink!”, “oink oink oink!”, ...

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 4 / 26

slide-8
SLIDE 8

Finite Specifications of Possibly Infinite Languages

Automaton - a recognizer; a machine that accepts all strings in a language (and rejects all other strings). E.g., a pig detector: accepts all sequences of “oink”s, rejects “moo”s

  • r “baa”s.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 5 / 26

slide-9
SLIDE 9

Finite Specifications of Possibly Infinite Languages

Automaton - a recognizer; a machine that accepts all strings in a language (and rejects all other strings). E.g., a pig detector: accepts all sequences of “oink”s, rejects “moo”s

  • r “baa”s.

Grammar - a generator that produced all strings in the language (and nothing else).

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 5 / 26

slide-10
SLIDE 10

Language (Chomsky) hierarchy

Regular (Type-3) languages are specified by regular expressions/ grammars and finite automata (FAs) ← SCANNING Context-free (Type-2) languages are specified by context-free grammars and pushdown automata (PDAs) ← PARSING Context-sensitive (Type-1) languages Recursively-enumerable (Type-0) languages are specified by general grammars and Turing machines

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 6 / 26

slide-11
SLIDE 11

Example: Grammar for Pigese (or Pigish?)

A formal grammar for our pig language could be: PigTalk ::= oink PigTalk (rule 1) | oink! (rule 2)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 7 / 26

slide-12
SLIDE 12

Example: Grammar for Pigese (or Pigish?)

A formal grammar for our pig language could be: PigTalk ::= oink PigTalk (rule 1) | oink! (rule 2) PigTalk can then generate, for example:

1

PigTalk ::= oink! (Rule 2)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 7 / 26

slide-13
SLIDE 13

Example: Grammar for Pigese (or Pigish?)

A formal grammar for our pig language could be: PigTalk ::= oink PigTalk (rule 1) | oink! (rule 2) PigTalk can then generate, for example:

1

PigTalk ::= oink! (Rule 2)

2

PigTalk ::= oink PigTalk (Rule 1) ::= oink oink!

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 7 / 26

slide-14
SLIDE 14

Example: Grammar for Pigese (or Pigish?)

A formal grammar for our pig language could be: PigTalk ::= oink PigTalk (rule 1) | oink! (rule 2) PigTalk can then generate, for example:

1

PigTalk ::= oink! (Rule 2)

2

PigTalk ::= oink PigTalk (Rule 1) ::= oink oink!

3

PigTalk ::= oink PigTalk (Rule 1) ::= oink oink PigTalk (Rule 1) ::= oink oink oink! (Rule 2)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 7 / 26

slide-15
SLIDE 15

More formally

The rules of a grammar are called productions.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 8 / 26

slide-16
SLIDE 16

More formally

The rules of a grammar are called productions. Rules contain:

Non-terminal symbols: grammar variables (program, statement, id, etc.)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 8 / 26

slide-17
SLIDE 17

More formally

The rules of a grammar are called productions. Rules contain:

Non-terminal symbols: grammar variables (program, statement, id, etc.) Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, =, (, ), ...

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 8 / 26

slide-18
SLIDE 18

More formally

The rules of a grammar are called productions. Rules contain:

Non-terminal symbols: grammar variables (program, statement, id, etc.) Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, =, (, ), ...

nonterminal ::= <sequence of terminals and nonterminals> In a derivation, an instance of nonterminal can be replaced by the sequence of terminals and nonterminals on the right of the production.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 8 / 26

slide-19
SLIDE 19

More formally

The rules of a grammar are called productions. Rules contain:

Non-terminal symbols: grammar variables (program, statement, id, etc.) Terminal symbols: concrete syntax that appears in programs (a, b, c, 0, 1, if, =, (, ), ...

nonterminal ::= <sequence of terminals and nonterminals> In a derivation, an instance of nonterminal can be replaced by the sequence of terminals and nonterminals on the right of the production. Often there are several productions for a nonterminal derivations can choose any of them.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 8 / 26

slide-20
SLIDE 20

Alternative Notations

There are several syntax notations for productions in common use; all mean the same thing. E.g.: ifStmt ::= if ( expr ) statement ifStmt → if ( expr ) statement <ifStmt> ::= if ( <expr> ) <statement>

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 9 / 26

slide-21
SLIDE 21

A small but a more realistic example

program ::= statement | program statement statement ::= assignStmt | ifStmt assignStmt ::= id = expr ; ifStmt ::= if (expr) statement expr ::= id | int | expr + expr id ::= a | b | c | i | j | k | n | x | y | z int ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 10 / 26

slide-22
SLIDE 22

Parsing and Scanning

Scanner: translate source code to tokens (e.g., < int >, +, < id >). Report lexical errors like illegal characters and illegal symbols.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 11 / 26

slide-23
SLIDE 23

Parsing and Scanning

Scanner: translate source code to tokens (e.g., < int >, +, < id >). Report lexical errors like illegal characters and illegal symbols. Parser: read token stream and reconstruct the derivation. Reports parsing errors i.e., source that is not derivable from the

  • grammar. E.g., mismatched parenthesis/braces, nonsensical

statements (x = 1 +;)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 11 / 26

slide-24
SLIDE 24

Why Separate the Scanner and the Parser?

Standard arguments about splitting functionality into independent pieces: Simplicity and separation of concerns

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 12 / 26

slide-25
SLIDE 25

Why Separate the Scanner and the Parser?

Standard arguments about splitting functionality into independent pieces: Simplicity and separation of concerns Efficiency

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 12 / 26

slide-26
SLIDE 26

But...

Not always possible to separate cleanly.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 13 / 26

slide-27
SLIDE 27

But...

Not always possible to separate cleanly.

Example: C/C++/Java type vs identifier. Things are even uglier in Fortran 77. E.g., myvar, my var, and my var are all the same identifier, keywords are not reserved, etc. Tokenizing requires context.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 13 / 26

slide-28
SLIDE 28

But...

Not always possible to separate cleanly.

Example: C/C++/Java type vs identifier. Things are even uglier in Fortran 77. E.g., myvar, my var, and my var are all the same identifier, keywords are not reserved, etc. Tokenizing requires context.

So we hack around it somehow ...

Either use simpler grammar and disambiguate later, or communicate between scanner and parser (with some semantic analysis mixed in). Real world: Often ends up very complex and hard to follow. Compiler front-ends are sometimes referred to as “black magic”.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 13 / 26

slide-29
SLIDE 29

Typical Tokens in Programming Languages

Operators and Punctuation

+ - * / ( ) [ ] ; : :: < <= == = != ! ...! Each of these is a district lexical class

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 14 / 26

slide-30
SLIDE 30

Typical Tokens in Programming Languages

Operators and Punctuation

+ - * / ( ) [ ] ; : :: < <= == = != ! ...! Each of these is a district lexical class

Keywords

if while for goto return switch void ... Each of these is also a distinct lexical class (not a string)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 14 / 26

slide-31
SLIDE 31

Typical Tokens in Programming Languages

Operators and Punctuation

+ - * / ( ) [ ] ; : :: < <= == = != ! ...! Each of these is a district lexical class

Keywords

if while for goto return switch void ... Each of these is also a distinct lexical class (not a string)

Identifiers (variables)

A single ID lexical class, but parameterized by actual identifier (often a pointer into a symbol table)

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 14 / 26

slide-32
SLIDE 32

Typical Tokens in Programming Languages

Operators and Punctuation

+ - * / ( ) [ ] ; : :: < <= == = != ! ...! Each of these is a district lexical class

Keywords

if while for goto return switch void ... Each of these is also a distinct lexical class (not a string)

Identifiers (variables)

A single ID lexical class, but parameterized by actual identifier (often a pointer into a symbol table)

Integer constants

A single INT lexical class, but parameterized by numeric value

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 14 / 26

slide-33
SLIDE 33

Typical Tokens in Programming Languages

Operators and Punctuation

+ - * / ( ) [ ] ; : :: < <= == = != ! ...! Each of these is a district lexical class

Keywords

if while for goto return switch void ... Each of these is also a distinct lexical class (not a string)

Identifiers (variables)

A single ID lexical class, but parameterized by actual identifier (often a pointer into a symbol table)

Integer constants

A single INT lexical class, but parameterized by numeric value Other constants (string, floating point, boolean, ...), etc.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 14 / 26

slide-34
SLIDE 34

Principle of Longest Match

In most languages (exception: Fortran 77), the scanner should pick the longest possible string to make up the next token if there is a choice.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 15 / 26

slide-35
SLIDE 35

Principle of Longest Match

In most languages (exception: Fortran 77), the scanner should pick the longest possible string to make up the next token if there is a choice. return maybe != iffy; - 5 tokens RETURN ID(maybe) NEQ ID(iffy) SCOLON

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 15 / 26

slide-36
SLIDE 36

Lexical Complications

Most modern languages are free-form

Layout doesn’t matter White space separates tokens

Alternatives

Haskell, Python - indentation and layout can imply grouping

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 16 / 26

slide-37
SLIDE 37

Regular Expressions used for Scanning

Defined over some alphabet .

For programming languages, alphabet is usually ASCII or Unicode.

If re is a regular expression, L(re) is the language (set of strings) generated by re.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 17 / 26

slide-38
SLIDE 38

Fundamentals of Regular Expressions (REs)

These are the basic building blocks that other REs are built from.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 18 / 26

slide-39
SLIDE 39

Operations on REs

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 19 / 26

slide-40
SLIDE 40

Operations on REs

Precedence: (R), R*, R1R2, R1|R2 (lowest). Parenthesis can be used to group REs as needed.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 19 / 26

slide-41
SLIDE 41

Examples

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 20 / 26

slide-42
SLIDE 42

Abbreviations on REs

There are common abbreviations used for convenience.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 21 / 26

slide-43
SLIDE 43

Abbreviations on REs

Many systems allow abbreviations to make writing and reading definitions or specifications easier. Restriction: abbreviations may not be circular (recursive) either directly or indirectly (otherwise would be not be a regular language). digit ::= [0-9] is okay number ::= digit number is not okay

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 22 / 26

slide-44
SLIDE 44

Example

Possible syntax for numeric constants digit ::= [0-9] digits ::= digit + number ::= digits ( . digits )? ([eE] (+ | -)? digits )? Notice that this allows (unnecessary) leading 0s, e.g., 00045.6. (0, or 0.14 would be necessary 0s).

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 23 / 26

slide-45
SLIDE 45

Example

Possible syntax for numeric constants digit ::= [0-9] nonzero digit ::= [1-9] digits ::= digit + number ::= (0 | nonzero digit digits?) ( . digits )? ([eE] (+ | -)? digits )?

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 24 / 26

slide-46
SLIDE 46

Recognizing Regular Expressions

Finite automata can be used to recognize languages generated by regular expressions. Can build by hand or automatically (tools like Lex, Flex (for compilers written in C++), and JFlex (for compilers written in Java) do this automatically, given a set of REs.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 25 / 26

slide-47
SLIDE 47

RE Practice: https://regexone.com/ Next time: Building finite automata that recognize regular expressions. How they can be used to build scanners.

Janyl Jumadinova Compiler Development (CMPSC 401) January 24, 2019 26 / 26