Compiler Construction Lecture 2: Compiler Structure and Lexical - - PowerPoint PPT Presentation

compiler construction
SMART_READER_LITE
LIVE PREVIEW

Compiler Construction Lecture 2: Compiler Structure and Lexical - - PowerPoint PPT Presentation

Compiler Construction Lecture 2: Compiler Structure and Lexical Analysis 2020-01-10 Michael Engel Includes material by Jan Christian Meyer .org Theoretical and practical exercises TA: Lahiru Rasnayake Six problem sets, one every


slide-1
SLIDE 1

Compiler Construction

Lecture 2: Compiler Structure and Lexical Analysis 2020-01-10 Michael Engel

Includes material by Jan Christian Meyer

slide-2
SLIDE 2

Compiler Construction 02: Compiler Structure, Scanning

2

.org

Theoretical and practical exercises

  • TA: Lahiru Rasnayake
  • Six problem sets, one every two weeks
  • Theoretical questions on scanning, parsing, optimization…
  • Practical: build parts of your own small compiler (in C)
  • Get your own software project running

  • Solutions need to be handed in on time
  • Rather, an empty solution than a plagiarized one
  • Only the final two will be graded
  • 20% of the final grade (80% exam)
  • More details next week
slide-3
SLIDE 3

Compiler Construction 02: Compiler Structure, Scanning

3

Overview

  • Overview: definition and tasks of a compiler
  • Structure and stages of a typical compiler
  • Deterministic finite automata (DFA)
  • Lexical analysis (scanning)
slide-4
SLIDE 4

Compiler Construction 02: Compiler Structure, Scanning

4

Compilers are everywhere

  • Original idea: enable programming of computers in higher-

level abstractions than machine language

– Zuse's Plankalkül (1940s), FORTAN, LISP, A0 (1950s)

  • Today:

– Many different source languages and target platforms

  • Additional uses of compilers:

– Static analysis and verification – Hardware synthesis – Source-to-source transformations – Just in time (JIT) compilation

slide-5
SLIDE 5

Compiler Construction 02: Compiler Structure, Scanning

5

What does a compiler do?

  • Compiler:


“Tool that translates software written in one language into another language”

  • must understand both the form, or syntax, and content, or

meaning (semantics), of the input language

  • and understand the rules that govern syntax and mean-

ing in the output language

  • needs a scheme for mapping content from the source

language to the target language

  • Requirements:
  • must preserve the meaning of the program being compiled
  • must improve the input program in some discernible way 

slide-6
SLIDE 6

Compiler Construction 02: Compiler Structure, Scanning

6

The compilation process black box

int factorial(int n) { int fact = 1; while (n--) fact = fact * n; return n; } . . . 0xE59F1010 0xE59F0008 0xE0815000 0xE59F5008 . . .

?

slide-7
SLIDE 7

Compiler Construction 02: Compiler Structure, Scanning

7

Compilation process in detail

source code in
 high-level language (.c) preprocessor preprocessed code compiler assembler code (.s) assembler machine (“object”) code (.o) linker executable code loader debugger libraries

slide-8
SLIDE 8

Compiler Construction 02: Compiler Structure, Scanning

8

Structure of a compiler (1)

“understand both the form,

  • r syntax, and content, or

meaning (semantics), of the input language”

Frontend Backend Source code Target program

compiler “understand the rules that govern syntax and mean- ing in the output language” “scheme for mapping content from the source language to the target language”

slide-9
SLIDE 9

Compiler Construction 02: Compiler Structure, Scanning

9

Structure of a compiler (2)

“understand both the form,

  • r syntax, and content, or

meaning (semantics), of the input language”

Frontend Backend Source code Target program

compiler “understand the rules that govern syntax and mean- ing in the output language” “scheme for mapping content from the source language to the target language” “must improve the input program in some discernible way”

Optimizer IR IR

slide-10
SLIDE 10

Compiler Construction 02: Compiler Structure, Scanning

10

Intermediate representation (IR)

  • Early compilers directly 


generated machine code

  • n source languages, m targets:

n x m compilers required!

  • Idea: use a common description


format:“Intermediate Representation” (IR) – Transform source to IR (front end) and IR to target code (back end):


  • nly n + m compilers required now
  • Additional advantages of using intermediate representations:

– Easy to change source or target language – Easier optimizations: developed only for the intermediate representation – Intermediate representation can be directly interpreted

Java ML Pascal C C++ Sparc MIPS Pentium Itanium Java ML Pascal C C++ Sparc MIPS Pentium Itanium IR

slide-11
SLIDE 11

Compiler Construction 02: Compiler Structure, Scanning

11

Stages of a compiler (1)

Lexical analysis (scanning): – Split source code into lexical units – Recognize tokens (using regular expressions/automata) – Token: character sequence relevant to source language grammar
 


Lexical analysis Syntax analysis Semantic analysis Code generation Code

  • ptimization

Source code character stream token sequence machine-level program x = y + 42 id(x)

  • p(=)

id(y)

  • p(+)

number(42) character stream token sequence

slide-12
SLIDE 12

Compiler Construction 02: Compiler Structure, Scanning

12

Stages of a compiler (2)

Syntax analysis (parsing) – Uses grammar of the source language – Decides if input token sequence can be 
 derived from the grammar


id(x)

  • p(=)

id(y)

  • p(+)

number(42) Lexical analysis Semantic analysis Code generation Code

  • ptimization

Source code token sequence machine-level program Syntax analysis syntax tree

slide-13
SLIDE 13

Compiler Construction 02: Compiler Structure, Scanning

13

Stages of a compiler (3)

Semantic analysis – Name analysis (check def. & scope of symbols) – Type analysis (check correct type of expressions) – Creation of symbol tables (map identifiers to their types and positions in the source code)

Syntax analysis Semantic analysis syntax tree IR Lexical analysis Code generation Code

  • ptimization

Source code machine-level program

slide-14
SLIDE 14

Compiler Construction 02: Compiler Structure, Scanning

14

Stages of a compiler (5)

Code optimization – Analyzes & applies patterns of redundancy – e.g., store of a variable followed by a load of it – Often, different stages/levels of optimization with different intermediate representations are applied

Code generation Code

  • ptimization

IR Semantic analysis Syntax analysis Lexical analysis Source code machine-level program IR

slide-15
SLIDE 15

Compiler Construction 02: Compiler Structure, Scanning

15

Stages of a compiler (4)

Code generation – Determines and outputs equivalent machine instructions 
 for components of the IR (instruction selection) – Determines correct instruction order with respect to pipeline constraints,
 exploitation of instruction-level parallelism (instruction scheduling) – Assigns variables to registers (register allocation) and memory locations

Semantic analysis Code generation IR Syntax analysis Lexical analysis Code

  • ptimization

Source code machine-level program machine code

slide-16
SLIDE 16

Compiler Construction 02: Compiler Structure, Scanning

16

Lexical analysis (scanning)

  • The compiler input is simply a stream (sequence) of bytes:



 72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, ... 


  • By convention, these are mapped to letters, digits, etc.:



 ‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘w’,’o’,’r’,’l’,’d’, ...

  • Other mappings (encodings) exist
  • e.g. Unicode UTF-8, EBCDIC
  • On this level, the input program is just a lot of bytes without

any structure

Lexical analysis ASCII encoding

slide-17
SLIDE 17

Compiler Construction 02: Compiler Structure, Scanning

17

Lexical analysis (scanning)

  • Naive approach to scanning:


Read letters one by one, e.g., for a key word “while”:



 w (119), h (104), i (105), l (108), e (10)

  • Writing a compiler that has to detect this pattern every time

the programmer wants to start a loop is inconvenient:

  • A programmer might choose to call a variable 'whilf':



 w (119), h (104), i (105), l (108), (looking good so far…)
 f (10) (oh no, start from scratch, that’s not a loop)

Lexical analysis

slide-18
SLIDE 18

Compiler Construction 02: Compiler Structure, Scanning

18

Identifying syntactical units

  • Better approach:


Group letters into meaningful units and operate on those:



 ‘i’, ‘f’, ‘(‘, ‘w’,’h’, ‘i’, ‘l’, ‘f’, ‘=’, ‘=’, ‘2’, ‘)’, ‘{‘, ‘x’, ‘=’, ‘5’, ‘;’, ‘}’
 
 if ( whilf == 2 ) { x = 5; }


  • Here, we use color coding to identify the various units:

keywords and punctuation 
 delimiters of groups 
 variables 


  • perators 


numbers

Lexical analysis

slide-19
SLIDE 19

Compiler Construction 02: Compiler Structure, Scanning

19

Deriving code structure

  • What use is the coloring of our units?


We've already seen this one:


if ( whilf == 2 ) { x = 5; }
 


How would we color that line?


while ( a < 42 ) { a += 2; } 
 
 Using the same coloring roles, we get:
 while ( a < 42 ) { a += 2; }

  • These two statements have completely different meanings but share the

same (syntactic) structure (here: sequence of colors)

  • We’ll talk about structure later
  • Today, we will look at lexical analysis


Lexical analysis

keywords and punctuation 
 delimiters of groups 
 variables 


  • perators 


numbers

slide-20
SLIDE 20

Compiler Construction 02: Compiler Structure, Scanning

20

Useful definitions

  • Lexeme
  • Lexemes are units of lexical analysis, words
  • They’re like entries in the dictionary, “house”, “walk”, “smooth”
  • Token
  • Tokens are units of syntactic analysis
  • They are like units of a sentence, “noun”,“verb”,“adjective”
  • Semantic
  • The meaning of something (there is no sensible unit)
  • Similar to explanations in the dictionary:
  • house: “a building which someone inhabits”
  • walk: “the act of putting one foot in front of the other”
  • smooth: “the property of a surface which offers little resistance

Lexical analysis

slide-21
SLIDE 21

Compiler Construction 02: Compiler Structure, Scanning

21

Classes of lexemes

  • Lexemes with a fixed meaning
  • keywords or reserved words
  • “if", “while”, “for”, “==“, …
  • Most languages forbid the use of these as identifiers (variable/

function/… names)

  • Source is easier to parse, less ambiguous code
  • Classes with countably infinite instances
  • e.g. 1, 2, 3, … 65535, …
  • All of these are specific cases of the class “integer number"

Lexical analysis

slide-22
SLIDE 22

Compiler Construction 02: Compiler Structure, Scanning

22

Finite automata

  • Required: 


Mechanism to identify classes of words (not just single words)

  • Example: mechanism to recognize real numbers
  • Informal description:


“A real number starts with one or more digits optionally followed by a decimal point followed by zero or more digits”

  • Formal approach: Deterministic Finite Automaton (DFA)
  • example given as a directed graph here (easy to follow)

s1 s2 s3 [0-9] [0-9] '.' [0-9]

Lexical analysis

slide-23
SLIDE 23

Compiler Construction 02: Compiler Structure, Scanning

23

DFA structure

s1 s2 s3 Nodes (vertices V) = States 
 (here: s1, s2, s3) States s2, s3 are
 accepting states (double outline) Automaton starts here Edges E = Transitions (annotated with conditions)

DFAs are often represented as directed graph G = (V, E)

[0-9] [0-9] '.' [0-9]

Lexical analysis

slide-24
SLIDE 24

Compiler Construction 02: Compiler Structure, Scanning

24

DFA formal definition

Formal definition: DFA = 5-tuple (Q, Σ, δ, q0, F) Q is a finite set called the states, Σ is a finite set called the alphabet, δ: Q×Σ → Q is the transition function, q0 ∈ Q is the start state, and F ⊆ Q is the set of accepting states

s1 s2 s3 [0-9] [0-9] '.' [0-9]

Q = {s1, s2, s3} Σ = {0,1,2,3,4,5,6,7,8,9,.} q0 = s1 F = {s2, s3} δ = ???

Lexical analysis

s2

slide-25
SLIDE 25

Compiler Construction 02: Compiler Structure, Scanning

25

Transition function of a DFA

Give the subsequent state for each state and each possible input, commonly as a table:

δ

1 2 3 4 5 6 7 8 9 . s1 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 input character current state s1 s2 s3 [0-9] [0-9] '.' [0-9]

Lexical analysis

Q = {s1, s2, s3} Σ = {0,1,2,3,4,5,6,7,8,9,.} q0 = s1 F = {s2, s3} δ = ???

slide-26
SLIDE 26

Compiler Construction 02: Compiler Structure, Scanning

26

Example DFA transition

s1 s2 s3 [0-9] [0-9] '.' [0-9]

δ

1 2 3 4 5 6 7 8 9 . s1 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 Input character sequence:

4 2 . 2 3

Read 1st char: '4' ➙ change to s2 Start: in state s1 Read 2nd char: '2' ➙ stay in s2 Read 3rd char: '.' ➙ change to s3 Read 4th char: '2' ➙ stay in s3 Read 5th char: '3' ➙ stay in s3 End of sequence in accepting state ✔

Lexical analysis

slide-27
SLIDE 27

Compiler Construction 02: Compiler Structure, Scanning

27

Error handling

  • What happens when a character '.' is read in state s1 or s3?

δ

1 2 3 4 5 6 7 8 9 . s1 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 ??? s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 ??? s1 s2 s3 [0-9] [0-9] '.' [0-9] Error '.' '.'

The error state is

  • ften omitted in DFA

descriptions. Implied: all non indicated characters ➙ error

Lexical analysis

slide-28
SLIDE 28

Compiler Construction 02: Compiler Structure, Scanning

28

Implementing a DFA in C the hard way

enum {error = 0, success}; int scan_real_number(void) { char c; enum states = {s1, s2, s3}; enum states cur = s1; while (1) { c = getchar(); // get next char if (c==EOF) break; // end? switch(cur) { case s1: 
 if (c>='0' && c<='9') cur = s2; else return error; break; case s2: if (c>='0' && c<='9') cur = s2; else if (c=='.') 
 cur = s3; else return error; break;

s1 s2 s3 [0-9] [0-9] '.' [0-9] Error '.' '.'

case s3: if (c>='0' && c<='9') cur = s3; else return error; break; } // switch } // while // check for accepting state if (cur != s2 && cur != s3) return error; else return success; }

slide-29
SLIDE 29

Compiler Construction 02: Compiler Structure, Scanning

29

Implementing a table-driven DFA in C

enum {error = 0, success}; enum states {s1, s2, s3, er}; enum states cur = s1; char alphabet[] = { '0', '1', '2', '3', '4',
 '5', '6', '7', '8', '9', '.' }; // next state for each char in alphabet (columns) struct scanner { enum states next[sizeof(alphabet)]; 
 }; // rows of the transition table struct scanner delta[sizeof(enum states)] = { // 0 1 2 3 4 5 6 7 8 9 . {s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, er}, // s1 {s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s3}, // s2 {s3, s3, s3, s3, s3, s3, s3, s3, s3, s3, er}, // s3 {er, er, er, er, er, er, er, er, er, er, er}, // er }; int scan_real_number(void) { char c; while (1) { c = getchar(); // get next char if (c==EOF) break; // end? cur = delta[cur].next[lookup(c)]; } // while // check for accepting state

if (cur!=s2 && cur!=s3) 
 return error;

else return success; } δ 1 2 3 4 5 6 7 8 9 . s1 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 er s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s2 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 s3 er

What is the task of the function call lookup(c) here and how would you implement it?

Beware: there's a subtle but
 potentially dangerous bug
 in the code! Can you find it?

slide-30
SLIDE 30

Compiler Construction 02: Compiler Structure, Scanning

30

Scanner generators

  • Programming a word-class recognizer (lexical analyzer, or

scanner) with ad-hoc logic is complicated and error-prone

  • Writing one using tables is a bit easier, but it requires

punching in a bunch of boring table entries to represent specific DFAs

  • Can we generate code for a scanner automatically from a

simple description?

  • Specify word classes as regular expressions
  • Let a program write a large table of states that includes all
  • f the expressions
  • More on this next week!