1
CSE 3341: Principles of Programming Languages Syntax Jeremy Morris - - PowerPoint PPT Presentation
CSE 3341: Principles of Programming Languages Syntax Jeremy Morris - - PowerPoint PPT Presentation
CSE 3341: Principles of Programming Languages Syntax Jeremy Morris 1 Syntax vs. Semantics Syntax: What kinds of symbols are allowed in a language? Semantics What do the symbols in a language mean ? 2 Language Terminology
Syntax vs. Semantics
Syntax:
What kinds of symbols are allowed in a language?
Semantics
What do the symbols in a language mean? 2
Language Terminology
Alphabet
Finite set of symbols
String
Sequence of symbols
Language
Set of strings over an alphabet
Grammar
Rules that define which strings over an alphabet are in the
language and which ones are not
3
Terminology Example
Consider the Java programming language
Alphabet
The tokens in the Java language.
if, then, while, do, >, <, String, variable names, etc.
Note: Not the individual characters
- Not your intuitive understanding of the term “alphabet”.
String
A sequence of tokens from the alphabet
Language
The set of all syntactically correct Java programs.
Grammar
The rules for producing syntactically correct Java programs.
https://docs.oracle.com/javase/specs/jls/se8/html/index.html
(It’s a nearly 800 page book – you don’t need to read it)
4
Language Terminology
We typically talk about languages in mathematical terms
as sets
Alphabet – finite set of symbols
Often denoted as Σ
String – finite set of symbol sequences
Empty string: ε – a sequence of length 0
Σ* - the set of all strings over Σ (including ε)
The * represents the “Kleene closure” – we’ll discuss this more later
Σ+ - the set of all non-empty strings over Σ
The + represents “one or more” where the * represents “zero or more”
Language – set of strings
Language L ⊆ Σ*
Defined by a grammar
Probably will not contain everything in Σ*
5
Syntax - Specification
We use syntax rules to specify the syntax of a language
Language – set of all strings
Some rules for non-negative integers:
number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
With these we can specify any non-negative integer.
6
Syntax Rule Terminology
Terminal symbol
Any symbol that represents a member of the alphabet for the
language
i.e. Any symbol that is in the set of all possible tokens for the language
Will only appear on the right hand side of a syntax rule
(At least for our purposes – not strictly true) Non-terminal symbol
Any symbol that represents a rule to be expanded
Non-terminal – meaning “we need to keep going”
Can appear on either the left or the right hand side of a syntax rule
Meta-symbols
Symbols used to write the rules, but not part of the alphabet or
non-terminals
→, |, *, etc.
7
Terminology Example
number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9 Which of these are terminal symbols? Non-terminal? Meta?
8
Syntax – Types of Grammars
Chomsky Hierarchy
Outlines how complex formal languages are based on their rules Type-0 – Unrestricted (aka Recursively enumerable) Type-1 – Context-sensitive Type-2 – Context-free Type-3 – Regular We will focus on those last two 9
Regular Languages (aka Regular Expressions)
The simplest kind of grammar
Requires only 3 kinds of rules:
Concatenation
Join two things together
Alternation
Select between two choices
“Kleene closure”
Repeat something zero or more times.
No recursion is allowed
If we allow recursion, then we get Context-free grammars
10
Regular Languages (aka Regular Expressions)
Assume an alphabet Σ. A regular expression over Σ is:
Φ – the empty set ε – the empty string Any member of Σ (i.e. R = { r | r ϵ Σ}) Concatenation
If R and S are both regular expressions over Σ, then so is RS
RS = {r.s | r ϵ Σ and s ϵ Σ}
Alternation
If R and S are both regular expressions over Σ, then so is R ∪ S
Written as R|S – choose between R or S
“Kleene closure”
If R is a regular expression over Σ, then so is R*
R repeated 0 or more times – R concatenated with itself
11
Regular Languages
In syntax rules we can define a regular language like
this: number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Another way of saying:
Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} number = {dd*, d ϵ Σ}
(There might be a problem with this definition of a natural number – can you spot it?)
12
Regular Languages
Another example (from the textbook)
Numeric constants
number → integer | real integer → digit digit* real → integer exp | decimal (exp | ε) decimal → digit* (. digit | digit .) digit* exp → (e | E) (+ | - | ε) integer digit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |0
13
Derivations
Using syntax rules we can derive strings that are in our
language
Using the previous set of rules, can we show that 655 is in our
language of “numeric constants”? number ⇒ integer ⇒ digit digit* ⇒ 6 digit* ⇒ 6 5 digit* ⇒ 6 5 5 digit* ⇒ 6 5 5
14
Derivations Example
Using the rules on the previous slide, determine if the
following strings are in the language for numeric constants:
10e5 .65e30 .65e0.30 10.0e5.0 10.0e-5 15
Context-Free Languages
The Chomsky Hierarchy mentioned above is a hierarchy
All Regular Languages are also Context-Free, but not all Context-
Free Languages are Regular
Consider the language L = { anbn | n ≥ 0 }
Empty string, ab, aabb, aaabbb, etc. are all in this language
aabbb, aaabb, a, etc. are not.
Can we derive the rules for this language using only the rules set out for regular languages?
No, as it turns out.
- You can prove this mathematically using a theorem known as the
pumping lemma, but that’s outside the scope of this class
- see CSE 3321 – Formal Languages and Automata
But if we allow recursion we can do it easily
16
Context-Free Grammars (CFGs)
A grammar that defines a Context-Free language has
the same properties as a Regular grammar…
Concatenation, Alternation, Kleene Closure
…but allows for recursion in its rules
Either immediate recursion – the non-terminal on both the right
and left hand side of the same rule
We’ll see an example of this on the next slide
Or mutal recursion – a non-terminal on the left expands a rule
that eventually expands that non-terminal
We’ll see an example of this in a moment – hang in there
17
Context Free Grammars (CFGs)
The following grammar is not Regular, but is Context-Free: expr → number | expr op expr | ( expr )
- p → + | - | / | *
number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Note the recursion in the rule for expanding expr This grammar is problematic…
Let’s derive 1+3*2 using the previous rules 18
Context-Free Grammars
We can represent a derivation graphically as a parse
tree or syntax tree
The root of the tree is the start symbol for the grammar The internal nodes are non-terminal symbols The leaf nodes are terminal symbols 19
expr expr
- p
expr number + expr
- p
expr number * number 1 3 2
Context-Free Grammars
Consider these two trees, both derived from the above
grammar:
20
expr expr
- p
expr number + expr
- p
expr number * number 1 3 2 expr
- p
* expr number 2 expr expr
- p
expr number + number 1 3
Context-Free Grammars
A better, unambiguous grammar:
expr → term | expr add_op term term → factor | term mult_op factor factor → number | ( expr ) mult_op → * | / add_op → + | - number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Still not Regular, but Context-Free
Recursion is still there 21
Languages in Compilers & Interpreters
22
Stream of Characters Tokenizer/ Scanner Stream of tokens Parser Parse Tree Next Steps
Syntax - Specification
The previous syntax rules are one type on notation for a
syntax. number → digit digit* digit → 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Here’s another:
<number> ::= <digit> | <digit> <number> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Backus-Naur Form (aka Backus normal form aka BNF)
Note that pure BNF does not use Kleene-star or Kleene-plus
Other extensions provide shorthand to allow these, but it doesn't change the expressiveness to not have them (see above for how to replace Kleene star)
23
BNF Specification
<number> ::= <digit> | <digit> <number> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9
Special symbols: <, >, | and ::=
Reserved (or ‘meta’) symbols
Non-terminals
Wrapped in <> tags - <digit> or <number> Indicate rules that need to be expanded
Terminals
Not wrapped in <> tags Indicate “terminal” symbols – no more expansion 24
CORE: Imperative Language
<prog> ::= program <decl-seq> begin <stmt-seq> end <decl-seq>::= <decl> | <decl> <decl-seq> <stmt-seq>::= <stmt> | <stmt> <stmt-seq> <decl> ::= int <id-list>; <id-list> ::= <id> | <id>, <id-list> <stmt> ::= <assign> | <if> | <loop> | <in> | <out> <assign> ::= <id> = <exp>; <if> ::= if <cond> then <stmt-seq> end; | if <cond> then <stmt-seq> else <stmt-seq> end; <loop> ::= while <cond> loop <stmt-seq> end; <in> ::= read <id-list>; <out> ::= write <id-list>;
25
CORE: Imperative Language
<cond> ::= <comp> | !<cond> | [ <cond> and <cond> ] | [ <cond> or <cond> ] <comp> ::= ( <fac> <comp-op> <fac> ) <exp> ::= <term> | <term> + <exp> | <term> - <exp> <term> ::= <fac> | <fac> * <term> <fac> ::= <int> | <id> | ( <exp> ) <comp-op> ::= != | == | < | > | <= | >= <id> ::= <let-seq> | <let-seq><int> <let-seq> ::= <let> | <let><let-seq> <let> ::= A | B | C | ... | X | Y | Z <int> ::= <digit> | <digit><int> <digit> ::= 0 | 1 | 2 | 3 | ... | 9
26
CORE syntax tree practice
program int X; begin X = 25; write X; end
27
Concrete Syntax Tree
28
<prog> program <decl-seq> begin <stmt-seq> end <decl> <id> X <stmt> <stmt-seq> <assign> <id> = <expr> ; <stmt> <out> X <term> <fac> <int> 25 write int <id-list> <id-list> <id> X ;
Abstract Syntax Tree
29
<prog> <decl-seq> <stmt-seq> <decl> <id> X <stmt> <stmt-seq> <assign> <id> <expr> <stmt> <out> X <term> <fac> <int> 25 <id-list> <id-list> <id> X
CORE parse tree practice
program int Y,Z; begin Y = 20; Z = 5; Y = Y – Z; write Y; end
30
CORE parse tree practice
program int X,Y,Z; begin Y = 20; Z = 5; X = 21; if Y < Z then if Y < X then Y=Z; else Y=X; end; end; write Y; end
31
Takeaways
Syntax vs. Semantics Regular Languages vs. Context Free Languages Parsing and ambiguity Abstract vs. Concrete Parse Trees
32
Readings
Chapter 2.1 – Syntax For next time: Chapter 2.3 – Parsing
Skim Chapter 2.2 - Scanning 33