Lecture Slides for MAT-73006 Theoretical computer science PART Ib: - - PowerPoint PPT Presentation

lecture slides for mat 73006 theoretical computer science
SMART_READER_LITE
LIVE PREVIEW

Lecture Slides for MAT-73006 Theoretical computer science PART Ib: - - PowerPoint PPT Presentation

Lecture Slides for MAT-73006 Theoretical computer science PART Ib: Automata and Languages. Context-Free languages Henri Hansen January 26, 2015 1 Context-free languages There are several very simple languages that are not regu- lar, such


slide-1
SLIDE 1

Lecture Slides for MAT-73006 Theoretical computer science PART Ib: Automata and Languages. Context-Free languages

Henri Hansen January 26, 2015

1

slide-2
SLIDE 2

Context-free languages

  • There are several very simple languages that are not regu-

lar, such as {0n1n | n ≥ 0}

  • They are ”simple” to describe mathematically, but computa-

tionally the situation is different

  • An important class of languages is context-free languages.
  • We shall explore a way of describing these languages, called

context-free grammars.

2

slide-3
SLIDE 3
  • An important area of application for these grammars is found

in programming languages

slide-4
SLIDE 4

Context-free grammar

  • Let us start with an example of a grammar:

A → 0A1 A → B B → #

  • These three rules are substitution rules. The left hand side
  • f each rule contains a variable, and the right hand side

contains a string consisting of variables and terminal sym- bols

3

slide-5
SLIDE 5
  • Terminal symbols are symbols of the language that is being

defined, i.e., Σ is the set of terminal symbols

  • A grammar describes a language by generating the strings

in the language. This happens by the following the proce- dure:

  • 1. Write down the start variable. Unless otherwise stated,

it is the left-hand side of the topmost rule

  • 2. Find a variable that has been written down, and a rule

that has this variable as it left-hand side. Replace the written down variable with the right-hand side of the rule

  • 3. Repeat step 2 until no variables remain.
slide-6
SLIDE 6
  • For example, the example grammar can generate the string

000#111

  • The sequence of substitutions that results in the string is

called a derivation.

  • A derivation can also have a graphic representation as a

parse tree.

  • The set of strings that can be generated by a given grammar

is called the language of the grammar.

slide-7
SLIDE 7

A more complicated example SENTENCE → NOUN-PHRASE VERB-PHRASE NOUN-PHRASE → CMPLX-NOUN | CMPLX-NOUN PREP-PHRASE VERB-PHRASE → CMPLX-VERB | CMPLX-VERB PREP-PHRASE PREP-PHRASE → PREP CMPLX-NOUN CMPLX-NOUN → ARTICLE NOUN CMPLX-VERB → VERB | VERB NOUN-PHRASE ARTICLE → a | the NOUN → boy | girl | flower VERB → likes | sees | touches PREP → with

4

slide-8
SLIDE 8

Formal definition of CFG

  • A context-free grammar is a 4-tuple (V, Σ, R, S), where
  • 1. V is a finite set called variables
  • 2. Σ is a finite set, disjoint from V called terminals (AKA

alphabet)

  • 3. R is a finite set of rules, a rule being a pair (v, σ) where

v is a variable and σ is s string of variables and termi- nals; also written as v → σ

  • 4. S ∈ V is the starting variable

5

slide-9
SLIDE 9
  • if u, v and w are strings of variables and terminals, and

A → w is a rule of the grammar, then uAv yields the string uwv, written uAv ⇒ uwv.

  • We say that u derives v, written u ⇒∗ v if u = v or if there

is some sequence u ⇒ u1 ⇒ u2 ⇒ · · · ⇒ uk ⇒ v

  • The language of the grammar is the set {w ∈ Σ∗ | S ⇒∗

w}

slide-10
SLIDE 10

Examples of CFGs.

  • Often we write a CFG by simply giving the rules; the vari-

ables are the symbols that appear at left-hand sides and the

  • thers are terminals.
  • S ⇒ aSb | SS | ǫ (think of a as "(" and b as ")")
  • E → E + T | T

T → T × F | F F → (E) | n

6

slide-11
SLIDE 11

Where the alphabet is {n, +, ×, (, )}

  • A compiler of a programming language translates code into

another form; CFG:s are used, for instance in describing programming language syntax

  • the process by which the meaning of a string is found by

relating it to a grammar, is known as parsing.

slide-12
SLIDE 12

Ambiguity

  • Consider the grammar rule E → E + E | E × E | (E) | a.

There are several derivations for strings such as a + a × a

  • Definition: A grammar is ambiguous if there are two or more

ways of deriving a string of its language

  • Ambiguity makes (unique) parsing impossible, so obviously
  • ne should strive to describe languages unambiguously when-

ever possible,

  • Some languages are inherently ambiguous, i.e., all gram-

mars that generate them, are ambiguous

7

slide-13
SLIDE 13

Pushdown automata

  • Regular languages were defined as languages that are rec-
  • gnized by some finite automaton
  • Context-free languages can similarly be recognized by cer-

tain kind of automata, due to the recursive nature of context- free languages, some form of memory is needed.

  • Informally, pushdown automata are like nondeterministic fi-

nite automata, but instead of simply moving from one state to another, they use a stack to store information about what the automaton has done in the past, and this information affects what the automaton does next

8

slide-14
SLIDE 14
  • When a pushdown automaton is in a given state, it responds

to the alphabet that is read from the input, and to the vari- able that is on top of the stack.

  • Let us mark Σǫ the set Σ ∪ {ǫ} (and similarly for Γǫ
  • Formally: A pushdown automaton is a 6-tuple (Q, Σ, Γ, δ, q0, F),

where

  • 1. Q is the (finite) set of states
  • 2. Σ is the input alphabet
  • 3. Γ is the stack alphabet
slide-15
SLIDE 15
  • 4. δ : Q × Σǫ × Γǫ → 2Q×Γǫ is the nondeterministic tran-

sition function

  • 5. q0 ∈ Q is the start state
  • 6. F ⊆ Q is the set of accept states
  • A pushdown automaton (PDA) M = (Q, Σ, Γ, δ, q0, F) ac-

cepts an input a1 · · · an (where ai ∈ Σǫ) if and only if there is some sequence of states q0q1 · · · qn and a set of strings g0, g1, · · · , gn of Γ∗

ǫ such that the following conditions are

met:

  • 1. g0 = ǫ, i.e., the automaton starts with an empty stack
slide-16
SLIDE 16
  • 2. for 0 ≤ i ≤ n − 1 we have (qi+1, x) ∈ δ(qi, ai+1, y)

and gi = yt and gi+1 = xt; i.e., the content of the stack is the same after the move, except possibly the topmost element

  • 3. qn ∈ F
  • To understand the transition function, if (qi+1, x) ∈ δ(qi, ai+1, y),

then this transition can executed if y is on top of the stack, the automaton is in state qi and the next read input symbol is ai+1. After it is executed, y is removed from the stack and x is put on top, and the automaton has moved to state qi+1

slide-17
SLIDE 17

Example

  • Consider the language {aibjck | i = j or i = k} i.e., either

the number of bs or the number of cs is the same as the number of as.

  • Informally, it is relatively easy to consider a PDA that ac-

cepts the language: First read all as, pushing a counter into the stack. Then, nondeterministically choose to count either the bs or the cs and match their number with as.

9

slide-18
SLIDE 18

q0 q1 q4 q2 q5 q6 q3 ǫ, ǫ → $ ǫ, ǫ → ǫ ǫ, ǫ → ǫ ǫ, $ → ǫ ǫ, ǫ → ǫ ǫ, $ → ǫ a, ǫ → a b, ǫ → ǫ c, a → ǫ b, a → ǫ c, ǫ → ǫ

slide-19
SLIDE 19

Equivalence

  • Pushdown automata and context-free grammars are equiv-

alent in the same way as regular expressions and finite au- tomata are:

  • Theorem: A language is context-free if and only if there is a

pushdown automaton that recognizes it

  • First we explain how to prove this in the other direction. Let

A be a context free language. By definition then, it has a CFG, say G that generates it

10

slide-20
SLIDE 20
  • The idea of the proof is as follows: We generate a nonde-

terministic PDA that, when reaging an input "guesses" what substitutions are needed for a given string.

  • 1. Initially, the PDA puts the start variable on the stack
  • 2. After this, the automaton always looks at the top symbol
  • f the stack. If it is a variable, then it nondeterministi-

cally chooses a rule to apply, removes the variable and replaces the variable with the right-hand side of the rule (in reverse order)

  • 3. If the top symbol is a terminal, then it compares it to

the next input. If the symbols differ, this branch rejects;

  • therwise the top symbol is simply removed.
slide-21
SLIDE 21
  • 4. If the stack is empty when the input ends, the automaton

accepts.

  • Please verify that the automaton accepts exactly the strings

that are generated by the grammar!

  • The other direction is proven so that we generate a context

free grammar from the transition relation of a PDA

  • Given a PDA P three modifications are made:
  • 1. It will contain only one accepting state, qa. This is not a

problem, because nondeterminism is allowed

slide-22
SLIDE 22
  • 2. The automaton only accepts after it has emptied the
  • stack. This is not a restriction either
  • 3. Every transition either pushes a symbol (but does not

remove) or removes a symbol (but does not add) to the

  • stack. Again, this is not a restriction, because transitions

can be "split" into two.

  • The PDA is then used as a recipe for creating a grammar

that generates exactly the language that is accepted by the PDA; let p be the first state and q be the last state (the unique accept state).

  • When P is computing on a string, say x, conditions 2 and

3 require that the first operation adds and the last operation

slide-23
SLIDE 23

removes a symbol of the stack. If the symbols are different, then the stack must have been empty at some point (why??)

  • If the symbols are the same, we create the rule Apq →

aArsb, where a is the input read at the first move and b at the last move.

  • If the symbols are not the same, then the there is some

state r in which the stack is empty. we create a rule Apq → AprArq, and so on.

  • To formalize the proof, let (Q, Σ, Γ, δ, q0, {qa}) be a PDA

(after the modification)

slide-24
SLIDE 24
  • 1. For each p, q, r, s ∈ Q, u ∈ Γ and a, b ∈ Σǫ, if δ(p, a, ǫ)

contains (r, u) and δ(s, b, u) contains (q, ǫ), generate the rule Apq → aArsb in G

  • 2. For each p, q, r ∈ Q put the rule Apq → AprArq in G
  • 3. Finally, for each p ∈ Q put the rule App → ǫ in G
  • Lemma: If Apq generates x then P has an execution from p

(with empty stack) to q (with empty stack) reading x.

  • This can proven by induction
  • 1. If the derivation of x happens in one step, then the right-

hand side contains a result with no variables, only termi-

slide-25
SLIDE 25
  • nals. The only such rule that is generated by this con-

struction is App → ǫ, hence, x must be the empty string

  • 2. Assume it holds for all derivations with at most k steps.

If Apq ⇒∗ x in k + 1 steps, the first step is either Apq → aArsb or Apq → AprArq. Both cases result in derivations of length less than k.

  • Lemma: If P has an execution reading x from p to q (with

empty stack in both ends), then Apq generates x.

  • This again is done by induction:
  • 1. If the computation contains 0 steps, the automaton can-

not read any symbols and x is the empty string, and the automaton stays in state p. App → ǫ generates x.

slide-26
SLIDE 26
  • 2. The inductive step is as before.
slide-27
SLIDE 27

Non- context-free languages

  • There are languages that are not regular nor context-free.
  • There is a lemma, similar to pumping lemma, for context

free grammars:

  • If A is a context free language, then there is a number p

such that, if s ∈ A with |s| ≥ p, then s can be divided into 5 parts s = wvxyz such that

  • 1. wvixyiz ∈ A for every i ≥ 0
  • 2. |vy| > 0 and

11

slide-28
SLIDE 28
  • 3. |vxy| ≤ p
  • Proof: Let A be a CFL. Then it has a grammar G that gen-

erates it. Let s be a "very long" string of the language.

  • Because s is "very long" (longer than p), it’s derivation will

use (at least) one of the variable symbols more than once

  • n (at least) one branch of the derivation tree. (please com-

pare to the pumping lemma!). Let this variable be called R.

  • Let x be the string that is derived from the last occurrence
  • f R, and the occurrence before the last derive wxy.
slide-29
SLIDE 29
  • Then, we can replace the last occurrence of R with exactly

the same subtree as the one in the second to last

  • Therefore, instead of wxy, we derive wwxyy.
  • This can be done arbitrarily many times over.
slide-30
SLIDE 30

Examples of non CF-languages.

  • The language {anbncn | n ≥ 0} is not context-free.
  • The language {aibjck | 0 ≤ i ≤ j ≤ k} is not context-free
  • The langauge {ww | w ∈ {0, 1}∗} is not context-free

12

slide-31
SLIDE 31

Deterministic CFLs.

  • Deterministic and nondeterministic finite automata are equiv-

alent, but the same does not hold for pushdown automata

  • To formalize the theory, let us begin with a definition of a

deterministic PDA, or DPDA.

  • A deterministic pushdown automaton is a 6-tuple (Q, Σ, Γ, δ, q0, F)

such that

  • 1. Q is a finite set of states
  • 2. Σ is the (input) alphabet

13

slide-32
SLIDE 32
  • 3. Γ is the stack alphabet
  • 4. δ : Q × Σǫ × Γǫ → (Q × Γǫ) ∪ {∅} is the transition

function

  • 5. q0 ∈ Q is the start state
  • 6. F ⊆ Q is the set of accept states
  • The transition function is furthermore required to be nonempty

for exactly one of the values δ(q, a, x), δ(q, a, ǫ), δ(q, ǫ, x), δ(q, ǫ, ǫ) for every q ∈ Q, a ∈ Σ, and x ∈ Γ.

slide-33
SLIDE 33
  • In other words, the automaton either accepts any input and

moves (the fist two) or it just moves, and when moving, it behaves in a unique manner.

  • A language accepted by a DPDA is called a deterministic

context-free language.

slide-34
SLIDE 34

Examples

  • The language {0n1n | n ≥ 0} is deterministic: It reads

input 0 and pushes a counter token every time until the first 1, after which it removes counters every time it reads a 1.

  • The language {aibjck | i = j ∨ i = k} is not deterministic.
  • The language of palindromes is not deterministic
  • Proving determinism is relatively easy: Simply give the de-

terministic PDA

14

slide-35
SLIDE 35
  • Proving nondeterminism is much harder, and for that we

need some more theory

slide-36
SLIDE 36

Properties of deterministic CFLs

  • Lemma: Every deterministic PDA has an equivalent au-

tomaton that always reads the entire input string – There are two ways in which a DPDA might fail to read the whole input: hanging, where the automaton is forced to pop an empty stack, and looping, where te automaton makes an endless loop of ǫ-reads. – Hanging is prevented by putting a special symbol into the stack before the automaton starts; popping this from the stack before the input ends, results in reading the rest of the input and rejecting.

15

slide-37
SLIDE 37

– Looping is solved by identifying loops structurally: a ǫ- loop is then replaced by reading the entire input and re- jecting – The exception being situations where the whole input has been read: if accepts states are visited in such situ- ations, the automaton should accept.

  • Theorem: The class of Deterministic CFLs is closed under

complementation – Swapping accept and non-accept states works for DFAs. – DPDAs need to solve an additional problem: if the au- tomaton enters both accepting and non-accepting states

slide-38
SLIDE 38

at the end of an input, it accepts even after complemen-

  • tation. This is solved by requiring that only states which

read input, are allowed to accept. – Swapping accept/non-accept states in such a DPDA com- plements the language accepted.

  • This yields at least one test for non-determinisim: If the

complement of a given CFL is not context-free, then the language is not deterministic.

  • Sometimes it is easier to look at a modified language. Let

A be a language, and let ⊥ be a symbol not in the alpha-

  • bet. We denote A⊥ = {w⊥ | w ∈ A} as the end-marked

language.

slide-39
SLIDE 39
  • Theorem: A is a deterministic CFL if and only if A⊥ is a

deterministic CFL. – proof of "only if": Accept states of a PDPA are replaced by a transition reading ⊥ and accepting. – proof of "if": Let P ⊥ accept A⊥. Construct P as follows: If P ⊥ would accept after reading ⊥ without looking at the stack, simply accept immediately. For other situations, the stack contains "two stacks" as a memory. When ⊥ would be read (and possibly accept, depending on the stack) the behaviour of P ⊥ is simulated and accepted accordingly, but if P ⊥ would reject, then the stack is "re- verted".

slide-40
SLIDE 40

Deterministic CFGs

  • Deterministic PDAs have counterpart in grammars, called

deterministic context-free grammars.

  • Deterministic CFGs and deterministic languages have some

attractive properties and restrictions on how strings can be derived.

  • A reduce step is a substitution in reverse, for example, if

R → xyz, then xyz is reduced into R, which is the reducing

  • string. The reverse derivation of a string is called reduction

16

slide-41
SLIDE 41
  • When a rule T → h is used backwards on a string xhy to

produce xTy, we write xhy ֒ → xTy

  • A reduction from u is a sequence u = u1 ֒

→ u2 ֒ → · · · ֒ → uk = S, with S as the start symbol.

  • The reduction is a leftmost reduction if each reducing string

is reduced only after all other reducing strings that appear to its left.

  • if the rule T → h is used in a leftmost reduction to produce

ui ֒ → ui+1, then h (with this rule) is called the handle of ui.

slide-42
SLIDE 42
  • A string that appears in a leftmost reduction (for instance,

ui) is called a valid string.

  • If v = xhy is a valid string and h is its handle, we say that

h is a forced handle if h is a unique handle for every valid string of the form xhz, where z ∈ Σ∗.

  • A context-free grammar is deterministic iff every valid string

has a forced handle

  • In other words, in deterministic grammars, reduction de-

pends only on the leftmost part of the string.

slide-43
SLIDE 43
  • This does not immediately give us a way of detecting deter-

minism, but there is one test that we can derive from it.

slide-44
SLIDE 44

The DK-test

  • For any CFG G we can construct a deterministic finite au-

tomaton DK that identifies handles. Specifically, DK ac- cepts z if

  • 1. z is the prefix of some valid string v = zy and
  • 2. z ends with a handle of v
  • We first define a nondeterministic automaton, K
  • 1. Let J be an NFA that accepts any string that ends with

the right-hand side of some grammar rule

17

slide-45
SLIDE 45
  • 2. In any accepting run of J, it "follows" the right-hand side
  • f a rule. Let us denote this so-called "rule-state" by

B → u′v, when the automaton has read u and v has not yet been read. Then the rule-state B → uv′ is ac- cepting.

  • 3. K works like J but with slight modifications.
  • 4. For every rule-state B → u′Cv there is a ǫ-transition

to a rule-state with C as the left-hand-side, that has not read anything yet.

  • Lemma: K may enter state T → u′v on reading z if and
  • nly if z = xu and xuvy is a valid string with handle uv and

reducing rule T → uv, for some y ∈ Σ∗.

slide-46
SLIDE 46
  • Proof should be obvious from construction
  • Corollary: K may enter accept state T → h′ on input z if

and only if z = xh and h is a hanlde of some valid string xhy with reducing rule T → h.

  • This gives us the DK-test: Make K deterministic and check

if every accept state contains

  • 1. Exactly one completed rule-state, and
  • 2. no rule-state in which a terminal symbol immediatiately

follows, i.e., no rule of the form B → u′av, for some a ∈ Σ

slide-47
SLIDE 47
  • Theorem: G passes the DK-test iff G is deterministic
  • If G is nondeterministic, there is some string with a handle

that is not forced. If DK is run on a string that is a handle but not a forced handle, then DK must enter an accept state at the end of the handle. Because the handle is not a forced handle, it is not unique, so that the accept state contains another accepting rule-state, or some continuation

  • f the current string leads to an accepting state, and the test

fails.

  • If the DK-test fails, then there is a valid string with two han-

dles: either the handle is complete or there is a continuation

  • f the valid string with a different handle.
slide-48
SLIDE 48

Practical applications of the theory

  • Deterministic CFL are very important in practice, because

parsing of deterministic CFGs is efficient That is why the syntax of most programming languages is given as deter- ministic CFGs.

  • The requirement of forced handles is, however, sometimes

too restrictive, because it restricts the use of intuition in de- signing grammars: it is not always easy to make sure all handles are forced.

  • There is a slightly broader class of grammars, however, that

is both practical and intuitive.

18

slide-49
SLIDE 49
  • The so-called LR(k)- grammars use lookahead. The idea is

that you are allowed to have non-determinism, as long as you can resolve it by looking ahead no more then k steps of the input before choosing the handle.

  • Formally: if h is the handle of v = xhy then we say that

h is forced by a lookahead of of k, if his the unique han- dle of every string xhz, where y and z agree on the first k symbols.

  • LR(0) languages are deterministic
  • LR(k) are grammars for which the handle of every valid

string is forced by a lookahead of k.