Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel - - PowerPoint PPT Presentation

taaltheorie en taalverwerking
SMART_READER_LITE
LIVE PREVIEW

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel - - PowerPoint PPT Presentation

Taaltheorie en Taalverwerking BSc Artificial Intelligence Raquel Fernndez Institute for Logic, Language, and Computation Winter 2012, lecture 2a Raquel Fernndez TtTv 2012 - lecture 2a 1 / 23


slide-1
SLIDE 1

Taaltheorie en Taalverwerking

BSc Artificial Intelligence

Raquel Fernández Institute for Logic, Language, and Computation

Winter 2012, lecture 2a

Raquel Fernández TtTv 2012 - lecture 2a 1 / 23

slide-2
SLIDE 2

http://www.youtube.com/watch?v=f4cYun7i01c&hd=1

Raquel Fernández TtTv 2012 - lecture 2a 2 / 23

slide-3
SLIDE 3

Outline

We have looked into regular expressions, FSAs and FSTs, and into how these tools can be used for search and to model aspects of the morphology and spelling of languages. Today:

  • beyond regular languages
  • context-free grammars
  • the Chomsky Hierarchy

Next Lecture:

  • natural language syntax
  • implementation of grammars in Prolog

Raquel Fernández TtTv 2012 - lecture 2a 3 / 23

slide-4
SLIDE 4

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-5
SLIDE 5

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-6
SLIDE 6

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

  • If L is regular, we should be able to build an FSA for it.

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-7
SLIDE 7

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

  • If L is regular, we should be able to build an FSA for it.

∗ there must be at least one state where that FSA stops accepting a’s and starts accepting b’s

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-8
SLIDE 8

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

  • If L is regular, we should be able to build an FSA for it.

∗ there must be at least one state where that FSA stops accepting a’s and starts accepting b’s ∗ but it needs to remember how many a’s it has seen to accept the same number of b’s

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-9
SLIDE 9

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

  • If L is regular, we should be able to build an FSA for it.

∗ there must be at least one state where that FSA stops accepting a’s and starts accepting b’s ∗ but it needs to remember how many a’s it has seen to accept the same number of b’s ∗ since L is infinite, this requires an infinite number of states and therefore we cannot build an FSA for L.

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-10
SLIDE 10

Are all formal languages regular?

Can we use regular expressions/FSAs to define any formal language? → given an alphabet Σ, are all subsets of Σ∗ regular?

  • Let us consider the formal language L = {an · bn|n ≥ 0} over

Σ = {a, b}. This language is made up of all strings that contain a number n of a’s followed by the same number of b’s.

  • If L is regular, we should be able to build an FSA for it.

∗ there must be at least one state where that FSA stops accepting a’s and starts accepting b’s ∗ but it needs to remember how many a’s it has seen to accept the same number of b’s ∗ since L is infinite, this requires an infinite number of states and therefore we cannot build an FSA for L.

There are languages which cannot be defined with the limited expressive power of regular expressions/FSAs.

Raquel Fernández TtTv 2012 - lecture 2a 4 / 23

slide-11
SLIDE 11

Grammars

To specify more complex languages, we introduce the notion of a grammar.

Raquel Fernández TtTv 2012 - lecture 2a 5 / 23

slide-12
SLIDE 12

Grammars

To specify more complex languages, we introduce the notion of a

  • grammar. Formally, a grammar can be specified by 4 parameters:
  • Σ: a finite alphabet of terminal symbols
  • N : a finite set of non-terminal symbols
  • S: a special symbol S ∈ N called the start symbol
  • R: a set of rules or productions consisting of:

∗ a sequence of terminal or non-terminal symbols ∗ the symbol ‘→’ ∗ another sequence of terminal or non-terminal symbols.

Raquel Fernández TtTv 2012 - lecture 2a 5 / 23

slide-13
SLIDE 13

Grammars

To specify more complex languages, we introduce the notion of a

  • grammar. Formally, a grammar can be specified by 4 parameters:
  • Σ: a finite alphabet of terminal symbols
  • N : a finite set of non-terminal symbols
  • S: a special symbol S ∈ N called the start symbol
  • R: a set of rules or productions consisting of:

∗ a sequence of terminal or non-terminal symbols ∗ the symbol ‘→’ ∗ another sequence of terminal or non-terminal symbols.

For now we’ll focus on a particular class of grammars, so-called context-free grammars, whose rules have the following form: A → γ where A is a single non-terminal element from N and γ is a sequence of symbols belonging to the infinite set (Σ ∪ N )∗.

Raquel Fernández TtTv 2012 - lecture 2a 5 / 23

slide-14
SLIDE 14

Context-Free Grammars

A couple of examples:

N = {A, B} Σ = {a, b, c} A → a B B → b B → c

Raquel Fernández TtTv 2012 - lecture 2a 6 / 23

slide-15
SLIDE 15

Context-Free Grammars

A couple of examples:

N = {A, B} Σ = {a, b, c} A → a B B → b B → c N = {NP, D, N , N ′, PP, P} Σ = {cat, hat, the, in} NP → D N ′ N → cat PP → P NP N → hat N ′ → N D → the N ′ → N NP P → in

Raquel Fernández TtTv 2012 - lecture 2a 6 / 23

slide-16
SLIDE 16

Context-Free Grammars

A couple of examples:

N = {A, B} Σ = {a, b, c} A → a B B → b B → c N = {NP, D, N , N ′, PP, P} Σ = {cat, hat, the, in} NP → D N ′ N → cat PP → P NP N → hat N ′ → N D → the N ′ → N NP P → in

We shall specify grammars by their set of rules and assume that:

  • the right-hand side symbol of the first rule is the start symbol.
  • symbols with uppercase letters are non-terminals
  • symbols with lowercase letters are terminals

Raquel Fernández TtTv 2012 - lecture 2a 6 / 23

slide-17
SLIDE 17

Derivations and Trees

What is the formal language specified by a grammar?

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-18
SLIDE 18

Derivations and Trees

What is the formal language specified by a grammar?

  • The language generated by a grammar is the set of strings composed
  • f terminal symbols that can be derived from the grammar’s start

symbol by the application of grammar rules.

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-19
SLIDE 19

Derivations and Trees

What is the formal language specified by a grammar?

  • The language generated by a grammar is the set of strings composed
  • f terminal symbols that can be derived from the grammar’s start

symbol by the application of grammar rules.

  • Each sequence of rules that produces a string of the language is called

a derivation. Derivations are typically represented as trees.

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-20
SLIDE 20

Derivations and Trees

What is the formal language specified by a grammar?

  • The language generated by a grammar is the set of strings composed
  • f terminal symbols that can be derived from the grammar’s start

symbol by the application of grammar rules.

  • Each sequence of rules that produces a string of the language is called

a derivation. Derivations are typically represented as trees.

∗ the first rule to be applied must begin with the start symbol. ∗ to apply a rule, we “rewrite” the left symbol with the right sequence. ∗ the derivation finishes when we end up with terminal symbols. ∗ the resulting string of terminal symbols is a string in the language defined by the grammar.

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-21
SLIDE 21

Derivations and Trees

What is the formal language specified by a grammar?

  • The language generated by a grammar is the set of strings composed
  • f terminal symbols that can be derived from the grammar’s start

symbol by the application of grammar rules.

  • Each sequence of rules that produces a string of the language is called

a derivation. Derivations are typically represented as trees.

∗ the first rule to be applied must begin with the start symbol. ∗ to apply a rule, we “rewrite” the left symbol with the right sequence. ∗ the derivation finishes when we end up with terminal symbols. ∗ the resulting string of terminal symbols is a string in the language defined by the grammar.

Our earlier grammar allows two possible derivations:

  • 1. A → a B
  • 2. B → b
  • 3. B → c

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-22
SLIDE 22

Derivations and Trees

What is the formal language specified by a grammar?

  • The language generated by a grammar is the set of strings composed
  • f terminal symbols that can be derived from the grammar’s start

symbol by the application of grammar rules.

  • Each sequence of rules that produces a string of the language is called

a derivation. Derivations are typically represented as trees.

∗ the first rule to be applied must begin with the start symbol. ∗ to apply a rule, we “rewrite” the left symbol with the right sequence. ∗ the derivation finishes when we end up with terminal symbols. ∗ the resulting string of terminal symbols is a string in the language defined by the grammar.

Our earlier grammar allows two possible derivations:

  • 1. A → a B
  • 2. B → b
  • 3. B → c

First we apply rule 1 then rule 2 A a B b First we apply rule 1 then rule 3 A a B c The language defined by the grammar therefore is {ab, ac}.

Raquel Fernández TtTv 2012 - lecture 2a 7 / 23

slide-23
SLIDE 23

Derivations and Trees

Our earlier grammar allows two possible derivations:

  • 1. A → a B
  • 2. B → b
  • 3. B → c

First we apply rule 1 then rule 2 A a B b First we apply rule 1 then rule 3 A a B c The language defined by the grammar therefore is {ab, ac}. Derivations can also be represented in “flat” form: A ⇒ aB ⇒ ab A ⇒ aB ⇒ ac

Raquel Fernández TtTv 2012 - lecture 2a 8 / 23

slide-24
SLIDE 24

Derivations and Trees

Our earlier grammar allows two possible derivations:

  • 1. A → a B
  • 2. B → b
  • 3. B → c

First we apply rule 1 then rule 2 A a B b First we apply rule 1 then rule 3 A a B c The language defined by the grammar therefore is {ab, ac}. Derivations can also be represented in “flat” form: A ⇒ aB ⇒ ab A ⇒ aB ⇒ ac

What kind of strings can be derived with our other example grammar?

  • 1. NP → D N ′
  • 2. PP → P NP
  • 3. N ′ → N
  • 4. N ′ → N NP
  • 5. D → the
  • 6. P → in
  • 7. N → cat
  • 8. N → hat

Recall that the start symbol is the left non-terminal of the first rule.

Raquel Fernández TtTv 2012 - lecture 2a 8 / 23

slide-25
SLIDE 25

CFGs and RegExp/FSAs

Can we use context-free grammars to specify languages that cannot be specified by regular expressions and FSAs? ֒ → note that our two example grammars are equivalent to regular expressions:

a(b|c) and the(cat|hat)(in the (cat|hat))∗, respectively.

Raquel Fernández TtTv 2012 - lecture 2a 9 / 23

slide-26
SLIDE 26

CFGs and RegExp/FSAs

Can we use context-free grammars to specify languages that cannot be specified by regular expressions and FSAs? ֒ → note that our two example grammars are equivalent to regular expressions:

a(b|c) and the(cat|hat)(in the (cat|hat))∗, respectively.

  • We’ve seen that the language L = {anbn|n ≥ 0} is not regular.

Raquel Fernández TtTv 2012 - lecture 2a 9 / 23

slide-27
SLIDE 27

CFGs and RegExp/FSAs

Can we use context-free grammars to specify languages that cannot be specified by regular expressions and FSAs? ֒ → note that our two example grammars are equivalent to regular expressions:

a(b|c) and the(cat|hat)(in the (cat|hat))∗, respectively.

  • We’ve seen that the language L = {anbn|n ≥ 0} is not regular.
  • But we can easily specify it with a context-free grammar (CFG):

S → ǫ S → a S b

Raquel Fernández TtTv 2012 - lecture 2a 9 / 23

slide-28
SLIDE 28

CFGs and RegExp/FSAs

Can we use context-free grammars to specify languages that cannot be specified by regular expressions and FSAs? ֒ → note that our two example grammars are equivalent to regular expressions:

a(b|c) and the(cat|hat)(in the (cat|hat))∗, respectively.

  • We’ve seen that the language L = {anbn|n ≥ 0} is not regular.
  • But we can easily specify it with a context-free grammar (CFG):

S → ǫ S → a S b

When a formalism or grammar can define a formal language that another one cannot define, we say that it has greater generative power or greater complexity. ⇒ CFGs have greater generative power than regular expressions/FSAs

Raquel Fernández TtTv 2012 - lecture 2a 9 / 23

slide-29
SLIDE 29

CFGs and RegExp/FSAs

Regular languages are a subset of context-free languages:

  • regular expressions can only describe a limited set of simple

languages, the so-called regular languages.

  • CFGs can describe a bigger set of languages: regular and also

context-free languages.

Raquel Fernández TtTv 2012 - lecture 2a 10 / 23

slide-30
SLIDE 30

CFGs and RegExp/FSAs

Regular languages are a subset of context-free languages:

  • regular expressions can only describe a limited set of simple

languages, the so-called regular languages.

  • CFGs can describe a bigger set of languages: regular and also

context-free languages. Every regular expression or FSA has an equivalent CFG (but the reverse does not hold!)

Raquel Fernández TtTv 2012 - lecture 2a 10 / 23

slide-31
SLIDE 31

CFGs and RegExp/FSAs

Regular languages are a subset of context-free languages:

  • regular expressions can only describe a limited set of simple

languages, the so-called regular languages.

  • CFGs can describe a bigger set of languages: regular and also

context-free languages. Every regular expression or FSA has an equivalent CFG (but the reverse does not hold!)

me(o)∗w q0 q1 q2 q3 m e

  • w

Raquel Fernández TtTv 2012 - lecture 2a 10 / 23

slide-32
SLIDE 32

CFGs and RegExp/FSAs

Regular languages are a subset of context-free languages:

  • regular expressions can only describe a limited set of simple

languages, the so-called regular languages.

  • CFGs can describe a bigger set of languages: regular and also

context-free languages. Every regular expression or FSA has an equivalent CFG (but the reverse does not hold!)

me(o)∗w q0 q1 q2 q3 m e

  • w

Q0 → m Q1 Q1 → e Q2 Q2 → o Q2 Q2 → w Q3 Q3 → ǫ

Raquel Fernández TtTv 2012 - lecture 2a 10 / 23

slide-33
SLIDE 33

CFGs and RegExp/FSAs

Regular languages are a subset of context-free languages:

  • regular expressions can only describe a limited set of simple

languages, the so-called regular languages.

  • CFGs can describe a bigger set of languages: regular and also

context-free languages. Every regular expression or FSA has an equivalent CFG (but the reverse does not hold!)

me(o)∗w q0 q1 q2 q3 m e

  • w

Q0 → m Q1 Q1 → e Q2 Q2 → o Q2 Q2 → w Q3 Q3 → ǫ The above regular expression, FSA, and CFG are equivalent: they all generate the same formal language: {mew, meow, meoow, meooow . . .}

Raquel Fernández TtTv 2012 - lecture 2a 10 / 23

slide-34
SLIDE 34

Right-Linear Grammars

  • All CFGs that are equivalent to a regular expression of FSA have

the same properties:

me(o)∗w q0 q1 q2 q3 m e

  • w

Q0 → m Q1 Q1 → e Q2 Q2 → o Q2 Q2 → w Q3 Q3 → ǫ

∗ the left-hand side of each rue is a single non-terminal symbol ∗ the right-hand side of each rule has at most one non-terminal, which (if present) is in the right-most position.

Raquel Fernández TtTv 2012 - lecture 2a 11 / 23

slide-35
SLIDE 35

Right-Linear Grammars

  • All CFGs that are equivalent to a regular expression of FSA have

the same properties:

me(o)∗w q0 q1 q2 q3 m e

  • w

Q0 → m Q1 Q1 → e Q2 Q2 → o Q2 Q2 → w Q3 Q3 → ǫ

∗ the left-hand side of each rue is a single non-terminal symbol ∗ the right-hand side of each rule has at most one non-terminal, which (if present) is in the right-most position.

  • This kind of CFGs are called right-linear grammars and are

equivalent to regular expressions and FSAs.

Raquel Fernández TtTv 2012 - lecture 2a 11 / 23

slide-36
SLIDE 36

Right-Linear Grammars

  • All CFGs that are equivalent to a regular expression of FSA have

the same properties:

me(o)∗w q0 q1 q2 q3 m e

  • w

Q0 → m Q1 Q1 → e Q2 Q2 → o Q2 Q2 → w Q3 Q3 → ǫ

∗ the left-hand side of each rue is a single non-terminal symbol ∗ the right-hand side of each rule has at most one non-terminal, which (if present) is in the right-most position.

  • This kind of CFGs are called right-linear grammars and are

equivalent to regular expressions and FSAs.

  • Right-linear grammars, also called regular grammars, are a

subset of CFGs (and, as mentioned, regular languages are a subset of context-free languages).

Raquel Fernández TtTv 2012 - lecture 2a 11 / 23

slide-37
SLIDE 37

The Chomsky Hierarchy

Noam Chomsky (1959): “On certain formal properties of grammars”, Information and Control, 2(2):137–167.

Raquel Fernández TtTv 2012 - lecture 2a 12 / 23

slide-38
SLIDE 38

The Chomsky Hierarchy

Noam Chomsky (1959): “On certain formal properties of grammars”, Information and Control, 2(2):137–167.

Chomsky proposed to classify formal grammars into 4 types, which differ in their generative capacity – in the complexity of the languages they are able to generate.

Raquel Fernández TtTv 2012 - lecture 2a 12 / 23

slide-39
SLIDE 39

The Chomsky Hierarchy

The different types of grammars withing the hierarchy differ with respect to the types of grammar rules they require:

Raquel Fernández TtTv 2012 - lecture 2a 13 / 23

slide-40
SLIDE 40

The Chomsky Hierarchy

The different types of grammars withing the hierarchy differ with respect to the types of grammar rules they require:

Let A and B be non-terminal symbols, x be an arbitrary string of terminal symbols, and α, β, γ be arbitrary strings of terminal and non-terminal symbols.

  • Type 0: Recursively enumerable or unrestricted grammars.

Rule skeleton: α → β (α = ǫ)

  • Type 1: Context-sensitive grammars.

Rule skeleton: αAβ → αγβ (γ = ǫ)

  • Type 2: Context-free grammars.

Rule skeleton: A → γ

  • Type 3: Regular grammars.

Rule skeleton: A → xB or A → x

Raquel Fernández TtTv 2012 - lecture 2a 13 / 23

slide-41
SLIDE 41

The Chomsky Hierarchy

This classification acts a subsumption hierarchy: the set of languages described by grammars of greater power subsumes the set of languages described by grammars of less power. ⇒ Balance between expressibility and computational efficiency: grammars with less expressive power are computationally more tractable.

Raquel Fernández TtTv 2012 - lecture 2a 14 / 23

slide-42
SLIDE 42

Formal Language Theory and Computational Linguistics

The tools to specify formal languages can be used to model aspects of natural languages.

Raquel Fernández TtTv 2012 - lecture 2a 15 / 23

slide-43
SLIDE 43

Formal Language Theory and Computational Linguistics

The tools to specify formal languages can be used to model aspects of natural languages.

  • we have seen that FSAs and FSTs can be used to model aspects
  • f morphology or spelling: to model which strings of morphemes
  • r letters correspond to actual words in a natural language.

Raquel Fernández TtTv 2012 - lecture 2a 15 / 23

slide-44
SLIDE 44

Formal Language Theory and Computational Linguistics

The tools to specify formal languages can be used to model aspects of natural languages.

  • we have seen that FSAs and FSTs can be used to model aspects
  • f morphology or spelling: to model which strings of morphemes
  • r letters correspond to actual words in a natural language.
  • from the perspective of formal language theory, syntax can be

seen as the sub-field of linguistics which studies which strings of words constitute well-formed sentences of a natural language.

Raquel Fernández TtTv 2012 - lecture 2a 15 / 23

slide-45
SLIDE 45

Formal Language Theory and Computational Linguistics

The tools to specify formal languages can be used to model aspects of natural languages.

  • we have seen that FSAs and FSTs can be used to model aspects
  • f morphology or spelling: to model which strings of morphemes
  • r letters correspond to actual words in a natural language.
  • from the perspective of formal language theory, syntax can be

seen as the sub-field of linguistics which studies which strings of words constitute well-formed sentences of a natural language.

  • what type of formal grammars can be used to describe the

syntax of natural languages?

Raquel Fernández TtTv 2012 - lecture 2a 15 / 23

slide-46
SLIDE 46

The Pumping Lemma

How to prove that a language is not regular?

Raquel Fernández TtTv 2012 - lecture 2a 16 / 23

slide-47
SLIDE 47

The Pumping Lemma

How to prove that a language is not regular?

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Raquel Fernández TtTv 2012 - lecture 2a 16 / 23

slide-48
SLIDE 48

The Pumping Lemma

How to prove that a language is not regular?

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Note that the Pumping Lemma can only be used to prove that a language is not regular.

Raquel Fernández TtTv 2012 - lecture 2a 16 / 23

slide-49
SLIDE 49

The Pumping Lemma

How to prove that a language is not regular?

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Note that the Pumping Lemma can only be used to prove that a language is not regular. You can easily use the Pumping Lemma to show that the formal language anbn is not regular. . .

Raquel Fernández TtTv 2012 - lecture 2a 16 / 23

slide-50
SLIDE 50

The Pumping Lemma

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Consider L = {anbn|n ≥ 0}. There is no decomposition of the strings in L that satisfies the pumping condition:

Raquel Fernández TtTv 2012 - lecture 2a 17 / 23

slide-51
SLIDE 51

The Pumping Lemma

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Consider L = {anbn|n ≥ 0}. There is no decomposition of the strings in L that satisfies the pumping condition:

  • take the shortest string in L that can be segmented into three

non-empty substrings x · y · z:

Raquel Fernández TtTv 2012 - lecture 2a 17 / 23

slide-52
SLIDE 52

The Pumping Lemma

Pumping Lemma: Let L be an infinite regular language. Then the following “pumping” condition applies: there must be at least one string s in L which can be segmented into x · y · z such that when substring y is “pumped” into yn (for any n ≥ 0), the string s′ = x · yn · z also belongs to L. If the pumping condition does not hold, then L is not regular.

Consider L = {anbn|n ≥ 0}. There is no decomposition of the strings in L that satisfies the pumping condition:

  • take the shortest string in L that can be segmented into three

non-empty substrings x · y · z:

a · a · bb aanbb ∈ L aa · b · b aabnb ∈ L

  • if the middle substring “is pumped”, the resulting strings don’t

belong to L. The same is true for all other strings in L.

Raquel Fernández TtTv 2012 - lecture 2a 17 / 23

slide-53
SLIDE 53

Are Natural Languages Regular?

Raquel Fernández TtTv 2012 - lecture 2a 18 / 23

slide-54
SLIDE 54

Are Natural Languages Regular?

Let’s assume that the following sentences are grammatical (although hard to understand) and that English allows an indefinite number of these structures, called center-embeddings.

(1) The cat died. (2) The cat that the dog chased died. (3) The cat that the dog that the rat bit chased died. (4) The cat that the dog that the rat that the duck admired bit chased died. . . .

Raquel Fernández TtTv 2012 - lecture 2a 18 / 23

slide-55
SLIDE 55

Are Natural Languages Regular?

Let’s assume that the following sentences are grammatical (although hard to understand) and that English allows an indefinite number of these structures, called center-embeddings.

(1) The cat died. (2) The cat that the dog chased died. (3) The cat that the dog that the rat bit chased died. (4) The cat that the dog that the rat that the duck admired bit chased died. . . .

These sentences are of the following form:

(the Noun)(that the Noun)n(Verb)ndied

Raquel Fernández TtTv 2012 - lecture 2a 18 / 23

slide-56
SLIDE 56

Are Natural Languages Regular?

Let’s assume that the following sentences are grammatical (although hard to understand) and that English allows an indefinite number of these structures, called center-embeddings.

(1) The cat died. (2) The cat that the dog chased died. (3) The cat that the dog that the rat bit chased died. (4) The cat that the dog that the rat that the duck admired bit chased died. . . .

These sentences are of the following form:

(the Noun)(that the Noun)n(Verb)ndied

As we did with anbn, we can use the Pumping Lemma to show that an infinite set of this kind of sentences cannot be modelled as a regular language.

Raquel Fernández TtTv 2012 - lecture 2a 18 / 23

slide-57
SLIDE 57

Are Natural Languages Regular?

We have seen that L = (the Noun)(that the Noun)n(Verb)ndied is not regular. But what about English as a whole? We can reason as follows:

Raquel Fernández TtTv 2012 - lecture 2a 19 / 23

slide-58
SLIDE 58

Are Natural Languages Regular?

We have seen that L = (the Noun)(that the Noun)n(Verb)ndied is not regular. But what about English as a whole? We can reason as follows: Regular languages are closed under intersection. That is, if A and B are regular languages then A ∩ B is also a regular language.

  • L′ = (the Noun)(that the Noun)∗(Verb)∗died is a regular language.
  • L = (the Noun)(that the Noun)n(Verb)ndied can be seen as the

intersection of L′ and English.

  • If English was regular, then L would also be.
  • Since L it is not, then we can conclude that English is not a regular

language either.

Raquel Fernández TtTv 2012 - lecture 2a 19 / 23

slide-59
SLIDE 59

Are Natural Languages Context-Free?

Raquel Fernández TtTv 2012 - lecture 2a 20 / 23

slide-60
SLIDE 60

Are Natural Languages Context-Free?

There has been a lot of discussion regarding whether natural languages are context-free or not.

  • many languages, including English, seem to be context-free,
  • but some languages, like Swiss German, have been proved not to be.

Raquel Fernández TtTv 2012 - lecture 2a 20 / 23

slide-61
SLIDE 61

Are Natural Languages Context-Free?

There has been a lot of discussion regarding whether natural languages are context-free or not.

  • many languages, including English, seem to be context-free,
  • but some languages, like Swiss German, have been proved not to be.

There is another version of the Pumping Lemma that can used to show that a language in not context-free

  • since it is rather mathematically complicated, we will not look into the

details of this here.

  • suffices to say that this lemma would allow us to show that the

following language is not context-free:

anbmcndm

Raquel Fernández TtTv 2012 - lecture 2a 20 / 23

slide-62
SLIDE 62

Are Natural Languages Context-Free?

There has been a lot of discussion regarding whether natural languages are context-free or not.

  • many languages, including English, seem to be context-free,
  • but some languages, like Swiss German, have been proved not to be.

There is another version of the Pumping Lemma that can used to show that a language in not context-free

  • since it is rather mathematically complicated, we will not look into the

details of this here.

  • suffices to say that this lemma would allow us to show that the

following language is not context-free:

anbmcndm

  • The key feature that makes this language non context-free are the

cross-serial dependencies it exhibits (as opposed to center-embedding).

Raquel Fernández TtTv 2012 - lecture 2a 20 / 23

slide-63
SLIDE 63

Are Natural Languages Context-Free?

It has been shown that Swiss German includes constructions that are equivalent to the non context-free language anbmcndm.

Raquel Fernández TtTv 2012 - lecture 2a 21 / 23

slide-64
SLIDE 64

Are Natural Languages Context-Free?

It has been shown that Swiss German includes constructions that are equivalent to the non context-free language anbmcndm.

Raquel Fernández TtTv 2012 - lecture 2a 21 / 23

slide-65
SLIDE 65

Are Natural Languages Context-Free?

It has been shown that Swiss German includes constructions that are equivalent to the non context-free language anbmcndm.

  • the number of accusative NPs must equal the number of verbs selecting for

an accusative,

  • the number of dative NPs must equal the number of verbs selecting for a

dative object

  • the order of NPs and verbs must be the same.

Shieber (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8:333–343. Raquel Fernández TtTv 2012 - lecture 2a 21 / 23

slide-66
SLIDE 66

Are Natural Languages Context-Free?

Note that this is similar in Dutch:

Raquel Fernández TtTv 2012 - lecture 2a 22 / 23

slide-67
SLIDE 67

Are Natural Languages Context-Free?

Note that this is similar in Dutch: But in this case since the cross-serial dependencies are not visible at the level of the strings (because there is no case marking), from the point of view of formal language theory these Dutch constructions are equivalent to anbn, which is context-free.

Bresnan, Kaplan, Peters, and Zaenen (1982). Cross-serial dependencies in Dutch. Linguistic Inquiry, 13(4):613–635. Manaster-Ramer (1987). Dutch as a formal language. Linguistics and Philosophy, 10(2):221–246 Raquel Fernández TtTv 2012 - lecture 2a 22 / 23

slide-68
SLIDE 68

Summing Up

Raquel Fernández TtTv 2012 - lecture 2a 23 / 23

slide-69
SLIDE 69

Summing Up

  • Grammars are formalisms to define formal languages.
  • The Chomsky hierarchy classifies grammars according to their

generative capacity

∗ grammars can be distinguished by the type of rules they allow ∗ it is a subsumption hierarchy ∗ grammars with more expressive power are computationally more costly

  • The Pumping Lemma can be used to show that a particular

language does not belong to a level in the hierarchy.

∗ we have only seen the Pumping Lemma for regular languages

  • The syntax of natural languages is not regular and it does not

seem to be context-free either.

  • However, many aspects of natural language syntax can be

modelled with CFGs or mildy context-sensitive grammars.

Raquel Fernández TtTv 2012 - lecture 2a 23 / 23