Grammar Scott Farrar CLMA, University of Washington farrar@uw.edu - - PowerPoint PPT Presentation

grammar
SMART_READER_LITE
LIVE PREVIEW

Grammar Scott Farrar CLMA, University of Washington farrar@uw.edu - - PowerPoint PPT Presentation

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Grammar Scott Farrar CLMA, University of Washington farrar@uw.edu January 3, 2010 Scott Farrar CLMA, University of Washington


slide-1
SLIDE 1

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammar

Scott Farrar CLMA, University of Washington farrar@uw.edu January 3, 2010

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-2
SLIDE 2

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Today’s lecture

1 Linguistic Structure

Syntax

2 Formal grammars

Formal language theory Context-free grammars

3 Treebanks, Grammars, Corpora 4 Practical Grammar Writing

Word classes Clause/Phrase classes Production Rules Writing small grammars

5 Computing, Homework

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-3
SLIDE 3

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

[continued from Monday’s lecture]

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-4
SLIDE 4

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition Syntax is the study of how the parts of an utterance are arranged in relation to one another.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-5
SLIDE 5

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition Syntax is the study of how the parts of an utterance are arranged in relation to one another. Some questions for syntax: Do all languages behave the same way?

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-6
SLIDE 6

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition Syntax is the study of how the parts of an utterance are arranged in relation to one another. Some questions for syntax: Do all languages behave the same way? Can the structure of yet un-analyzed languages be predicted?

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-7
SLIDE 7

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition Syntax is the study of how the parts of an utterance are arranged in relation to one another. Some questions for syntax: Do all languages behave the same way? Can the structure of yet un-analyzed languages be predicted? How is syntax learned by children (with little negative evidence)?

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-8
SLIDE 8

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition Syntax is the study of how the parts of an utterance are arranged in relation to one another. Some questions for syntax: Do all languages behave the same way? Can the structure of yet un-analyzed languages be predicted? How is syntax learned by children (with little negative evidence)? Definition A basic construct of syntax is the structural description, a structure that shows word order, syntactic constituency, and labels for the constituents.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-9
SLIDE 9

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Structural description: tree

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-10
SLIDE 10

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Structural description: bracketed structure

[This [is [a simple bracketed structure]]] ...with no labels.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-11
SLIDE 11

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Structural description: bracketed structure

[This [is [a simple bracketed structure]]] ...with no labels. [SThis[VP[V is][NPa simple bracketed structure]]] The labeled bracketed structure is equivalent to the tree structure.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-12
SLIDE 12

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Definition A tree is a graphical way of representing structural description; trees for NL are more precisely ordered directed trees with (nodes, labels and arcs). Node a component or unit of a tree. Root node the node with no ancestors (labeled by the start symbol. Nonterminal node a node with descendants. Terminal/Leaf node a node with no descendants (corresponding to the strings, or “words”, of the language).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-13
SLIDE 13

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Definition A tree is a graphical way of representing structural description; trees for NL are more precisely ordered directed trees with (nodes, labels and arcs). Node a component or unit of a tree. Root node the node with no ancestors (labeled by the start symbol. Nonterminal node a node with descendants. Terminal/Leaf node a node with no descendants (corresponding to the strings, or “words”, of the language).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-14
SLIDE 14

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Definition A tree is a graphical way of representing structural description; trees for NL are more precisely ordered directed trees with (nodes, labels and arcs). Node a component or unit of a tree. Root node the node with no ancestors (labeled by the start symbol. Nonterminal node a node with descendants. Terminal/Leaf node a node with no descendants (corresponding to the strings, or “words”, of the language).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-15
SLIDE 15

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Definition A tree is a graphical way of representing structural description; trees for NL are more precisely ordered directed trees with (nodes, labels and arcs). Node a component or unit of a tree. Root node the node with no ancestors (labeled by the start symbol. Nonterminal node a node with descendants. Terminal/Leaf node a node with no descendants (corresponding to the strings, or “words”, of the language).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-16
SLIDE 16

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Definition A tree is a graphical way of representing structural description; trees for NL are more precisely ordered directed trees with (nodes, labels and arcs). Node a component or unit of a tree. Root node the node with no ancestors (labeled by the start symbol. Nonterminal node a node with descendants. Terminal/Leaf node a node with no descendants (corresponding to the strings, or “words”, of the language).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-17
SLIDE 17

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Preterminal node the node with only a single leaf as its descendant. In NL grammar, these are the part of speech nodes.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-18
SLIDE 18

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Preterminal node the node with only a single leaf as its descendant. In NL grammar, these are the part of speech nodes. Arc shows the constituency relation, but is untyped.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-19
SLIDE 19

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

Tree structures for linguistics

Preterminal node the node with only a single leaf as its descendant. In NL grammar, these are the part of speech nodes. Arc shows the constituency relation, but is untyped. Label a symbol identifying the category of some node.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-20
SLIDE 20

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition In another sense, syntax is the set of rules by which well formed utterances are formed. (The term grammar is more general and refers to all aspects of language.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-21
SLIDE 21

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Syntax

What is syntax?

Definition In another sense, syntax is the set of rules by which well formed utterances are formed. (The term grammar is more general and refers to all aspects of language.) We can formalize the notion of syntax using ideas from formal language theory.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-22
SLIDE 22

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal language theory

Since natural language is rule-governed, not random, a grammar can be constructed to parse natural language, just as with compilers and machine languages. Definition Formal language theory, or the study of the properties of formal languages, gives a firm conceptual framework from which to study natural language.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-23
SLIDE 23

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Chomsky hierarchy

Definition The Chomsky Hierarchy describes four classes of formal grammars that generate four corresponding classes of languages. (We only talk about CFGs in Ling571, but see J&M Chap. 16 for an overview.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-24
SLIDE 24

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Chomsky hierarchy

Definition The Chomsky Hierarchy describes four classes of formal grammars that generate four corresponding classes of languages.

1 Type 0: Turing equivalent language/grammar

(We only talk about CFGs in Ling571, but see J&M Chap. 16 for an overview.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-25
SLIDE 25

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Chomsky hierarchy

Definition The Chomsky Hierarchy describes four classes of formal grammars that generate four corresponding classes of languages.

1 Type 0: Turing equivalent language/grammar 2 Type 1: context sensitive language/grammar

(We only talk about CFGs in Ling571, but see J&M Chap. 16 for an overview.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-26
SLIDE 26

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Chomsky hierarchy

Definition The Chomsky Hierarchy describes four classes of formal grammars that generate four corresponding classes of languages.

1 Type 0: Turing equivalent language/grammar 2 Type 1: context sensitive language/grammar 3 Type 2: context free language/grammar

(We only talk about CFGs in Ling571, but see J&M Chap. 16 for an overview.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-27
SLIDE 27

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Chomsky hierarchy

Definition The Chomsky Hierarchy describes four classes of formal grammars that generate four corresponding classes of languages.

1 Type 0: Turing equivalent language/grammar 2 Type 1: context sensitive language/grammar 3 Type 2: context free language/grammar 4 Type 3: regular language/grammar

(We only talk about CFGs in Ling571, but see J&M Chap. 16 for an overview.)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-28
SLIDE 28

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal grammar

Definition (Generalized) formal grammar A grammar is defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . Σ is a set of terminals, typically x, y, z, . . . P is a set of production rules S is the starting or goal variable from N, i.e., S ∈ N

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-29
SLIDE 29

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal grammar

Definition (Generalized) formal grammar A grammar is defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . Σ is a set of terminals, typically x, y, z, . . . P is a set of production rules S is the starting or goal variable from N, i.e., S ∈ N

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-30
SLIDE 30

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal grammar

Definition (Generalized) formal grammar A grammar is defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . Σ is a set of terminals, typically x, y, z, . . . P is a set of production rules S is the starting or goal variable from N, i.e., S ∈ N

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-31
SLIDE 31

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal grammar

Definition (Generalized) formal grammar A grammar is defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . Σ is a set of terminals, typically x, y, z, . . . P is a set of production rules S is the starting or goal variable from N, i.e., S ∈ N

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-32
SLIDE 32

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Formal grammar

Definition (Generalized) formal grammar A grammar is defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . Σ is a set of terminals, typically x, y, z, . . . P is a set of production rules S is the starting or goal variable from N, i.e., S ∈ N

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-33
SLIDE 33

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Sample grammar

S -> NP VP NP -> Det Noun NP -> ProperNoun VP -> Verb VP -> Verb NP Det -> the|a|that Noun -> lamp|pig|dirt ProperNoun -> Washington|Sam Verb -> understands|chases

Washington understands Sam Sam chases that pig *understands Sam

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-34
SLIDE 34

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Sample grammar

S -> NP VP NP -> Det Noun NP -> ProperNoun VP -> Verb VP -> Verb NP Det -> the|a|that Noun -> lamp|pig|dirt ProperNoun -> Washington|Sam Verb -> understands|chases

Washington understands Sam Sam chases that pig *understands Sam

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-35
SLIDE 35

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Sample grammar

S -> NP VP NP -> Det Noun NP -> ProperNoun VP -> Verb VP -> Verb NP Det -> the|a|that Noun -> lamp|pig|dirt ProperNoun -> Washington|Sam Verb -> understands|chases

Washington understands Sam Sam chases that pig *understands Sam

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-36
SLIDE 36

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Sample grammar

S -> NP VP NP -> Det Noun NP -> ProperNoun VP -> Verb VP -> Verb NP Det -> the|a|that Noun -> lamp|pig|dirt ProperNoun -> Washington|Sam Verb -> understands|chases

Washington understands Sam Sam chases that pig *understands Sam

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-37
SLIDE 37

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-38
SLIDE 38

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-39
SLIDE 39

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-40
SLIDE 40

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-41
SLIDE 41

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-42
SLIDE 42

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-43
SLIDE 43

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-44
SLIDE 44

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Context-free grammar

Definition A CFG grammar is formally defined as G = N, Σ, P, S where: N is a set of non-terminal symbols, typically S, A, B, . . . S is the starting or goal symbol from N, i.e., S ∈ N Σ is a set of terminal symbols, typically x, y, z, . . . disjoint from N P is a set of production rules The productions P are of the form: A → β, where:

A is a non-terminal A ∈ N β is a string of symbols from (Σ ∪ N)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-45
SLIDE 45

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A nonterminal symbol labels a syntactic part (constituent): NP, VP, PP, (Noun, Verb, Det) A starting symbol indicates which symbol has to come first; it labels the largest constituent or biggest part: S, Root, or Top A terminal symbol labels the smallest part, the actual strings

  • f the language:

man, they, swim

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-46
SLIDE 46

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A nonterminal symbol labels a syntactic part (constituent): NP, VP, PP, (Noun, Verb, Det) A starting symbol indicates which symbol has to come first; it labels the largest constituent or biggest part: S, Root, or Top A terminal symbol labels the smallest part, the actual strings

  • f the language:

man, they, swim

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-47
SLIDE 47

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A nonterminal symbol labels a syntactic part (constituent): NP, VP, PP, (Noun, Verb, Det) A starting symbol indicates which symbol has to come first; it labels the largest constituent or biggest part: S, Root, or Top A terminal symbol labels the smallest part, the actual strings

  • f the language:

man, they, swim

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-48
SLIDE 48

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A production rule (a.k.a. re-write rule) indicates when one symbol is to be rewritten (→) as one or more others: NP → Det Noun The resulting symbols are, thus, derived from the parent. A production rule captures the notion of syntactic

  • constituency. ‘LHS’ is used to indicate the left-hand side of

the → , and likewise for ‘RHS’.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-49
SLIDE 49

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A production rule (a.k.a. re-write rule) indicates when one symbol is to be rewritten (→) as one or more others: NP → Det Noun The resulting symbols are, thus, derived from the parent. A production rule captures the notion of syntactic

  • constituency. ‘LHS’ is used to indicate the left-hand side of

the → , and likewise for ‘RHS’.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-50
SLIDE 50

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Unpacking the definition

A production rule (a.k.a. re-write rule) indicates when one symbol is to be rewritten (→) as one or more others: NP → Det Noun The resulting symbols are, thus, derived from the parent. A production rule captures the notion of syntactic

  • constituency. ‘LHS’ is used to indicate the left-hand side of

the → , and likewise for ‘RHS’.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-51
SLIDE 51

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Is this a valid CFG grammar?

NP -> (Det) Nom Nom -> (Adj) Noun VP -> VB NP Det -> the | a Noun -> colonel | chicken Adj -> fried | baked VB -> ate | likes the colonel ate fried chicken

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-52
SLIDE 52

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Formal language theory Context-free grammars

Is this a valid CFG grammar?

NP -> (Det) Nom Nom -> (Adj) Noun VP -> VB NP Det -> the | a Noun -> colonel | chicken Adj -> fried | baked VB -> ate | likes the colonel ate fried chicken

Needs a start symbol and associated rule: S → NP VP

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-53
SLIDE 53

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks:

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-54
SLIDE 54

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks: The Penn Treebank is a tagged, parsed corpus of English (most well-known treebank)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-55
SLIDE 55

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks: The Penn Treebank is a tagged, parsed corpus of English (most well-known treebank) Penn Chinese Treebank Project

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-56
SLIDE 56

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks: The Penn Treebank is a tagged, parsed corpus of English (most well-known treebank) Penn Chinese Treebank Project The T¨ ubingen Treebank of Written German

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-57
SLIDE 57

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks: The Penn Treebank is a tagged, parsed corpus of English (most well-known treebank) Penn Chinese Treebank Project The T¨ ubingen Treebank of Written German Arabic Treebank

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-58
SLIDE 58

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Grammars and treebanks

Definition A linguistic corpus is a collection of naturally occurring human

  • language. A treebank is a linguistic corpus annotated for syntactic

structure. Thus, a treebank implicitly contains a grammar (grammars) of the language it contains. Examples of treebanks: The Penn Treebank is a tagged, parsed corpus of English (most well-known treebank) Penn Chinese Treebank Project The T¨ ubingen Treebank of Written German Arabic Treebank Korean Treebank

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-59
SLIDE 59

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Corpora on Patas

stiv@patas:/corpora$ ls birkbeck Conll europarl-old JRC-Acquis.3.0 nltk treebanks coconut europarl ICAME LDC TREC wordnet

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-60
SLIDE 60

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Corpora on Patas

stiv@patas:/corpora$ ls birkbeck Conll europarl-old JRC-Acquis.3.0 nltk treebanks coconut europarl ICAME LDC TREC wordnet

Largest collection is the Linguistics Data Consortium (LDC) corpora. The NLTK comes with many corpora fragments as well. See the compling database to search for specific corpora: https://pongo.ling.washington.edu/db/index.php Perform a search for ‘Treebank’

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-61
SLIDE 61

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Penn Treebank (2 and 3)

cd /corpora/LDC/LDC99T42/RAW (see readme.1st) cd /corpora/LDC/LDC99T42/RAW/parsed/mrg/wsj/04$ (see wsj_0432.mrg)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-62
SLIDE 62

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Word classes

The number of word classes (pre-terminals) depends on the task and how fine you want to cut the pie (Tagged Brown corpus has 87 pre-terminal tags; Penn Treebank uses a 49-item pre-terminal tagset.) There’s no right answer for NLP. Definition Closed class word: a function word in a grammar; there are relatively few of these in a language, though their frequency is very high. Definition Open class word: a content word in a grammar; there is an

  • pen-ended set of these, but their frequencies may be very low (cf.

home with octogenarian).

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-63
SLIDE 63

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

Recall grade school definition: Definition A noun is a person, place, thing, or idea.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-64
SLIDE 64

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

Recall grade school definition: Definition A noun is a person, place, thing, or idea. “You shall know a word by the company it keeps.” J. R. Firth (d. 1960)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-65
SLIDE 65

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

Recall grade school definition: Definition A noun is a person, place, thing, or idea. “You shall know a word by the company it keeps.” J. R. Firth (d. 1960) In other words, syntactic word categories are defined based on their distribution:

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-66
SLIDE 66

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

Recall grade school definition: Definition A noun is a person, place, thing, or idea. “You shall know a word by the company it keeps.” J. R. Firth (d. 1960) In other words, syntactic word categories are defined based on their distribution: Definition Noun is a class of lexical items that occur after determiners (the, a, ...) or adjectives, and can be subjects of sentences. Nouns often represent a person, place, thing, or idea.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-67
SLIDE 67

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

NN a singular common noun, occurring after adjectives and determiners the [NNfisherman] caught it

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-68
SLIDE 68

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

NN a singular common noun, occurring after adjectives and determiners the [NNfisherman] caught it NNS a plural common noun, occurring alone or after adjectives and determiners [NNSfish] swim well

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-69
SLIDE 69

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

NN a singular common noun, occurring after adjectives and determiners the [NNfisherman] caught it NNS a plural common noun, occurring alone or after adjectives and determiners [NNSfish] swim well NNP a proper noun or name, occurring alone in a noun phrase; does not (usually) occur after a determiner [NNPJack] knows

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-70
SLIDE 70

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Nouns

NN a singular common noun, occurring after adjectives and determiners the [NNfisherman] caught it NNS a plural common noun, occurring alone or after adjectives and determiners [NNSfish] swim well NNP a proper noun or name, occurring alone in a noun phrase; does not (usually) occur after a determiner [NNPJack] knows NNPS a plural proper noun the [NNPSimpsons] know the [NNP Jones]

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-71
SLIDE 71

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Verbs

Definition A verb describes states or events. The forms of English verbs predict where they will occur. Consider these verb labels (based on WSJ corpus):

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-72
SLIDE 72

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Verbs

Definition A verb describes states or events. The forms of English verbs predict where they will occur. Consider these verb labels (based on WSJ corpus): VBD a past tense form occurs alone the Earl [VBD ate] a sandwich

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-73
SLIDE 73

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Verbs

Definition A verb describes states or events. The forms of English verbs predict where they will occur. Consider these verb labels (based on WSJ corpus): VBD a past tense form occurs alone the Earl [VBD ate] a sandwich VBZ a third person form occurs after a singular (pro)noun she [VBZ runs] two marathons a year

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-74
SLIDE 74

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Verbs

Definition A verb describes states or events. The forms of English verbs predict where they will occur. Consider these verb labels (based on WSJ corpus): VBD a past tense form occurs alone the Earl [VBD ate] a sandwich VBZ a third person form occurs after a singular (pro)noun she [VBZ runs] two marathons a year VBN a participle form occurs after was, were, has, had, have, got, get, etc he was [VBN bitten] by a tiger

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-75
SLIDE 75

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Adjectives

Definition Adjectives ascribe properties to nouns. They occur before nouns or after verbs.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-76
SLIDE 76

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Adjectives

Definition Adjectives ascribe properties to nouns. They occur before nouns or after verbs. JJ a simple adjective the [JJmetamorphic] rock, the rock is [JJmetamorphic]

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-77
SLIDE 77

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Adjectives

Definition Adjectives ascribe properties to nouns. They occur before nouns or after verbs. JJ a simple adjective the [JJmetamorphic] rock, the rock is [JJmetamorphic] JJR a comparative adjective the [JJRbigger] rock

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-78
SLIDE 78

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Adjectives

Definition Adjectives ascribe properties to nouns. They occur before nouns or after verbs. JJ a simple adjective the [JJmetamorphic] rock, the rock is [JJmetamorphic] JJR a comparative adjective the [JJRbigger] rock JJS a superlative adjective the [JJSbiggest] one

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-79
SLIDE 79

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Adverbs

Definition Adverbs modify verbs (and adjectives) to specify time, manner, place, or direction of the event. RB an adverb can occur around the verb phrase or at the beginning/end of the clause fast, quickly, really, here,

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-80
SLIDE 80

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Other categories

DT determiner a(n), the, that, those MD modal do, can, may PRP pronoun she, her, him, he, we EX existential there there are many fish CD cardinal number

  • ne, two, three

(see list in front cover of J&M)

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-81
SLIDE 81

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Other common abbreviations

Symbol Meaning Symbol Meaning Det determiner NP noun phrase Noun noun VP verb phrase Nom nominal AP adjective phrase Pro pronoun PP prepositional phrase Aux auxiliary Card cardinal number Ord

  • rdinal number

Quant quantifier

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-82
SLIDE 82

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

PTB phrase types

NP noun phrase including all constituents that depend on the noun head VP: verb phrase including all constituents that depend on the verb head PP: prepositional phrase ADJP: adjective phrase headed by an adjective ADVP: adverb phrase headed by an adverb CONJP: used to mark multi-word conjunctions QP: quantifier phrase, used inside NPs . . .

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-83
SLIDE 83

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

PTB Clause types

The number of non-terminals (excluding pre-terminals) is generally

  • small. In the Penn Treebank, there are, for example, 29 basic tags

for syntactic constituents, including 5 basic clause types and 21 phrase-level constituents.

S declaratives, passives, imperatives, questions with declarative order, (embedded) infinitive clauses, gerund classes SINV inverted clauses SBAR relative and subordinate clauses SBARQ Wh-questions SQ Y/N-questions, inside SBARQ S-CLF : it-cleft clauses FRAG stand-alone clauses, phrases without a predicate argument structure.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-84
SLIDE 84

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Rules

Practical treebanks have a large number of rules. The Penn Treebank has more than 17,000! rules. Most of them are flat, and tailored for very specific sentences. The number rules seems to grow at a constant rate as the corpus grows. Not good, but we’re stuck with it. (see Gaizauskas paper). Largest number of rules for S, NP, and VP, but reduced ambiguity.

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-85
SLIDE 85

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework Word classes Clause/Phrase classes Production Rules Writing small grammars

Strategy

The task in grammar writing is to choose the best elements for N and P. Consider constructing a grammar for named entities (noun phrases), or for a large corpus of sentences (1M+ words).

1 Settle on a tagset for pre-terminals (part-of-speech) 2 Tag data for part of speech 3 Identify larger clause patterns; come up with tags 4 Identify each phrase type; come up with tags 5 Fill in details for each phrase type 6 Identify major clause types 7 Address problematic cases Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar

slide-86
SLIDE 86

Linguistic Structure Formal grammars Treebanks, Grammars, Corpora Practical Grammar Writing Computing, Homework

Instructions for homework

[see hw1 pdf on website]

Scott Farrar CLMA, University of Washington farrar@uw.edu Grammar