A Bayesian Approach to Learning the Structure of Human Languages - - PowerPoint PPT Presentation

a bayesian approach to learning the structure of human
SMART_READER_LITE
LIVE PREVIEW

A Bayesian Approach to Learning the Structure of Human Languages - - PowerPoint PPT Presentation

A Bayesian Approach to Learning the Structure of Human Languages Phil Blunsom University of Oxford 1/35 Grammar Induction ? The proposal would undermine effectiveness managers contend Grammar Induction research pursues two main aims:


slide-1
SLIDE 1

A Bayesian Approach to Learning the Structure of Human Languages

Phil Blunsom

University of Oxford

1/35

slide-2
SLIDE 2

Grammar Induction

The proposal would undermine effectiveness managers contend

?

Grammar Induction research pursues two main aims:

  • to produce testable models of human language acquisition,
  • to implement unsupervised parsers capable of reducing

the reliance on annotated treebanks in Natural Language Processing.

2/35

slide-3
SLIDE 3

Language Acquisition

The proposal would undermine effectiveness managers contend

?

  • The empirical success or otherwise of weak bias models of

grammar induction impact on the viability of the Argument from the Poverty of the Stimulus.

  • This contrasts with the strong bias hypothesis of Universal

Grammar.

3/35

slide-4
SLIDE 4

Machine Translation

Grammar Induction for Machine Translation

this read to wanted I book

4/35

slide-5
SLIDE 5

Machine Translation

Learn the syntactic part-of-speech categories of words

Noun DT Verb TO Verb PRP this read to wanted I book

4/35

slide-6
SLIDE 6

Machine Translation

Learn the grammatical structure of the sentences

Noun DT Verb TO Verb PRP this read to wanted I book

4/35

slide-7
SLIDE 7

Machine Translation

Learn syntactic reorderings from Subject-Verb-Object to Subject-Object-Verb

Noun DT Verb TO Verb PRP this read to wanted I book

4/35

slide-8
SLIDE 8

Machine Translation

Learn to translate

lesen Buch dieses wollte Ich Noun DT Verb TO Verb PRP this read to wanted I book

4/35

slide-9
SLIDE 9

Dependency Grammar Induction

OfIN courseNN theDT healthNN

  • fIN

theDT economyNN willIN beVB threatenedVBN ifIN theDT marketNN continuesVBZ toTO diveVB thisDT weekNN

Formalism Dependency Grammar induction has provided one

  • f the most promising avenues for this research.

5/35

slide-10
SLIDE 10

Dependency Grammar Induction

OfIN courseNN theDT healthNN

  • fIN

theDT economyNN willIN beVB threatenedVBN ifIN theDT marketNN continuesVBZ toTO diveVB thisDT weekNN

We induce two probabilistic models

1 A model of the syntactic part-of-speech

categories of the tokens (Noun, Verb, etc.),

2 A model of the dependency derivations of the

text given these syntactic categories.

5/35

slide-11
SLIDE 11

Weak Bias: Power Laws

6/35

slide-12
SLIDE 12

Weak Bias: Pitman-Yor Process Priors

In a Pitman-Yor Process (PYP) unigram language model words (w1 . . . wn) are generated as follows: G|a, b, P0 ∼ PYP(a, b, P0) wi|G ∼ G

  • G is a distribution over an infinite set of words,
  • P0 is the probability that an word will be in the support of G,
  • a and b control the power-law behavior of the PYP

. One way of understanding the predictions made by the PYP model is through the Chinese restaurant process (CRP) . . .

7/35

slide-13
SLIDE 13

The Chinese Restaurant Process

the

n0=0

the

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-14
SLIDE 14

The Chinese Restaurant Process

the

n0=1

cats

cats

n1=0

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-15
SLIDE 15

The Chinese Restaurant Process

the

n0=1

cats

cats

n1=1

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-16
SLIDE 16

The Chinese Restaurant Process

the

n0=1

the

cats

n1=2

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-17
SLIDE 17

The Chinese Restaurant Process

the

n0=2

the

cats

n1=2

the

n2=0

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-18
SLIDE 18

The Chinese Restaurant Process

the

n0=2

meow

cats

n1=2

the

n2=1

meow

n3=0

Customers (words) enter a restaurant and choose a table (from K + 1 tables) according to the distribution: P(zi = k|wi = w, z−) ∝    1wk(w) × (n−

k − a),

0 ≤ k < K (Ka + b)P0(w), k = K + 1

8/35

slide-19
SLIDE 19

The Chinese Restaurant Process

the

n0=2

the

cats

n1=2

the

n2=1

meow

n3=1

The 7th customer ‘the’ enters the restaurant and chooses a table from those already seating ‘the’, or opens a new table: P(z6 = 0|w6 = the, z−6) ∝ 2 − a

8/35

slide-20
SLIDE 20

The Chinese Restaurant Process

the

n0=2

the

cats

n1=2

the

n2=1

meow

n3=1

The 7th customer ‘the’ enters the restaurant and chooses a table from those already seating ‘the’, or opens a new table: P(z6 = 2|w6 = the, z−6) ∝ 1 − a

8/35

slide-21
SLIDE 21

The Chinese Restaurant Process

the

n0=2

the

cats

n1=2

the

n2=1

meow

n3=1

the

n4=0

The 7th customer ‘the’ enters the restaurant and chooses a table from those already seating ‘the’, or opens a new table: P(z6 = 4|w6 = the, z−6) ∝ (4a + b)P0(the)

8/35

slide-22
SLIDE 22

Outline

1

Inducing the syntactic categories of words

2

Inducing the syntactic structure of sentences

Inducing the syntactic categories of words 9/35

slide-23
SLIDE 23

Unsupervised PoS Tagging

A DT simple JJ example NN A 5 simple 6 example 1

Unsupervised part-of-speech tagging aims to learn a partitioning of tokens corresponding to syntactic equivalence classes.

Inducing the syntactic categories of words 10/35

slide-24
SLIDE 24

Unsupervised PoS Tagging

A DT simple JJ example NN A 5 simple 6 example 1

Previous research has followed two paradigms:

  • word class induction, popular for language modelling and Machine
  • Translation. All tokens of a type must have the same class.
  • syntactic models, generally based on HMMs, allow multiple tags per

type and evaluate against an annotated treebank. For both paradigms most models optimise the likelihood of the training corpus, though more recently Bayesian approaches have become popular.

Inducing the syntactic categories of words 10/35

slide-25
SLIDE 25

A Hierarchical Pitman-Yor HMM

Triij Emj

w1 t1 w2 t2 w3 t3

...

tl|tl−1, tl−2, Tri ∼ Tritl−1,tl−2 wl|tl, Em ∼ Emtl .

Inducing the syntactic categories of words 11/35

slide-26
SLIDE 26

A Hierarchical Pitman-Yor HMM

Bij Triij Emj

w1 t1 w2 t2 w3 t3

...

Triij|aTri, bTri, Bij ∼ PYP(aTri, bTri, Bij) Emj|aEm, bEm, C ∼ PYP(aEm, bEm, Uniform).

Inducing the syntactic categories of words 11/35

slide-27
SLIDE 27

A Hierarchical Pitman-Yor HMM

Uni Bij Triij Emj

w1 t1 w2 t2 w3 t3

...

Bij|aBi, bBi, Uni ∼ PYP(aBi, bBi, Uni) Uni|aUni, bUni ∼ PYP(aUni, bUni, Uniform)

Inducing the syntactic categories of words 11/35

slide-28
SLIDE 28

Unsupervised PoS Tagging

We perform inference in this model using Gibbs sampling, an MCMC technique:

  • the tagging of one token, conditioned on all others, is

considered at each sampling step

  • we employ a hierarchical Chinese Restaurant analogy in

which trigrams are considered as customers sitting at restaurant tables

Inducing the syntactic categories of words 12/35

slide-29
SLIDE 29

A Hierarchical Pitman-Yor HMM

A DT simple JJ example ?

JJ

1

JJ

4

NNP

2

NN

5

NNS

3

?

Tri(DT,JJ):

. . .

Inducing the syntactic categories of words 13/35

slide-30
SLIDE 30

A Hierarchical Pitman-Yor HMM

A DT simple JJ example NN

PTri(tl = NN, zl ≤ tables|z−l, t−l) ∝ count−

(DT,JJ,NN) − aTri × tables− (DT,JJ,NN)

count−

(DT,JJ) + bTri

JJ

1

JJ

4

NNP

2

NN

5

NNS

3

NN

Tri(DT,JJ):

. . .

Inducing the syntactic categories of words 13/35

slide-31
SLIDE 31

A Hierarchical Pitman-Yor HMM

A DT simple JJ example NN

PTri(tl = NN, zl = tables + 1|z−l, t−l) ∝

  • aTri × tables−

(DT,JJ,NN) + bTri

PBi(NN|z−l, t−l) count−

(DT,JJ) + bTri

JJ

1

JJ

4

NNP

2

NN

6

NN

5

NNS

3

NN

Tri(DT,JJ):

. . .

Inducing the syntactic categories of words 13/35

slide-32
SLIDE 32

A Hierarchical Pitman-Yor HMM

A DT simple JJ example NN

PBi(tl = NN, zl ≤ tables|z−l, t−l) ∝ count−

(JJ,NN) − aBi × tables− (JJ,NN)

count−

(JJ) + bBi

JJ

1

JJ

4

NNP

2

NN

6

NN

5

NNS

3

NN JJ

1

NN

4

NNP

2

NNS

3

Tri(DT,JJ): Bi(JJ):

. . . . . .

Inducing the syntactic categories of words 13/35

slide-33
SLIDE 33

A Hierarchical Pitman-Yor HMM

A DT simple JJ example NN

PBi(tl = NN, zl = tables + 1|z−l, t−l) ∝

  • aBi × tables−

(JJ,NN) + bBi

PUni(NN|z−l, t−l) count−

(JJ) + bBi

JJ

1

JJ

4

NNP

2

NN

6

NN

5

NNS

3

NN JJ

1

NN

4

NNP

2

NNS

3

Tri(DT,JJ): Bi(JJ):

NN

5

. . . . . .

Inducing the syntactic categories of words 13/35

slide-34
SLIDE 34

A Hierarchical Pitman-Yor HMM

A DT simple JJ example NN

PUni(tl = NN, zl ≤ tables|z−l, t−l) ∝ count−

NN − aUni × tables− NN

count− + bUni

JJ

1

JJ

4

NNP

2

NN

6

NN

5

NNS

3

NN JJ

1

NN

4

NNP

2

NNS

3

Tri(DT,JJ): Bi(JJ):

NN

5

JJ

1

NN

4

NNP

2

NNS

3

Uni:

. . . . . . . . .

Inducing the syntactic categories of words 13/35

slide-35
SLIDE 35

PYP HMM Results

Model Many-to-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Bigram PYP-HMM 66.9 Trigram PYP-HMM 69.8

Inducing the syntactic categories of words 14/35

slide-36
SLIDE 36

PYP HMM Results

Model Many-to-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Bigram PYP-HMM 66.9 Trigram PYP-HMM 69.8 The hierarchical PYP priors give a substantial improvement in accuracy.

Inducing the syntactic categories of words 14/35

slide-37
SLIDE 37

PYP HMM Results

Model Many-to-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Bigram PYP-HMM 66.9 Trigram PYP-HMM 69.8 Adding trigram conditioning leads to further improvements, countering previous work which has found decreases in performance with the increased complexity of trigrams.

Inducing the syntactic categories of words 14/35

slide-38
SLIDE 38

PYP HMM Results

Model Many-to-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Bigram PYP-HMM 66.9 Trigram PYP-HMM 69.8 Previous research has shown that restricting the model to one class per word type provides the best performance. Next we incorporate this constraint into our PYP model.

Inducing the syntactic categories of words 14/35

slide-39
SLIDE 39

One tag per type HMM

A

DT

simple

?

example

NN

for

IN

a

DT

simple

?

talk

NN

We modify the sampler to simultaneously sample a single tag assignment for every token corresponding to a word type.

  • we only consider taggings in which all tokens of a type receive

the same tag

  • note that we don’t change the model, just what we are sampling

Inducing the syntactic categories of words 15/35

slide-40
SLIDE 40

One tag per type HMM

A

DT

simple

?

example

NN

for

IN

a

DT

simple

?

talk

NN

The simultaneous sampling of multiple token tags induces correlations:

  • calculating the probability of the type simple taking the tag JJ

requires enumerating all seating configurations after adding the trigram (DT,JJ,NN) twice.

  • the complexity of this calculation grows exponentially in the

number of tokens being resampled. This calculation is intractable, we approximate by using fractional table counts.

Inducing the syntactic categories of words 15/35

slide-41
SLIDE 41

Results: 1 class per word type

Model M-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Trigram PYP-HMM 69.8 Trigram PYP-1HMM 76.0

Inducing the syntactic categories of words 16/35

slide-42
SLIDE 42

Incorporating Morphological Information

Uni Bij Triij Emj

w1 t1 w2 t2 w3 t3

...

wl|tl, Em ∼ Emtl

Inducing the syntactic categories of words 17/35

slide-43
SLIDE 43

Incorporating Morphological Information

w3 = running

Triij Emj

w1 t1 t2 t3

...

EBijk

Uni Bij

r u n

w2

n i n g

Emj|aEm, bEm, EBij ∼ PYP(aEm, bEm, EBij) wlk|wlk−1, tl, EBij ∼ EBitlwlk−1

Inducing the syntactic categories of words 17/35

slide-44
SLIDE 44

Incorporating Morphological Information

w3 = running

Triij Emj

w1 t1 t2 t3

...

EBijk EUnij

Uni Bij

r u n

w2

n i n g

EBijk|aEBi, bEBi, EUnij ∼ PYP(aEbi, bEbi, EUnij) EUnij|aEUni, bEUni ∼ PYP(aEUni, bEUni, Uniform)

Inducing the syntactic categories of words 17/35

slide-45
SLIDE 45

Results: 1 class per word type

Model M-1 Dirichlet Bigram HMM (Goldwater and Griffiths 2007) 63.2 Trigram PYP-HMM 69.8 Trigram PYP-1HMM 76.0 Trigram PYP-1HMM-Morph 77.5

Inducing the syntactic categories of words 18/35

slide-46
SLIDE 46

Multilingual Results: CoNLL-X Shared Task

Language

PYP-HMM PYP-1HMM PYP-1HMM-Morph Best pub.

Arabic 57.1 62.7 67.5

  • Bulgarian

67.8 69.7 73.2

  • Czech

62.0 66.3 70.1

  • Danish

69.9 73.9 76.2 66.7 Dutch 66.6 68.7 70.4 67.3 Hungarian 65.9 69.0 73.0

  • Japanese

76.8 81.7 81.9

  • Portuguese

72.1 73.5 78.5 75.3 Spanish 71.6 74.7 78.8 73.2 Swedish 66.6 67.0 68.6 60.6

Inducing the syntactic categories of words 19/35

slide-47
SLIDE 47

Tag Confusion Matrix

MLE

NN IN NNP DT JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$ $ ‘‘ ’’ :

PYP

NN IN NNP DT JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$ $ ‘‘ ’’ :

Inducing the syntactic categories of words 20/35

slide-48
SLIDE 48

Outline

1

Inducing the syntactic categories of words

2

Inducing the syntactic structure of sentences

Inducing the syntactic structure of sentences 21/35

slide-49
SLIDE 49

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

ROOT

Inducing the syntactic structure of sentences 22/35

slide-50
SLIDE 50

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

contend VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-51
SLIDE 51

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

managers contend NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-52
SLIDE 52

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

would managers contend MD NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-53
SLIDE 53

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

proposal would managers contend NN MD NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-54
SLIDE 54

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

The proposal would managers contend DT NN MD NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-55
SLIDE 55

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

The proposal would undermine managers contend DT NN MD VB NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-56
SLIDE 56

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 22/35

slide-57
SLIDE 57

Dependency Grammar

The benchmark model for dependency grammar induction is parameterised in terms p(child|head) multinomial distributions:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

This basic model was the first to beat a trivial right attachment baseline.

Inducing the syntactic structure of sentences 22/35

slide-58
SLIDE 58

Dependency Grammar

English syntax often allows a degree of flexibility in how dependents of a verb are organised:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 23/35

slide-59
SLIDE 59

Dependency Grammar

English syntax often allows a degree of flexibility in how dependents of a verb are organised:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 23/35

slide-60
SLIDE 60

Dependency Grammar

However there a definite limitations:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 24/35

slide-61
SLIDE 61

Dependency Grammar

Such limitations give strong clues to unsupervised grammar induction systems as to where noun phrases attach:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 25/35

slide-62
SLIDE 62

Dependency Grammar

Such limitations give strong clues to unsupervised grammar induction systems as to where noun phrases attach:

The proposal would undermine effectiveness managers contend DT NN MD VB NN NNS VBP ROOT

Inducing the syntactic structure of sentences 25/35

slide-63
SLIDE 63

Dependency Grammar

Not all dependency fragments are created equal, we want to learn a preference for some over others:

MD VBP NNS MD VBP NNS MD VBP NNS MD VBP NNS MD VBP

Inducing the syntactic structure of sentences 26/35

slide-64
SLIDE 64

Model MD VBP

Distribution over elementary trees, e, given root non-terminal c: Gc|ac, bc, Pcfg ∼ PYP(ac, bc, Pcfg(·|c)) , e|c ∼ Gc

  • Pcfg(·|c) is the base distribution
  • ac is the concentration parameter
  • bc is the discount parameter

Inducing the syntactic structure of sentences 27/35

slide-65
SLIDE 65

Base Distribution (Unlexicalised)

Elementary trees → dependency edges → dependents MD * * * * *

MD VBP VBP VBP VBP MD VBP VBP

Pcfg Psh

Inducing the syntactic structure of sentences 28/35

slide-66
SLIDE 66

Inference

We have no observed trees, so need to reason over the space

  • f latent trees:
  • the space of TSG derivations is exponential,
  • we use a Metropolis Hastings sampler (Cohn and

Blunsom, 2010) We also sample the hyperparamters, ac, bc and sc. Net effect: no free parameters

Inducing the syntactic structure of sentences 29/35

slide-67
SLIDE 67

Results

Directed Attachment Accuracy on WSJ23 Model |w| ≤ 10 |w| ≤ ∞ Attach-Right 38.4 31.7 EM (Klein 2004) 46.1 35.9 Viterbi EM (Spitkovsky et al. 2010) 65.3 47.9 TSG-DMV 65.1σ=2.2 51.5σ=2.0 LexTSG-DMV 67.7σ=1.5 55.7σ=2.0 Supervised MLE 84.5 68.8

Inducing the syntactic structure of sentences 30/35

slide-68
SLIDE 68

Example Parse

Speculators are calling for a degree of liquidity that is not there in the market

Inducing the syntactic structure of sentences 31/35

slide-69
SLIDE 69

Putting the two models together

Directed Attachment Accuracy on WSJ23 Model |w| ≤ ∞ Attach-Right 31.7 EM (Klein 2004) 35.9 LexTSG-DMV 55.7 LexTSG-DMV + PYP-1HMM 28.5 LexTSG-DMV + PYP-HMM 35.8 Supervised MLE 68.8

Inducing the syntactic structure of sentences 32/35

slide-70
SLIDE 70

Interpretation

The proposal would undermine effectiveness managers contend

?

What contribution does our model make to the language aquisition debate?

  • we provide empirical evidence that dependency grammars

are learnable with a weak non-language specific bias,

  • our model bias comes from Pitman-Yor Process priors that

favour power-law distributions,

  • such distributions are common across many cognitive

processess, not just language.

Inducing the syntactic structure of sentences 33/35

slide-71
SLIDE 71

Conclusions

Summary:

  • Bayesian model of part-of-speech and dependency

grammar induction

  • hierarchical prior allows expressive back-off
  • learns latent linguistic phenomena

Contributions:

  • state-of-the-art grammar induction
  • empirical evidence for the learnability of dependency

grammar

  • unsupervised parse trees approaching a level usable in

practical applications

Inducing the syntactic structure of sentences 34/35

slide-72
SLIDE 72

The End Questions?

Inducing the syntactic structure of sentences 35/35