A Supertag-Context Model for Weakly-Supervised CCG Parser Learning - - PowerPoint PPT Presentation

a supertag context model for weakly supervised ccg parser
SMART_READER_LITE
LIVE PREVIEW

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning - - PowerPoint PPT Presentation

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette U. Washington Chris Dyer CMU Jason Baldridge UT-Austin Noah A. Smith CMU Contributions 1. A new generative model for learning CCG parsers from weak


slide-1
SLIDE 1

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith

  • U. Washington

CMU UT-Austin CMU

slide-2
SLIDE 2

Contributions

  • 1. A new generative model for learning

CCG parsers from weak supervision

  • 2. A way to select Bayesian priors that

capture properties of CCG

  • 3. A Bayesian inference procedure to

learn the parameters of our model

slide-3
SLIDE 3

Type-Level Supervision

  • Unannotated text
  • Incomplete tag dictionary: word {tags}
slide-4
SLIDE 4

the lazy dogs

np/n n/n np

wander

n np (s\np)/np

Type-Level Supervision

slide-5
SLIDE 5

the lazy dogs

np/n n/n np

wander

n np (s\np)/np n n/n np/n s\np …

Type-Level Supervision

slide-6
SLIDE 6

the lazy dogs

np/n n/n np

wander

n np (s\np)/np n n/n np/n s\np …

Type-Level Supervision

?

slide-7
SLIDE 7

PCFG: Local Decisions

slide-8
SLIDE 8

PCFG: Local Decisions

A

slide-9
SLIDE 9

PCFG: Local Decisions

A B C

slide-10
SLIDE 10

PCFG: Local Decisions

A B C

slide-11
SLIDE 11

PCFG: Local Decisions

A B C D E F G

slide-12
SLIDE 12

PCFG: Local Decisions

A B C D E F G

P(D E| )

B B

P(F G| )

C C

slide-13
SLIDE 13

PCFG: Local Decisions

A B C D E F G

P(D E| )

B B

P(F G| )

C C

slide-14
SLIDE 14

A New Generative Model

A B C D E F G

P(D E| )

B B

slide-15
SLIDE 15

A New Generative Model

A B C D E F G

P(D E| )

B B

×P(

| )

F B B

R

slide-16
SLIDE 16

A New Generative Model

A B C D E F G

P(D E| )

B B <S>

×P(

| )

B S B

×P(

| )

F B B

R L

slide-17
SLIDE 17

A New Generative Model

A <S> <E>

(This makes inference tricky… we’ll come back to that)

B C D E F G

slide-18
SLIDE 18
  • The grammar formalism itself can be used

to guide learning

  • Given any two categories, we always

know whether they are combinable.

Why CCG?

  • We can extract a priori context preferences,

before we even look at the data

  • Adjacent categories tend to be

combinable.

slide-19
SLIDE 19

Why CCG?

buy the book np/n n s/np np s universal, intrinsic grammar properties all relationships must be learned buy the book VB DT NN NP S VP ?

slide-20
SLIDE 20

np np n sleeps the lazy dog / n n / \ s n n np s

CCG Parsing

FA

slide-21
SLIDE 21

np np n sleeps the lazy dog / n n / \ s n np s

CCG Parsing

np/n FC

slide-22
SLIDE 22

np np np n sleeps the lazy dog n /n n n / \ s np s n

Supertag Context

slide-23
SLIDE 23

np np np n sleeps the lazy dog n /n n n / \ s np s n n

Supertag Context

slide-24
SLIDE 24

np np np n sleeps the lazy dog n /n n n / \ s np s n n

Supertag Context

slide-25
SLIDE 25

np np np n sleeps the lazy dog n / \ s np s n

Supertag Context

slide-26
SLIDE 26

np np np n sleeps the lazy dog n / \ s np s n n /n n n

Supertag Context

slide-27
SLIDE 27

sleeps the lazy dog

Constituent Context

[Klein & Manning 2002]

  • Klein & Manning showed the value of modeling

context with the Constituent Context Model (CCM)

slide-28
SLIDE 28

NN

Constituent Context

[Klein & Manning 2002]

DT JJ VBZ ( )

slide-29
SLIDE 29

NN

Constituent Context

[Klein & Manning 2002]

DT JJ VBZ ( ) lazy dog “substitutability”

slide-30
SLIDE 30

Constituent Context

[Klein & Manning 2002]

DT VBZ ( ) NN dog “substitutability”

slide-31
SLIDE 31

NN

Constituent Context

[Klein & Manning 2002]

DT JJ VBZ ( ) JJ big lazy dog “substitutability”

slide-32
SLIDE 32

Constituent Context

[Klein & Manning 2002]

DT VBZ ( ) “substitutability”

~Noun

slide-33
SLIDE 33

Constituent Context

[Klein & Manning 2002]

DT VBZ ( ) “substitutability”

slide-34
SLIDE 34

sleeps the np np n / np \ s s

Supertag Context

( ) lazy dog n /n n n n

slide-35
SLIDE 35

sleeps the np np n / np \ s s

Supertag Context

( ) n

  • We know the constituent label
  • We know if it’s a fitting context, even before

looking at the data

slide-36
SLIDE 36

This Paper

  • 1. A new generative model for learning

CCG parsers from weak supervision

  • 2. A way to select Bayesian priors that

capture properties of CCG

  • 3. A Bayesian inference procedure to

learn the parameters of our model

slide-37
SLIDE 37

P(A → Aleft Aright OR wi) Standard PCFG w1 w2 w3 w4 t1 t2 t3 t4 A13

1 2 3 4

A04 A03

Supertag-Context Parsing

P(Aroot)

slide-38
SLIDE 38

With Context w1 w2 w3 w4 t1 t2 t3 t4 A04 A13

1 2 3 4

A03 P(A → tleft) P(A → tright)

<s> <e>

Supertag-Context Parsing

P(Aroot) P(A → Aleft Aright OR wi)

slide-39
SLIDE 39

np/n n the lazy dog np\(np/n) np np the lazy dog np/n n/n n n np

Prior on Categories

[Garrette, Dyer, Baldridge, and Smith, 2015]

(np\(np/n))/n

slide-40
SLIDE 40

PL-prior(tleft | A)

Supertag-Context Prior

{

105 1 if tleft can combine with A

  • therwise

sleeps the lazy dog A

?

tleft tright

slide-41
SLIDE 41

Supertag-Context Prior

sleeps the lazy dog n

?

PR-prior(tright | A) {

105 1 if A can combine with tright

  • therwise

tleft tright

slide-42
SLIDE 42

This Paper

  • 1. A new generative model for learning

CCG parsers from weak supervision

  • 2. A way to select Bayesian priors that

capture properties of CCG

  • 3. A Bayesian inference procedure to

learn the parameters of our model

slide-43
SLIDE 43

the lazy dogs

np/n n/n np

wander

n np (s\np)/np

Type-Level Supervision

?

slide-44
SLIDE 44

Type-Supervised Learning

unlabeled corpus tag dictionary universal properties of the CCG formalism

slide-45
SLIDE 45

Posterior Inference

  • A Bayesian inference procedure will make

use of our linguistically-informed priors

  • But we can’t do sampling like a PCFG
  • Can’t compute the inside chart, even

with dynamic programming.

slide-46
SLIDE 46

Sampling via Metropolis-Hastings

Idea:

  • Sample tree from an efficient proposal distribution
  • (PCFG parameters) (Johnson et al. 2007)
  • Accept according to the full distribution
  • (Context parameters)
slide-47
SLIDE 47

Posterior Inference

the lazy dogs

np/n n/n np

wander

n np (s\np)/np

Priors

(prefer connections)

Model

slide-48
SLIDE 48

Posterior Inference

the lazy dogs

np/n n/n np

wander

n np (s\np)/np

Priors

(prefer connections)

Model

slide-49
SLIDE 49

Posterior Inference

the lazy dogs

np/n n/n np

wander

n np (s\np)/np n n/n np/n s\np …

Priors

(prefer connections)

Model

slide-50
SLIDE 50

Posterior Inference

the lazy dogs

np/n n/n np

wander

n np (s\np)/np n n/n np/n s\np …

Priors

(prefer connections)

Model

Inside

slide-51
SLIDE 51

Posterior Inference

the lazy dogs

np/n n/n np

wander

n np (s\np)/np n n/n np/n s\np …

Priors

(prefer connections)

Model

Sample

slide-52
SLIDE 52

Priors

(prefer connections)

Model

Metropolis-Hastings

slide-53
SLIDE 53

Priors

(prefer connections)

Model

Existing Tree New Tree

Metropolis-Hastings

slide-54
SLIDE 54

Priors

(prefer connections)

Model

Existing Tree New Tree

Metropolis-Hastings

slide-55
SLIDE 55

Priors

(prefer connections)

Model

Existing Tree New Tree

Metropolis-Hastings

slide-56
SLIDE 56

Priors

(prefer connections)

Model

Metropolis-Hastings

slide-57
SLIDE 57

Posterior Inference

Priors

(prefer connections)

Model

slide-58
SLIDE 58
  • Sample tree based only on the pcfg parameters
  • Accept based only on the context
  • New worse than old => less likely to accept

Metropolis-Hastings

slide-59
SLIDE 59

Experimental Results

slide-60
SLIDE 60

Experimental Question

  • When supervision is incomplete, does modeling

context, and biasing toward combining contexts, help learn better parsing models?

slide-61
SLIDE 61

25 50 75 250k 200k 150k 100k 50k 25k

no context +context combinability

parsing accuracy size of the corpus from which the tag dictionary is drawn

English Results

60 65 61 64 60 64 59 63 56 60 55 58

slide-62
SLIDE 62

20 40 60

English Italian Chinese

no context +context combinability

parsing accuracy 25k token TD corpus

Experimental Results

55 58 52 54 29 34

slide-63
SLIDE 63

Conclusion

Under weak supervision, we can use universal grammatical knowledge about context to find trees with a better global structure.

slide-64
SLIDE 64

Deficiency

  • Generative story has a “throw away” step if the

context-generated nonterminals don’t match the tree.

  • We sample only over the space of valid trees

(condition on well-formed structures).

  • This is a benefit of the Bayesian formulation.
  • See Smith 2011.
slide-65
SLIDE 65

accept if z <

Metropolis-Hastings

z ∼ uniform(0,1) = current tree new tree Pfull(y) Ppcfg(y) Pcontext(y) Pfull(y′) Ppcfg(y′) Pcontext(y′) / = / = Ppcfg(y) Pfull(y′) Ppcfg(y′) Pfull(y) / / Pcontext(y′) Pcontext(y)