Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 - - PowerPoint PPT Presentation

sparsity in dependency grammar induction
SMART_READER_LITE
LIVE PREVIEW

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 - - PowerPoint PPT Presentation

Sparsity in Dependency Grammar Induction Jennifer Gillenwater 1 Kuzman Ganchev 1 ca 2 Jo ao Gra Ben Taskar 1 Fernando Pereira 3 1 Computer & Information Science University of Pennsylvania 2 L 2 F INESC-ID, Lisboa, Portugal 3 Google, Inc.


slide-1
SLIDE 1

Sparsity in Dependency Grammar Induction

Jennifer Gillenwater1 Kuzman Ganchev1 Jo˜ ao Gra¸ ca2 Ben Taskar1 Fernando Pereira3

1Computer & Information Science

University of Pennsylvania

2L2F INESC-ID, Lisboa, Portugal 3Google, Inc.

July 12, 2010

1/9

slide-2
SLIDE 2

Outline

A generative dependency parsing model

2/9

slide-3
SLIDE 3

Outline

A generative dependency parsing model The ambiguity problem this model faces

2/9

slide-4
SLIDE 4

Outline

A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity

2/9

slide-5
SLIDE 5

Outline

A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity

2/9

slide-6
SLIDE 6

Outline

A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective

2/9

slide-7
SLIDE 7

Outline

A generative dependency parsing model The ambiguity problem this model faces Previous attempts to reduce ambiguity How posteriors provide a good measure of ambiguity Applying posterior regularization to the likelihood objective Success with respect to EM and parameter prior baselines

2/9

slide-8
SLIDE 8

Dependency model with valence

(Klein and Manning, ACL 2004)

x y Regularization N creates V sparse ADJ grammars N pθ(x, y) = θroot(V )

3/9

slide-9
SLIDE 9

Dependency model with valence

(Klein and Manning, ACL 2004)

x y Regularization N creates V sparse ADJ grammars N pθ(x, y) = θroot(V ) ·θstop(nostop|V ,right,false) · θchild(N|V ,right)

3/9

slide-10
SLIDE 10

Dependency model with valence

(Klein and Manning, ACL 2004)

x y Regularization N creates V sparse ADJ grammars N pθ(x, y) = θroot(V ) ·θstop(nostop|V ,right,false) · θchild(N|V ,right) ·θstop(stop|V ,right,true) · θstop(nostop|V ,left,false) · θchild(N|V ,left) . . .

3/9

slide-11
SLIDE 11

Traditional objective optimization

Traditional objective: marginal log likelihood max

θ

L(θ) = EX[log pθ(x)] = EX[log

  • y

pθ(x, y)]

4/9

slide-12
SLIDE 12

Traditional objective optimization

Traditional objective: marginal log likelihood max

θ

L(θ) = EX[log pθ(x)] = EX[log

  • y

pθ(x, y)] Optimization method: expectation maximization (EM)

4/9

slide-13
SLIDE 13

Traditional objective optimization

Traditional objective: marginal log likelihood max

θ

L(θ) = EX[log pθ(x)] = EX[log

  • y

pθ(x, y)] Optimization method: expectation maximization (EM) Problem: EM may learn a very ambiguous grammar

Too many non-zero probabilities Ex: V → N should have non-zero probability, but V → DET, V → JJ, V → PRP$, etc. should be 0

4/9

slide-14
SLIDE 14

Previous approaches to improving performance

Structural annealing1

1

Smith and Eisner, ACL 2006

2

Headden et al., NAACL 2009

3

Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

slide-15
SLIDE 15

Previous approaches to improving performance

Structural annealing1 L(θ′): Model extension2

1

Smith and Eisner, ACL 2006

2

Headden et al., NAACL 2009

3

Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

slide-16
SLIDE 16

Previous approaches to improving performance

Structural annealing1 L(θ′): Model extension2 L(θ) + log p(θ): Parameter regularization3

Tend to reduce unique # of children per parent, rather than directly reducing # of unique parent-child pairs θchild(Y |X,dir) = posterior(X→Y )

1

Smith and Eisner, ACL 2006

2

Headden et al., NAACL 2009

3

Liang et al., EMNLP 2007; Johnson et al., NIPS 2007; Cohen et al., NIPS 2008, NAACL 2009 5/9

slide-17
SLIDE 17

Ambiguity measure using posteriors: L1/∞

Intuition: True # of unique parent tags for a child tag is small

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

6/9

slide-18
SLIDE 18

Ambiguity measure using posteriors: L1/∞

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V

1

6/9

slide-19
SLIDE 19

Ambiguity measure using posteriors: L1/∞

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V

1

Sparsity N is V working V

1

6/9

slide-20
SLIDE 20

Ambiguity measure using posteriors: L1/∞

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V

1

Sparsity N is V working V

1

Use V good ADJ grammars N

1

6/9

slide-21
SLIDE 21

Ambiguity measure using posteriors: L1/∞

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V

1

Sparsity N is V working V

1

Use V good ADJ grammars N

1

Use V good ADJ grammars N

1

6/9

slide-22
SLIDE 22

Ambiguity measure using posteriors: L1/∞

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V

1

Sparsity N is V working V

1

Use V good ADJ grammars N

1

Use V good ADJ grammars N

1 max ↓ sum = 3 ← 1 1 1

6/9

slide-23
SLIDE 23

Measuring ambiguity on distributions over trees

For a distribution pθ(y | x) instead of gold trees:

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

7/9

slide-24
SLIDE 24

Measuring ambiguity on distributions over trees

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V 0.4 0.6

1

7/9

slide-25
SLIDE 25

Measuring ambiguity on distributions over trees

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V 0.4 0.6

1

Sparsity N is V working V 0.4 0.6

.4 .6

7/9

slide-26
SLIDE 26

Measuring ambiguity on distributions over trees

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V 0.4 0.6

1

Sparsity N is V working V 0.4 0.6

.4 .6

Use V good ADJ grammars N 0.70.3

.7 .3

7/9

slide-27
SLIDE 27

Measuring ambiguity on distributions over trees

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V 0.4 0.6

1

Sparsity N is V working V 0.4 0.6

.4 .6

Use V good ADJ grammars N 0.70.3

.7 .3

Use V good ADJ grammars N 0.4 0.6

.4 .6

7/9

slide-28
SLIDE 28

Measuring ambiguity on distributions over trees

N → N V → N ADJ → N N → V V → V ADJ → V N → ADJ V → ADJ ADJ → ADJ

Sparsity N is V working V 0.4 0.6

1

Sparsity N is V working V 0.4 0.6

.4 .6

Use V good ADJ grammars N 0.70.3

.7 .3

Use V good ADJ grammars N 0.4 0.6

.4 .6 max ↓ sum = 3.3 ← 1 .3 .4 .6 .4 .6

7/9

slide-29
SLIDE 29

Minimizing ambiguity through posterior regularization

E-Step qt(y | x) = arg min

q(y|x)

KL(q pθt)

8/9

slide-30
SLIDE 30

Minimizing ambiguity through posterior regularization

E-Step qt(y | x) = arg min

q(y|x)

KL(q pθt) q(y | x) =

D N V N

♣ t ① t

q(root → xi) parent D N V N child D

♣ ✈ ♣ ✈

N

✉ ♣ ✉ r

V

♣ ♣ ♣ ♣

N

✉ r ✉ ♣

q(xi → xj) Probability

r s t

8/9

slide-31
SLIDE 31

Minimizing ambiguity through posterior regularization

Apply E-step penalty L1/∞ on posteriors q(y | x) to induce sparsity

(Graca et al., NIPS 2007 & 2009)

E-Step qt(y | x) = arg min

q(y|x)

KL(q pθt) + σL1/∞(q(y | x)) q(y | x) =

D N V N

♣ t ① t

q(root → xi) parent D N V N child D

♣ ✈ ♣ ✈

N

♣ ♣ ② r

V

♣ ♣ ♣ ♣

N

♣ r ② ♣

q(xi → xj) Probability

r s t

8/9

slide-32
SLIDE 32

Experimental results

English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR (σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR (σ = 140, λ = 1/3) 64.4 55.2 50.5 DD (α = 1, λ learned) 65.0 (±5.7)

9/9

slide-33
SLIDE 33

Experimental results

English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR (σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR (σ = 140, λ = 1/3) 64.4 55.2 50.5 DD (α = 1, λ learned) 65.0 (±5.7) 11 other languages from CoNLL-X:

Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM

9/9

slide-34
SLIDE 34

Experimental results

English from Penn Treebank: state-of-the-art accuracy Learning Method Accuracy ≤ 10 ≤ 20 all PR (σ = 140) 62.1 53.8 49.1 LN families 59.3 45.1 39.0 SLN TieV & N 61.3 47.4 41.4 PR (σ = 140, λ = 1/3) 64.4 55.2 50.5 DD (α = 1, λ learned) 65.0 (±5.7) 11 other languages from CoNLL-X:

Dirichlet prior baseline: 1.5% average gain over EM Posterior regularization: 6.5% average gain over EM

Come see the poster for more details

9/9