Viterbi Training Improves Unsupervised Dependency Parsing Valentin - - PowerPoint PPT Presentation

viterbi training improves
SMART_READER_LITE
LIVE PREVIEW

Viterbi Training Improves Unsupervised Dependency Parsing Valentin - - PowerPoint PPT Presentation

Viterbi Training Improves Unsupervised Dependency Parsing Valentin I. Spitkovsky with Hiyan Alshawi (Google Inc.) Daniel Jurafsky (Stanford University) and Christopher D. Manning (Stanford University) Spitkovsky et al. (Stanford & Google)


slide-1
SLIDE 1

Viterbi Training Improves

Unsupervised Dependency Parsing Valentin I. Spitkovsky with Hiyan Alshawi (Google Inc.) Daniel Jurafsky (Stanford University) and Christopher D. Manning (Stanford University)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 1 / 26

slide-2
SLIDE 2

Outline

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-3
SLIDE 3

Outline

1 Viterbi EM Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-4
SLIDE 4

Outline

1 Viterbi EM

— faster, simpler and more accurate

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-5
SLIDE 5

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-6
SLIDE 6

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-7
SLIDE 7

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation

— machine learning and linguistic perspectives

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-8
SLIDE 8

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation

— machine learning and linguistic perspectives — practical insights (some theoretical underpinning)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-9
SLIDE 9

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation

— machine learning and linguistic perspectives — practical insights (some theoretical underpinning)

3 Core Issue Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-10
SLIDE 10

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation

— machine learning and linguistic perspectives — practical insights (some theoretical underpinning)

3 Core Issue

— provably wrong objective functions

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-11
SLIDE 11

Outline

1 Viterbi EM

— faster, simpler and more accurate — easy state-of-the-art results

2 Interpretation

— machine learning and linguistic perspectives — practical insights (some theoretical underpinning)

3 Core Issue

— provably wrong objective functions — theoretical insights (mathematically sound)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 2 / 26

slide-12
SLIDE 12

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 3 / 26

slide-13
SLIDE 13

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

Input: Raw Text ... By most measures, the nation’s industrial sector is now growing very slowly — if at all. Factory payrolls fell in

  • September. So did the Federal Reserve ...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 3 / 26

slide-14
SLIDE 14

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

NN NNS VBD IN NN ♦ | | | | | | Factory payrolls fell in September . Input: Raw Text (Sentences, Tokens and POS-tags) ... By most measures, the nation’s industrial sector is now growing very slowly — if at all. Factory payrolls fell in

  • September. So did the Federal Reserve ...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 3 / 26

slide-15
SLIDE 15

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

NN NNS VBD IN NN ♦ | | | | | | Factory payrolls fell in September . Input: Raw Text (Sentences, Tokens and POS-tags) ... By most measures, the nation’s industrial sector is now growing very slowly — if at all. Factory payrolls fell in

  • September. So did the Federal Reserve ...

Output: Syntactic Structures (and a Probabilistic Grammar)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 3 / 26

slide-16
SLIDE 16

The Problem Input/Output

Disclaimer: Your Mileage May Vary...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 4 / 26

slide-17
SLIDE 17

The Problem Input/Output

Disclaimer: Your Mileage May Vary...

  • ur scope is a very specific problem

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 4 / 26

slide-18
SLIDE 18

The Problem Input/Output

Disclaimer: Your Mileage May Vary...

  • ur scope is a very specific problem

but the high-level ideas may generalize

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 4 / 26

slide-19
SLIDE 19

The Problem Input/Output

Disclaimer: Your Mileage May Vary...

  • ur scope is a very specific problem

but the high-level ideas may generalize Classic EM: “focus across the board”

(hard to see the trees for the forest)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 4 / 26

slide-20
SLIDE 20

The Problem Input/Output

Disclaimer: Your Mileage May Vary...

  • ur scope is a very specific problem

but the high-level ideas may generalize Classic EM: “focus across the board”

(hard to see the trees for the forest)

Viterbi EM: zoom in on likeliest tree

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 4 / 26

slide-21
SLIDE 21

The Problem Scoring

Scoring: Directed Dependency Accuracy

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 5 / 26

slide-22
SLIDE 22

The Problem Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦ | | | | | | Factory payrolls fell in September .

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 5 / 26

slide-23
SLIDE 23

The Problem Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦ | | | | | | Factory payrolls fell in September . Directed score:

3 5 = 60%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 5 / 26

slide-24
SLIDE 24

The Problem Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦ | | | | | | Factory payrolls fell in September . Directed score:

3 5 = 60%

(right/left-branching baselines:

2 5 = 40%).

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 5 / 26

slide-25
SLIDE 25

The Problem Model

State-of-the-Art: Dependency Model with Valence

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-26
SLIDE 26

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-27
SLIDE 27

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-28
SLIDE 28

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-29
SLIDE 29

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-30
SLIDE 30

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-31
SLIDE 31

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-32
SLIDE 32

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1 a2

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-33
SLIDE 33

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1 a2

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-34
SLIDE 34

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1 a2

STOP

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-35
SLIDE 35

The Problem Model

State-of-the-Art: Dependency Model with Valence

a head-outward model, with word classes and valence/adjacency

(Klein and Manning, 2004)

h a1 a2

STOP

P(th) =

  • dir∈{L,R}

  PSTOP(ch, dir,

adj

  • 1n=0)

n

  • i=1

P(tai) PATTACH(ch, dir, cai) (1 − PSTOP(ch, dir,

adj

  • 1i=1))

  

n=|args(h,dir)|

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 6 / 26

slide-36
SLIDE 36

The Problem Learning

Learning: EM, via inside-outside re-estimation

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-37
SLIDE 37

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-38
SLIDE 38

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}, legal parse trees t ∈ T(s)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-39
SLIDE 39

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}, legal parse trees t ∈ T(s), and a gold t∗

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-40
SLIDE 40

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}, legal parse trees t ∈ T(s), and a gold t∗ non-convex objective — very sensitive to initialization

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-41
SLIDE 41

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}, legal parse trees t ∈ T(s), and a gold t∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): ˆ θUNS = arg max

θ

  • s
  • t∈T(s)

Pθ(t)

  • Pθ(s)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-42
SLIDE 42

The Problem Learning

Learning: EM, via inside-outside re-estimation

sentences {s}, legal parse trees t ∈ T(s), and a gold t∗ non-convex objective — very sensitive to initialization maximizing the probability of data (sentence strings): ˆ θUNS = arg max

θ

  • s
  • t∈T(s)

Pθ(t)

  • Pθ(s)

supervised objective would be convex (counting): ˆ θSUP = arg max

θ

  • s

Pθ(t∗(s))

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 7 / 26

slide-43
SLIDE 43

The Data WSJ

Standard Corpus: WSJk

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-44
SLIDE 44

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-45
SLIDE 45

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

◮ ... stripped of punctuation, etc. Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-46
SLIDE 46

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-47
SLIDE 47

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — {t∗},

using “head percolation rules” (Collins, 1999).

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-48
SLIDE 48

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — {t∗},

using “head percolation rules” (Collins, 1999).

Training: traditionally, WSJ10 (Klein, 2005);

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-49
SLIDE 49

The Data WSJ

Standard Corpus: WSJk

The Wall Street Journal section of the Penn Treebank Project (Marcus et al., 1993)

◮ ... stripped of punctuation, etc. ◮ ... rid of sentences left with more than k POS tags; ◮ ... and converted to reference dependencies — {t∗},

using “head percolation rules” (Collins, 1999).

Training: traditionally, WSJ10 (Klein, 2005); Evaluation: Section 23 of WSJ∞ (all sentences).

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 8 / 26

slide-50
SLIDE 50

The Data WSJ

Standard Corpus: WSJk

5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 Sentences (1,000s) Tokens (1,000s) 100 200 300 400 500 600 700 800 900 WSJk

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 9 / 26

slide-51
SLIDE 51

The Data WSJ

Standard Corpus: WSJk

5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 Sentences (1,000s) Tokens (1,000s) 100 200 300 400 500 600 700 800 900 WSJk

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 9 / 26

slide-52
SLIDE 52

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40
slide-53
SLIDE 53

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed

slide-54
SLIDE 54

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed

slide-55
SLIDE 55

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed Oracle

slide-56
SLIDE 56

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed Oracle

slide-57
SLIDE 57

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed Oracle K&M∗

slide-58
SLIDE 58

Results Classic EM

Classic EM: The Lay of the Land

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Uninformed Oracle K&M∗

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 10 / 26

slide-59
SLIDE 59

Results Viterbi EM

Viterbi EM: Results!

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Oracle

slide-60
SLIDE 60

Results Viterbi EM

Viterbi EM: Results!

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Oracle Uninformed

slide-61
SLIDE 61

Results Viterbi EM

Viterbi EM: Results!

5 10 15 20 25 30 35 40 10 20 30 40 50 60 70 WSJk Directed Dependency Accuracy (%)

  • n WSJ40

Oracle Uninformed K&M∗

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 11 / 26

slide-62
SLIDE 62

Results Viterbi EM

State-of-the-Art

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-63
SLIDE 63

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Right-Branching Baseline

(Klein and Manning, 2004)

32%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-64
SLIDE 64

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-65
SLIDE 65

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45% DMV with Viterbi EM with Smoothing 45%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-66
SLIDE 66

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45% DMV with Viterbi EM with Smoothing 45% + Clever Initialization 48%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-67
SLIDE 67

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Brown100 Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45% 43% DMV with Viterbi EM with Smoothing 45% 48% + Clever Initialization 48% 51%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-68
SLIDE 68

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Brown100 Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45% 43% DMV with Viterbi EM with Smoothing 45% 48% (+5%) + Clever Initialization 48% 51%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-69
SLIDE 69

Results Viterbi EM

State-of-the-Art

Section 23 of WSJ∞ Brown100 Right-Branching Baseline

(Klein and Manning, 2004)

32% DMV with Classic EM

(Klein and Manning, 2004)

34%

(Spitkovsky et al., 2010)

45% 43% DMV with Viterbi EM with Smoothing 45% 48% (+5%) + Clever Initialization 48% 51% (+3%)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 12 / 26

slide-70
SLIDE 70

Interpretation

Interpretation: Why Does Viterbi EM Work?

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-71
SLIDE 71

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-72
SLIDE 72

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-73
SLIDE 73

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning:

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-74
SLIDE 74

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-75
SLIDE 75

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s) Classic EM: wt = Pθ(t | s)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-76
SLIDE 76

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s) Classic EM: wt = Pθ(t | s) clearly, this is redistribution of wealth mass

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-77
SLIDE 77

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s) Classic EM: wt = Pθ(t | s) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-78
SLIDE 78

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s) Classic EM: wt = Pθ(t | s) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-79
SLIDE 79

Interpretation

Interpretation: Why Does Viterbi EM Work?

in theory, Viterbi is a quick-and-dirty approximation in theory, Communism works... in practice, EM emulates supervised learning: s → {t} = T(s) Classic EM: wt = Pθ(t | s) clearly, this is redistribution of wealth mass — also, resembles an omniscient central planner (knows the true value of everything at all times) — could work, given a very powerful model θ...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 13 / 26

slide-80
SLIDE 80

Interpretation

Interpretation: How Does Classic EM Fail?

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-81
SLIDE 81

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-82
SLIDE 82

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-83
SLIDE 83

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-84
SLIDE 84

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-85
SLIDE 85

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-86
SLIDE 86

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-87
SLIDE 87

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-88
SLIDE 88

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-89
SLIDE 89

Interpretation

Interpretation: How Does Classic EM Fail?

  • ur model is quite weak (e.g., doesn’t handle agreement)

reserves a lot of mass for ludicrous parse trees... — each entitled to non-trivial support by the distribution at small scales, this is not a problem (short sentences) — only so many possible parses → few free-loaders eventually, exponentially many trees (unwashed masses) result: a dog of a probability distribution... ... wagged by its very long tail

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 14 / 26

slide-90
SLIDE 90

Interpretation

Interpretation: Idealogical Difference!

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-91
SLIDE 91

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-92
SLIDE 92

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-93
SLIDE 93

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-94
SLIDE 94

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-95
SLIDE 95

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-96
SLIDE 96

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-97
SLIDE 97

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-98
SLIDE 98

Interpretation

Interpretation: Idealogical Difference!

Viterbi EM is powered by greed (much like Capitalism) does not require ability to properly value all parse trees so long as it can spot a decent one (winner-take-all) different (weaker?) requirement on models: (like IR) — θ needs to be just discriminative enough! (ranking) at small scales, data are too sparse (markets are illiquid) improves with more data (statistics become efficient) — really, what we want from unsupervised learners!

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 15 / 26

slide-99
SLIDE 99

Interpretation

Interpretation: Summary

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-100
SLIDE 100

Interpretation

Interpretation: Summary

Viterbi EM: focus on the individual best parse trees

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-101
SLIDE 101

Interpretation

Interpretation: Summary

Viterbi EM: focus on the individual best parse trees — given a decent estimate, makes rapid progress (the rich get richer)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-102
SLIDE 102

Interpretation

Interpretation: Summary

Viterbi EM: focus on the individual best parse trees — given a decent estimate, makes rapid progress (the rich get richer) Classic EM: integrates over the collective forests

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-103
SLIDE 103

Interpretation

Interpretation: Summary

Viterbi EM: focus on the individual best parse trees — given a decent estimate, makes rapid progress (the rich get richer) Classic EM: integrates over the collective forests — given a bad (uniform) estimate, makes little progress (all trees remain equally poor)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-104
SLIDE 104

Interpretation

Interpretation: Summary

Viterbi EM: focus on the individual best parse trees — given a decent estimate, makes rapid progress (the rich get richer) Classic EM: integrates over the collective forests — given a bad (uniform) estimate, makes little progress (all trees remain equally poor) — given a great (supervised) estimate, cuts down the better trees (Dekulakization)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 16 / 26

slide-105
SLIDE 105

Interpretation Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 17 / 26

slide-106
SLIDE 106

Interpretation

Interpretation: Connections

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-107
SLIDE 107

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-108
SLIDE 108

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

— relevance to understanding language acquisition?

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-109
SLIDE 109

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

— relevance to understanding language acquisition? — human probabilistic parsing models massively pruned

(Jurafsky, 1996; Chater et al., 1998; Lewis and Vasishth, 2005)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-110
SLIDE 110

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

— relevance to understanding language acquisition? — human probabilistic parsing models massively pruned

(Jurafsky, 1996; Chater et al., 1998; Lewis and Vasishth, 2005)

synchronizing approximation across learning and inference — it’s a parser, not a language model!

(Wainwright, 2006)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-111
SLIDE 111

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

— relevance to understanding language acquisition? — human probabilistic parsing models massively pruned

(Jurafsky, 1996; Chater et al., 1998; Lewis and Vasishth, 2005)

synchronizing approximation across learning and inference — it’s a parser, not a language model!

(Wainwright, 2006)

annealing of objective functions

(Smith and Eisner, 2004)

— wt ∝ Pθ(t | s)β, β ∈ [0, 1] (from Uniform to Classic EM)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-112
SLIDE 112

Interpretation

Interpretation: Connections

“learning by doing” — (unsupervised) self-training

(Clark et al., 2003; Ng and Cardie, 2003; McClosky et al., 2006)

— relevance to understanding language acquisition? — human probabilistic parsing models massively pruned

(Jurafsky, 1996; Chater et al., 1998; Lewis and Vasishth, 2005)

synchronizing approximation across learning and inference — it’s a parser, not a language model!

(Wainwright, 2006)

annealing of objective functions

(Smith and Eisner, 2004)

— wt ∝ Pθ(t | s)β, β ∈ [0, 1] (from Uniform to Classic EM) — Viterbi EM: limβ→∞

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 18 / 26

slide-113
SLIDE 113

Objective Functions

Three Objective Functions

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 19 / 26

slide-114
SLIDE 114

Objective Functions

Three Objective Functions

supervised objective (convex): ˆ θSUP = arg max

θ

  • s

Pθ(t∗(s))

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 19 / 26

slide-115
SLIDE 115

Objective Functions

Three Objective Functions

supervised objective (convex): ˆ θSUP = arg max

θ

  • s

Pθ(t∗(s)) unsupervised objective (non-convex): ˆ θUNS = arg max

θ

  • s
  • t∈T(s)

Pθ(t)

  • Pθ(s)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 19 / 26

slide-116
SLIDE 116

Objective Functions

Three Objective Functions

supervised objective (convex): ˆ θSUP = arg max

θ

  • s

Pθ(t∗(s)) unsupervised objective (non-convex): ˆ θUNS = arg max

θ

  • s
  • t∈T(s)

Pθ(t)

  • Pθ(s)

another unsupervised objective (also non-convex): ˆ θVIT = arg max

θ

  • s

max

t∈T(s) Pθ(t)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 19 / 26

slide-117
SLIDE 117

Objective Functions

Potential Disconnects

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-118
SLIDE 118

Objective Functions

Potential Disconnects

classic unsupervised parsers:

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-119
SLIDE 119

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-120
SLIDE 120

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-121
SLIDE 121

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-122
SLIDE 122

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation) the true generative model θ∗:

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-123
SLIDE 123

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation) the true generative model θ∗: — may not yield the most discriminating parser

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-124
SLIDE 124

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation) the true generative model θ∗: — may not yield the most discriminating parser — may assign suboptimal mass to strings

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-125
SLIDE 125

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation) the true generative model θ∗: — may not yield the most discriminating parser — may assign suboptimal mass to strings Viterbi EM fixes one of these ...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-126
SLIDE 126

Objective Functions

Potential Disconnects

classic unsupervised parsers: — train with respect to sentence strings (learning) — parse with respect to one-best trees (inference) — judged against external references (evaluation) the true generative model θ∗: — may not yield the most discriminating parser — may assign suboptimal mass to strings Viterbi EM fixes one of these ... — ... but both flavors of EM walk away from the supervised optimum

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 20 / 26

slide-127
SLIDE 127

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-128
SLIDE 128

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-129
SLIDE 129

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-130
SLIDE 130

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • Spitkovsky et al. (Stanford & Google)

Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-131
SLIDE 131

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • (iii)
  • a

a a

  • Spitkovsky et al. (Stanford & Google)

Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-132
SLIDE 132

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • (iii)
  • a

a a

  • (iv) a
  • a

a

  • (v)
  • a

a a

  • Spitkovsky et al. (Stanford & Google)

Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-133
SLIDE 133

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • (iii)
  • a

a a

  • (iv) a
  • a

a

  • (v)
  • a

a a

  • expected accuracy for ˆ

θSUP: 40%

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-134
SLIDE 134

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • (iii)
  • a

a a

  • (iv) a
  • a

a

  • (v)
  • a

a a

  • expected accuracy for ˆ

θSUP: 40% (20% for exact trees)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-135
SLIDE 135

Objective Functions

Reminder: Accuracy vs. θ∗ = ˆ θSUP

maximizing likelihood may degrade accuracy

(Pereira and Schabes, 1992; Elworthy, 1994; Merialdo, 1994)

simple example: optimize the wrong model (e.g., make incorrect independence assumptions) fitting the (supervised) DMV to contrived symmetries: (i)

  • a

a a

  • (ii)
  • a

a a

  • (iii)
  • a

a a

  • (iv) a
  • a

a

  • (v)
  • a

a a

  • expected accuracy for ˆ

θSUP: 40% (20% for exact trees) — yet could achieve 50% (for both) deterministically

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 21 / 26

slide-136
SLIDE 136

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 22 / 26

slide-137
SLIDE 137

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

this time, an organic example:

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 22 / 26

slide-138
SLIDE 138

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

this time, an organic example:

NP : NNP NNP

— Marvin Alisky. S : NNP VBD

(Braniff declined). NP-LOC : NNP NNP

Victoria, Texas

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 22 / 26

slide-139
SLIDE 139

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-140
SLIDE 140

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-141
SLIDE 141

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-142
SLIDE 142

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-143
SLIDE 143

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-144
SLIDE 144

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-145
SLIDE 145

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗:

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-146
SLIDE 146

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-147
SLIDE 147

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-148
SLIDE 148

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-149
SLIDE 149

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy — and is a fixed point for both flavors of EM

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-150
SLIDE 150

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy — and is a fixed point for both flavors of EM ... “fun” exercise, left to the readers! :)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-151
SLIDE 151

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy — and is a fixed point for both flavors of EM ... “fun” exercise, left to the readers! :) Classic EM known for local deterministic attractors

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-152
SLIDE 152

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy — and is a fixed point for both flavors of EM ... “fun” exercise, left to the readers! :) Classic EM known for local deterministic attractors — Viterbi EM suggested as a remedy

(de Marcken, 1995)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-153
SLIDE 153

Objective Functions

More Subtle: θ∗ = ˆ θSUP vs. ˆ θUNS vs. ˆ θVIT

— the right model, DMV factors the parameters — no unwarranted independence assumptions — exact calculations (no numerical instabilities) — issue persists with infinite data can again find a more deterministic ˜ θ than θ∗: — assigns zero probability to the truth — attains higher likelihood on both unsupervised metrics — has the same expected (but lower variance) accuracy — and is a fixed point for both flavors of EM ... “fun” exercise, left to the readers! :) Classic EM known for local deterministic attractors — Viterbi EM suggested as a remedy

(de Marcken, 1995)

— but problem with objectives not confined to EM!

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 23 / 26

slide-154
SLIDE 154

Conclusion

Conclusion

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-155
SLIDE 155

Conclusion

Conclusion

need stronger models and better objective functions

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-156
SLIDE 156

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-157
SLIDE 157

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-158
SLIDE 158

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-159
SLIDE 159

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-160
SLIDE 160

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-161
SLIDE 161

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

1 partial bracketings

(Pereira and Schabes, 1992)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-162
SLIDE 162

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

1 partial bracketings

(Pereira and Schabes, 1992)

2 synchronous grammars induction (Alshawi and Douglas, 2000) Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-163
SLIDE 163

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

1 partial bracketings

(Pereira and Schabes, 1992)

2 synchronous grammars induction (Alshawi and Douglas, 2000) 3 linear-time parsing, skewness, Zipf’s Law...

(Seginer, 2007)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-164
SLIDE 164

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

1 partial bracketings

(Pereira and Schabes, 1992)

2 synchronous grammars induction (Alshawi and Douglas, 2000) 3 linear-time parsing, skewness, Zipf’s Law...

(Seginer, 2007)

4 sparse posterior regularization

(Ganchev et al., 2009)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-165
SLIDE 165

Conclusion

Conclusion

need stronger models and better objective functions — but this pulls us back towards central planning... grammar induction is inherently underdetermined in general, unsupervised learning is underconstrained alternative: introduce application-specific constraints — encourage equilibria that share our values (regulation!)

1 partial bracketings

(Pereira and Schabes, 1992)

2 synchronous grammars induction (Alshawi and Douglas, 2000) 3 linear-time parsing, skewness, Zipf’s Law...

(Seginer, 2007)

4 sparse posterior regularization

(Ganchev et al., 2009)

5 mining structure from web mark-up

(Spitkovsky et al., 2010)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 24 / 26

slide-166
SLIDE 166

Conclusion

Summary

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-167
SLIDE 167

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-168
SLIDE 168

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-169
SLIDE 169

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-170
SLIDE 170

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations)

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-171
SLIDE 171

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-172
SLIDE 172

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-173
SLIDE 173

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets — performs gracefully with more complex data

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-174
SLIDE 174

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets — performs gracefully with more complex data simpler algorithm

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-175
SLIDE 175

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets — performs gracefully with more complex data simpler algorithm — easier to code up, debug, and understand...

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-176
SLIDE 176

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets — performs gracefully with more complex data simpler algorithm — easier to code up, debug, and understand... — invites more flexible modeling techniques!

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-177
SLIDE 177

Conclusion

Summary

Viterbi EM well-suited to unsupervised parsing faster to run — no outside charts (each iteration is faster) — quicker to converge (4-10x fewer iterations) scales better — efficiently handles larger data sets — performs gracefully with more complex data simpler algorithm — easier to code up, debug, and understand... — invites more flexible modeling techniques! achieves state-of-the-art results!

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 25 / 26

slide-178
SLIDE 178

Conclusion

Thanks!

Questions?

Spitkovsky et al. (Stanford & Google) Viterbi EM CoNLL (2010-07-15) 26 / 26