of Latent Structure in Natural Language Text Noah A. Smith Hertz - - PowerPoint PPT Presentation

of latent structure
SMART_READER_LITE
LIVE PREVIEW

of Latent Structure in Natural Language Text Noah A. Smith Hertz - - PowerPoint PPT Presentation

Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text Noah A. Smith Hertz Foundation Fellow Department of Computer Science / Assistant Professor Center for Language and Speech Processing Language


slide-1
SLIDE 1

July 13, 2006 1

Novel Estimation Methods

for Unsupervised Discovery

  • f Latent Structure

in Natural Language Text

Noah A. Smith

Hertz Foundation Fellow Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University Assistant Professor Language Technologies Institute / Machine Learning Department School of Computer Science Carnegie Mellon University

Advisor: Jason Eisner

slide-2
SLIDE 2

July 13, 2006 2

Situating the Thesis

  • Too much information in the world!
  • Most information is represented linguistically.

– Most of us can understand one language or more.

  • How can computers help?
  • Can NLP systems “build themselves”?
slide-3
SLIDE 3

July 13, 2006 3

Modern NLP

Machine Learning / Statistics Linguistics / Cognitive Science Natural Language Processing Build models empirically from data; language learning and processing are inference. Symbolic formalisms for elegance, efficiency, and intelligibility.

slide-4
SLIDE 4

July 13, 2006 4

Parse Tree

An Example: Parsing

Model Dynamic Programming Algorithm Discrete Search Sentence

slide-5
SLIDE 5

July 13, 2006 5

Is Parsing Useful?

  • Speech recognition (Chelba & Jelinek, 1998)
  • Text correction (Shieber & Tao, 2003)
  • Machine translation (Chiang, 2005)
  • Information extraction (Viola and Narasimhan, 2005)
  • NL interfaces to databases (Zettlemoyer & Collins, 2005)

Different parsers for different problems, and learning depends on the task.

slide-6
SLIDE 6

July 13, 2006 6

The Current Bottleneck

  • Empirical methods are great when you have enough of

the right data.

  • Reliable unsupervised learning would let us more

cheaply:

– Build models for new domains – Train systems for new languages – Explore new representations (hidden structures) – Focus more on applications

slide-7
SLIDE 7

July 13, 2006 7

Central Practical Problem of the Thesis

  • How far can we get with

unsupervised estimation?

Parse Tree Model Dynamic Programming Algorithm Discrete Search Sentence

slide-8
SLIDE 8

July 13, 2006 8

Structured input

Deeper Problem

  • How far can we get with

unsupervised estimation?

Structured output Model Structured input Structured input Structured input Structured input

slide-9
SLIDE 9

July 13, 2006 9

Outline of the Talk

Learning To Parse Learning = Optimizing a Function Improving the Function Improving the Optimizer Improving the Function and the Optimizer Multilingual Experiments Chapters 1, 2 Chapter 3 Chapters 4, 5, 6 Chapter 7

Maximum Likelihood by EM Contrastive Estimation Deterministic Annealing Structural Annealing

  • German
  • English
  • Bulgarian
  • Mandarin
  • Turkish
  • Portuguese
slide-10
SLIDE 10

July 13, 2006 10

Dependency Parsing

  • Underlies many linguistic

theories

  • Simple model & algorithms

(Eisner, 1996)

  • Projectivity constraint →

context-free

(cf. McDonald et al., 2005)

  • Unsupervised learning:

– Carroll & Charniak (1992) – Yuret (1998) – Paskin (2002) – Klein & Manning (2004) Applications:

  • Relation extraction

Culotta & Sorenson (2004)

  • Machine translation

Ding & Palmer (2005)

  • Language modeling

Chelba & Jelinek (1998)

  • All kinds of lexical learning

Lin & Pantel (2001), inter alia

  • Semantic role labeling

Carerras & Marquez (2004)

  • Textual entailment

Raina et al. (2005), inter alia

slide-11
SLIDE 11

July 13, 2006 11

A Dependency Tree

slide-12
SLIDE 12

July 13, 2006 12

Our Model A (“DMV”)

  • Expressible as a SCFG
  • Can be viewed as a log-linear model with these

features:

– Root tag is U. – Tag U has a child tag V in direction D. – Tag U has no children in direction D. – Tag U has at least one child in direction D. – Tag U has only one child in direction D. – Tag U has a non-first child in direction D.

slide-13
SLIDE 13

July 13, 2006 13

Example Derivation of the Model

Klein & Manning, 2004 VBZ

retains

NN

title

NNP

Mr.

NNP

Smith

CD

39

DT

the

IN

  • f

JJ

chief

JJ

financial

NN

  • fficer

Root tag is VBZ. VBZ has a right child. VBZ has only 1 right child. VBZ has NN as right child. VBZ has a left child. VBZ has NNP as left child. VBZ has only 1 left child. NNP has a right child. NNP has CD as right child.

slide-14
SLIDE 14

July 13, 2006 14

Stochastic and Log-linear CFGs

Model

˙ ˙ p

r x,y

( ) =

def exp

r f x,y

( )

r

  • (

)

˙ ˙ Z

r W

( )

sentence, tree Set of all sentences and their trees

p r

x,y

( )

= e r

derivation feature tokens r

  • =

e fr x,y

( )r

rules r

  • =

exp r f x,y

( )

r

  • (

)

Context-Free Grammar

(production rules)

Rule weights

slide-15
SLIDE 15

July 13, 2006 15

Model A is Very Simple!

  • Connected, directed trees over tags.

– Tag-tag relationships – Affine valency model

  • No sister effects, even on same side of parent.
  • No grandparent effects.
  • No lexical selection, subcategorization, anything.
  • No distance effects.

O(n5) naïve; O(n3)

(Eisner & Satta, 1999)

slide-16
SLIDE 16

July 13, 2006 16

Evaluation

Treebank tree (gold standard) hypothesis tree

✖ ✖ ✖ ✖ ✖ ✖ ✔ ✔ ✔

Accuracy = 3 / (3 + 6) = 33.3%

slide-17
SLIDE 17

July 13, 2006 17

Evaluation

Treebank tree (gold standard) hypothesis tree

✖ ✔ ✖ ✔ ✖ ✖ ✔ ✔ ✔

Undirected Accuracy = 5 / (5 + 4) = 55.5%

slide-18
SLIDE 18

July 13, 2006 18

Fixed Grammar, Learned Weights

Model Context-Free Grammar

(production rules)

Rule weights All dependency trees on all tag sequences can be derived. How do we learn the weights?

r

slide-19
SLIDE 19

July 13, 2006 19

Maximum Likelihood Estimation

max

r

  • pr
  • bserved data

( )

Supervised training: “observed data” are sentences with trees Unsupervised training: “observed data” are sentences

max

r

  • pr

xi,yi

( )

i=1 n

  • max

r

  • pr

xi,y

( )

y

  • i=1

n

  • Independence

among examples Marginalize over trees For PCFGs, closed form solution R e q u i r e s n u m e r i c a l

  • p

t i m i z a t i

  • n
slide-20
SLIDE 20

July 13, 2006 20

Expectation-Maximization

  • Hillclimber for the likelihood function.
  • Quality of the estimate depends on the starting point.

Rule weights

pr

  • r

x

( )

r

slide-21
SLIDE 21

July 13, 2006 21

EM for Stochastic Grammars

  • E step

Compute expected rule counts for each sentence:

  • M step

Renormalize counts into multinomial distributions.

cr +

Epr

i

( ) fr x j,Y

( )

[ ]

r

i+1

( ) = log cr

( ) Z

Dynamic Programming Algorithm

slide-22
SLIDE 22

July 13, 2006 22

Experiment

  • WSJ10: 5300 part-of-speech sequences of length ≤10
  • Words ignored, punctuation stripped
  • Three initializers:

– Zero: all weights set to zero – K&M: Klein and Manning (2004), roughly – Local: Slight variation on K&M, more smoothed

  • 530 test sentences
slide-23
SLIDE 23

July 13, 2006 23

Experimental Results: MLE/EM

MLE/EM 26.07 49 58.9 22.8 Local 25.16 62 62.1 41.7 K&M 26.07 49 58.8 22.7 Zero

  • 62.1

39.5 Attach-Right

  • 62.1

22.6 Attach-Left Cross- Entropy Iterations Undirected Accuracy (%) Accuracy (%)

slide-24
SLIDE 24

July 13, 2006 24

Dirichlet Priors for PCFG Multinomials

  • Simplest conceivable smoothing: add-λ
  • Slight change to M step:

r

i+1

( ) = log cr +

( ) Z

As if we saw each event an additional λ times. This is Maximum a Posteriori estimation, or “MLE with a prior.” How to pick λ?

slide-25
SLIDE 25

July 13, 2006 25

Model Selection

Supervised selection: best accuracy on annotated development data (presented in talk) Unsupervised selection: best likelihood on unannotated development data (given in thesis)

Rule weights λ  Rule weights λ  Rule weights λ  … Best on development dataset Rule weights

slide-26
SLIDE 26

July 13, 2006 26

Model Selection

Advantages:

  • Can re-select later for different applications/datasets.

Disadvantages:

  • Lots of models to train!
  • Still have to decide which λ values to train with.

Rule weights λ  Rule weights λ  Rule weights λ  … Best on development dataset Rule weights

slide-27
SLIDE 27

July 13, 2006 27

Experimental Results: MAP/EM

MLE/EM 26.07 49 58.9 22.8 Local 25.54 49 62.2 41.6 MAP/EM (sel. λ, initializer) 25.16 62 62.1 41.7 K&M 26.07 49 58.8 22.7 Zero

  • 62.1

39.5 Attach-Right Cross- Entropy Iterations Undirected Accuracy (%) Accuracy (%)

slide-28
SLIDE 28

July 13, 2006 28

“Typical” Trees

Treebank learned model

slide-29
SLIDE 29

July 13, 2006 29

Good and Bad News About Likelihood

slide-30
SLIDE 30

July 13, 2006 30

Selection over Random Initializers

slide-31
SLIDE 31

July 13, 2006 31

On Aesthetics

 Hyperparameters should be interpretable.  Reasonable initializers should perform reasonably.

  • These are a form of domain knowledge that should help,

not hurt performance.

  • If all else fails, “Zero” (maxent) initializer should perform

well.

Can we have both?

slide-32
SLIDE 32

July 13, 2006 32

Where are we?

Learning To Parse Learning = Optimizing a Function Improving the Function   

slide-33
SLIDE 33

July 13, 2006 33

Likelihood as Teacher

Red leaves don’t hide blue jays. Mommy doesn’t love you. Dishwashers are a dime a dozen. Dancing granola doesn’t hide blue jays.

slide-34
SLIDE 34

July 13, 2006 34

Probability Allocation

Σ*

  • bserved

sentences

slide-35
SLIDE 35

July 13, 2006 35

What We’d Like

  • Focus on the model on the properties of the data that

will lead to an explanation of syntax. Red leaves don’t hide blue jays. *Jays blue hide don’t leaves red. *Blue don’t hide jays leaves red. *Hide don’t blue jays red leaves.

  • Idea: train model to explain order but not content.
slide-36
SLIDE 36

July 13, 2006 36

Contrastive Estimation

(Smith & Eisner, 2005)

Σ*

  • bserved

sentences implicitly negative sentences

slide-37
SLIDE 37

July 13, 2006 37

Maximum Likelihood Estimation

  • vs. Contrastive Estimation

MLE/MAP:

  • bserved data are

Sentences, neighborhood is Σ*

max

r

  • pr

xi,y

( )

y

  • i=1

n

  • CE:
  • bserved data are

sentences, neighborhood is …?

max

r

  • pr

xi,y

( )

y

  • p r

x,y

( )

y

  • x N xi

( )

  • i=1

n

  • = max

r

  • pr

X = xi | X N xi

( )

( )

i=1 n

  • Require

numerical

  • ptimization
slide-38
SLIDE 38

July 13, 2006 38

Partition Neighborhood = Conditional EM

Σ*

  • bserved

sentences implicitly negative sentences

slide-39
SLIDE 39

July 13, 2006 39

Riezler’s (1999) Approximation

Σ*

  • bserved

sentences

slide-40
SLIDE 40

July 13, 2006 40

Analogy to Conditional Estimation (Supervised)

Σ* Y

slide-41
SLIDE 41

July 13, 2006 41

CE for Syntax

Σ*

  • bserved

sentences Same content, syntactically ill-formed

slide-42
SLIDE 42

July 13, 2006 42

CE as Teacher

Red leaves don’t hide blue jays. Leaves red don’t hide blue jays. Red don’t leaves hide blue jays. Red leaves hide don’t blue jays.

slide-43
SLIDE 43

July 13, 2006 43

Optimizing Contrastive Likelihood

F r = Ep r

fr xi,Y

( )

[ ] Ep r

fr X,Y

( ) | X N xi ( )

[ ]

i=1 n

  • Expected count

Of rule r in sentence i Expected count Of rule r in neighborhood i

What about the simplex constraints? How to make the second term efficient?

F r

  • ( ) =

log p r

X = xi

( ) log pr

X N xi

( )

( )

i=1 n

  • Gradient ascent,

Conjugate gradient, LMVM/LBFGS

slide-44
SLIDE 44

July 13, 2006 44

Getting Rid of Simplex Constraints

  • PCFGs represent distributions p(tree, sentence).
  • So do some WCFGs - if you can normalize.

(Requires a finite sum over all derivation scores.)

PCFGs and WCFGs represent the same family.

  • PCFGs represent p(tree | sentence).
  • So do some WCFGs - if you can normalize.

(Requires a finite sum over all sentence derivations.)

PCFGs and WCFGs represent the same conditional family.

Chi (1999) Abney et al. (1999) Smith and Johnson (2005) ˙ ˙ Z

r W

( )

˙ ˙ Z

r x

( )

slide-45
SLIDE 45

July 13, 2006 45

Optimizing Contrastive Likelihood

F r

  • ( ) =

log p r

X = xi

( ) log pr

X N xi

( )

( )

i=1 n

  • F

r = Ep r

fr xi,Y

( )

[ ] Ep r

fr X,Y

( ) | X N xi ( )

[ ]

i=1 n

  • Expected count

Of rule r in sentence i Expected count Of rule r in neighborhood i

What about the simplex constraints? How to make the second term efficient?

slide-46
SLIDE 46

July 13, 2006 46

Summing over N(x)

  • Dynamic programming saves the day again!
  • If the set N(x) is represented as a lattice, we can

apply the usual Inside-Outside algorithm with a slight change.

a b c Dynamic Programming Algorithm a b c a b c c b

slide-47
SLIDE 47

July 13, 2006 47

Original Idea: Word Order

N(x) = all permutations of x

  • Up to |x|! reorderings and requires lattic

e with O(2|x|) arcs

  • Tradeoff: we want

– A small lattice – A neighborhood that includes as many conceivable negative examples as possible – A neighborhood that has few false negative examples

slide-48
SLIDE 48

July 13, 2006 48

Crude Lattice Neighborhoods

  • Mangle the syntax of the sentence by locally

reordering and/or deleting some tags.

Transpose 1 Dynasearch Delete 1

slide-49
SLIDE 49

July 13, 2006 49

Midpoint Joke

slide-50
SLIDE 50

July 13, 2006 50

CE Computation

Dynamic Programming Algorithm Dynamic Programming Algorithm

slide-51
SLIDE 51

July 13, 2006 51

Experimental Results: CE

65.3 47.6 Dynasearch (sel. σ2, init.) 53.5 39.7 Del1 (sel. σ2, init.) 69.0 57.6 Del1OrTrans1 (sel. σ2, init.) 62.5 41.2 Trans1 (sel. σ2, init.) 62.2 41.6 MAP/EM (sel. λ, initializer) 64.9 45.5 Length (sel. σ2, init.) 62.1 39.5 Attach-Right Undirected Accuracy (%) Accuracy (%)

slide-52
SLIDE 52

July 13, 2006 52

Experimental Results: Del1OrTrans1

Local K&M Zero 61.8 62.2 58.9 58.8 62.1 Undir. (%) 48.4 48.6 41.6 41.7 39.5 Dir. (%) 65.4 64.9 62.2 62.2 62.1 Undir. (%) 69.0 57.6 35.8 Del1OrTrans1 (unreg.) 59.4 24.4 23.8 MAP/EM (sel. λ) 69.0 57.6 36.4 Del1OrTrans1 (sel. σ2) 58.9 22.8 22.7 MLE/EM 62.1 39.5 39.5 Attach-Right Undir. (%) Dir. (%) Dir. (%)

slide-53
SLIDE 53

July 13, 2006 53

“Typical” Trees

Treebank MAP/EM CE

slide-54
SLIDE 54

July 13, 2006 54

Cause for Concern?

slide-55
SLIDE 55

July 13, 2006 55

Bonus!

  • Log-linear grammars can model more features.
  • Smith & Eisner (2005): in HMM estimation from

unlabeled data, spelling features can make up for worse dictionaries.

  • In thesis: Model U

– Not representable as a stochastic model (only log-linear) – Improvement with spelling features (poor man’s lexicalization)

slide-56
SLIDE 56

July 13, 2006 56

Where are we?

Learning To Parse Learning = Optimizing a Function Improving the Function    Improving the Optimizer 

slide-57
SLIDE 57

July 13, 2006 57

Expectation-Maximization

  • Hillclimber for the likelihood function.
  • Quality of the estimate depends on the starting point.

Rule weights

pr

  • r

x

( )

r

  • Can we improve the

search procedure to avoid getting stuck on local optima?

slide-58
SLIDE 58

July 13, 2006 58

Deterministic Annealing

Rose et al. (1990) Ueda and Nakano (1998)

slide-59
SLIDE 59

July 13, 2006 59

EM as Coordinate Ascent

Neal and Hinton (1998)

slide-60
SLIDE 60

July 13, 2006 60

Deterministic Annealing

Model Model Model Model High entropy required No entropy constraint time β ⋲ 0 β = 1

slide-61
SLIDE 61

July 13, 2006 61

Skewed Deterministic Annealing (Smith and Eisner, 2004)

Clever initializer

slide-62
SLIDE 62

July 13, 2006 62

Skewed Deterministic Annealing

Model Low divergence from initializer No divergence constraint time β ⋲ 0 β = 1 Model Model Model

slide-63
SLIDE 63

July 13, 2006 63

Optimizers of Likelihood

27.92 22.12 26.07 Cross- entropy (training) 46.7 34.8 41.6 Accuracy (s-sel.; %) ✔ ✔ Skewed DA ✔ ✖ DA ✖ ✔ EM Tries to avoid local

  • ptima

Can exploit good initializer

Supervised selection applied across initializers, λ (for EM), and schedule (for DA, SDA).

slide-64
SLIDE 64

July 13, 2006 64

Summary So Far

  • EM just barely outperforms Attach-Right
  • CE training does better with good initializers

 Bonus: log-linear models, so new features can be added  Concern: performance gain not consistent on random models

  • DA does its job (better likelihood) but doesn’t help accuracy!
  • SDA can outperform EM, but not because it avoided a local
  • ptimum. (Either luck, or effect of search trajectory.)

Objective matters. Search matters.

slide-65
SLIDE 65

July 13, 2006 65

Where are we?

Learning To Parse Learning = Optimizing a Function Improving the Function    Improving the Optimizer Improving the Function and the Optimizer  

slide-66
SLIDE 66

July 13, 2006 66

A Different Approach

  • CE: Domain knowledge defines neighborhood

– Define what structure is supposed to “explain”

  • DA/SDA: “Managed” difficulty improves search

– Easy function → difficult function

  • Structural Annealing:

– Domain knowledge informs our ideas about search difficulty – Easy structures → difficult structures

slide-67
SLIDE 67

July 13, 2006 67

Short Dependency Preference

1 1 1 1 1 2 2 2 3

slide-68
SLIDE 68

July 13, 2006 68

Dependency Length Distribution

slide-69
SLIDE 69

July 13, 2006 69

A Locality Feature (Model L)

pr

, x,y

( ) pr

x,y

( ) exp

|i j |

j y(i)

  • i=1

|x|

  • Global sum-of-lengths feature,

factors locally Locality bias δ accuracy

slide-70
SLIDE 70

July 13, 2006 70

Structural Annealing

  • Early: Big penalty for long attachments

(δ << 0) … gradually increase δ …

  • Later: No penalty

(δ = 0) (Keep going, using development data to decide when to stop.)

slide-71
SLIDE 71

July 13, 2006 71

Two Views of SA

  • Search View: We start with an easier objective and

move to a harder one.

  • Objective Function View:

– We added a feature to the model, during training. – Its weight is trained in a different way, because we know roughly what it should be. – Adding a feature changes the objective.

slide-72
SLIDE 72

July 13, 2006 72

Experimental Results: SA

∞, Local 69.0 57.6 CE/Del1OrTrans1 (sel. σ2, init.) 10, -0.6, Zero 69.4 61.8 Locality Bias (sel. λ, δ, init.) 10-2/3, K&M 62.2 41.6 MAP/EM (sel. λ, initializer) 10, -0.6, 0.1, 0.1, Zero 73.1 66.7 Structural Annealing (sel. λ, δ0, ∆δ, δf, init.)

  • 62.1

39.5 Attach-Right Hyper- parameters Undirected Accuracy (%) Accuracy (%)

slide-73
SLIDE 73

July 13, 2006 73

Structural Annealing Performance

Zero initializer, λ = 10

slide-74
SLIDE 74

July 13, 2006 74

“Typical” Trees

Treebank MAP/EM MAP/SA

slide-75
SLIDE 75

July 13, 2006 75

Path Analysis

slide-76
SLIDE 76

July 13, 2006 76

Path Analysis

Attach-Right CE/Del1OrTrans1 MAP/EM MAP/SA Distribution over distance from a tag to its true parent, in the hypothesized (undirected) tree.

slide-77
SLIDE 77

July 13, 2006 77

CE and SA

65.5 / 72.3 66.7 / 73.1 Annealed bias 63.5 / 71.5 61.8 / 69.4 Fixed bias

57.6 / 69.0) 41.6 / 62.2 (No bias

CE (Del1OrTrans1) MAP

  • bjective

search

slide-78
SLIDE 78

July 13, 2006 78

Another Structural Feature

  • “Model S” - just like Model A, but allows broken trees

(roots modeled by unigram distribution).

  • Gradually in

crease bias toward connectedness.

  • Decode with Model A.

Undirected (%) Directed (%) 68.8 58.4 (anneal β) 67.0 55.6 Model S (fix β) 62.2 41.6 Model A (MAP/EM)

slide-79
SLIDE 79

July 13, 2006 79

Decoding under Model S

slide-80
SLIDE 80

July 13, 2006 80

On Supervision

size of development set directed accuracy (%) Use SA if you have <50 trees

slide-81
SLIDE 81

July 13, 2006 81

Where are we?

Learning To Parse Learning = Optimizing a Function Improving the Function    Improving the Optimizer Improving the Function and the Optimizer Multilingual Experiments 

slide-82
SLIDE 82

July 13, 2006 82

Experimental Setup

  • Similar to English:

– Part-of-speech tags only, sequences of ≤10 tags after stripping punctuation – ⋲500 development, ⋲500 test sentences

  • Training:

– 8K German (Tiger) – 5K English (WSJ) & Bulgarian (BulTreeBank) – 3K Mandarin (Penn Chinese) & Turkish (METU-Sabanci) – 2K Portuguese (Bosque)

  • Supervised model selection
slide-83
SLIDE 83

July 13, 2006 83

Multilingual Experiments

86.5 72.5 72.3 79.2 82.5 83.7 supervised

50.5 62.3 58.0 58.7 66.7 71.8

MAP/SA

50.4 62.3 51.1 49.2 61.8 61.3

MAP/δ

71.8 59.0 41.1 40.5 57.6 63.4

CE

42.3 48.0 50.0 45.6 41.6 54.4

MAP/EM

29.5 61.8 42.9 23.8 39.5 47.0

Attach- Right

36.2 6.6 13.1 37.2 22.6 8.2

Attach-Left Portuguese Turkish Mandarin Bulgarian English German

slide-84
SLIDE 84

July 13, 2006 84

Multilingual Experiments

slide-85
SLIDE 85

July 13, 2006 85

Future Work

  • Hyperparameter selection should be part of optimization.

– More Bayesian (and expensive) approach: optimize hyperparameters, integrating out the parameters!

  • Better models that can capture lexical effects.

– “Anneal” from Model A into such models?

  • Learning & testing on longer sentences.

– Structural annealing might be even more helpful!

  • Better or more task-focused CE neighborhoods?
  • Other kinds of structure

– Cross-lingual structure (word alignments, trees, etc.) – Morphology, semantics, discourse, tertiary protein structure …

slide-86
SLIDE 86

July 13, 2006 86

Conclusion

  • Explored two key dimensions of unsupervised

structure learning:

– What do you optimize? (objective function) – How do you optimize it? (search) Both are important!

  • Five-fold increase in “labeled data threshold.”
  • State-of-the-art performance on all 6 languages tested.
  • Two clean ways to improve unsupervised modeling

using domain knowledge: CE, SA

slide-87
SLIDE 87

July 13, 2006 87

Notes of Appreciation

 Hertz Foundation (esp. Lowell Wood)  Jason Eisner, Dale Schuurmans, Paul Smolensky, David Yarowsky  Markus Dreyer, Ben Klemens, David Smith, Roy Tromble  Eric Brill, Bill Byrne, Eugene Charniak, Michael Collins, Bob Frank, Joshua Goodman, Keith Hall, Rebecca Hwa, Fred Jelinek, Mark Johnson, Damianos Karakos, Sanjeev Khudanpur, Dan Klein, John Lafferty, Chris Manning, Dan Melamed, Philip Resnik, Dan Roth, Giorgio Satta, Zak Shafran  Geetu, John, Silviu, Jia, Sourin, Yonggang, Elliott, Trish, Ahmad, Erin, Hans, Nikesh, Arnab, Eric, Shankar, Gideon, Lambert, Paul, Charles, Yi, Veera, Paola, Chris, Rich, Jun, Peng, Lisa  Laura Graham, Eiwe Lingfors, Sue Porterfield, Steve Rifkin, Linda Rorke  Kay Dixon, Gene Granger, Lorie Smith, Maria Smith, Wayne Smith  Karen Thickman

slide-88
SLIDE 88

July 13, 2006 88

Key Contributions

  • Novel generalization of partial-data MLE to incorporate

implicit negative evidence (CE).

– Bonus: easier training of log-linear models (with arbitrary features)

  • Novel generalization of deterministic annealing to

exploit good initializers (SDA).

  • Novel parameter search technique allowing the use of

domain knowledge to start simple and gradually push the model toward difficult structures (SA).

  • Significant accuracy improvements on weighted

grammar induction in six diverse languages.

slide-89
SLIDE 89

July 13, 2006 89

Other Contributions Not in Thesis

  • WCFG = SCFG (as conditional distributions)

(Smith & Johnson, in review)

  • Vine grammar: regular dependency grammars

(Eisner & Smith, 2005)

  • Multilingual NLP:

Korean/English parsing (Smith & Smith, 2004) State-of-the-art morphological disambiguation for Korean, Arabic, and Czech (Smith, Smith, & Tromble, 2005) Fast, precise vine parsing for 13 languages (Dreyer, Smith, & Smith, 2006) Contributor to:

  • Dyna language ror weighted dynamic programming (Eisner, Goldlust, & Smith, 2004, 2005)
  • STRAND bilingual text mining system (Resnik, 1999; Resnik and Smith, 2003)
  • Egypt statistical machine translation toolkit (Al-Onaizan et al., 1999, Smith & Jahr, 2000)
slide-90
SLIDE 90

July 13, 2006 90

Model A, Supervised

  • MLE: 82.5% accuracy, 84.8% undirected
  • MAP (oracle λ): 82.8%, 85.1%
  • MCLE (unreg.): 83.9%, 86.6%
  • MLE (train on Sections 2-21): 70.4% (Section 23)

– With distance model: 75.6% (Eisner & Smith, 2005)

McDonald et al. (2006): 91.5%

slide-91
SLIDE 91

July 13, 2006 91

Motivation

  • Goal of NLP: build software that does useful things with

language.

– Transcribe spoken language. – Digitize printed language. – Find & present information from text & speech databases. – Translate between languages.

  • Does this have anything to do with human intelligence?

Maybe. Success will have everything to do with understanding language.

slide-92
SLIDE 92

July 13, 2006 92

7-fold cross-validation

slide-93
SLIDE 93

July 13, 2006 93