Learning as Search Optimization: Approximate Large Margin Methods - - PowerPoint PPT Presentation

learning as search optimization
SMART_READER_LITE
LIVE PREVIEW

Learning as Search Optimization: Approximate Large Margin Methods - - PowerPoint PPT Presentation

Hal Daum III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daum III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu


slide-1
SLIDE 1

Learning as Search Optimization Slide 1 Hal Daumé III (hdaume@isi.edu)

Learning as Search Optimization:

Approximate Large Margin Methods for Structured Prediction

Hal Daumé III and Daniel Marcu

Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu

slide-2
SLIDE 2

Learning as Search Optimization Slide 2 Hal Daumé III (hdaume@isi.edu)

Structured Prediction 101

Learn a function mapping inputs to complex outputs:

f : X  Y

I can can a can Pro Md Vb Dt Nn Pro Md Vb Dt Vb Pro Md Vb Dt Md Sequence Labeling Pro Md Nn Dt Nn Pro Md Nn Dt Vb Pro Md Nn Dt Md Pro Md Md Dt Nn Pro Md Md Dt Vb Parsing Bill Clinton Clinton Al Gore Gore he the President

Input Space Decoding Output Space

Mary did not slap the green witch . Mary no daba una botefada a la bruja verda . Coreference Resolution Machine Translation

slide-3
SLIDE 3

Learning as Search Optimization Slide 3 Hal Daumé III (hdaume@isi.edu)

Problem Decomposition

Divide problem into regions

Express both the loss function and the features in terms of regions:

I can can a can Pro Md Vb Dt Nn

Decoding:

Tractable using dynamic programming when regions are simple (max-product algorithm)

Parameter estimation (linear models – CRF, M3N, SVMSO, etc):

Tractable using dynamic programming when regions are simple (sum-product algorithm)

slide-4
SLIDE 4

Learning as Search Optimization Slide 4 Hal Daumé III (hdaume@isi.edu)

Problem

In many (most?) problems, decoding is hard:

Coreference resolution

Machine translation

Automatic document summarization

Even joint sequence labeling!

I can can a can Pro Md Vb Dt Nn NP VP NP

Suboptimal heuristic search

Even if estimation were tractable, optimality is gone

  • utput space
  • bjective

unsearched region

Want weights that are optimal for a suboptimal search procedure

slide-5
SLIDE 5

Learning as Search Optimization Slide 5 Hal Daumé III (hdaume@isi.edu)

Generic Search Formulation

Search Problem:

Search space

Operators

Goal-test function

Path-cost function

Search Variable:

Enqueue function

nodes := MakeQueue(S0)

while nodes is not empty

node := RemoveFront(nodes)

if node is a goal state return node

next := Operators(node)

nodes := Enqueue(nodes, next)

fail Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc...

slide-6
SLIDE 6

Learning as Search Optimization Slide 6 Hal Daumé III (hdaume@isi.edu)

Exact (DP) Search

S0

slide-7
SLIDE 7

Learning as Search Optimization Slide 7 Hal Daumé III (hdaume@isi.edu)

Beam Search

S0

slide-8
SLIDE 8

Learning as Search Optimization Slide 8 Hal Daumé III (hdaume@isi.edu)

Inspecting Enqueue

Generally, we sort nodes by:

f n = gn  hn

Node value Path cost Future cost Assume this is given Assume this is a linear function of features:

gn =  w

T x ,n

slide-9
SLIDE 9

Learning as Search Optimization Slide 9 Hal Daumé III (hdaume@isi.edu)

Formal Specification

Given:

An input space , output space , and search space

A parameter function

A loss function that decomposes over search: (monotonicity)

Find weights to minimize: X Y l : X × Y × Y  ℝ

≥0

 : X × S  ℝ

D

S w

L = ∑

m=1 M

lxm, ym,  y=searchxm ; w ≤ ∑

m=1 M

n  y [lxm, ym ,n−lxm, ym , parn]

+ regularization term

We focus on 0/1 loss

lx ,y ,  y ≤ lx , y ,n ∀ n  y lx , y ,n ≤ lx , y ,  n ∀ n  n

(not absolutely necessary)

slide-10
SLIDE 10

Learning as Search Optimization Slide 10 Hal Daumé III (hdaume@isi.edu)

Online Learning Framework (LaSO)

nodes := MakeQueue(S0)

while nodes is not empty

node := RemoveFront(nodes)

if none of {node} ∪ nodes is y-good or node is a goal & not y-good

sibs := siblings(node, y)

w := update(w, x, sibs, {node} ∪ nodes)

nodes := MakeQueue(sibs)

else

if node is a goal state return w

next := Operators(node)

nodes := Enqueue(nodes, next) Monotonicity: for any node, we can tell if it can lead to the correct solution or not If we erred... Where should we have gone? Update our weights based on the good and the bad choices Continue search...

slide-11
SLIDE 11

Learning as Search Optimization Slide 11 Hal Daumé III (hdaume@isi.edu)

Search-based Margin

The margin is the amount by which we are correct:  u

T x , g1

 u

T x , g2

 u

T x ,b1

 u

T x ,b2

 u  Note that the margin and hence linear separability is also a function of the search algorithm!

slide-12
SLIDE 12

Learning as Search Optimization Slide 12 Hal Daumé III (hdaume@isi.edu)

Update Methods:

Perceptron updates:

Approximate large margin updates:

Also downweight y-good nodes by:

 w   w  [ ∑

n∈good

x ,n ∣good∣ ] − [ ∑

n∈bad

x ,n ∣bad∣ ]

[Rosenblatt 1958; Freund+Shapire 1999; Collins 2002]

 w  ℘  w 

C

k ℘

℘ u =  u / max { 0, ∥ u∥ }

Generation of weight vector Nuisance param, use

Project into unit sphere

2

1− B

k

Nuisance param, use

1/

Ratio of desired margin

[Gentile 2001]

slide-13
SLIDE 13

Learning as Search Optimization Slide 13 Hal Daumé III (hdaume@isi.edu)

Convergence Theorems

For linearly separable data:

For perceptron updates,

For large margin updates,

Similar bounds for inseparable case K ≤ 

−2

K ≤ 2 

2 

2  − 1

2

 8  − 4 = 2 

−24

=1

Number of updates

[Rosenblatt 1958; Freund+Shapire 1999; Collins 2002] [Gentile 2001]

slide-14
SLIDE 14

Learning as Search Optimization Slide 14 Hal Daumé III (hdaume@isi.edu)

Experimental Results

Two related tasks:

Syntactic chunking (exact search + estimation is possible)

Joint chunking + part of speech tagging (search + estimation intractable)

Data from CoNLL 2000 data set

8936 training sentences (212k words)

2012 test sentences (47k words)

The usual suspects as features:

Chunk length, word identity (+lower-cased, +stemmed), case pattern, {1,2,3}-letter prefix and suffix

Membership on lists of names, locations, abbreviations, stop words, etc

Applied in a window of 3

For syntactic chunking, we also use output of Brill's tagger as POS information

[Sutton + McCallum 2004]

slide-15
SLIDE 15

Learning as Search Optimization Slide 15 Hal Daumé III (hdaume@isi.edu)

Syntactic Chunking

Search:

Left-to-right, hypothesizes entire chunk at a time:

Enqueue functions:

Beam search: sort by cost, keep only top k hypotheses after each step

An error occurs exactly when none of the beam elements are good

Exact search: store costs in dynamic programming lattice

An error occurs only when the fully-decoded sequence is wrong

Updates are made by summing over the entire lattice

This is nearly the same as the CRF/M3N/SVMISO updates, but with evenly weighted errors

[Great American]NP [said]VP [it]NP [increased]VP [its loan-loss reserves]NP [by]PP [$ 93 million]NP [after]PP [reviewing]VP [its loan portfolio]NP , ...  = [ ∑

n∈good

x ,n ∣good∣ ] − [ ∑

n∈bad

x ,n ∣bad∣ ]

slide-16
SLIDE 16

Learning as Search Optimization Slide 16 Hal Daumé III (hdaume@isi.edu)

Syntactic Chunking Results

Training Time (minutes) F-Score

[Collins 2002] [Zhang+Damerau+Johnson 2002]; timing unknown [Sarawagi+Cohen 2004]

33 min 22 min 24 min 4 min

slide-17
SLIDE 17

Learning as Search Optimization Slide 17 Hal Daumé III (hdaume@isi.edu)

Joint Tagging + Chunking

Search: left-to-right, hypothesis POS and BIO-chunk

Previous approach: Sutton+McCallum use belief propagation algorithms (eg., tree-based reparameterization) to perform inference in a double-chained CRF (13.6 hrs to train on 5%: 400 sentences)

Enqueue: beam search

Great American said it increased its loan-loss reserves by ...

NNP NNP VBD PRP VBD PRP$ NN NNS IN ... B-NP I-NP B-VP B-NP B-VP B-NP I-NP I-NP B-PP ...

slide-18
SLIDE 18

Learning as Search Optimization Slide 18 Hal Daumé III (hdaume@isi.edu)

Joint T+C Results

Training Time (hours) [log scale] Joint tagging/chunking accuracy

[Sutton+McCallum 2004]

23 min 7 min 3 min 1 min

slide-19
SLIDE 19

Learning as Search Optimization Slide 19 Hal Daumé III (hdaume@isi.edu)

Variations on a Beam

Observation:

We needn't use the same beam size for training and decoding

Varying these values independently yields:

1 5 10 25 50 1 93.9 92.8 91.9 91.3 90.9 5 90.5 94.3 94.4 94.1 94.1 10 89.5 94.3 94.4 94.2 94.2 25 88.7 94.2 94.5 94.3 94.3 50 88.4 94.2 94.4 94.2 94.4 Decoding Beam Training Beam

slide-20
SLIDE 20

Learning as Search Optimization Slide 20 Hal Daumé III (hdaume@isi.edu)

Conclusions

Problem:

Solving most problems is intractable

How can we learn effectively for these problems?

Solution:

Integrate learning with search and learn parameters that are both good for identifying correct hypotheses and guiding search

Results: State-of-the-art performance at low computational cost

Current work:

Apply this framework to more complex problems

Explore alternative loss functions

Better formalize the optimization problem

Connection to CRFs, M3Ns and SVMSOs

Reductionist strategy