learning as search optimization
play

Learning as Search Optimization: Approximate Large Margin Methods - PowerPoint PPT Presentation

Hal Daum III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daum III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu


  1. Hal Daumé III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daumé III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu Learning as Search Optimization Slide 1

  2. Hal Daumé III (hdaume@isi.edu) Structured Prediction 101 ➢ Learn a function mapping inputs to complex outputs: : X  Y f Pro Md Md Dt Vb Input Space Decoding Output Space Pro Md Md Dt Nn Bill Clinton Mary did not slap the green witch . Pro Md Nn Dt Md the President Clinton Pro Md Nn Dt Vb Mary no daba una botefada a la bruja verda . Pro Md Nn Dt Nn Pro Md Vb Dt Md he Al Gore Pro Md Vb Dt Vb Pro Md Vb Dt Nn Gore I can can a can Coreference Resolution Machine Translation Parsing Sequence Labeling Learning as Search Optimization Slide 2

  3. Hal Daumé III (hdaume@isi.edu) Problem Decomposition ➢ Divide problem into regions ➢ Express both the loss function and the features in terms of regions: Pro Md Vb Dt Nn I can can a can ➢ Decoding: ➢ Tractable using dynamic programming when regions are simple (max-product algorithm) ➢ Parameter estimation (linear models – CRF, M3N, SVMSO, etc): ➢ Tractable using dynamic programming when regions are simple (sum-product algorithm) Learning as Search Optimization Slide 3

  4. Hal Daumé III (hdaume@isi.edu) Problem ➢ In many (most?) problems, decoding is hard: ➢ Coreference resolution ➢ Machine translation Suboptimal heuristic search ➢ Automatic document summarization ➢ Even joint sequence labeling! NP VP NP Want weights that are optimal Pro Md Vb Dt Nn I can can a can for a suboptimal search procedure ➢ Even if estimation were tractable, optimality is gone unsearched region objective output space Learning as Search Optimization Slide 4

  5. Hal Daumé III (hdaume@isi.edu) Generic Search Formulation ➢ ➢ Search Problem: nodes := MakeQueue(S0) ➢ Search space ➢ Operators ➢ while nodes is not empty ➢ Goal-test function ➢ node := RemoveFront(nodes) ➢ Path-cost function ➢ if node is a goal state return node ➢ next := Operators(node) ➢ Search Variable: ➢ nodes := Enqueue(nodes, next) ➢ Enqueue function ➢ fail Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc... Learning as Search Optimization Slide 5

  6. Hal Daumé III (hdaume@isi.edu) Exact (DP) Search S0 Learning as Search Optimization Slide 6

  7. Hal Daumé III (hdaume@isi.edu) Beam Search S0 Learning as Search Optimization Slide 7

  8. Hal Daumé III (hdaume@isi.edu) Inspecting Enqueue ➢ Generally, we sort nodes by: f  n  = g  n   h  n  Assume this Node value Path cost Future cost is given Assume this is a linear function of features: T  x ,n  g  n  =  w Learning as Search Optimization Slide 8

  9. Hal Daumé III (hdaume@isi.edu) Formal Specification ➢ Given: X ➢ An input space , output space , and search space Y S D  : X × S  ℝ ➢ A parameter function ≥ 0 l : X × Y × Y  ℝ ➢ A loss function that decomposes over search: l  x ,y ,  ∀ n   y  ≤ l  x , y ,n  y  (not absolutely l  x , y ,n  ≤ l  x , y ,  n  ∀ n   n  necessary) (monotonicity) ➢ Find weights to minimize: w M = ∑ l  x m , y m ,  y = search  x m ; w  L m = 1 + regularization term M ≤ ∑ ∑ y [ l  x m , y m ,n − l  x m , y m , par  n  ] m = 1 n   We focus on 0/1 loss Learning as Search Optimization Slide 9

  10. Hal Daumé III (hdaume@isi.edu) Online Learning Framework (LaSO) Monotonicity : for any node, ➢ nodes := MakeQueue(S0) we can tell if it can lead to ➢ while nodes is not empty the correct solution or not ➢ node := RemoveFront(nodes) if none of {node} ∪ nodes is y-good or node is a goal & not y-good ➢ Where should we have gone? If we erred... ➢ sibs := siblings(node, y) w := update(w, x, sibs, {node} ∪ nodes) ➢ ➢ nodes := MakeQueue(sibs) Update our weights based on the good and the bad choices ➢ else Continue search... ➢ if node is a goal state return w ➢ next := Operators(node) ➢ nodes := Enqueue(nodes, next) Learning as Search Optimization Slide 10

  11. Hal Daumé III (hdaume@isi.edu) Search-based Margin ➢ The margin is the amount by which we are correct: T  x , g 1   u  u T  x , g 2   u  T  x ,b 1   u T  x ,b 2   u Note that the margin and hence linear separability is also a function of the search algorithm! Learning as Search Optimization Slide 11

  12. Hal Daumé III (hdaume@isi.edu) Update Methods: ➢ Perceptron updates: w  [ ∑ ∣ good ∣ ] − [ ∑ [Rosenblatt 1958; ∣ bad ∣ ]  x ,n   x ,n  Freund+Shapire 1999; w    Collins 2002] n ∈ good n ∈ bad  ➢ Approximate large margin updates:  2 Nuisance param, use [Gentile 2001] C  ℘  w   k ℘  w Project into unit sphere Generation of weight vector u / max { 0, ∥ u ∥ } ℘ u  =  ➢ Also downweight y-good nodes by: 1 / Nuisance param, use  1 − B  k Ratio of desired margin Learning as Search Optimization Slide 12

  13. Hal Daumé III (hdaume@isi.edu) Convergence Theorems ➢ For linearly separable data: [Rosenblatt 1958; − 2 K ≤  ➢ For perceptron updates, Freund+Shapire 1999; Collins 2002] Number of updates ➢ For large margin updates, 2  [Gentile 2001]  − 1  2 2 2  8 ≤  − 4 K  − 2  4 = 2  = 1  ➢ Similar bounds for inseparable case Learning as Search Optimization Slide 13

  14. Hal Daumé III (hdaume@isi.edu) Experimental Results ➢ Two related tasks: ➢ Syntactic chunking (exact search + estimation is possible) ➢ Joint chunking + part of speech tagging [Sutton + McCallum 2004] (search + estimation intractable) ➢ Data from CoNLL 2000 data set ➢ 8936 training sentences (212k words) ➢ 2012 test sentences (47k words) ➢ The usual suspects as features: ➢ Chunk length, word identity (+lower-cased, +stemmed), case pattern, {1,2,3}-letter prefix and suffix ➢ Membership on lists of names, locations, abbreviations, stop words, etc ➢ Applied in a window of 3 ➢ For syntactic chunking, we also use output of Brill's tagger as POS information Learning as Search Optimization Slide 14

  15. Hal Daumé III (hdaume@isi.edu) Syntactic Chunking ➢ Search: ➢ Left-to-right, hypothesizes entire chunk at a time: [Great American] NP [said] VP [it] NP [increased] VP [its loan-loss reserves] NP [by] PP [$ 93 million] NP [after] PP [reviewing] VP [its loan portfolio] NP , ... ➢ Enqueue functions: ➢ Beam search: sort by cost, keep only top k hypotheses after each step ➢ An error occurs exactly when none of the beam elements are good ➢ Exact search: store costs in dynamic programming lattice ➢ An error occurs only when the fully-decoded sequence is wrong ➢ Updates are made by summing over the entire lattice ➢ This is nearly the same as the CRF/M3N/SVMISO updates, but with evenly weighted errors  = [ ∑ ∣ good ∣ ] − [ ∑ ∣ bad ∣ ]  x ,n   x ,n  n ∈ good n ∈ bad Learning as Search Optimization Slide 15

  16. Hal Daumé III (hdaume@isi.edu) Syntactic Chunking Results 24 min 4 min [Zhang+Damerau+Johnson 2002]; timing unknown F-Score 22 min [Collins 2002] 33 min [Sarawagi+Cohen 2004] Training Time (minutes) Learning as Search Optimization Slide 16

  17. Hal Daumé III (hdaume@isi.edu) Joint Tagging + Chunking ➢ Search: left-to-right, hypothesis POS and BIO-chunk Great American said it increased its loan-loss reserves by ... NNP NNP VBD PRP VBD PRP$ NN NNS IN ... B-NP I-NP B-VP B-NP B-VP B-NP I-NP I-NP B-PP ... ➢ Previous approach: Sutton+McCallum use belief propagation algorithms (eg., tree-based reparameterization) to perform inference in a double-chained CRF (13.6 hrs to train on 5%: 400 sentences) ➢ Enqueue: beam search Learning as Search Optimization Slide 17

  18. Hal Daumé III (hdaume@isi.edu) Joint T+C Results 23 min Joint tagging/chunking accuracy [Sutton+McCallum 2004] 7 min 3 min 1 min Training Time (hours) [log scale] Learning as Search Optimization Slide 18

  19. Hal Daumé III (hdaume@isi.edu) Variations on a Beam ➢ Observation: ➢ We needn't use the same beam size for training and decoding ➢ Varying these values independently yields: Decoding Beam 1 5 10 25 50 Training Beam 1 93.9 92.8 91.9 91.3 90.9 5 90.5 94.3 94.4 94.1 94.1 10 89.5 94.3 94.4 94.2 94.2 25 88.7 94.2 94.5 94.3 94.3 50 88.4 94.2 94.4 94.2 94.4 Learning as Search Optimization Slide 19

  20. Hal Daumé III (hdaume@isi.edu) Conclusions ➢ Problem: ➢ Solving most problems is intractable ➢ How can we learn effectively for these problems? ➢ Solution: ➢ Integrate learning with search and learn parameters that are both good for identifying correct hypotheses and guiding search ➢ Results: State-of-the-art performance at low computational cost ➢ Current work: ➢ Apply this framework to more complex problems ➢ Explore alternative loss functions ➢ Better formalize the optimization problem ➢ Connection to CRFs, M3Ns and SVMSOs ➢ Reductionist strategy Learning as Search Optimization Slide 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend