Learning as Search Optimization: Approximate Large Margin Methods - PowerPoint PPT Presentation

Hal Daumé III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daumé III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu Learning as Search Optimization Slide 1

Hal Daumé III (hdaume@isi.edu) Structured Prediction 101 ➢ Learn a function mapping inputs to complex outputs: : X  Y f Pro Md Md Dt Vb Input Space Decoding Output Space Pro Md Md Dt Nn Bill Clinton Mary did not slap the green witch . Pro Md Nn Dt Md the President Clinton Pro Md Nn Dt Vb Mary no daba una botefada a la bruja verda . Pro Md Nn Dt Nn Pro Md Vb Dt Md he Al Gore Pro Md Vb Dt Vb Pro Md Vb Dt Nn Gore I can can a can Coreference Resolution Machine Translation Parsing Sequence Labeling Learning as Search Optimization Slide 2

Hal Daumé III (hdaume@isi.edu) Problem Decomposition ➢ Divide problem into regions ➢ Express both the loss function and the features in terms of regions: Pro Md Vb Dt Nn I can can a can ➢ Decoding: ➢ Tractable using dynamic programming when regions are simple (max-product algorithm) ➢ Parameter estimation (linear models – CRF, M3N, SVMSO, etc): ➢ Tractable using dynamic programming when regions are simple (sum-product algorithm) Learning as Search Optimization Slide 3

Hal Daumé III (hdaume@isi.edu) Problem ➢ In many (most?) problems, decoding is hard: ➢ Coreference resolution ➢ Machine translation Suboptimal heuristic search ➢ Automatic document summarization ➢ Even joint sequence labeling! NP VP NP Want weights that are optimal Pro Md Vb Dt Nn I can can a can for a suboptimal search procedure ➢ Even if estimation were tractable, optimality is gone unsearched region objective output space Learning as Search Optimization Slide 4

Hal Daumé III (hdaume@isi.edu) Generic Search Formulation ➢ ➢ Search Problem: nodes := MakeQueue(S0) ➢ Search space ➢ Operators ➢ while nodes is not empty ➢ Goal-test function ➢ node := RemoveFront(nodes) ➢ Path-cost function ➢ if node is a goal state return node ➢ next := Operators(node) ➢ Search Variable: ➢ nodes := Enqueue(nodes, next) ➢ Enqueue function ➢ fail Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc... Learning as Search Optimization Slide 5

Hal Daumé III (hdaume@isi.edu) Exact (DP) Search S0 Learning as Search Optimization Slide 6

Hal Daumé III (hdaume@isi.edu) Beam Search S0 Learning as Search Optimization Slide 7

Hal Daumé III (hdaume@isi.edu) Inspecting Enqueue ➢ Generally, we sort nodes by: f  n  = g  n   h  n  Assume this Node value Path cost Future cost is given Assume this is a linear function of features: T  x ,n  g  n  =  w Learning as Search Optimization Slide 8

Hal Daumé III (hdaume@isi.edu) Formal Specification ➢ Given: X ➢ An input space , output space , and search space Y S D  : X × S  ℝ ➢ A parameter function ≥ 0 l : X × Y × Y  ℝ ➢ A loss function that decomposes over search: l  x ,y ,  ∀ n   y  ≤ l  x , y ,n  y  (not absolutely l  x , y ,n  ≤ l  x , y ,  n  ∀ n   n  necessary) (monotonicity) ➢ Find weights to minimize: w M = ∑ l  x m , y m ,  y = search  x m ; w  L m = 1 + regularization term M ≤ ∑ ∑ y [ l  x m , y m ,n − l  x m , y m , par  n  ] m = 1 n   We focus on 0/1 loss Learning as Search Optimization Slide 9

Hal Daumé III (hdaume@isi.edu) Online Learning Framework (LaSO) Monotonicity : for any node, ➢ nodes := MakeQueue(S0) we can tell if it can lead to ➢ while nodes is not empty the correct solution or not ➢ node := RemoveFront(nodes) if none of {node} ∪ nodes is y-good or node is a goal & not y-good ➢ Where should we have gone? If we erred... ➢ sibs := siblings(node, y) w := update(w, x, sibs, {node} ∪ nodes) ➢ ➢ nodes := MakeQueue(sibs) Update our weights based on the good and the bad choices ➢ else Continue search... ➢ if node is a goal state return w ➢ next := Operators(node) ➢ nodes := Enqueue(nodes, next) Learning as Search Optimization Slide 10

Hal Daumé III (hdaume@isi.edu) Search-based Margin ➢ The margin is the amount by which we are correct: T  x , g 1   u  u T  x , g 2   u  T  x ,b 1   u T  x ,b 2   u Note that the margin and hence linear separability is also a function of the search algorithm! Learning as Search Optimization Slide 11

Hal Daumé III (hdaume@isi.edu) Update Methods: ➢ Perceptron updates: w  [ ∑ ∣ good ∣ ] − [ ∑ [Rosenblatt 1958; ∣ bad ∣ ]  x ,n   x ,n  Freund+Shapire 1999; w    Collins 2002] n ∈ good n ∈ bad  ➢ Approximate large margin updates:  2 Nuisance param, use [Gentile 2001] C  ℘  w   k ℘  w Project into unit sphere Generation of weight vector u / max { 0, ∥ u ∥ } ℘ u  =  ➢ Also downweight y-good nodes by: 1 / Nuisance param, use  1 − B  k Ratio of desired margin Learning as Search Optimization Slide 12

Hal Daumé III (hdaume@isi.edu) Convergence Theorems ➢ For linearly separable data: [Rosenblatt 1958; − 2 K ≤  ➢ For perceptron updates, Freund+Shapire 1999; Collins 2002] Number of updates ➢ For large margin updates, 2  [Gentile 2001]  − 1  2 2 2  8 ≤  − 4 K  − 2  4 = 2  = 1  ➢ Similar bounds for inseparable case Learning as Search Optimization Slide 13

Hal Daumé III (hdaume@isi.edu) Experimental Results ➢ Two related tasks: ➢ Syntactic chunking (exact search + estimation is possible) ➢ Joint chunking + part of speech tagging [Sutton + McCallum 2004] (search + estimation intractable) ➢ Data from CoNLL 2000 data set ➢ 8936 training sentences (212k words) ➢ 2012 test sentences (47k words) ➢ The usual suspects as features: ➢ Chunk length, word identity (+lower-cased, +stemmed), case pattern, {1,2,3}-letter prefix and suffix ➢ Membership on lists of names, locations, abbreviations, stop words, etc ➢ Applied in a window of 3 ➢ For syntactic chunking, we also use output of Brill's tagger as POS information Learning as Search Optimization Slide 14

Hal Daumé III (hdaume@isi.edu) Syntactic Chunking ➢ Search: ➢ Left-to-right, hypothesizes entire chunk at a time: [Great American] NP [said] VP [it] NP [increased] VP [its loan-loss reserves] NP [by] PP [$ 93 million] NP [after] PP [reviewing] VP [its loan portfolio] NP , ... ➢ Enqueue functions: ➢ Beam search: sort by cost, keep only top k hypotheses after each step ➢ An error occurs exactly when none of the beam elements are good ➢ Exact search: store costs in dynamic programming lattice ➢ An error occurs only when the fully-decoded sequence is wrong ➢ Updates are made by summing over the entire lattice ➢ This is nearly the same as the CRF/M3N/SVMISO updates, but with evenly weighted errors  = [ ∑ ∣ good ∣ ] − [ ∑ ∣ bad ∣ ]  x ,n   x ,n  n ∈ good n ∈ bad Learning as Search Optimization Slide 15

Hal Daumé III (hdaume@isi.edu) Syntactic Chunking Results 24 min 4 min [Zhang+Damerau+Johnson 2002]; timing unknown F-Score 22 min [Collins 2002] 33 min [Sarawagi+Cohen 2004] Training Time (minutes) Learning as Search Optimization Slide 16

Hal Daumé III (hdaume@isi.edu) Joint Tagging + Chunking ➢ Search: left-to-right, hypothesis POS and BIO-chunk Great American said it increased its loan-loss reserves by ... NNP NNP VBD PRP VBD PRP$ NN NNS IN ... B-NP I-NP B-VP B-NP B-VP B-NP I-NP I-NP B-PP ... ➢ Previous approach: Sutton+McCallum use belief propagation algorithms (eg., tree-based reparameterization) to perform inference in a double-chained CRF (13.6 hrs to train on 5%: 400 sentences) ➢ Enqueue: beam search Learning as Search Optimization Slide 17

Hal Daumé III (hdaume@isi.edu) Joint T+C Results 23 min Joint tagging/chunking accuracy [Sutton+McCallum 2004] 7 min 3 min 1 min Training Time (hours) [log scale] Learning as Search Optimization Slide 18

Hal Daumé III (hdaume@isi.edu) Variations on a Beam ➢ Observation: ➢ We needn't use the same beam size for training and decoding ➢ Varying these values independently yields: Decoding Beam 1 5 10 25 50 Training Beam 1 93.9 92.8 91.9 91.3 90.9 5 90.5 94.3 94.4 94.1 94.1 10 89.5 94.3 94.4 94.2 94.2 25 88.7 94.2 94.5 94.3 94.3 50 88.4 94.2 94.4 94.2 94.4 Learning as Search Optimization Slide 19

Hal Daumé III (hdaume@isi.edu) Conclusions ➢ Problem: ➢ Solving most problems is intractable ➢ How can we learn effectively for these problems? ➢ Solution: ➢ Integrate learning with search and learn parameters that are both good for identifying correct hypotheses and guiding search ➢ Results: State-of-the-art performance at low computational cost ➢ Current work: ➢ Apply this framework to more complex problems ➢ Explore alternative loss functions ➢ Better formalize the optimization problem ➢ Connection to CRFs, M3Ns and SVMSOs ➢ Reductionist strategy Learning as Search Optimization Slide 20

Learning as Search Optimization: Approximate Large Margin Methods - PowerPoint PPT Presentation

Hal Daum III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daum III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Outline DM811 Fall 2009 Heuristics for Combinatorial Optimization 1. Complete Search Methods

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Seminar: Search and Optimization 2. Search Problems Florian Pommerening Universit at Basel

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

SUPPL SUPPLY CHAI CHAIN AND AND WORKI RKING CAPI CAPITAL FINAN FINANCE SEM SEMINAR NAR July 2017

Y2K $ 134 billion Traditional Analytics context account real-time rules do not take into

Building & Property Committee Construction Punch List Fiber Build Funds

The Problem Domain Story: As a Investment Bank, we need meet all BCBS239 and associated

Sympathy & Neglect Admiration & Loyalty Warmth Contempt & Rejection Envy &

Given Topic using Natural Language Processing Techniques MIKE ROYLANCE UNIVERSITY OF WASHINGTON

S. Masciantonio Banca dItalia 2 nd EBA Research Workshop How to regulate and resolve

HOW TO FINANCE A SHIP A Practical Guide to debt, equity & leasing Matt McCleery President,

Learning as Search Optimization: Approximate Large Margin Methods - PowerPoint PPT Presentation

Hal Daum III (hdaume@isi.edu) Learning as Search Optimization: Approximate Large Margin Methods for Structured Prediction Hal Daum III and Daniel Marcu Information Sciences Institute University of Southern California {hdaume,marcu}@isi.edu

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

Outline DM811 Fall 2009 Heuristics for Combinatorial Optimization 1. Complete Search Methods

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Seminar: Search and Optimization 2. Search Problems Florian Pommerening Universit at Basel

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

2 EBI Search 3 EBI Search 4 EBI

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Search Algorithms 3 AI Slides (6e) c Lin Zuoquan@PKU 2003-2020 3 1 3 Search Algorithms

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

SUPPL SUPPLY CHAI CHAIN AND AND WORKI RKING CAPI CAPITAL FINAN FINANCE SEM SEMINAR NAR July 2017

Y2K $ 134 billion Traditional Analytics context account real-time rules do not take into

Building &amp; Property Committee Construction Punch List Fiber Build Funds

The Problem Domain Story: As a Investment Bank, we need meet all BCBS239 and associated

Sympathy &amp; Neglect Admiration &amp; Loyalty Warmth Contempt &amp; Rejection Envy &amp;

Given Topic using Natural Language Processing Techniques MIKE ROYLANCE UNIVERSITY OF WASHINGTON

S. Masciantonio Banca dItalia 2 nd EBA Research Workshop How to regulate and resolve

HOW TO FINANCE A SHIP A Practical Guide to debt, equity &amp; leasing Matt McCleery President,

Building & Property Committee Construction Punch List Fiber Build Funds

Sympathy & Neglect Admiration & Loyalty Warmth Contempt & Rejection Envy &

HOW TO FINANCE A SHIP A Practical Guide to debt, equity & leasing Matt McCleery President,