Structured Output Learning with Indirect Supervision Ming-Wei Chang , - PowerPoint PPT Presentation

Geometric Interpretation for SSVM Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Page. 11/31

Geometric Interpretation for SSVM w Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Page. 11/31

Geometric Interpretation for SSVM w Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! Predict: Φ( x 1 , ˆ h ) Page. 11/31

Geometric Interpretation for SSVM Decision Function { Φ( x 1 , h ) | h ∈ H ( x 1 ) } w T Φ( x i , h ) arg max h ∈H ( x i ) Φ( x 1 , h ∗ 1 ) Training: Intuition Given an example ( x i , h i ), find a w such that the gold structure h i has the highest score! w Page. 11/31

Structural SVM � w � 2 min + C 1 � L S ( x i , h i , w ) 2 w i ∈ S Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: L S ( x i , h i , w ): Measures “the distance” between the current best prediction and the gold structure h i L S can use hinge or square hinge functions or others A convex optimization problem Page. 12/31

Structural SVM � w � 2 min + C 1 � L S ( x i , h i , w ) 2 w i ∈ S Regularization Measures the model complexity Structural Loss : S is the set of structured labeled examples: L S ( x i , h i , w ): Measures “the distance” between the current best prediction and the gold structure h i L S can use hinge or square hinge functions or others A convex optimization problem Now, add supervision from the companion task! Page. 12/31

The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Page. 13/31

The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? Page. 13/31

The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? x 1 is positive . There must exist a good structure that justifies the positive label ∃ h , w T Φ( x 1 , h ) ≥ 0 Page. 13/31

The role of binary labeled data Companion Binary Output Problem Structured Output Problem Israel I t a l y Yes/No Companion Task : Does this example possess a good structure? x 1 is positive . There must exist a good structure that justifies the positive label ∃ h , w T Φ( x 1 , h ) ≥ 0 x 2 is negative . No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0 Page. 13/31

Why is binary labeled data useful? x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? { Φ( x 1 , h ) | h ∈ H ( x 1 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Predict: Φ( x 1 , ˆ h ) x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? SSVM: w { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Why is binary labeled data useful? w : SSVM+Indirect Supervision { Φ( x 1 , h ) | h ∈ H ( x 1 ) } Gold: Φ( x 1 , h ∗ 1 ) Predict: Φ( x 1 , ˆ h ) { Φ( x 2 , h ) | h ∈ H ( x 2 ) } x 1 is positive : There exists a good structure ∃ h , w T Φ( x 1 , h ) ≥ 0, or max h w T Φ( x 1 , h ) ≥ 0 x 2 is negative : No structure is good enough ∀ h , w T Φ( x 2 , h ) ≤ 0, or max h w T Φ( x 2 , h ) ≤ 0 Page. 14/31

Outline Motivation 1 Structured Output Prediction and Its Companion Task 2 J oint L earning with I ndirect S upervision 3 Optimization 4 Experiments 5 Page. 15/31

Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task Page. 16/31

Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Page. 16/31

Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Page. 16/31

Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Structural Loss: L S Binary Loss: L B Page. 16/31

Binary and structured labeled data Direct Supervision: S Indirect Supervision: B Target Task Companion Task An example: ( x i , h i ) An example: ( x i , y i ) Goal: Goal: w T Φ( x i , h i ) ≥ max h ∈H ( x i ) w T Φ( x i , h ) . h ∈H ( x i ) w T Φ( x i , h ) ≥ 0 y i max Structural Loss: L S Binary Loss: L B Both L S and L B can use hinge, square-hinge, logistic, . . . Page. 16/31

J oint L earning with I ndirect S upervision � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 � L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B Regularization : measures the model complexity Direct Supervision : structured labeled data S = { ( x , h ) } Indirect Supervision : binary labeled data B = { ( x , y ) } Page. 17/31

J oint L earning with I ndirect S upervision � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 � L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B Regularization : measures the model complexity Direct Supervision : structured labeled data S = { ( x , h ) } Indirect Supervision : binary labeled data B = { ( x , y ) } Share weight vector w Use the same weight vector for both structured labeled data and binary labeled data. Page. 17/31

Convexity Properties � w � 2 � � min + C 1 L S ( x i , h i , w ) + C 2 L B ( x i , y i , w ) , 2 w i ∈ S i ∈ B � � ∆( h , h i ) − w T Φ( x i , h i ) + w T Φ( x i , h ) � � L S ( x i , h i , w ) = ℓ max (1) h � � h ∈H ( x ) ( w T Φ B ( x i , h )) L B ( x i , y i , w ) = ℓ 1 − y i max (2) Page. 19/31

Convexity Properties Regularization , Direct Supervision , Negative Data B − Convex Parts � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S � + C 2 i ∈ B + L B ( x i , y i , w ) Neither convex nor concave Positive Data B + Page. 19/31

JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! Page. 20/31

JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Page. 20/31

JLIS: optimization procedure Algorithm 1: Find the best structures for positive examples 2: Find the weight vector using the structure found in Step 1 . Still need to do inference for structured examples and negative examples 3: Repeat! This algorithm converges when ℓ is monotonically increasing and convex. Properties of the algorithm : Asymmetric nature Converting a non-convex problem into a series of smaller convex problems Inference allows incorporating constraints on the output space. (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Page. 20/31

Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S + C 2 i ∈ B + L B ( x i , y i , w ) � Page. 21/31

Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Page. 21/31

Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Cutting plane method Find the “best structure” for examples in S and B − with the current w Add chosen structure into the cache and solve it again! Page. 21/31

Solving the convex sub-problem � w � 2 min + C 1 � L S ( x i , h i , w ) + C 2 i ∈ B − L B ( x i , y i , w ) � 2 w i ∈ S ✞ ☎ � + C 2 L B ( x i , y i , w ) with fixed structures ✝ ✆ i ∈ B + Cutting plane method Find the “best structure” for examples in S and B − with the current w Add chosen structure into the cache and solve it again! Dual coordinate descent method Simple implementation with square (L2) hinge loss Page. 21/31

Experimental Setting Tasks Task 1 : Phonetic alignment Task 2 : Part-of-speech Tagging Task 3 : Information Extraction Citation recognition Advertisement field recognition Companion Tasks Phonetic alignment : Transliteration pair or not POS Tagging : Has a legitimate POS tag sequence or not IE : Is a legitimate Citation/Advertisement or not Page. 23/31

Experimental Results 80 Accuracy 70 PA : Phonetic Alignment 60 ADS : Advertisement field recognition PA POS Citation ADS Tasks Structural SVM Joint Learning with Indirect Supervision Page. 24/31

Impact of negative examples J-LIS: takes advantage of both positively and negatively labeled data Page. 25/31

Impact of negative examples J-LIS: takes advantage of both positively and negatively labeled data Structural SVM 66 JLIS Accuracy 64 62 100 200 400 800 1 . 6 k 3 . 2 k 6 . 4 k 12 . 8 k 25 . 6 k all Number of tokens in the negative examples Page. 25/31

Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Page. 26/31

Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Semi-Supervised Learning methods (Zien, Brefeld, and Scheffer 2007) : Transductive Structural SSVM, (Brefeld and Scheffer 2006) : co-Structural SVM J-LIS uses “negative” examples Page. 26/31

Comparison to other learning framework Generalization over several frameworks B = ∅ ⇒ Structured SVM (Tsochantaridis, Hofmann, Joachims, and Altun 2004) S = ∅ ⇒ Latent SVM/LR (Felzenszwalb, Girshick, McAllester, and Ramanan 2009) (Chang, Goldwasser, Roth, and Srikumar NAACL 2010) Semi-Supervised Learning methods (Zien, Brefeld, and Scheffer 2007) : Transductive Structural SSVM, (Brefeld and Scheffer 2006) : co-Structural SVM J-LIS uses “negative” examples Compared to Contrastive Estimation Conceptually related. More discussion Page. 26/31

Conclusions It is possible to use binary labeled data for learning structures! J-LIS : gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task Jump Allows the use of constraints on structures Page. 27/31

Conclusions It is possible to use binary labeled data for learning structures! J-LIS : gains from both direct and indirect supervision Similarly, structured labeled data can help the binary task Jump Allows the use of constraints on structures Many exciting new directions! Using existing labeled dataset as structured task supervisions How to generate good “negative” examples? Other forms of indirect supervision? Page. 27/31

Thank you! Thank you!! Our learning code is available: the JLIS package http://l2r.cs.uiuc.edu/~cogcomp/software.php Page. 28/31

Compared to Contrastive Estimation: I Contrastive Estimation Performing unsupervised learning with log-linear models Maximize log P ( x ) Model 1 h exp( w T Φ( x , h )) � P ( x ) = � x exp( w T Φ(ˆ x , h )) h , ˆ CE � h exp( w T Φ( x , h )) P ( x ) = � x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ Page. 29/31

Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Page. 30/31

Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Page. 30/31

Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Page. 30/31

Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x . J-LIS can use existing binary labeled data. Page. 30/31

Compared to Contrastive Estimation: II h exp( w T Φ( x , h )) P P ( x ) = P x ∈ N ( x ) exp( w T Φ(ˆ x , h )) h , ˆ CE J-LIS Supervision type “Neighbors” Structured + Binary Inference Problem sum max Property Can use existing data CE needs to know the relationship between “neighbors” of the input x . J-LIS can use existing binary labeled data. Compared J-LIS and CE without using labeled data Jump Back Part-of-speech tags experiments. Same features and dataset. Random Base line: 35% EM: 60.9% (62.1%), CE: 74.7% (79.0%) J-LIS : 70.1% .J-LIS + 5 labeled example: 79.1% Page. 30/31

Joint learning: Results 95 90 Accuracy on the binary classiciation 85 80 75 70 65 60 55 50 |S| = 10, init. only |S| = 10, joint 45 |S| = 20, init. only |S| = 20, joint 40 100 200 400 800 1600 The size of training data (|B|) Impact of structure labeled data when binary classification is our target. Results (for transliteration identification) show that joint training of direct and indirect supervision significantly improves performance, especially when direct supervision is scarce. Page. 31/31

Structured Output Learning with Indirect Supervision Ming-Wei Chang , - PowerPoint PPT Presentation

Structured Output Learning with Indirect Supervision Ming-Wei Chang , Vivek Srikumar, Dan Goldwasser and Dan Roth Computer Science Department, University of Illinois at Urbana-Champaign Page. 1/31 Review: structured output prediction Example

Indirect Left Turns Study Indirect Left Turns Study Indirect Left Turns Study Indirect Left

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

Oil & Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect

Direct3D 11 Indirect Illumination Holger Gruen European ISV Relations AMD Direct3D 11 Indirect

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Chapter 12 Overview Devices and Output Visual Output Dynamic Visualizations Sound

16. Recursion 2 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20 Output: 12

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Learning with Structured Output Spaces Keerthiram Murugesan Standard Predic,on Find func8on

Psychological Abuse NCEA Elder Abuse Presentation: Psychological Abuse www.ncea.aoa.gov 1

Programming Models and Runtime Systems for Heterogeneous Architectures Sylvain Henry

ENVIRONMENTAL GEOTECHNICS CE-488 Lecture No. 18 Prof. D N Singh Department of Civil Engineering

Functions Function: Unit of operation Functions o A series of statements grouped together

Frequent Pattern Mining Christian Borgelt Dept. of Mathematics / Dept. of Computer Sciences

Introduction to Matlab Marco Chiarandini Department of Mathematics & Computer Science

Lecture 25: Natural Language Processing with Neural Nets Julia Hockenmaier April 2019

Structured Output Learning with Indirect Supervision Ming-Wei Chang , - PowerPoint PPT Presentation

Structured Output Learning with Indirect Supervision Ming-Wei Chang , Vivek Srikumar, Dan Goldwasser and Dan Roth Computer Science Department, University of Illinois at Urbana-Champaign Page. 1/31 Review: structured output prediction Example

Indirect Left Turns Study Indirect Left Turns Study Indirect Left Turns Study Indirect Left

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

Oil &amp; Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect

Direct3D 11 Indirect Illumination Holger Gruen European ISV Relations AMD Direct3D 11 Indirect

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Chapter 12 Overview Devices and Output Visual Output Dynamic Visualizations Sound

16. Recursion 2 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20 Output: 12

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Learning with Structured Output Spaces Keerthiram Murugesan Standard Predic,on Find func8on

Psychological Abuse NCEA Elder Abuse Presentation: Psychological Abuse www.ncea.aoa.gov 1

Programming Models and Runtime Systems for Heterogeneous Architectures Sylvain Henry

ENVIRONMENTAL GEOTECHNICS CE-488 Lecture No. 18 Prof. D N Singh Department of Civil Engineering

Functions Function: Unit of operation Functions o A series of statements grouped together

Frequent Pattern Mining Christian Borgelt Dept. of Mathematics / Dept. of Computer Sciences

Introduction to Matlab Marco Chiarandini Department of Mathematics &amp; Computer Science

Lecture 25: Natural Language Processing with Neural Nets Julia Hockenmaier April 2019

Oil & Gas Industry: Indirect tax By Santosh R. Sonar 12 January 2013 Oil and Gas - Indirect

Introduction to Matlab Marco Chiarandini Department of Mathematics & Computer Science