Tractable Semi-Supervised Learning of Complex Structured Prediction - PowerPoint PPT Presentation

Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya Keerthi S (Microsoft CISL) September 24, 2013 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 1 / 23

Structured Prediction Problems (examples) Sequence Learning (e.g., input: a sentence; output: POS Tag) The President Came to the office DT N V P DT N Multi-label Classification (e.g., a document belongs to more than one class - finance, politics) ( Object → { c l 1 , c l 2 , c l K } ) In this paper, we consider general structures Characteristics: Exponential number of output combinations for a given input (e.g., 2 K in K output multi-label classification problem) Label dependency across the outputs K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 2 / 23

Semi-supervised Learning (SSL) Manual labeling is expensive Unlabeled data is freely available (e.g., web pages, mails) Additional domain knowledge or side information available ◮ Label distribution in the unlabeled data (e.g., 80% positive examples) ◮ Label correlation (e.g., multi-label classification problem) For SSL, we need inference engine that can handle domain constraints Make use of unlabeled data with domain knowledge or side information to constrain the solution space - improved performance K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 3 / 23

SSL of Complex Structured Prediction Models Most works assume that the output structure is simple (e.g., Dhillon et al 12, Chang et al 12) ⇒ Cannot handle problems with complex structure Contributions: We propose an approximate semi-supervised learning algorithm: ◮ use piecewise training for estimating the model weights ◮ dual decomposition method for inference problem ⇒ extend SSL to general structured prediction problems Our inference engine can be applied to various SSL frameworks K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 4 / 23

Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 5 / 23

SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function Inference Problem: y ∗ = argmax s ( x , y ; θ ) y ∈Y K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i Semi-supervised Learning: max S ( θ ) + L ( Y U ; X U , θ ) s . t . label constraints on Y U θ, Y U K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large Decomposable scoring function � s ( · ) : s ( y ; x , θ ) = c φ c ( y π c ) where c is a component K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

Decomposable Scoring Function (2) Decomposable Scoring Function s ( · ) � s ( y ; x , θ ) = φ c ( y π c ) c where c is a component Can we use a simplified likelihood model to learn the model parameters efficiently? Composite Likelihood Approach - Compose likelihood using likelihoods of individual components Can we use popular decomposition methods for solving inference problems with domain constraints efficiently? (e.g., dual decomposition) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 10 / 23

Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute Examples: Let y = ( y 1 , y 2 , ..., y K ) , y k ∈ { + , −} , decompose likelihood function using K spanning trees (involving all variables y ): ◮ Score of each tree φ k ( y π c ) = 1 p θ p ( y p ) · x + 1 � � q � = k θ pq ( y pq ) · x K 2 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

Tractable Semi-Supervised Learning of Complex Structured Prediction - PowerPoint PPT Presentation

Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation:

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

Chapter 4 ICS-275 Fall 2010 Fall 2010 ICS 275 - Constraint Networks 1 Tractable Tractable

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

5 Semi-Supervised Learning BVM Tutorial: Advanced Deep Learning Methods David Zimmerer, Division

Semi-Supervised Learning Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Data Mining Graphical Models for Discrete Data Undirected Graphs (Markov Random Fields) Ad

Inference Probabilis7c Graphical Models: Marginal and condi,onal

DECOMPOSABILITY OF ULTRAFILTERS, MODEL-THEORETICAL PRINCIPLES, AND COMPACTNESS OF TOPOLOGICAL

Advanced Tools from Modern Cryptography Lecture 14 MPC: Feasibility Results Summary

Hyperspaces which are cones Alejandro Illanes Universidad Nacional Autonma de Mxico

Decentralized En.ty-Level Modeling for Coreference Resolu.on Greg

Learning in Graphical Models Andrea Passerini passerini@disi.unitn.it Machine Learning Learning

Reduccion de la Planificacion Conformante a SAT mediante Compilacion a d DNNF H ector