tractable semi supervised learning of complex structured
play

Tractable Semi-Supervised Learning of Complex Structured Prediction - PowerPoint PPT Presentation

Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya


  1. Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya Keerthi S (Microsoft CISL) September 24, 2013 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 1 / 23

  2. Structured Prediction Problems (examples) Sequence Learning (e.g., input: a sentence; output: POS Tag) The President Came to the office DT N V P DT N Multi-label Classification (e.g., a document belongs to more than one class - finance, politics) ( Object → { c l 1 , c l 2 , c l K } ) In this paper, we consider general structures Characteristics: Exponential number of output combinations for a given input (e.g., 2 K in K output multi-label classification problem) Label dependency across the outputs K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 2 / 23

  3. Semi-supervised Learning (SSL) Manual labeling is expensive Unlabeled data is freely available (e.g., web pages, mails) Additional domain knowledge or side information available ◮ Label distribution in the unlabeled data (e.g., 80% positive examples) ◮ Label correlation (e.g., multi-label classification problem) For SSL, we need inference engine that can handle domain constraints Make use of unlabeled data with domain knowledge or side information to constrain the solution space - improved performance K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 3 / 23

  4. SSL of Complex Structured Prediction Models Most works assume that the output structure is simple (e.g., Dhillon et al 12, Chang et al 12) ⇒ Cannot handle problems with complex structure Contributions: We propose an approximate semi-supervised learning algorithm: ◮ use piecewise training for estimating the model weights ◮ dual decomposition method for inference problem ⇒ extend SSL to general structured prediction problems Our inference engine can be applied to various SSL frameworks K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 4 / 23

  5. Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 5 / 23

  6. Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 6 / 23

  7. SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

  8. SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

  9. SSL Problem Input Space X Output Space Y a small set of labeled examples X L = { x i } n i =1 , Y L = { y i } n i =1 a large set of unlabeled examples X U = { x j } m j =1 domain knowledge or a set of constraints C Learning Problem: learn a scoring function s ( x , y ; θ ) = θ · f ( x , y ) where θ denotes model parameter and f ( · ) is the feature function Inference Problem: y ∗ = argmax s ( x , y ; θ ) y ∈Y K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

  10. SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

  11. SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

  12. SSL Problem (2) (Exact) Likelihood Model (using the scoring function s ( · )) exp( s ( x , y ; θ )) p ( y | x ; θ ) = � y exp( s ( x , y ; θ )) Supervised Learning: max θ S ( θ ) = R ( θ ) + L ( Y L ; X L , θ ) Regularization: R ( θ ) = − || θ || 2 2 σ 2 ( σ 2 : regularization Parameter) Log Likelihood Function: L ( Y ; X , θ ) = 1 n log p ( Y | X ; θ ) = 1 � log p ( y i | x i ; θ ) n i Semi-supervised Learning: max S ( θ ) + L ( Y U ; X U , θ ) s . t . label constraints on Y U θ, Y U K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

  13. Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

  14. Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

  15. Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) exp( s ( x , y ; θ )) p ( y | x ; w ) = � y exp( s ( x , y ; θ )) Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large Decomposable scoring function � s ( · ) : s ( y ; x , θ ) = c φ c ( y π c ) where c is a component K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

  16. Decomposable Scoring Function (2) Decomposable Scoring Function s ( · ) � s ( y ; x , θ ) = φ c ( y π c ) c where c is a component Can we use a simplified likelihood model to learn the model parameters efficiently? Composite Likelihood Approach - Compose likelihood using likelihoods of individual components Can we use popular decomposition methods for solving inference problems with domain constraints efficiently? (e.g., dual decomposition) K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 10 / 23

  17. Outline Background ◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function Semi-supervised Learning for Structured Predictions ◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints) Experimental Results Conclusion K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 11 / 23

  18. Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

  19. Composite Likelihood Composite (log) likelihood ˜ � � � L ( y ; x , θ ) = c L c ( y π c ; x , θ ) = c φ c ( y π c ) − c log Z c π c ⊂ { 1 , . . . , N } is an index set associated with c . Key: Partition function in each component is easy to compute Examples: Let y = ( y 1 , y 2 , ..., y K ) , y k ∈ { + , −} , decompose likelihood function using K spanning trees (involving all variables y ): ◮ Score of each tree φ k ( y π c ) = 1 p θ p ( y p ) · x + 1 � � q � = k θ pq ( y pq ) · x K 2 K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend