factorization of the label conditional distribution for
play

Factorization of the Label Conditional Distribution for Multi-Label - PowerPoint PPT Presentation

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of


  1. Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of Lyon 1, France September 11, 2015 1/20

  2. Outline ◮ Multi-label classification ◮ Unified probabilistic framework ◮ Hamming loss vs Subset 0 / 1 loss ◮ Factorization of the joint conditional distribution of the labels ◮ Irreducible label factors ◮ The ILF-Compo algorithm ◮ Experimental results ◮ Toy problem ◮ Benchmark data sets This work was recently presented at ICML (Gasse, Aussem, and Elghazel 2015). 2/20

  3. Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y 3/20

  4. Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h 3/20

  5. Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y 3/20

  6. Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y The current trend is to exploit label dependence to improve MLC... under which loss function? 3/20

  7. Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 4/20

  8. Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 4/20

  9. Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 p ( Y | x ) much harder to estimate than p ( Y i | x )... can we use the label dependencies to better model p ( Y | x ) ? 4/20

  10. Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) 5/20

  11. Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.46 1 0 0.44 1 1 0.08 HLoss optimal : J = 1, R = 1 (52% , 54%) SLoss optimal : J = 0, R = 1 (46%) 5/20

  12. Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . 6/20

  13. Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. 6/20

  14. Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. We seek a factorization into (unique) irreducible label factors ILF . 6/20

  15. Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 7/20

  16. Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. 7/20

  17. Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. Much easier if we assume the Composition property. 7/20

  18. The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z 8/20

  19. The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). 8/20

  20. The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). Typical counter-example The exclusive OR relationship, A = B ⊕ C ⇒ { A } �⊥ ⊥ { B , C } ∧ { A } ⊥ ⊥ { B } ∧ { A } ⊥ ⊥ { C } 8/20

  21. Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 9/20

  22. Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, 9/20

  23. Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, Theorem Suppose p supports the Composition property and consider M i an arbitrary Markov blanket of Y i in X . Then, { Y i } �⊥ ⊥ { Y j } | X is true iff { Y i } �⊥ ⊥ { Y j } | M i . 9/20

  24. ILF-Compo algorithm Generic procedure ◮ For each label Y i compute M i a Markov boundary in X . ◮ For each pair of labels ( Y i , Y j ) check { Y i } �⊥ ⊥ { Y j } | M i to build G . ◮ Extract the partition ILF = { Y LF 1 , . . . , Y LF m } from G . ◮ Decompose the multi-label problem into a series of independent multi-class problems. 10/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend