Factorization of the Label Conditional Distribution for Multi-Label - PowerPoint PPT Presentation

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of Lyon 1, France September 11, 2015 1/20

Outline ◮ Multi-label classification ◮ Unified probabilistic framework ◮ Hamming loss vs Subset 0 / 1 loss ◮ Factorization of the joint conditional distribution of the labels ◮ Irreducible label factors ◮ The ILF-Compo algorithm ◮ Experimental results ◮ Toy problem ◮ Benchmark data sets This work was recently presented at ICML (Gasse, Aussem, and Elghazel 2015). 2/20

Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y 3/20

Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h 3/20

Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y 3/20

Unified probabilistic framework Find a mapping h from a space of features X to a space of labels Y x ∈ R d , y ∈ { 0 , 1 } c , h : X → Y The risk-minimizing model h ⋆ with respect to a loss function L is defined over p ( X , Y ) as h ⋆ = arg min E X , Y [ L ( Y , h ( X ))] h The point-wise best prediction requires only p ( Y | X ) h ⋆ ( x ) = arg min E Y | x [ L ( Y , y )]. y The current trend is to exploit label dependence to improve MLC... under which loss function? 3/20

Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 4/20

Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 4/20

Hamming loss vs Subset 0 / 1 loss Hamming loss Subset 0 / 1 loss c � L H ( y , h ( x )) = 1 / c 1 ( y i � = h i ( x )) L S ( y , h ( x )) = 1 ( y � = h ( x )) i =1 BR (Binary Relevance) is LP (Label Powerset) is optimal, with 2 c parameters optimal, with c parameters c h ⋆ S ( x ) = arg max p ( y | x ) � h ⋆ H ( x ) = arg max p ( y i | x ) y y i i =1 p ( Y | x ) much harder to estimate than p ( Y i | x )... can we use the label dependencies to better model p ( Y | x ) ? 4/20

Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) 5/20

Hamming loss vs Subset 0 / 1 loss A quick example: who is in the picture? Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.10 1 0 0.13 1 1 0.75 HLoss optimal : J = 1, R = 1 (88% , 85%) SLoss optimal : J = 1, R = 1 (75%) Jean Ren´ e p ( J , R | x ) 0 0 0.02 0 1 0.46 1 0 0.44 1 1 0.08 HLoss optimal : J = 1, R = 1 (52% , 54%) SLoss optimal : J = 0, R = 1 (46%) 5/20

Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . 6/20

Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. 6/20

Factorization of the joint conditional distribution Depending on the dependency structure between the labels and the features, the problem of modeling the joint conditional distribution may actually be decomposed into a product of label factors � p ( Y | X ) = p ( Y LF | X ), Y LF ∈P Y � arg max p ( y | x ) = arg max p ( y LF | x ), y y Y LF ∈P Y with P Y a partition of Y . Definition We say that Y LF ⊆ Y is a label factor iff Y LF ⊥ ⊥ Y \ Y LF | X . Additionally, Y LF is said irreducible iff none of its non-empty proper subsets is a label factor. We seek a factorization into (unique) irreducible label factors ILF . 6/20

Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 7/20

Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. 7/20

Graphical characterization Theorem Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff ∃ Z ⊆ Y \ { Y i , Y j } such that { Y i } �⊥ ⊥ { Y j } | X ∪ Z . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 2 c ) pairwise tests of conditional independence to characterize the irreducible label factors. Much easier if we assume the Composition property. 7/20

The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z 8/20

The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). 8/20

The Composition property The dependency of a whole implies the dependency of some part X �⊥ ⊥ Y ∪ W | Z ⇒ X �⊥ ⊥ Y | Z ∨ X �⊥ ⊥ W | Z Weak assumption: several existing methods and algorithms assume the Composition property (e.g. forward feature selection). Typical counter-example The exclusive OR relationship, A = B ⊕ C ⇒ { A } �⊥ ⊥ { B , C } ∧ { A } ⊥ ⊥ { B } ∧ { A } ⊥ ⊥ { C } 8/20

Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . 9/20

Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, 9/20

Graphical characterization - assuming Composition Theorem Suppose p supports the Composition property. Let G be an undirected graph whose nodes correspond to the random variables in Y and in which two nodes Y i and Y j are adjacent iff { Y i } �⊥ ⊥ { Y j } | X . Then, two labels Y i and Y j belong to the same irreducible label factor iff a path exists between Y i and Y j in G . O ( c 2 ) pairwise tests only. Moreover, Theorem Suppose p supports the Composition property and consider M i an arbitrary Markov blanket of Y i in X . Then, { Y i } �⊥ ⊥ { Y j } | X is true iff { Y i } �⊥ ⊥ { Y j } | M i . 9/20

ILF-Compo algorithm Generic procedure ◮ For each label Y i compute M i a Markov boundary in X . ◮ For each pair of labels ( Y i , Y j ) check { Y i } �⊥ ⊥ { Y j } | M i to build G . ◮ Extract the partition ILF = { Y LF 1 , . . . , Y LF m } from G . ◮ Decompose the multi-label problem into a series of independent multi-class problems. 10/20

Factorization of the Label Conditional Distribution for Multi-Label - PowerPoint PPT Presentation

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Outline Outline Conditional Distribution and Density Conditional Distribution and

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Upstream Assigned Label Collision Solution draft-zhang-idr-upstream-label-collision-solution-00

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth

Drawing Parallels between Multi-label Classification and Multi-target Regression Grigorios

Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations Bal

Experiments in Value Function Approximation with Sparse Support Vector Regression Tobias Jung and

The Player Kernel Lucas Maystre , Victor Kristof, Antonio Gonzlez Ferrer, Matthias Grossglauser

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi

Product2Vec : MRNe Net-Pr A A Multi ti-task Recurrent Ne Neural Ne Network for Product

Factorization of the Label Conditional Distribution for Multi-Label - PowerPoint PPT Presentation

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015 International Workshop on Big Multi-Target Prediction Maxime Gasse Alex Aussem Haytham Elghazel LIRIS Laboratory, UMR 5205 CNRS University of

Blue Label Pilot-plant Reactor 1 Product Line-up Platinum Label Gold Label Blue Label Blue

AG! Blue Label Bench-top Reactor 1 Product line up Platinum Label Gold Label Blue Label Blue

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Extreme Classification A New Paradigm for Ranking &amp; Recommendation Manik Varma Microsoft

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Outline Outline Conditional Distribution and Density Conditional Distribution and

Club Med Bintan Island, Indonesia A HOLISTIC WELLNESS ESCAPE JUST OFF SINGAPORE Image label

Presentation of the label Certicold WHY A CERTICOLD LABEL? A European conformity label For

IETF 78 TPA-Label for ADSP DKIM Third-Party Authorization Label draft-otis-dkim-tpa-label By

MPLS Source Label draft-chen-mpls-source-label-02 Mach Chen, Xiaohu Xu Zhenbin Li, Luyuan Fang

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

1. Normal distribution 2. Geometric distribution 3. Binomial distribution 4.

On-line Hierarchical Multi-label Text Classification Jesse Read Supervised by Bernhard (and Eibe

Upstream Assigned Label Collision Solution draft-zhang-idr-upstream-label-collision-solution-00

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth

Drawing Parallels between Multi-label Classification and Multi-target Regression Grigorios

Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations Bal

Experiments in Value Function Approximation with Sparse Support Vector Regression Tobias Jung and

The Player Kernel Lucas Maystre , Victor Kristof, Antonio Gonzlez Ferrer, Matthias Grossglauser

Vandalism Detection on Wikipedia The class imbalance problem &amp; new approaches Paul Gtze

Safe Policy Improvement with Baseline Bootstrapping Romain Laroche, Paul Trichelair, R emi

Product2Vec : MRNe Net-Pr A A Multi ti-task Recurrent Ne Neural Ne Network for Product

Extreme Classification A New Paradigm for Ranking & Recommendation Manik Varma Microsoft

Vandalism Detection on Wikipedia The class imbalance problem & new approaches Paul Gtze