xlii conference on mathematical statistics ccnet joint
play

XLII Conference on Mathematical Statistics CCnet: joint multi-label - PowerPoint PPT Presentation

XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization Pawe Teisseyre Institute of Computer Science, Polish Academy of Sciences Outline


  1. XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization Paweł Teisseyre Institute of Computer Science, Polish Academy of Sciences

  2. Outline ◮ Multi-label classification. ◮ Novel method: CCnet. ◮ Theoretical results. ◮ Results of experiments.

  3. Single-label (binary) classification: . . . x 1 x 2 x p y 1.0 2.2 . . . 4.2 1 2.4 1.3 . . . 3.1 1 0.9 1.4 . . . 3.2 0 . . . . . . . . . 1.7 3.5 . . . 4.2 0 3.9 2.5 . . . 4.1 ? Tabela : Single-label classification. ◮ y ∈ { 0 , 1 } - target variable (label). ◮ x = ( x 1 , . . . , x p ) T - vector of explanatory variables (features). TASK: build a model which predicts y using x .

  4. Multi-label classification: x 1 x 2 . . . x p y 1 y 2 . . . y K 1.0 2.2 . . . 4.2 1 0 . . . 1 2.4 1.3 . . . 3.1 1 0 . . . 1 0.9 1.4 . . . 3.2 0 0 . . . 1 . . . . . . . . . . . . 1.7 3.5 . . . 4.2 0 1 . . . 0 3.9 2.5 . . . 4.1 ? ? . . . ? Tabela : Multi-label classification. ◮ y = ( y 1 , . . . , y K ) ′ - vector of target variables (labels). ◮ x = ( x 1 , . . . , x p ) T - vector of explanatory variables (features). TASK: build a model which predicts y using x .

  5. Motivation example: modelling multimorbidity 1 BMI Weight Glucose ... Diabetes Hypotension Liver disease ... 31 84 10 ... 1 0 1 ... 26 63 6 ... 1 0 0 ... 27 60 7 ... 0 0 0 ... Features x: characteristics of patients. Labels y: occurrences of diseases.. ◮ Task 1: predict which diseases are likely to occur based on patients characteristics.(PREDICTION). ◮ Task 2: select features that influence the occurrence of diseases (FEATURE SELECTION). 1 co-occurrence two or more diseases in one patient

  6. Multi-label classification Standard approach: 1. Estimate posterior probability: p ( y | x ) . 2. Make prediction for some new observation x 0 : ˆ y ( x 0 ) = arg y ∈{ 0 , 1 } K ˆ max p ( y | x 0 ) , where ˆ p ( y | x 0 ) is estimated probability. Both steps are more difficult than for a single-label classification.

  7. Posterior probability estimation Possible approaches: 1. Direct modelling of posterior probability : assume some parametric form of p ( y | x ) , e.g. Ising model. 2. Binary Relevance : assume conditional independence of labels: K � p ( y | x ) = p ( y 1 , . . . , y K | x ) = p ( y k | x ) k = 1 and estimate the marginal probabilities. 3. Classifier chains : use chain rule K � p ( y | x ) = p ( y 1 , . . . , y K | x ) = p ( y 1 | x ) p ( y k | x , y 1 , . . . , y k − 1 ) . k = 2 and estimate the conditional probabilities.

  8. CCnet 1. Use chain rule for posterior probability: K � p ( y | x , θ ) = p ( y 1 | x , θ 1 ) p ( y k | y − k , x , θ k ) , k = 2 where: y − k = ( y 1 , . . . , y k − 1 ) T , θ = ( θ 1 , . . . , θ K ) T . 2. Assume that conditional probabilities are of the logistic form. 3. Use penalized maximum likelihood method (with elastic-net penalty) to estimate parameters θ k : n θ k {− 1 log p ( y ( i ) k | y ( i ) ˆ � − k , x ( i ) , θ k )+ λ 1 || θ k || 1 + λ 2 || θ k || 2 θ k = arg min 2 } , n i = 1 for k = 1 , . . . , K , based on training data ( x ( i ) , y ( i ) ) , i = 1 , . . . , n .

  9. Theoretical results ◮ Stability of CCnet with respect to subset loss: a small perturbation in the training data does not affect the value of the subset loss function. ◮ Generalization error bound for CCnet: we use an idea described in Bousquet & Elisseeff (JMLR 2002) to show that the difference between expected error and empirical error can be bounded by the term related to the stability.

  10. Loss functions Let: g ( x , y , θ ) = p ( y | x , θ ) − max y ′ � = y p ( y ′ | x , θ ) . ◮ Subset loss (equal 0 if all labels are predicted correctly and 1 otherwise): � 1 if g ( x , y , θ ) < 0 l ( x , y , θ ) = (1) 0 if g ( x , y , θ ) � 0 . ◮ Modification of subset loss:  1 if g ( x , y , θ ) < 0   l γ ( x , y , θ ) = 1 − g ( x , y , θ ) /γ if 0 � g ( x , y , θ ) < γ 0 if g ( x , y , θ ) � γ.  

  11. Loss functions Let: g ( x , y , θ ) = p ( y | x , θ ) − max y ′ � = y p ( y ′ | x , θ ) . ◮ Subset loss (equal 0 if all labels are predicted correctly and 1 otherwise): � 1 if g ( x , y , θ ) < 0 l ( x , y , θ ) = (1) 0 if g ( x , y , θ ) � 0 . ◮ Modification of subset loss:  1 if g ( x , y , θ ) < 0   l γ ( x , y , θ ) = 1 − g ( x , y , θ ) /γ if 0 � g ( x , y , θ ) < γ 0 if g ( x , y , θ ) � γ.  

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend