XLII Conference on Mathematical Statistics CCnet: joint multi-label - - PowerPoint PPT Presentation

xlii conference on mathematical statistics ccnet joint
SMART_READER_LITE
LIVE PREVIEW

XLII Conference on Mathematical Statistics CCnet: joint multi-label - - PowerPoint PPT Presentation

XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization Pawe Teisseyre Institute of Computer Science, Polish Academy of Sciences Outline


slide-1
SLIDE 1

XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization

Paweł Teisseyre

Institute of Computer Science, Polish Academy of Sciences

slide-2
SLIDE 2

Outline

◮ Multi-label classification. ◮ Novel method: CCnet. ◮ Theoretical results. ◮ Results of experiments.

slide-3
SLIDE 3

Single-label (binary) classification: x1 x2 . . . xp y 1.0 2.2 . . . 4.2 1 2.4 1.3 . . . 3.1 1 0.9 1.4 . . . 3.2 . . . . . . . . . 1.7 3.5 . . . 4.2 3.9 2.5 . . . 4.1 ?

Tabela : Single-label classification.

◮ y ∈ {0, 1}- target variable (label). ◮ x = (x1, . . . , xp)T- vector of explanatory variables (features).

TASK: build a model which predicts y using x.

slide-4
SLIDE 4

Multi-label classification: x1 x2 . . . xp y1 y2 . . . yK 1.0 2.2 . . . 4.2 1 . . . 1 2.4 1.3 . . . 3.1 1 . . . 1 0.9 1.4 . . . 3.2 . . . 1 . . . . . . . . . . . . 1.7 3.5 . . . 4.2 1 . . . 3.9 2.5 . . . 4.1 ? ? . . . ?

Tabela : Multi-label classification.

◮ y = (y1, . . . , yK)′- vector of target variables (labels). ◮ x = (x1, . . . , xp)T- vector of explanatory variables (features).

TASK: build a model which predicts y using x.

slide-5
SLIDE 5

Motivation example: modelling multimorbidity1

BMI Weight Glucose ... Diabetes Hypotension Liver disease ... 31 84 10 ... 1 1 ... 26 63 6 ... 1 ... 27 60 7 ... ... Features x: characteristics of patients. Labels y: occurrences of diseases..

◮ Task 1: predict which diseases are likely to occur based on patients

characteristics.(PREDICTION).

◮ Task 2: select features that influence the occurrence of diseases

(FEATURE SELECTION).

1co-occurrence two or more diseases in one patient

slide-6
SLIDE 6

Multi-label classification

Standard approach:

  • 1. Estimate posterior probability:

p(y|x).

  • 2. Make prediction for some new observation x0:

ˆ y(x0) = arg max

y∈{0,1}K ˆ

p(y|x0), where ˆ p(y|x0) is estimated probability. Both steps are more difficult than for a single-label classification.

slide-7
SLIDE 7

Posterior probability estimation

Possible approaches:

  • 1. Direct modelling of posterior probability: assume some

parametric form of p(y|x), e.g. Ising model.

  • 2. Binary Relevance: assume conditional independence of

labels: p(y|x) = p(y1, . . . , yK|x) =

K

  • k=1

p(yk|x) and estimate the marginal probabilities.

  • 3. Classifier chains: use chain rule

p(y|x) = p(y1, . . . , yK|x) = p(y1|x)

K

  • k=2

p(yk|x, y1, . . . , yk−1). and estimate the conditional probabilities.

slide-8
SLIDE 8

CCnet

  • 1. Use chain rule for posterior probability:

p(y|x, θ) = p(y1|x, θ1)

K

  • k=2

p(yk|y−k, x, θk), where: y−k = (y1, . . . , yk−1)T, θ = (θ1, . . . , θK)T.

  • 2. Assume that conditional probabilities are of the logistic form.
  • 3. Use penalized maximum likelihood method (with elastic-net

penalty) to estimate parameters θk: ˆ θk = arg min

θk {−1

n

n

  • i=1

log p(y(i)

k |y(i) −k, x(i), θk)+λ1||θk||1+λ2||θk||2 2},

for k = 1, . . . , K, based on training data (x(i), y(i)), i = 1, . . . , n.

slide-9
SLIDE 9

Theoretical results

◮ Stability of CCnet with respect to subset loss: a small

perturbation in the training data does not affect the value of the subset loss function.

◮ Generalization error bound for CCnet: we use an idea

described in Bousquet & Elisseeff (JMLR 2002) to show that the difference between expected error and empirical error can be bounded by the term related to the stability.

slide-10
SLIDE 10

Loss functions

Let: g(x, y, θ) = p(y|x, θ) − maxy′=y p(y′|x, θ).

◮ Subset loss (equal 0 if all labels are predicted correctly and 1

  • therwise):

l(x, y, θ) =

  • 1

if g(x, y, θ) < 0 if g(x, y, θ) 0. (1)

◮ Modification of subset loss:

lγ(x, y, θ) =

    

1 if g(x, y, θ) < 0 1 − g(x, y, θ)/γ if 0 g(x, y, θ) < γ if g(x, y, θ) γ.

slide-11
SLIDE 11

Loss functions

Let: g(x, y, θ) = p(y|x, θ) − maxy′=y p(y′|x, θ).

◮ Subset loss (equal 0 if all labels are predicted correctly and 1

  • therwise):

l(x, y, θ) =

  • 1

if g(x, y, θ) < 0 if g(x, y, θ) 0. (1)

◮ Modification of subset loss:

lγ(x, y, θ) =

    

1 if g(x, y, θ) < 0 1 − g(x, y, θ)/γ if 0 g(x, y, θ) < γ if g(x, y, θ) γ.

slide-12
SLIDE 12
  • 0.0

0.2 0.4 0.6 0.8 1.0 g(x,y,θ) l(x,y,θ)

  • 0.0

0.2 0.4 0.6 0.8 1.0 g(x,y,θ) lγ(x,y,θ) γ

(a) (b)

Rysunek : Subset loss l(x, y, θ) (a) and modified subset loss lγ(x, y, θ) (b).

slide-13
SLIDE 13

Stability of CCnet

◮ Original training data: D = (x(i), y(i)), i = 1, . . . , n ◮ Modified training data: Dl: l-th observation from D is

replaced by its independent copy.

◮ Solutions of CCnet: ˆ

θ and ˆ θ

l calculated based on D and Dl,

respectively.

Theorem (stability of CCnet)

Assume that ||x||2 L and let λ2 > 0. For l = 1, . . . , n we have: |lγ(x, y, ˆ θ) − lγ(x, y, ˆ θ

l)| 4K(L2 + K)

λ2nγ .

slide-14
SLIDE 14

Stability of CCnet

◮ Original training data: D = (x(i), y(i)), i = 1, . . . , n ◮ Modified training data: Dl: l-th observation from D is

replaced by its independent copy.

◮ Solutions of CCnet: ˆ

θ and ˆ θ

l calculated based on D and Dl,

respectively.

Theorem (stability of CCnet)

Assume that ||x||2 L and let λ2 > 0. For l = 1, . . . , n we have: |lγ(x, y, ˆ θ) − lγ(x, y, ˆ θ

l)| 4K(L2 + K)

λ2nγ .

slide-15
SLIDE 15

Generalization error bound for CCnet

◮ Expected error:

err(ˆ θ) = Ex,ylγ(x, y, ˆ θ),

◮ Empirical error:

Err(ˆ θ) = 1 n

n

  • i=1

lγ(x(i), y(i), ˆ θ).

Theorem (generalization error bound)

Assume that ||x||2 L and λ2 > 0. The following inequality holds with probability at least 1 − δ: err(ˆ θ)−Err(ˆ θ) 8K(L2 + K) λ2nγ +

  • 16K(L2 + K)

λ2γ + 1 log(1/δ) 2n .

slide-16
SLIDE 16

Generalization error bound for CCnet

◮ Expected error:

err(ˆ θ) = Ex,ylγ(x, y, ˆ θ),

◮ Empirical error:

Err(ˆ θ) = 1 n

n

  • i=1

lγ(x(i), y(i), ˆ θ).

Theorem (generalization error bound)

Assume that ||x||2 L and λ2 > 0. The following inequality holds with probability at least 1 − δ: err(ˆ θ)−Err(ˆ θ) 8K(L2 + K) λ2nγ +

  • 16K(L2 + K)

λ2γ + 1 log(1/δ) 2n .

slide-17
SLIDE 17

Experiments

Two main goals of the experiments:

  • 1. Experiment 1: compare the prediction performance of CCnet

and other state-of-the-art methods.

  • 2. Experiment 2: analyse the stability of CCnet with respect to
  • rder of fitting the models in the chain.
slide-18
SLIDE 18

Experiment 1

Methods:

  • 1. BRlogit
  • 2. BRtree
  • 3. BRnet (lasso penalty)
  • 4. CClogit
  • 5. CCtree
  • 6. CCnet (lasso penalty)
  • 7. MLKNN
  • 8. LPtree
slide-19
SLIDE 19

Experiment 1

Dataset CClogit CCtree CCnet BRlogit BRtree BRnet LPtree MLKNN music 0.215 0.221 0.275 0.186 0.191 0.257 0.275 0.265 yeast 0.214 0.168 0.184 0.156 0.048 0.123 0.182 0.231 scene 0.473 0.467 0.629 0.385 0.337 0.416 0.565 0.622 birds 0.349 0.375 0.532 0.332 0.375 0.535 0.543 0.450 flags 0.227 0.139 0.216 0.124 0.072 0.165 0.263 0.087 medical 0.181 0.690 0.760 0.180 0.634 0.752 0.763 0.579 cal500 0.008 0.006 0.020 0.012 0.004 0.008 0.002 0.002 genbase 0.989 0.985 0.986 0.989 0.985 0.989 0.994 0.500 mediamill 0.223 0.096 0.197 0.192 0.057 0.155 0.124 0.212 enron 0.037 0.170 0.139 0.038 0.130 0.095 0.123 0.152 nuswide 0.363 0.045 0.338 0.350 0.028 0.330 0.305 0.208 bookmarks 0.108 0.287 0.754 0.067 0.292 0.754 0.753 0.689 bibtex 0.361 0.404 0.780 0.359 0.414 0.780 0.760 0.617 avg rank 5.962 4.615 9.154 4.038 2.923 6.808 7.808 6.115

Tabela : Subset measure. The larger the rank the better. The number in bold corresponds to the winner method.

slide-20
SLIDE 20

Experiment 1

Dataset CClogit CCtree CCnet BRlogit BRtree BRnet LPtree MLKNN music 71 69 36 71 71 34 47 71 yeast 103 95 70 103 103 54 13 103 scene 294 177 161 294 190 160 51 294 birds 260 34 44 260 33 45 9 260 flags 19 19 16 19 19 6 15 19 medical 1449 80 33 1449 75 31 57 1449 cal500 68 68 5 68 68 2 1 68 genbase 1186 28 13 1186 29 14 44 1186 mediamill 120 81 75 120 94 72 24 120 enron 1001 303 86 1001 374 78 25 1001 nuswide 128 1 94 128 1 104 1 128 bookmarks 2150 450 56 2150 453 57 13 2150 bibtex 1836 171 108 1836 169 109 36 1836 avg rank 10 6 3 10 6 3 2 10

Tabela : Number of selected variables. Parameter λ was chosen using BIC

  • criterion. The smaller the rank the better. The number in bold

corresponds to the winner method.

slide-21
SLIDE 21

Experiment 2

Selected features: S =

K

  • k=1

{1 r p : ˆ θk,r = 0}. Remarks:

◮ Order of fitting the models in the chain may influence the

results.

◮ We would like to verify how the chosen subset of relevant

features depends on the order of fitting the models in the chain.

◮ We repeat the experiments with the CCnet for 100 random

permutations of labels and report which features are selected.

slide-22
SLIDE 22

Experiment 2

Artificial data:

  • 1. First we draw x from 50-dimensional standard Gaussian distribution.
  • 2. Then we generate the labels y1, . . . , y4 sequentially using the formula:

p(yk|zk, θk) = exp(θT

k zkyk)

1 + exp(θT

k zk)

, (2) with zk = (y−k, x)T and y−k = (y1, . . . , yT

k−1) (for k = 1 we set zk = x).

  • 3. The coordinates of θk corresponding to y−k are always equal 0.1.
  • 4. Moreover, coordinates of θk, corresponding to the first 3 features x1, x2, x3 are

equal b (b is a parameter whose value is varying in the experiment).

  • 5. The coordinates corresponding to the remaining features are zero.
  • 6. Only 3 features: x1, x2, x3 are relevant.
slide-23
SLIDE 23

Experiment 2

◮ T- set of true relevant features (in our case T = {1, 2, 3}). ◮ S- set of features chosen by the CCnet.

Performance measures:

◮ Positive Selection Rate:

PSR(T, S) = |S ∩ T| |T| ,

◮ False Discovery Rate:

FDR(T, S) = |S \ T| |S| .

slide-24
SLIDE 24

5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 parameter b PSR 5 10 15 20 0.2 0.4 0.6 0.8 1.0 parameter b FDR

(a) (b)

Rysunek : Positive Selection Rate (PSR) and False Discovery Rate (FDR) with respect to parameter b, for n = 100. The lines correspond to different orders of fitting the models in the chain.

slide-25
SLIDE 25

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 number of observations n PSR 100 200 300 400 500 0.2 0.4 0.6 0.8 1.0 number of observations n FDR

(a) (b)

Rysunek : Positive Selection Rate (PSR) and False Discovery Rate (FDR) with respect to a sample size n, for b = 0.5. The lines correspond to different orders of learning the models in the chain.

slide-26
SLIDE 26

Conclusions

  • 1. CCnet is a combination of classifier chains and logistic

regression with elastic-net regularization.

  • 2. We prove stability of elastic net and bound the generalization

error.

  • 3. The empirical experiments show that CCnet outperforms other

state-of-the-art methods with respect to subset accuracy.

  • 4. CCnet is stable with respect to order of fitting the models in

the chain.

slide-27
SLIDE 27

References:

  • 1. J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains

for multi-label classification, Machine Learning.

  • 2. H. Zou and T. Hastie, Regularization and variable selection

via the elastic net, Journal of the Royal Statistical Society B.

  • 3. P. Teisseyre, CCnet: joint multi-label classification and feature

selection using classifier chains and elastic net regularization, submitted.