XLII Conference on Mathematical Statistics CCnet: joint multi-label - - PowerPoint PPT Presentation
XLII Conference on Mathematical Statistics CCnet: joint multi-label - - PowerPoint PPT Presentation
XLII Conference on Mathematical Statistics CCnet: joint multi-label classification and feature selection using classifier chains and elastic net regularization Pawe Teisseyre Institute of Computer Science, Polish Academy of Sciences Outline
Outline
◮ Multi-label classification. ◮ Novel method: CCnet. ◮ Theoretical results. ◮ Results of experiments.
Single-label (binary) classification: x1 x2 . . . xp y 1.0 2.2 . . . 4.2 1 2.4 1.3 . . . 3.1 1 0.9 1.4 . . . 3.2 . . . . . . . . . 1.7 3.5 . . . 4.2 3.9 2.5 . . . 4.1 ?
Tabela : Single-label classification.
◮ y ∈ {0, 1}- target variable (label). ◮ x = (x1, . . . , xp)T- vector of explanatory variables (features).
TASK: build a model which predicts y using x.
Multi-label classification: x1 x2 . . . xp y1 y2 . . . yK 1.0 2.2 . . . 4.2 1 . . . 1 2.4 1.3 . . . 3.1 1 . . . 1 0.9 1.4 . . . 3.2 . . . 1 . . . . . . . . . . . . 1.7 3.5 . . . 4.2 1 . . . 3.9 2.5 . . . 4.1 ? ? . . . ?
Tabela : Multi-label classification.
◮ y = (y1, . . . , yK)′- vector of target variables (labels). ◮ x = (x1, . . . , xp)T- vector of explanatory variables (features).
TASK: build a model which predicts y using x.
Motivation example: modelling multimorbidity1
BMI Weight Glucose ... Diabetes Hypotension Liver disease ... 31 84 10 ... 1 1 ... 26 63 6 ... 1 ... 27 60 7 ... ... Features x: characteristics of patients. Labels y: occurrences of diseases..
◮ Task 1: predict which diseases are likely to occur based on patients
characteristics.(PREDICTION).
◮ Task 2: select features that influence the occurrence of diseases
(FEATURE SELECTION).
1co-occurrence two or more diseases in one patient
Multi-label classification
Standard approach:
- 1. Estimate posterior probability:
p(y|x).
- 2. Make prediction for some new observation x0:
ˆ y(x0) = arg max
y∈{0,1}K ˆ
p(y|x0), where ˆ p(y|x0) is estimated probability. Both steps are more difficult than for a single-label classification.
Posterior probability estimation
Possible approaches:
- 1. Direct modelling of posterior probability: assume some
parametric form of p(y|x), e.g. Ising model.
- 2. Binary Relevance: assume conditional independence of
labels: p(y|x) = p(y1, . . . , yK|x) =
K
- k=1
p(yk|x) and estimate the marginal probabilities.
- 3. Classifier chains: use chain rule
p(y|x) = p(y1, . . . , yK|x) = p(y1|x)
K
- k=2
p(yk|x, y1, . . . , yk−1). and estimate the conditional probabilities.
CCnet
- 1. Use chain rule for posterior probability:
p(y|x, θ) = p(y1|x, θ1)
K
- k=2
p(yk|y−k, x, θk), where: y−k = (y1, . . . , yk−1)T, θ = (θ1, . . . , θK)T.
- 2. Assume that conditional probabilities are of the logistic form.
- 3. Use penalized maximum likelihood method (with elastic-net
penalty) to estimate parameters θk: ˆ θk = arg min
θk {−1
n
n
- i=1
log p(y(i)
k |y(i) −k, x(i), θk)+λ1||θk||1+λ2||θk||2 2},
for k = 1, . . . , K, based on training data (x(i), y(i)), i = 1, . . . , n.
Theoretical results
◮ Stability of CCnet with respect to subset loss: a small
perturbation in the training data does not affect the value of the subset loss function.
◮ Generalization error bound for CCnet: we use an idea
described in Bousquet & Elisseeff (JMLR 2002) to show that the difference between expected error and empirical error can be bounded by the term related to the stability.
Loss functions
Let: g(x, y, θ) = p(y|x, θ) − maxy′=y p(y′|x, θ).
◮ Subset loss (equal 0 if all labels are predicted correctly and 1
- therwise):
l(x, y, θ) =
- 1
if g(x, y, θ) < 0 if g(x, y, θ) 0. (1)
◮ Modification of subset loss:
lγ(x, y, θ) =
1 if g(x, y, θ) < 0 1 − g(x, y, θ)/γ if 0 g(x, y, θ) < γ if g(x, y, θ) γ.
Loss functions
Let: g(x, y, θ) = p(y|x, θ) − maxy′=y p(y′|x, θ).
◮ Subset loss (equal 0 if all labels are predicted correctly and 1
- therwise):
l(x, y, θ) =
- 1
if g(x, y, θ) < 0 if g(x, y, θ) 0. (1)
◮ Modification of subset loss:
lγ(x, y, θ) =
1 if g(x, y, θ) < 0 1 − g(x, y, θ)/γ if 0 g(x, y, θ) < γ if g(x, y, θ) γ.
- 0.0
0.2 0.4 0.6 0.8 1.0 g(x,y,θ) l(x,y,θ)
- 0.0
0.2 0.4 0.6 0.8 1.0 g(x,y,θ) lγ(x,y,θ) γ
(a) (b)
Rysunek : Subset loss l(x, y, θ) (a) and modified subset loss lγ(x, y, θ) (b).
Stability of CCnet
◮ Original training data: D = (x(i), y(i)), i = 1, . . . , n ◮ Modified training data: Dl: l-th observation from D is
replaced by its independent copy.
◮ Solutions of CCnet: ˆ
θ and ˆ θ
l calculated based on D and Dl,
respectively.
Theorem (stability of CCnet)
Assume that ||x||2 L and let λ2 > 0. For l = 1, . . . , n we have: |lγ(x, y, ˆ θ) − lγ(x, y, ˆ θ
l)| 4K(L2 + K)
λ2nγ .
Stability of CCnet
◮ Original training data: D = (x(i), y(i)), i = 1, . . . , n ◮ Modified training data: Dl: l-th observation from D is
replaced by its independent copy.
◮ Solutions of CCnet: ˆ
θ and ˆ θ
l calculated based on D and Dl,
respectively.
Theorem (stability of CCnet)
Assume that ||x||2 L and let λ2 > 0. For l = 1, . . . , n we have: |lγ(x, y, ˆ θ) − lγ(x, y, ˆ θ
l)| 4K(L2 + K)
λ2nγ .
Generalization error bound for CCnet
◮ Expected error:
err(ˆ θ) = Ex,ylγ(x, y, ˆ θ),
◮ Empirical error:
Err(ˆ θ) = 1 n
n
- i=1
lγ(x(i), y(i), ˆ θ).
Theorem (generalization error bound)
Assume that ||x||2 L and λ2 > 0. The following inequality holds with probability at least 1 − δ: err(ˆ θ)−Err(ˆ θ) 8K(L2 + K) λ2nγ +
- 16K(L2 + K)
λ2γ + 1 log(1/δ) 2n .
Generalization error bound for CCnet
◮ Expected error:
err(ˆ θ) = Ex,ylγ(x, y, ˆ θ),
◮ Empirical error:
Err(ˆ θ) = 1 n
n
- i=1
lγ(x(i), y(i), ˆ θ).
Theorem (generalization error bound)
Assume that ||x||2 L and λ2 > 0. The following inequality holds with probability at least 1 − δ: err(ˆ θ)−Err(ˆ θ) 8K(L2 + K) λ2nγ +
- 16K(L2 + K)
λ2γ + 1 log(1/δ) 2n .
Experiments
Two main goals of the experiments:
- 1. Experiment 1: compare the prediction performance of CCnet
and other state-of-the-art methods.
- 2. Experiment 2: analyse the stability of CCnet with respect to
- rder of fitting the models in the chain.
Experiment 1
Methods:
- 1. BRlogit
- 2. BRtree
- 3. BRnet (lasso penalty)
- 4. CClogit
- 5. CCtree
- 6. CCnet (lasso penalty)
- 7. MLKNN
- 8. LPtree
Experiment 1
Dataset CClogit CCtree CCnet BRlogit BRtree BRnet LPtree MLKNN music 0.215 0.221 0.275 0.186 0.191 0.257 0.275 0.265 yeast 0.214 0.168 0.184 0.156 0.048 0.123 0.182 0.231 scene 0.473 0.467 0.629 0.385 0.337 0.416 0.565 0.622 birds 0.349 0.375 0.532 0.332 0.375 0.535 0.543 0.450 flags 0.227 0.139 0.216 0.124 0.072 0.165 0.263 0.087 medical 0.181 0.690 0.760 0.180 0.634 0.752 0.763 0.579 cal500 0.008 0.006 0.020 0.012 0.004 0.008 0.002 0.002 genbase 0.989 0.985 0.986 0.989 0.985 0.989 0.994 0.500 mediamill 0.223 0.096 0.197 0.192 0.057 0.155 0.124 0.212 enron 0.037 0.170 0.139 0.038 0.130 0.095 0.123 0.152 nuswide 0.363 0.045 0.338 0.350 0.028 0.330 0.305 0.208 bookmarks 0.108 0.287 0.754 0.067 0.292 0.754 0.753 0.689 bibtex 0.361 0.404 0.780 0.359 0.414 0.780 0.760 0.617 avg rank 5.962 4.615 9.154 4.038 2.923 6.808 7.808 6.115
Tabela : Subset measure. The larger the rank the better. The number in bold corresponds to the winner method.
Experiment 1
Dataset CClogit CCtree CCnet BRlogit BRtree BRnet LPtree MLKNN music 71 69 36 71 71 34 47 71 yeast 103 95 70 103 103 54 13 103 scene 294 177 161 294 190 160 51 294 birds 260 34 44 260 33 45 9 260 flags 19 19 16 19 19 6 15 19 medical 1449 80 33 1449 75 31 57 1449 cal500 68 68 5 68 68 2 1 68 genbase 1186 28 13 1186 29 14 44 1186 mediamill 120 81 75 120 94 72 24 120 enron 1001 303 86 1001 374 78 25 1001 nuswide 128 1 94 128 1 104 1 128 bookmarks 2150 450 56 2150 453 57 13 2150 bibtex 1836 171 108 1836 169 109 36 1836 avg rank 10 6 3 10 6 3 2 10
Tabela : Number of selected variables. Parameter λ was chosen using BIC
- criterion. The smaller the rank the better. The number in bold
corresponds to the winner method.
Experiment 2
Selected features: S =
K
- k=1
{1 r p : ˆ θk,r = 0}. Remarks:
◮ Order of fitting the models in the chain may influence the
results.
◮ We would like to verify how the chosen subset of relevant
features depends on the order of fitting the models in the chain.
◮ We repeat the experiments with the CCnet for 100 random
permutations of labels and report which features are selected.
Experiment 2
Artificial data:
- 1. First we draw x from 50-dimensional standard Gaussian distribution.
- 2. Then we generate the labels y1, . . . , y4 sequentially using the formula:
p(yk|zk, θk) = exp(θT
k zkyk)
1 + exp(θT
k zk)
, (2) with zk = (y−k, x)T and y−k = (y1, . . . , yT
k−1) (for k = 1 we set zk = x).
- 3. The coordinates of θk corresponding to y−k are always equal 0.1.
- 4. Moreover, coordinates of θk, corresponding to the first 3 features x1, x2, x3 are
equal b (b is a parameter whose value is varying in the experiment).
- 5. The coordinates corresponding to the remaining features are zero.
- 6. Only 3 features: x1, x2, x3 are relevant.
Experiment 2
◮ T- set of true relevant features (in our case T = {1, 2, 3}). ◮ S- set of features chosen by the CCnet.
Performance measures:
◮ Positive Selection Rate:
PSR(T, S) = |S ∩ T| |T| ,
◮ False Discovery Rate:
FDR(T, S) = |S \ T| |S| .
5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 parameter b PSR 5 10 15 20 0.2 0.4 0.6 0.8 1.0 parameter b FDR
(a) (b)
Rysunek : Positive Selection Rate (PSR) and False Discovery Rate (FDR) with respect to parameter b, for n = 100. The lines correspond to different orders of fitting the models in the chain.
100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 number of observations n PSR 100 200 300 400 500 0.2 0.4 0.6 0.8 1.0 number of observations n FDR
(a) (b)
Rysunek : Positive Selection Rate (PSR) and False Discovery Rate (FDR) with respect to a sample size n, for b = 0.5. The lines correspond to different orders of learning the models in the chain.
Conclusions
- 1. CCnet is a combination of classifier chains and logistic
regression with elastic-net regularization.
- 2. We prove stability of elastic net and bound the generalization
error.
- 3. The empirical experiments show that CCnet outperforms other
state-of-the-art methods with respect to subset accuracy.
- 4. CCnet is stable with respect to order of fitting the models in
the chain.
References:
- 1. J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains
for multi-label classification, Machine Learning.
- 2. H. Zou and T. Hastie, Regularization and variable selection
via the elastic net, Journal of the Royal Statistical Society B.
- 3. P. Teisseyre, CCnet: joint multi-label classification and feature