Consensual Aggregation of Clusters based on Bregman Divergences to Improve Predictive Models
Sothea HAS Sorbonne Universit´ e
LPSM, Universit´ e Paris-Diderot Mathilde Mougeot Aur´ elie Fischer sothea.has@lpsm.paris
2 avril 2019
1/21
Consensual Aggregation of Clusters based on Bregman Divergences to - - PowerPoint PPT Presentation
Consensual Aggregation of Clusters based on Bregman Divergences to Improve Predictive Models Sothea HAS Sorbonne Universit e LPSM, Universit e Paris-Diderot Mathilde Mougeot Aur elie Fischer sothea.has@lpsm.paris 2 avril 2019 1/21
Sothea HAS Sorbonne Universit´ e
LPSM, Universit´ e Paris-Diderot Mathilde Mougeot Aur´ elie Fischer sothea.has@lpsm.paris
2 avril 2019
1/21
2/21
Input data with 3 clusters Different model on each cluster
3/21
Setting : (X, Z) ∈ X × Z : input-out data.
X = Rd : input space. Z =
: regression {0, 1} : binary classification
Tn = {(xi, zi)n
i=1} : iid learning data.
Objective : Construct a good predictive model for regression or classification. Assumption : X is composed of more than one group or cluster. The number of clusters K is available. There exists different underlying models on these clusters.
4/21
There are 3 important steps :
5/21
φ : C ⊂ Rd → R, strictly convex and of class C1 then for any (x, y) ∈ C × int(C) (points of the input space X), dφ(x, y) = φ(x) − φ(y) − x − y, ∇φ(y)
−1 1 2 3 2 4 6 8 y x φ(y) + x − y, ∇φ(y) φ(x) φ(y) dφ(x, y)
Figure – Graphical interpretation of Bregman divergences.
6/21
X is a member of an exponential family Eψ if f (x|θ) = h(x) exp(θ, T(x) − ψ(θ)), θ ∈ Θ Example :
Continuous cases : exponential, normal, gamma, beta... Discrete cases : Bernoulli, poisson, binomial, multinomial...
7/21
Theorem [Banerjee et al., 2005]
If X is a member of an exponential family Eψ and if φ is the convex conjugate of ψ defined by φ(x) = sup
y {x, y − ψ(y)}
then there exists a unique Bregman divergence dφ such that f (x|θ) = h(x) exp(−dφ(T(x), E[T(X)]) + φ(T(x))) Example : Exponential distribution : dφ(x, y) = x
y − log
y
Poisson distribution : dφ(x, y) = x log
y
Kullback-Leibler).
8/21
Perform K-means algorithm with M options of Bregman divergences. Each BDℓ gives an associated partition cell Sℓ = {Sℓ
k}K k=1.
BD1 BD2 ... BDM S1 S2 ... SM Step 1
9/21
Suppose that ∀ℓ, k : Sℓ
k ∈ Sℓ contains enough data points.
∀ℓ, k : construct an estimator mℓ
k on Sℓ k.
mℓ = {mℓ
k}K k=1 is the candidate estimator associated to DBℓ.
BD1 BD2 ... BDM S1 S2 ... SM m1 m2 ... mM Step 1 Step 2
10/21
Why consensual aggregation ? Neither the distribution nor the clustering structure of the input data is available. Not easy to choose the“best”one among {mℓ}M
ℓ=1.
DB1 DB2 ... DBM S1 S2 ... SM m1 m2 ... mM Aggregation Step 1 Step 2 Step 3
11/21
Example : Suppose we have 4 classifiers : m = (m1, m2, m3, m4) An observation x with predictions : (1, 1, 0, 1). ID m1 m2 m3 m4 z 1 1 1 1 1 2 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1
Table – Table of predictions.
Based on the following works : [Mojirsheibani, 1999] : Classical method (Mo1). [Mojirsheibani, 2000] : A kernel-based method (Mo2). [Fischer and Mougeot, 2019] : MixCOBRA.
12/21
The aggregation takes the following form : Aggn(x) =
n
Wn,i(x)zi [Biau et al., 2016] : with weight 0 − 1 (COBRA). Wn,i(x) = M
ℓ=1 ✶{|mℓ(xi)−mℓ(x)|<ε}
n
j=1
M
ℓ=1 ✶{|mℓ(xj)−mℓ(x)|<ε}
Kernel-based method of COBRA (kernel-based weight). [Fischer and Mougeot, 2019] : MixCOBRA.
13/21
Bregman divergences
Euclidean : For all x ∈ C = Rd, φ(x) = x2
2 = d i=1 x2 i ,
dφ(x, y) = x − y2
2
General Kullback-Leibler (GKL) : φ(x) = d
i=1 xi log(xi),
C = (0, +∞)d, dφ(x, y) = d
i=1
yi
i=1[xi log(xi) + (1 − xi) log(1 − xi)], C = (0, 1)d ,
dφ(x, y) = d
i=1
yi
1−yi
i=1 log(xi), C = (0, +∞)d,
dφ(x, y) = d
i=1
yi − log
yi
M = 4 et K = 3.
Figure – K-means with Bregman divergences on some simulated data.
15/21
With 20 replications of each case.
K = 1 mℓ kernel used in Wn,i (x) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt Exp 18.86 8.58 7.42 4.09 3.92 3.49 3.51 3.46 3.51 3.56 3.56 (0.89) (0.94) (0.88) (0.94) (0.91) (0.91) (1.70) (1.77) (1.55) (1.08) (1.15) 2.91 2.63 2.49 2.70 2.56 2.46 (0.81) (0.70) (0.74) (0.75) (0.63) (0.66) Pois 46.93 9.19 8.45 13.33 10.15 8.59 8.51 8.51 8.51 8.52 8.52 (1.37) (1.46) (1.47) (1.46) (1.47) (1.49) (3.35) (1.27) (1.24) (1.84) (1.47) 8.51 8.46 8.44 8.42 8.57 8.44 (1.28) (1.11) (1.17) (1.15) (1.28) (1.13) Geom 19.90 12.57 4.71 3.94 8.12 3.61 3.60 3.60 3.61 3.60 3.60 (1.15) (1.16) (1.16) (1.15) (1.16) (1.16) (2.07) (2.39) (2.37) (1.15) (1.57) 3.76 3.52 2.94 3.48 3.47 3.40 (0.92) (1.11) (0.93) (1.09) (1.11) (1.06) 2D Gaus 49.00 12.37 12.40 14.14 13.05 12.87 12.82 12.80 12.84 12.84 12.87 (1.60) (1.59) (1.56) (1.57) (1.57) (1.60) (2.52) (1.55) (1.50) (1.44) (1.61) 12.02 12.11 12.06 12.11 12.09 12.10 (1.30) (1.24) (1.35) (1.27) (1.23) (1.22) 3D Gaus 43.39 10.77 10.99 11.74 11.56 11.08 11.01 11.00 11.00 11.04 11.03 (1.58) (1.52) (1.50) (1.50) (1.57) (1.55) (2.52) (1.40) (1.44) (1.45) (1.51) 10.23 9.93 9.76 10.04 9.83 9.84 (1.40) (1.47) (1.53) (1.47) (1.61) (1.61)
Table – Table of average testing misclassification error (1 unit = 10−2).
16/21
K = 1 mℓ kernel used in Wn,i (x) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt Exp 107.73 69.82 58.93 44.54 44.46 55.11 51.14 40.21 52.99 50.24 50.64 (15.85) (13.31) (14.40) (13.12) (13.74) (14.41) (7.13) (6.84) (7.37) (7.37) (10.96) 56.34 52.62 39.12 51.31 51.20 51.98 (17.48) (17.82) (14.98) (19.55) (19.69) (20.12) Pois 26.76 10.16 8.22 16.72 12.15 8.88 9.18 8.43 8.85 8.84 8.76 (1.65) (1.98) (2.18) (2.06) (2.03) (2.03) (1.11) (1.91) (2.25) (1.61) (1.86) 9.73 9.61 9.13 9.64 9.40 9.43 (2.25) (1.86) (1.92) (1.91) (1.86) (1.93) Geom 70.45 29.99 18.33 22.94 31.94 36.39 32.49 21.51 31.48 31.44 30.89 (13.81) (13.49) (11.79) (14.31) (13.51) (12.21) (4.52) (5.95) (7.34) (6.21) (5.19) 31.83 27.90 17.82 26.82 28.45 24.58 (12.88) (14.20) (12.58) (13.28) (14.02) (13.21) 2D Gaus 21.98 5.63 6.46 19.36 9.38 7.09 6.57 5.57 6.20 6.41 6.33 (2.55) (1.78) (0.49) (1.72) (1.76) (1.75) (1.20) (1.26) (1.81) (1.11) (1.86) 9.75 7.70 6.42 7.45 7.47 7.34 (1.30) (2.24) (1.49) (2.42) (2.28) (2.31) 3D Gaus 53.55 19.89 20.93 23.71 22.96 18.16 18.20 16.94 18.25 18.05 18.00 3.42) (3.45) (4.06) (3.41) (3.50) (3.49) (1.74) (3.49) (2.97) (2.70) (2.74) 19.24 18.52 17.51 18.64 18.19 18.42 (3.54) (4.02) (3.64) (4.37) (3.91) (3.68)
Table – Table of average testing RMSE.
17/21
Air compressor
Given by [Cadet et al., 2005]. Six predictors : air temperature, input pressure, output pressure, flow and water temperature. Response variable : power consumption.
K is not available !
18/21
For K = 1 : RMSE = 178.67.
K Euclid GKL Logistic Ita COBRA MixCOBRA∗ 2 158.85 158.90 159.35 158.96 153.34 116.69 (6.42) (6.48) (6.71) (6.41) (6.72) (5.86) 3 157.38 157.24 156.99 157.24 153.69 117.45 (6.95) (6.84) (6.65) (6.85) (6.64) (5.55) 4 154.33 153.96 153.99 154.07 152.09 117.16 (6.69) (6.74) (6.45) (7.01) (6.58) (5.99) 5 153.18 153.19 152.95 152.25 151.05 117.55 (6.91) (6.77) (6.57) (6.70) (6.76) (5.90) 6 151.16 151.67 151.89 151.75 150.27 117.74 (6.91) (6.96) (6.62) (6.57) (6.82) (5.86) 7 151.08 150.99 152.81 151.85 150.46 117.58 (6.77) (6.84) (7.11) (6.61) (6.87) (6.15) 8 151.27 151.09 152.07 150.90 150.21 117.91 (7.17) (7.01) (6.65) (6.96) (7.03) (5.83)
Table – RMSE of air compressor data.
∗ Consensual aggregation method integrating input X into the weight.
[Fischer and Mougeot, 2019].
19/21
20/21
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005). Clustering with Bregman divergences. Journal Machine Learning Research, 6 :1705–1749. Biau, G., Fischer, A., Guedj, B., and Malleye, J. D. (2016). COBRA : a combined regression strategy. Journal of Multivariate Analysis, 146 :18–28. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematical and Mathematical Physics, 7 :200–217. Cadet, O., Harper, C., and Mougeot, M. (2005). Monitoring energy performance of compressors with an innovative auto-adaptive approach. In Instrumentation System and Automation -ISA- Chicago. Fischer, A., Has, S., and Mougeot, M. (2018). Consensual aggregation of clusters based on Bregman divergences to improve predictive models. Fischer, A. and Mougeot, M. (2019). Aggregation using input-output trade-off. Journal of Statistical Planning and Inference, 200 :1–19. Mojirsheibani, M. (1999). Combined classifiers via disretization. Journal of the American Statistical Association, 94(446) :600–609. Mojirsheibani, M. (2000). A kernel-based combined classification rule. Journal of Statistics and Probability Letters, 48(4) :411–419. 21/21