 
              Consensual Aggregation of Clusters based on Bregman Divergences to Improve Predictive Models Sothea HAS Sorbonne Universit´ e LPSM, Universit´ e Paris-Diderot Mathilde Mougeot Aur´ elie Fischer sothea.has@lpsm.paris 2 avril 2019 1/21
Overview A. Introduction B. Construction of a predictive model 1. K-means algorithm with Bregman divergences 2. Construction of candidate estimators 3. Consensual aggregation C. Applications 1. Simulated data 2. Real data 2/21
Consider an example... Input data with 3 clusters Different model on each cluster 3/21
Introduction Setting : ( X , Z ) ∈ X × Z : input-out data. X = R d : input space. � R : regression Z = { 0 , 1 } : binary classification T n = { ( x i , z i ) n i =1 } : iid learning data. Objective : Construct a good predictive model for regression or classification. Assumption : X is composed of more than one group or cluster. The number of clusters K is available. There exists different underlying models on these clusters. 4/21
Construction of a predictive model There are 3 important steps : 1. K-means algorithm with Bregman divergences 2. Construction of candidate estimators 3. Consensual aggregation 5/21
Bregman divergences (BD) [Bregman, 1967] φ : C ⊂ R d → R , strictly convex and of class C 1 then for any ( x , y ) ∈ C × int ( C ) (points of the input space X ), d φ ( x , y ) = φ ( x ) − φ ( y ) − � x − y , ∇ φ ( y ) � 8 φ ( x ) 6 d φ ( x , y ) 4 φ ( y ) + � x − y , ∇ φ ( y ) � 2 φ ( y ) 0 y x − 1 0 1 2 3 Figure – Graphical interpretation of Bregman divergences. 6/21
Exponential families (EF) X is a member of an exponential family E ψ if f ( x | θ ) = h ( x ) exp( � θ, T ( x ) � − ψ ( θ )) , θ ∈ Θ Example : Continuous cases : exponential, normal, gamma, beta... Discrete cases : Bernoulli, poisson, binomial, multinomial... 7/21
Relationship between BD and EF Theorem [Banerjee et al., 2005] If X is a member of an exponential family E ψ and if φ is the convex conjugate of ψ defined by φ ( x ) = sup y {� x , y � − ψ ( y ) } then there exists a unique Bregman divergence d φ such that f ( x | θ ) = h ( x ) exp( − d φ ( T ( x ) , E [ T ( X )]) + φ ( T ( x ))) Example : � � Exponential distribution : d φ ( x , y ) = x x y − log − 1 (Itakura-Saito). y � � x Poisson distribution : d φ ( x , y ) = x log − ( x − y ) (General y Kullback-Leibler). 8/21
Step 1 : K-means algorithm with Bregman divergences Perform K-means algorithm with M options of Bregman divergences. Each BD ℓ gives an associated partition cell S ℓ = { S ℓ k } K k =1 . BD 1 S 1 BD 2 S 2 ... ... BD M S M Step 1 9/21
Step 2 : Construction of candidate estimators k ∈ S ℓ contains enough data points. Suppose that ∀ ℓ, k : S ℓ ∀ ℓ, k : construct an estimator m ℓ k on S ℓ k . m ℓ = { m ℓ k } K k =1 is the candidate estimator associated to DB ℓ . BD 1 S 1 m 1 BD 2 S 2 m 2 ... ... ... BD M S M m M Step 1 Step 2 10/21
Step 3 : Consensual aggregation Why consensual aggregation ? Neither the distribution nor the clustering structure of the input data is available. Not easy to choose the“best”one among { m ℓ } M ℓ =1 . DB 1 S 1 m 1 DB 2 S 2 m 2 Aggregation ... ... ... DB M S M m M Step 3 Step 1 Step 2 11/21
Classification Example : Suppose we have 4 classifiers : m = ( m 1 , m 2 , m 3 , m 4 ) An observation x with predictions : (1 , 1 , 0 , 1). ID m 1 m 2 m 3 m 4 z 1 1 1 0 1 1 2 0 0 0 1 0 3 1 1 0 1 0 4 1 0 1 1 1 5 1 1 0 1 1 Table – Table of predictions. Based on the following works : [Mojirsheibani, 1999] : Classical method (Mo1). [Mojirsheibani, 2000] : A kernel-based method (Mo2). [Fischer and Mougeot, 2019] : MixCOBRA. 12/21
Regression The aggregation takes the following form : n � Agg n ( x ) = W n , i ( x ) z i i =1 [Biau et al., 2016] : with weight 0 − 1 (COBRA). � M ℓ =1 ✶ {| m ℓ ( x i ) − m ℓ ( x ) | <ε } W n , i ( x ) = � n � M ℓ =1 ✶ {| m ℓ ( x j ) − m ℓ ( x ) | <ε } j =1 Kernel-based method of COBRA (kernel-based weight). [Fischer and Mougeot, 2019] : MixCOBRA. 13/21
Applications Bregman divergences Euclidean : For all x ∈ C = R d , φ ( x ) = � x � 2 2 = � d i =1 x 2 i , d φ ( x , y ) = � x − y � 2 2 General Kullback-Leibler (GKL) : φ ( x ) = � d i =1 x i log( x i ), C = (0 , + ∞ ) d , � � � � d φ ( x , y ) = � d x i x i log − ( x i − y i ) i =1 y i i =1 [ x i log( x i ) + (1 − x i ) log(1 − x i )], C = (0 , 1) d , Logistic : φ ( x ) = � d � � � � �� d φ ( x , y ) = � d x i 1 − x i x i log + (1 − x i ) log i =1 y i 1 − y i Itakura-Saito : φ ( x ) = − � d i =1 log( x i ), C = (0 , + ∞ ) d , � � � � d φ ( x , y ) = � d x i x i y i − log − 1 i =1 y i 14/21
Simulated data M = 4 et K = 3. Figure – K-means with Bregman divergences on some simulated data. 15/21
Classification : numerical results With 20 replications of each case. m ℓ K = 1 kernel used in W n , i ( x ) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt 3 . 49 3 . 51 3 . 51 3 . 56 3 . 56 3.46 (0 . 89) (0 . 94) (0 . 88) (0 . 94) (0 . 91) (0 . 91) 18 . 86 8 . 58 7 . 42 4 . 09 3.92 Exp (1 . 70) (1 . 77) (1 . 55) (1 . 08) (1 . 15) 2 . 91 2 . 63 2 . 49 2 . 70 2 . 56 2.46 (0 . 81) (0 . 70) (0 . 74) (0 . 75) (0 . 63) (0 . 66) 8 . 59 8.51 8.51 8.51 8 . 52 8 . 52 46 . 93 9 . 19 13 . 33 10.15 (1 . 37) (1 . 46) (1 . 47) (1 . 46) (1 . 47) (1 . 49) 8.45 Pois (3 . 35) (1 . 27) (1 . 24) (1 . 84) (1 . 47) 8 . 51 8 . 46 8 . 44 8.42 8 . 57 8 . 44 (1 . 28) (1 . 11) (1 . 17) (1 . 15) (1 . 28) (1 . 13) 3 . 61 3.60 3.60 3 . 61 3.60 3.60 8.12 (1 . 15) (1 . 16) (1 . 16) (1 . 15) (1 . 16) (1 . 16) 19 . 90 12 . 57 4 . 71 3.94 Geom (2 . 07) (2 . 39) (2 . 37) (1 . 15) (1 . 57) 3 . 76 3 . 52 2.94 3 . 48 3 . 47 3 . 40 (0 . 92) (1 . 11) (0 . 93) (1 . 09) (1 . 11) (1 . 06) 12 . 87 12 . 82 12.80 12 . 84 12 . 84 12 . 87 13.05 (1 . 60) (1 . 59) (1 . 56) (1 . 57) (1 . 57) (1 . 60) 49 . 00 12.37 12 . 40 14 . 14 2D Gaus (2 . 52) (1 . 55) (1 . 50) (1 . 44) (1 . 61) 12.02 12 . 11 12 . 06 12 . 11 12 . 09 12 . 10 (1 . 30) (1 . 24) (1 . 35) (1 . 27) (1 . 23) (1 . 22) 11 . 08 11 . 01 11 . 04 11 . 03 11.00 11.00 (1 . 58) (1 . 52) (1 . 50) (1 . 50) (1 . 57) (1 . 55) 43 . 39 10 . 99 11 . 74 11.56 10.77 3D Gaus (2 . 52) (1 . 40) (1 . 44) (1 . 45) (1 . 51) 10 . 23 9 . 93 10 . 04 9 . 83 9 . 84 9.76 (1 . 40) (1 . 47) (1 . 53) (1 . 47) (1 . 61) (1 . 61) Table – Table of average testing misclassification error (1 unit = 10 − 2 ). 16/21
Regression : numerical results m ℓ K = 1 kernel used in W n , i ( x ) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt 55 . 11 51 . 14 40.21 52 . 99 50 . 24 50 . 64 44.46 (15 . 85) (13 . 31) (14 . 40) (13 . 12) (13 . 74) (14 . 41) 107 . 73 69 . 82 58 . 93 44 . 54 Exp (7 . 13) (6 . 84) (7 . 37) (7 . 37) (10 . 96) 56 . 34 52 . 62 39.12 51 . 31 51 . 20 51 . 98 (17 . 48) (17 . 82) (14 . 98) (19 . 55) (19 . 69) (20 . 12) 8 . 88 9 . 18 8.43 8 . 85 8 . 84 8 . 76 12.15 (1 . 65) (1 . 98) (2 . 18) (2 . 06) (2 . 03) (2 . 03) 26 . 76 10 . 16 8.22 16 . 72 Pois (1 . 11) (1 . 91) (2 . 25) (1 . 61) (1 . 86) 9 . 73 9 . 61 9.13 9 . 64 9 . 40 9 . 43 (2 . 25) (1 . 86) (1 . 92) (1 . 91) (1 . 86) (1 . 93) 36 . 39 32 . 49 21.51 31 . 48 31 . 44 30 . 89 (13 . 81) (13 . 49) (11 . 79) (14 . 31) (13 . 51) (12 . 21) 70 . 45 29 . 99 22 . 94 31.94 18.33 Geom (4 . 52) (5 . 95) (7 . 34) (6 . 21) (5 . 19) 31 . 83 27 . 90 26 . 82 28 . 45 24 . 58 17.82 (12 . 88) (14 . 20) (12 . 58) (13 . 28) (14 . 02) (13 . 21) 7 . 09 6 . 57 5.57 6 . 20 6 . 41 6 . 33 9.38 (2 . 55) (1 . 78) (0 . 49) (1 . 72) (1 . 76) (1 . 75) 21 . 98 5.63 6 . 46 19 . 36 2D Gaus (1 . 20) (1 . 26) (1 . 81) (1 . 11) (1 . 86) 9 . 75 7 . 70 6.42 7 . 45 7 . 47 7 . 34 (1 . 30) (2 . 24) (1 . 49) (2 . 42) (2 . 28) (2 . 31) 18 . 16 18 . 20 16.94 18 . 25 18 . 05 18 . 00 22.96 3 . 42) (3 . 45) (4 . 06) (3 . 41) (3 . 50) (3 . 49) 53 . 55 19.89 20 . 93 23 . 71 3D Gaus (1 . 74) (3 . 49) (2 . 97) (2 . 70) (2 . 74) 19 . 24 18 . 52 17.51 18 . 64 18 . 19 18 . 42 (3 . 54) (4 . 02) (3 . 64) (4 . 37) (3 . 91) (3 . 68) Table – Table of average testing RMSE. 17/21
Real data Air compressor Given by [Cadet et al., 2005]. Six predictors : air temperature, input pressure, output pressure, flow and water temperature. Response variable : power consumption. � K is not available ! 18/21
Results of air compressor data For K = 1 : RMSE = 178 . 67. K Euclid GKL Logistic Ita COBRA MixCOBRA ∗ 158 . 85 158 . 90 159 . 35 158 . 96 153 . 34 116.69 2 (6 . 42) (6 . 48) (6 . 71) (6 . 41) (6 . 72) (5 . 86) 157 . 38 157 . 24 156 . 99 157 . 24 153 . 69 117.45 3 (6 . 95) (6 . 84) (6 . 65) (6 . 85) (6 . 64) (5 . 55) 154 . 33 153 . 96 153 . 99 154 . 07 152 . 09 117.16 4 (6 . 69) (6 . 74) (6 . 45) (7 . 01) (6 . 58) (5 . 99) 153 . 18 153 . 19 152 . 95 152 . 25 151 . 05 117.55 5 (6 . 91) (6 . 77) (6 . 57) (6 . 70) (6 . 76) (5 . 90) 151 . 16 151 . 67 151 . 89 151 . 75 150 . 27 117.74 6 (6 . 91) (6 . 96) (6 . 62) (6 . 57) (6 . 82) (5 . 86) 151 . 08 150 . 99 152 . 81 151 . 85 150 . 46 117.58 7 (6 . 77) (6 . 84) (7 . 11) (6 . 61) (6 . 87) (6 . 15) 151 . 27 151 . 09 152 . 07 150 . 90 150 . 21 117.91 8 (7 . 17) (7 . 01) (6 . 65) (6 . 96) (7 . 03) (5 . 83) Table – RMSE of air compressor data. ∗ Consensual aggregation method integrating input X into the weight. [Fischer and Mougeot, 2019]. 19/21
Thank you Question ? 20/21
Recommend
More recommend