Consensual Aggregation of Clusters based on Bregman Divergences to - - PowerPoint PPT Presentation

consensual aggregation of clusters based on bregman
SMART_READER_LITE
LIVE PREVIEW

Consensual Aggregation of Clusters based on Bregman Divergences to - - PowerPoint PPT Presentation

Consensual Aggregation of Clusters based on Bregman Divergences to Improve Predictive Models Sothea HAS Sorbonne Universit e LPSM, Universit e Paris-Diderot Mathilde Mougeot Aur elie Fischer sothea.has@lpsm.paris 2 avril 2019 1/21


slide-1
SLIDE 1

Consensual Aggregation of Clusters based on Bregman Divergences to Improve Predictive Models

Sothea HAS Sorbonne Universit´ e

LPSM, Universit´ e Paris-Diderot Mathilde Mougeot Aur´ elie Fischer sothea.has@lpsm.paris

2 avril 2019

1/21

slide-2
SLIDE 2

Overview

  • A. Introduction
  • B. Construction of a predictive model
  • 1. K-means algorithm with Bregman divergences
  • 2. Construction of candidate estimators
  • 3. Consensual aggregation
  • C. Applications
  • 1. Simulated data
  • 2. Real data

2/21

slide-3
SLIDE 3

Consider an example...

Input data with 3 clusters Different model on each cluster

3/21

slide-4
SLIDE 4

Introduction

Setting : (X, Z) ∈ X × Z : input-out data.

X = Rd : input space. Z =

  • R

: regression {0, 1} : binary classification

Tn = {(xi, zi)n

i=1} : iid learning data.

Objective : Construct a good predictive model for regression or classification. Assumption : X is composed of more than one group or cluster. The number of clusters K is available. There exists different underlying models on these clusters.

4/21

slide-5
SLIDE 5

Construction of a predictive model

There are 3 important steps :

  • 1. K-means algorithm with Bregman divergences
  • 2. Construction of candidate estimators
  • 3. Consensual aggregation

5/21

slide-6
SLIDE 6

Bregman divergences (BD) [Bregman, 1967]

φ : C ⊂ Rd → R, strictly convex and of class C1 then for any (x, y) ∈ C × int(C) (points of the input space X), dφ(x, y) = φ(x) − φ(y) − x − y, ∇φ(y)

−1 1 2 3 2 4 6 8 y x φ(y) + x − y, ∇φ(y) φ(x) φ(y) dφ(x, y)

Figure – Graphical interpretation of Bregman divergences.

6/21

slide-7
SLIDE 7

Exponential families (EF)

X is a member of an exponential family Eψ if f (x|θ) = h(x) exp(θ, T(x) − ψ(θ)), θ ∈ Θ Example :

Continuous cases : exponential, normal, gamma, beta... Discrete cases : Bernoulli, poisson, binomial, multinomial...

7/21

slide-8
SLIDE 8

Relationship between BD and EF

Theorem [Banerjee et al., 2005]

If X is a member of an exponential family Eψ and if φ is the convex conjugate of ψ defined by φ(x) = sup

y {x, y − ψ(y)}

then there exists a unique Bregman divergence dφ such that f (x|θ) = h(x) exp(−dφ(T(x), E[T(X)]) + φ(T(x))) Example : Exponential distribution : dφ(x, y) = x

y − log

  • x

y

  • − 1 (Itakura-Saito).

Poisson distribution : dφ(x, y) = x log

  • x

y

  • − (x − y) (General

Kullback-Leibler).

8/21

slide-9
SLIDE 9

Step 1 : K-means algorithm with Bregman divergences

Perform K-means algorithm with M options of Bregman divergences. Each BDℓ gives an associated partition cell Sℓ = {Sℓ

k}K k=1.

BD1 BD2 ... BDM S1 S2 ... SM Step 1

9/21

slide-10
SLIDE 10

Step 2 : Construction of candidate estimators

Suppose that ∀ℓ, k : Sℓ

k ∈ Sℓ contains enough data points.

∀ℓ, k : construct an estimator mℓ

k on Sℓ k.

mℓ = {mℓ

k}K k=1 is the candidate estimator associated to DBℓ.

BD1 BD2 ... BDM S1 S2 ... SM m1 m2 ... mM Step 1 Step 2

10/21

slide-11
SLIDE 11

Step 3 : Consensual aggregation

Why consensual aggregation ? Neither the distribution nor the clustering structure of the input data is available. Not easy to choose the“best”one among {mℓ}M

ℓ=1.

DB1 DB2 ... DBM S1 S2 ... SM m1 m2 ... mM Aggregation Step 1 Step 2 Step 3

11/21

slide-12
SLIDE 12

Classification

Example : Suppose we have 4 classifiers : m = (m1, m2, m3, m4) An observation x with predictions : (1, 1, 0, 1). ID m1 m2 m3 m4 z 1 1 1 1 1 2 1 3 1 1 1 4 1 1 1 1 5 1 1 1 1

Table – Table of predictions.

Based on the following works : [Mojirsheibani, 1999] : Classical method (Mo1). [Mojirsheibani, 2000] : A kernel-based method (Mo2). [Fischer and Mougeot, 2019] : MixCOBRA.

12/21

slide-13
SLIDE 13

Regression

The aggregation takes the following form : Aggn(x) =

n

  • i=1

Wn,i(x)zi [Biau et al., 2016] : with weight 0 − 1 (COBRA). Wn,i(x) = M

ℓ=1 ✶{|mℓ(xi)−mℓ(x)|<ε}

n

j=1

M

ℓ=1 ✶{|mℓ(xj)−mℓ(x)|<ε}

Kernel-based method of COBRA (kernel-based weight). [Fischer and Mougeot, 2019] : MixCOBRA.

13/21

slide-14
SLIDE 14

Applications

Bregman divergences

Euclidean : For all x ∈ C = Rd, φ(x) = x2

2 = d i=1 x2 i ,

dφ(x, y) = x − y2

2

General Kullback-Leibler (GKL) : φ(x) = d

i=1 xi log(xi),

C = (0, +∞)d, dφ(x, y) = d

i=1

  • xi log
  • xi

yi

  • − (xi − yi)
  • Logistic : φ(x) = d

i=1[xi log(xi) + (1 − xi) log(1 − xi)], C = (0, 1)d ,

dφ(x, y) = d

i=1

  • xi log
  • xi

yi

  • + (1 − xi) log
  • 1−xi

1−yi

  • Itakura-Saito : φ(x) = − d

i=1 log(xi), C = (0, +∞)d,

dφ(x, y) = d

i=1

  • xi

yi − log

  • xi

yi

  • − 1
  • 14/21
slide-15
SLIDE 15

Simulated data

M = 4 et K = 3.

Figure – K-means with Bregman divergences on some simulated data.

15/21

slide-16
SLIDE 16

Classification : numerical results

With 20 replications of each case.

K = 1 mℓ kernel used in Wn,i (x) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt Exp 18.86 8.58 7.42 4.09 3.92 3.49 3.51 3.46 3.51 3.56 3.56 (0.89) (0.94) (0.88) (0.94) (0.91) (0.91) (1.70) (1.77) (1.55) (1.08) (1.15) 2.91 2.63 2.49 2.70 2.56 2.46 (0.81) (0.70) (0.74) (0.75) (0.63) (0.66) Pois 46.93 9.19 8.45 13.33 10.15 8.59 8.51 8.51 8.51 8.52 8.52 (1.37) (1.46) (1.47) (1.46) (1.47) (1.49) (3.35) (1.27) (1.24) (1.84) (1.47) 8.51 8.46 8.44 8.42 8.57 8.44 (1.28) (1.11) (1.17) (1.15) (1.28) (1.13) Geom 19.90 12.57 4.71 3.94 8.12 3.61 3.60 3.60 3.61 3.60 3.60 (1.15) (1.16) (1.16) (1.15) (1.16) (1.16) (2.07) (2.39) (2.37) (1.15) (1.57) 3.76 3.52 2.94 3.48 3.47 3.40 (0.92) (1.11) (0.93) (1.09) (1.11) (1.06) 2D Gaus 49.00 12.37 12.40 14.14 13.05 12.87 12.82 12.80 12.84 12.84 12.87 (1.60) (1.59) (1.56) (1.57) (1.57) (1.60) (2.52) (1.55) (1.50) (1.44) (1.61) 12.02 12.11 12.06 12.11 12.09 12.10 (1.30) (1.24) (1.35) (1.27) (1.23) (1.22) 3D Gaus 43.39 10.77 10.99 11.74 11.56 11.08 11.01 11.00 11.00 11.04 11.03 (1.58) (1.52) (1.50) (1.50) (1.57) (1.55) (2.52) (1.40) (1.44) (1.45) (1.51) 10.23 9.93 9.76 10.04 9.83 9.84 (1.40) (1.47) (1.53) (1.47) (1.61) (1.61)

Table – Table of average testing misclassification error (1 unit = 10−2).

16/21

slide-17
SLIDE 17

Regression : numerical results

K = 1 mℓ kernel used in Wn,i (x) Distribution Single Euclid GKL Logit Ita Unif Epan Gaus Triang Bi-wgt Tri-wgt Exp 107.73 69.82 58.93 44.54 44.46 55.11 51.14 40.21 52.99 50.24 50.64 (15.85) (13.31) (14.40) (13.12) (13.74) (14.41) (7.13) (6.84) (7.37) (7.37) (10.96) 56.34 52.62 39.12 51.31 51.20 51.98 (17.48) (17.82) (14.98) (19.55) (19.69) (20.12) Pois 26.76 10.16 8.22 16.72 12.15 8.88 9.18 8.43 8.85 8.84 8.76 (1.65) (1.98) (2.18) (2.06) (2.03) (2.03) (1.11) (1.91) (2.25) (1.61) (1.86) 9.73 9.61 9.13 9.64 9.40 9.43 (2.25) (1.86) (1.92) (1.91) (1.86) (1.93) Geom 70.45 29.99 18.33 22.94 31.94 36.39 32.49 21.51 31.48 31.44 30.89 (13.81) (13.49) (11.79) (14.31) (13.51) (12.21) (4.52) (5.95) (7.34) (6.21) (5.19) 31.83 27.90 17.82 26.82 28.45 24.58 (12.88) (14.20) (12.58) (13.28) (14.02) (13.21) 2D Gaus 21.98 5.63 6.46 19.36 9.38 7.09 6.57 5.57 6.20 6.41 6.33 (2.55) (1.78) (0.49) (1.72) (1.76) (1.75) (1.20) (1.26) (1.81) (1.11) (1.86) 9.75 7.70 6.42 7.45 7.47 7.34 (1.30) (2.24) (1.49) (2.42) (2.28) (2.31) 3D Gaus 53.55 19.89 20.93 23.71 22.96 18.16 18.20 16.94 18.25 18.05 18.00 3.42) (3.45) (4.06) (3.41) (3.50) (3.49) (1.74) (3.49) (2.97) (2.70) (2.74) 19.24 18.52 17.51 18.64 18.19 18.42 (3.54) (4.02) (3.64) (4.37) (3.91) (3.68)

Table – Table of average testing RMSE.

17/21

slide-18
SLIDE 18

Real data

Air compressor

Given by [Cadet et al., 2005]. Six predictors : air temperature, input pressure, output pressure, flow and water temperature. Response variable : power consumption.

K is not available !

18/21

slide-19
SLIDE 19

Results of air compressor data

For K = 1 : RMSE = 178.67.

K Euclid GKL Logistic Ita COBRA MixCOBRA∗ 2 158.85 158.90 159.35 158.96 153.34 116.69 (6.42) (6.48) (6.71) (6.41) (6.72) (5.86) 3 157.38 157.24 156.99 157.24 153.69 117.45 (6.95) (6.84) (6.65) (6.85) (6.64) (5.55) 4 154.33 153.96 153.99 154.07 152.09 117.16 (6.69) (6.74) (6.45) (7.01) (6.58) (5.99) 5 153.18 153.19 152.95 152.25 151.05 117.55 (6.91) (6.77) (6.57) (6.70) (6.76) (5.90) 6 151.16 151.67 151.89 151.75 150.27 117.74 (6.91) (6.96) (6.62) (6.57) (6.82) (5.86) 7 151.08 150.99 152.81 151.85 150.46 117.58 (6.77) (6.84) (7.11) (6.61) (6.87) (6.15) 8 151.27 151.09 152.07 150.90 150.21 117.91 (7.17) (7.01) (6.65) (6.96) (7.03) (5.83)

Table – RMSE of air compressor data.

∗ Consensual aggregation method integrating input X into the weight.

[Fischer and Mougeot, 2019].

19/21

slide-20
SLIDE 20

Thank you Question ?

20/21

slide-21
SLIDE 21

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005). Clustering with Bregman divergences. Journal Machine Learning Research, 6 :1705–1749. Biau, G., Fischer, A., Guedj, B., and Malleye, J. D. (2016). COBRA : a combined regression strategy. Journal of Multivariate Analysis, 146 :18–28. Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematical and Mathematical Physics, 7 :200–217. Cadet, O., Harper, C., and Mougeot, M. (2005). Monitoring energy performance of compressors with an innovative auto-adaptive approach. In Instrumentation System and Automation -ISA- Chicago. Fischer, A., Has, S., and Mougeot, M. (2018). Consensual aggregation of clusters based on Bregman divergences to improve predictive models. Fischer, A. and Mougeot, M. (2019). Aggregation using input-output trade-off. Journal of Statistical Planning and Inference, 200 :1–19. Mojirsheibani, M. (1999). Combined classifiers via disretization. Journal of the American Statistical Association, 94(446) :600–609. Mojirsheibani, M. (2000). A kernel-based combined classification rule. Journal of Statistics and Probability Letters, 48(4) :411–419. 21/21