Ordinary Least Squares for Histogram Data based on Wasserstein - - PowerPoint PPT Presentation

ordinary least squares for
SMART_READER_LITE
LIVE PREVIEW

Ordinary Least Squares for Histogram Data based on Wasserstein - - PowerPoint PPT Presentation

Ordinary Least Squares for Histogram Data based on Wasserstein Distance Rosanna Verde Antonio Irpino Dipartimento di Studi Europei e Mediterranei Seconda Universit degli Studi di Napoli (ITALY) [rosanna.verde] [antonio.irpino]@unina2.it


slide-1
SLIDE 1

COMPSTAT 2010 - Paris - August 22-27

Ordinary Least Squares for Histogram Data based on Wasserstein Distance

Rosanna Verde Antonio Irpino

Dipartimento di Studi Europei e Mediterranei Seconda Università degli Studi di Napoli (ITALY) [rosanna.verde] [antonio.irpino]@unina2.it

slide-2
SLIDE 2

COMPSTAT 2010 - Paris - August 22-27

Outline

 Histogram data  A regression model for histogram variables  Properties of the Wasserstein distance  Ordinary Least Square fitting  Tools for the interpretation  An application on real data

slide-3
SLIDE 3

COMPSTAT 2010 - Paris - August 22-27

Sources of histogram data

 Result of summary/clustering procedures

From surveys

From large databases

From sensors

Temperatures

Pollutant concentration

Network activity  Data streams

Description of time windows

 Image analysis

Color bandwidths

 Confidentiality data

Summary data – non punctual

0.2 0.5 0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.5 0.1 0.2 0.2 0.4 0.1 0.1 0.1 0.2 0.3 0.4 0.5 0.6

slide-4
SLIDE 4

COMPSTAT 2010 - Paris - August 22-27

Histogram data as a particular case of modal symbolic descriptions [Bock and Diday (2000) ]

Histogram data is a kind of symbolic representation which allows to describe an individual by means of a histogram In Bock and Diday (2000) Histogram variable is one of the three definition of modal numerical variables :

 [Histogram variable] The description is a classic histogram where

the support is partitioned into intervals. Each interval is weighted by the empirical density;

 [Empirical distribution function variable] The description is done

according to an empirical distribution function;

 [Model of distribution variable] The description is done according

to a predefined model of random variable.

slide-5
SLIDE 5

COMPSTAT 2010 - Paris - August 22-27

Histogram variable

Let Y be a continuous variable defined on a finite support where: are the minimum and maximum values of the variable domain. The variable Y is partitioned into a set of contiguous intervals (bins) Given n observations of the variable Y, each semi-open interval, is associated with a random variable equal to It is possible to associate with an empirical distribution A histogram of Y is the representation in which each pair (for h = 1, …,H) is represented by a vertical bar, with base interval along the horizontal axis and the area proportional to

slide-6
SLIDE 6

COMPSTAT 2010 - Paris - August 22-27

A Regression model for histogram variables

 In order to study the dependence structure of a histogram

variable Y (dependent) to the another X (independent) we introduce a new regression approach based on the Ordinary Least Square estimation method According to the nature of the variables, we propose to compute the squared deviations (in the least squares function) by using the Wasserstein distance.

slide-7
SLIDE 7

COMPSTAT 2010 - Paris - August 22-27

A Regression model for histogram variables

Data = Model Fit + Residual

 Linear regression is a general method for

estimating/describing association between a continuous

  • utcome variable (dependent) and one or multiple predictors

in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data?

slide-8
SLIDE 8

COMPSTAT 2010 - Paris - August 22-27

Simple linear regression

Classic data Histogram data

slide-9
SLIDE 9

COMPSTAT 2010 - Paris - August 22-27

Regression between histograms: a proposal

A solution was given by Billard and Diday (2006)

 The model fit a linear

regression line throught the mixture of the n bivariate distributions

 Given a punctual value of

X it is possible to predict the punctual value of Y

slide-10
SLIDE 10

COMPSTAT 2010 - Paris - August 22-27

Regression between histograms: our approach

 Given a histogram description

for X, we search for a li linear r trasfo sformat rmation ion of the description which allows us to predict the histogram description of Y

 For example:

given the temperature ature histogr

  • gram

am observed ved in a region

  • n

during ng a m month, Is it possible to predict the e dist stribution tion of the temper perature ature of a another her month using a linear transformation of the histogram variable?

A histogram by a histogram

slide-11
SLIDE 11

COMPSTAT 2010 - Paris - August 22-27

Wasserstein distance

 We propose to use the Wasserstein-Kantorovich metric in

Least Square Function. Expecially the derived L2-Mallow’s distance between two quantile functions

 

 

1 2 1 1 W i j i j

d x ,x F (t ) F (t ) dt

 

 

slide-12
SLIDE 12

COMPSTAT 2010 - Paris - August 22-27

An interpretative decomposition of the L2-Wasserstein metric

     

1 2 1 2 2 2 1

( , ) : ( ) ( ) 2 (1 ( , )

W i j i j i j Shape Location Siz j j i i j i e

x x F d x x t t dt x x F     

 

        

QQ plot 60 65 70 75 80 85 90 95 100 10 20 30 40

If the two distribution have the same shape: If they have the same size and shape:

   

2 2 2 ( ,

) :

W i j i j i j Location Size

d x x        

 

2 2 ( ,

) :

W i j i j Location

d x x    

slide-13
SLIDE 13

COMPSTAT 2010 - Paris - August 22-27

Some simplifications and notations

       

1 1 1 1 2 2 2 2 2 2

( ) ( ) ( ) and ( ) ( )

i i

i i i i x i i i x i

x t F t x x t dt x t dt x x t dt x  

      

  

1 1 1 1 1 1 1

1 1 1 ( ) ( ) [0,1]; ( ) ( ) ( ) ( ) ( , ) ( ) ( ) ( , )

i j i j

n n n i i i i i i i j i j i j i j i j x x i j x x

x t x t t x x t dt x x t dt n n n x t x t dt x x x x x t x t dt x x x x      

  

          

      

quantile function of the i-th macro-unit xi (histogram/distribution data) Mean and variance

  • f the distribution/

histogram data Average distribution/ histogram data Correlation between pair of distribution/histogram data (xi , xj)

slide-14
SLIDE 14

COMPSTAT 2010 - Paris - August 22-27

Fitting with a linear model

 Given two variables Y and X regression model is here proposed to

perform a linear transformation of X which better fit Y

 Considering the error as close as possible to zero:

ˆ

( ) ( ) ( ) [0,1]

i

i i i y

y t t t t x        

( ) ( ) ( ) ˆ

i i i

t y t y t   

slide-15
SLIDE 15

COMPSTAT 2010 - Paris - August 22-27

The error term in the classic case

 Classic case (Euclidean norm)

   

2 2 2

, ˆ ˆ ˆ

i i i i i i E i i

y y d y y y y       

ˆi y

i

y

i

x

Error

slide-16
SLIDE 16

COMPSTAT 2010 - Paris - August 22-27

The error term of the model (our approach)

 Histogram case (Wasserstein distance)

ˆ ( )

i

y t predicted ( )

i

y t

  • bserved

( )

i

x t

(squared) Error

     

 

2 2 1 2 2

( ) ( ) ( ) ( ) ( ) ( ) ( ) ˆ ˆ [0,1 ( ) , ] ˆ ˆ

i i i i i i i i W i i

t y t t t y t t y t t dt y y y t y d y          

 

2

, ˆ

W i i

d y y

slide-17
SLIDE 17

COMPSTAT 2010 - Paris - August 22-27

Fitting a linear model: histograms

 We propose to find a linear transformation of the quantile

function of xi (histogram data) in order to predict the quantile function of yi i.e.:

It is worth noting the linear transformation is unique: the parameters  and  are estimated for all the i macro-units xi and yi

 A first problem:

Only if >0 a quantile function can be derived. In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance.

ˆ [0, ( ) ( ( )) ( 1 ) ]

i i i

t f x t x t y t       

ˆ ( )

i

y t

slide-18
SLIDE 18

COMPSTAT 2010 - Paris - August 22-27

Solution to <0

 The quantile function can be decomposed as:  Then, we propose the following model:  Using the Wasserstein distance it is possible to set up a

OLS method that returns three coefficients. We demonstrate 2 is always greater or equal to zero.

( ) ( ) ( ) ( )

c i i i c i i i

t x x t where x t t x is the centered quantile functi n x

  • x

   

1 2

( ) ( ) ( ).

c i i i i

y t x x t t        

slide-19
SLIDE 19

COMPSTAT 2010 - Paris - August 22-27

The error term:

a property of the Wasserstein distance decomposition

 The (squared) error can be written according the two

components

 

 

1 2 2 2 2 2

( ) ( ) ( ˆ ˆ , ˆ ˆ ) ( ) ,

i i i i i c c i W i W i i

d y y t t dt d y y y y y y        

slide-20
SLIDE 20

COMPSTAT 2010 - Paris - August 22-27

Ordinary Least Squares  

3 1 2

2 2 1 2 ( , , ) 1 1

ˆ argmin ( , , ) ( ), ( ) ( )

n n i W i i i i

f d y t t y t

  

   

  

 

 

1 2 1 2 1 2 1 0

( , , ) ( ) ( ) .

n c i i i i

f y t x x t dt      

       



 

ˆ ( ) ( )

i i

y t x t predicted   ( )

i

y t

  • bserved

( )

i

x t

Observed (squared) error

 

2

ˆ ( ), ( )

W i i

d y t y t

slide-21
SLIDE 21

COMPSTAT 2010 - Paris - August 22-27

Solving OLS

 First order conditions

     

1 1 2 1 0 1 1 2 1 1 1 1 2 1 2

2 ( ) ( ) ( ) 2 ( ) ( ) ( ) 2 ( ) ( ) ( ) ( )

n c c i i i i i n c c i i i i i i n c c c i i i i i i

f y y t x x t dt I f x y y t x x t dt II f x t y y t x x t dt III               

  

                               

  

slide-22
SLIDE 22

COMPSTAT 2010 - Paris - August 22-27

The estimated parameters

 It is easy to see that:

1 1 1 1 2 2 2 2 1 1

( , ) ˆ ˆ ˆ ˆ ; ;

i i i

n n i i i i x y i i n n i x i i

x y ny x x y y x x nx        

   

     

   

1 2

ˆ ˆ ˆ, and     

Correlation between quantile functions xi yi

slide-23
SLIDE 23

COMPSTAT 2010 - Paris - August 22-27

Interpretation of the parameters

 Regression parameters for the distribution mean

locations

 Shrinking factor for the variability  >1 (<1) the yi histogram has a greater (smaller) variability

than the xi histogram.

 =0 when the distributions collapse in points. 1

ˆ ˆ,   

1 2 2 1

( , ) ˆ

i i i

n i i x y i n x i

x y     

 

 

 

slide-24
SLIDE 24

COMPSTAT 2010 - Paris - August 22-27

Tools for the interpretation

 The sum of squares of Y is  We recall the decomposition of the Sum of Squares of Y

 

 

1 2 2 1 1 0

( ) ( ), ( ) ( ) ( )

n n W i i i i

SS Y d y t y t y t y t dt

 

  

 

SS(Y)=SS +SS

Error Regression

slide-25
SLIDE 25

COMPSTAT 2010 - Paris - August 22-27

Decomposition of SS(Y)

 Being:  we obtain

     

1 2 2 1 1 0 1 1 2 1 0 1

ˆ ( ) ( ), ( ) ( ) ( ) ˆ ( ) ( ) 2 ( ) ( ) 1 ˆ ( ) ( ( ) ( )) [0,1]

Error Regression

n n W i i i i i SS n i i SS Bias n i i i

SS Y d y t y t y t y t dt y t y t dt n y t e t dt e t y t y t t n

   

          

    

1 2

ˆ ˆ ˆ ˆ ( ) ( )

c i i i

y t x x t      

Average error function

slide-26
SLIDE 26

COMPSTAT 2010 - Paris - August 22-27

The bias

 The bias is due to different shapes of distributions:

 bias=0 when all the histograms have the same shape

That represents the incapacity of the linear transformation

  • f fitting distributions that are very different in shape

1 2 2

( ) ( ) ( ( ), ( ))

y x y

bias y t e t dt x t y t        

Correlation between the average quantile functions

slide-27
SLIDE 27

COMPSTAT 2010 - Paris - August 22-27

A measure of fitting

 Pseudo R2

Considering that We propose the following pseudo R2

 

1 1 2 2 1 1 0

ˆ ˆ ( ) ( ) ( ) ( )

n n c c Regression i i i i i

SS y y y t y t dt y t e t dt

 

        

  

2

min max 0;1 ;1 . ( )

Error

SS PseudoR SS Y              

slide-28
SLIDE 28

COMPSTAT 2010 - Paris - August 22-27

An application on a Climatic Dataset: 60 Chinese stations

slide-29
SLIDE 29

COMPSTAT 2010 - Paris - August 22-27

Histogram data

Let us be considered to predict the following fenomena:

  • Humidity
  • Pressure
  • Temperature
  • Wind Speed
  • Precipitation

in July from the distributions observed in January

slide-30
SLIDE 30

COMPSTAT 2010 - Paris - August 22-27

Main Results

Variable Y X  1 2 PseudoR2 Bias/SS(Y)

Relative Umidity (%)

July January 472.52 0.393 0.593 0.1564

  • 0.0296

Station Pressure (mb) July January

515.31 0.929 0.993 0.9981 0.0007

Temperature (Cel)

July January 25.46 0.196 0.521 0.3813

  • 0.0185

Wind Speed (m/s)

July January 7.98 0.638 0.848 0.6563

  • 0.0564

Precipitation (mm)

July January 1337.22 0.617 3.578 0.0000

  • 0.9275

Best fitting model: Station pressure July  January 2 shows as the distribution have quite the same variability while the bias value puts in evidence that the histograms have quite the same shape

slide-31
SLIDE 31

COMPSTAT 2010 - Paris - August 22-27

Variable Y X  1 2 PseudoR2 Bias/SS(Y)

Relative Umidity (%)

July January 472.52 0.393 0.593 0.1564

  • 0.0296

Station Pressure (mb) July January

515.31 0.929 0.993 0.9981 0.0007

Temperature (Cel)

July January 25.46 0.196 0.521 0.3813

  • 0.0185

Wind Speed (m/s)

July January 7.98 0.638 0.848 0.6563

  • 0.0564

Precipitation (mm)

July January 1337.22 0.617 3.578 0.0000

  • 0.9275

The estimated parameter 2 in the model Wind speed July  January shows that the variability of the predicted distribution on July is smaller than the January one

slide-32
SLIDE 32

COMPSTAT 2010 - Paris - August 22-27

Variable Y X  1 2 PseudoR2 Bias/SS(Y)

Relative Umidity (%)

July January 472.52 0.393 0.593 0.1564

  • 0.0296

Station Pressure (mb) July January

515.31 0.929 0.993 0.9981 0.0007

Temperature (Cel)

July January 25.46 0.196 0.521 0.3813

  • 0.0185

Wind Speed (m/s)

July January 7.98 0.638 0.848 0.6563

  • 0.0564

Precipitation (mm)

July January 1337.22 0.617 3.578 0.0000

  • 0.9275

The worst fitting model: Precipitation July  January The January variability is explained only by the different shape components The PseudooR2 and bias values show has the histogram data present very different shapes That makes unable a linear model to explain the causal relationship between this Histogram variables

slide-33
SLIDE 33

COMPSTAT 2010 - Paris - August 22-27

Main references

BILLARD, L. and DIDAY, E. (2006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley Series in Computational Statistics. John Wiley & Sons.

BOCK, H.H. and DIDAY, E. (2000): Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag.

CUESTA-ALBERTOS, J.A., MATRAN, C., TUERO-DIAZ, A. (1997): Optimal transportation plans and convergence in distribution. Journ. of Multiv. An., 60, 72–83.

GIBBS, A.L. and SU, F.E. (2002): On choosing and bounding probability metrics. Intl.

  • Stat. Rev. 7 (3), 419–435.

IRPINO, A., LECHEVALLIER, Y. and VERDE, R. (2006): Dynamic clustering of histograms using Wasserstein metric. In: Rizzi, A., Vichi, M. (eds.) COMPSTAT 2006. Physica-Verlag, Berlin, 869–876.

VERDE, R. and IRPINO, A.(2008): Comparing Histogram data using a Mahalanobis– Wasserstein distance. In: Brito, P. (eds.) COMPSTAT 2008. Physica–Verlag, Springer, Berlin, 77–89.

LIMA NETO, E.d.A. and DE CARVALHO, F.d.A.T. (2010): Constrained linear regression models for symbolic interval-valued variables. Computational Statistics and Data Analysis, 54, 2, Elsevier, 333–347.