Clustering compositional data trajectories F. Greco , F. Bruno - - PowerPoint PPT Presentation

clustering compositional data trajectories
SMART_READER_LITE
LIVE PREVIEW

Clustering compositional data trajectories F. Greco , F. Bruno - - PowerPoint PPT Presentation

Clustering compositional data trajectories F. Greco , F. Bruno Dipartimento di Scienze Statistiche Universit di Bologna Outline We deal with trajectories of compositional data, that is with sequences of composition measurements in domains.


slide-1
SLIDE 1

Clustering compositional data trajectories

  • F. Greco, F. Bruno

Dipartimento di Scienze Statistiche Università di Bologna

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Outline We deal with trajectories of compositional data, that is with sequences of composition measurements in domains. Observed trajectories are known as “functional data” The problem of clustering compositional data trajectories is addressed. Procedure for clustering functional data can be summarised as follows:

  • smooth the curves in order to remove measurement errors;
  • choose a metric to evaluate dissimilarity among the considered objects;
  • apply a clustering algorithm and evaluate the quality of the obtained

partition.

slide-5
SLIDE 5

Some notation An observed compositional data trajectory can be seen as a set of measurement taken along a domain

[ ]

min max

; x x x

(for example altitude, deepness, time). The complete data matrix for a compositional trajectory is denoted as

[ ]

  • =

, D x p

. The generic t-th row is denoted as

[ ]

  • ,

t t

x p

(t=1,…,T) and contains the C-dimensional composition vector

( )

1 2

, ,...,

t t t Ct

p p p

  • =

p

  • bserved in correspondence of

t

x .

slide-6
SLIDE 6

Smoothing compositional data trajectories Steps proposed as a strategy for obtaining smoothed compositional trajectories :

  • apply the additive log-ratio (alr) transformation to the observed

compositions; ( )

1 2 1

log ,log ,...,log

C C C C

p p p alr p p p

⎛ ⎞ = = ⎜ ⎟ ⎝ ⎠ p z

  • smooth transformed data trajectories by means of usual smoothing

techniques (B-spline, p-spline, cubic-spline, etc.).

  • In order to obtain smoothed compositions, the inverse alr

transformation can be applied to smoothed transformed data.

slide-7
SLIDE 7

Smoothing compositional data trajectories

11 21 31 12 22 32 1 2 3 1 2 3

... ... ... ... ... ...

t t t T T T

p p p p p p p p p p p p ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

alr

⎯⎯ →

t t T T

z z z z z z z z ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

11 21 12 22 1 2 1 2

... ... ... ...

ˆ

it it it

z z ε = +

⎯⎯⎯⎯ →

11 21 12 22 1 2 1 2

ˆ ˆ ˆ ˆ ... ... ˆ ˆ ... ... ˆ ˆ

t t T T

z z z z z z z z ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

1

alr −

⎯⎯ ⎯ →

11 21 31 12 22 32 1 2 3 1 2 3

ˆ ˆ ˆ ˆ ˆ ˆ ... ... ... ˆ ˆ ˆ ... ... ... ˆ ˆ ˆ

t t t T T T

p p p p p p p p p p p p ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠

slide-8
SLIDE 8

Functional Data Analysis (1) When considering K trajectories, the data matrix referred to the k-th trajectory is defined as

[ ]

,

k k k

  • =

D x p

.

tk

  • p

is the C-dimensional compositional vector at time t for trajectory k

k

  • p

is the

k

T C ×

matrix containing data measured for the k-th trajectory. Smoothed compositional trajectories ˆ

k

  • p

, or k=1,..,K, are obtained following the steps described before. Several approaches in functional cluster analysis are based on measuring the differences in observed curves by evaluating differences on the spline coefficients

( )

1 2

ˆ ˆ ˆ ˆ , ,...,

k k k Tk

β β β = β

.

slide-9
SLIDE 9

Functional Data Analysis (2) This approach is effective only if the same degree and vector of knots, as well as the same basis functions are used. Measurements might be taken at different values of the predictor variable (misalignment) and the quantities

( )

min

k

  • x

and

( )

max

k

  • x

can vary sensibly among trajectories. For this reason we prefer a more flexible approach based on different knots placement and different amount of smoothing for each trajectory.

slide-10
SLIDE 10

Construction of the metric Given two generic functions f and g, a measure of the distance between them in the interval [

]

min max

; x x

is the integral:

( ) ( ) ( )

min max

,

X X

d f g f x g x dx = −

where • indicates a norm. This integral can be evaluated via Monte Carlo integration by averaging point-to-point distances on a regular grid in the interval [

]

min max

; x x

as follows

( ) ( ) ( )

1 1

,

n i i i

d f g n f x g x

− =

≅ −

Suitable differences and norms have to be adopted in the simplex.

slide-11
SLIDE 11

Construction of the metric Given two compositions

( )

1 2

, ,...,

C

q q q = q

and

( )

1 2

, ,...,

C

w w w = w

such difference is evaluated as:

C C C C C C C i i i i i i i i i

q w q w q w q q q w w w q w q w q w

= = =

⎡ ⎤ ⎡ ⎤ ⎢ ⎥ Θ = Γ = = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦

∑ ∑ ∑

1 1 2 2 1 2 1 2 1 1 1

, ,..., , ,..., q w m

The distance (

)

, d q w is defined as the norm of the difference

( ) ( ) ( )

1

, ' d alr alr

= = q w m m Ψ m

where

( )

1 1 1 1

1 '

C C C

C

− − − −

⎡ ⎤ = − ⎢ ⎥ ⎣ ⎦ Ψ I j j

slide-12
SLIDE 12

A distance between trajectories in the simplex (1) Predicted values

l

  • p

%

and

k

  • p

%

  • n a grid are obtained.

Differently from

l

  • p and

k

  • p

, this predicted values are then aligned on the grid

; 1,...,

i

x i n =

. The distance between trajectories k and l are measured as ( )

1 1

,

n il ik i

d l k n−

  • =

≅ −

p p % %

The distance matrix D with generic entry is ( )

,

lk

d l k = D

is finally obtained. Starting from this matrix, alternative clustering algorithm can be adopted.

slide-13
SLIDE 13

Shape and level (center)

  • 5

5 10 15

  • 1

1 2 3 4 5

slide-14
SLIDE 14

A distance between trajectories in the simplex (2) The center of a curve in the simplex is captured by its geometric mean. Thus, for the predicted values

k

  • p

%

, k=1,…,K, the centered trajectories are

  • btained as:

k k k

  • =

Θ c p g % % %

( )

1

1 n n k ik i

  • =

= ∏ g p % %

: geometric mean of the predicted values for trajectory k. Distances between centered trajectories: ( )

* 1 1

,

n il ik i

d l k n−

  • =

≅ −

c c % %

We obtain the distance matrix

*

D where the generic entry is

( )

* *

,

lk

d l k = D

.

slide-15
SLIDE 15

Clustering algorithms Two clustering algorithms are applied and compared: Hierarchical clustering: Ward algorithm Partitive clustering: k-medoid (appealing because a representative object in the cluster can be identified).

slide-16
SLIDE 16

Data – Particulate matter vertical profiles Particulate vertical profiles measured along highness (71 launches in winter period) We consider three compositional classes 0.3-0.4; 0.4-0.5; 0.5-1.6

slide-17
SLIDE 17

Trajectory– matrx D – Class 1

0.70 0.72 0.74 0.76 0.78

  • 1.0
  • 0.5

0.0 0.5 1.0 1st composition Standardised Height

slide-18
SLIDE 18

Trajectory– matrx D – Class 2

0.16 0.18 0.20 0.22 0.24

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd composition Standardised Height

slide-19
SLIDE 19

Trajectory– matrx D – Class 3

0.03 0.04 0.05 0.06 0.07 0.08 0.09

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd composition Standardised Height

slide-20
SLIDE 20

Trajectory– matrx D* – Class 1

0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st composition Standardised Height

slide-21
SLIDE 21

Trajectory– matrx D* – Class 2

0.16 0.18 0.20 0.22 0.24 0.26

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd composition Standardised Height

slide-22
SLIDE 22

Trajectory– matrx D* – Class 3

0.04 0.05 0.06 0.07 0.08 0.09 0.10

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd composition Standardised Height

slide-23
SLIDE 23

Trajectories witin the Cluster – matrx D – Class 1

0.55 0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height 0.55 0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height 0.55 0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height

slide-24
SLIDE 24

Trajectories witin the Cluster – matrx D – Class 2

0.15 0.20 0.25 0.30 0.35

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height 0.15 0.20 0.25 0.30 0.35

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height 0.15 0.20 0.25 0.30 0.35

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height

slide-25
SLIDE 25

Trajectories witin the Cluster – matrx D – Class 3

0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height 0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height 0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height

slide-26
SLIDE 26

Trajectories witin the Cluster – matrx D* – Class 1

0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height 0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height 0.55 0.60 0.65 0.70 0.75 0.80

  • 1.0
  • 0.5

0.0 0.5 1.0 1st Composition Height

slide-27
SLIDE 27

Trajectories witin the Cluster – matrx D* – Class 2

0.10 0.15 0.20 0.25 0.30

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height 0.10 0.15 0.20 0.25 0.30

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height 0.15 0.20 0.25 0.30

  • 1.0
  • 0.5

0.0 0.5 1.0 2nd Composition Height

slide-28
SLIDE 28

Trajectories witin the Cluster – matrx D* – Class 3

0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height 0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height 0.02 0.04 0.06 0.08 0.10 0.12 0.14

  • 1.0
  • 0.5

0.0 0.5 1.0 3rd Composition Height

slide-29
SLIDE 29

Ternary Diagram – matrix D

slide-30
SLIDE 30

Ternary Diagram – matrix D*

slide-31
SLIDE 31

Descrizione dei cluster in termini di meteo e confronti D 1 2 3 1 6 33 16 55 D* 2 4 4 1 9 3 3 4 7 13 41 17 71

slide-32
SLIDE 32

Descrizione dei cluster in termini di meteo e confronti

Mean HMIX TEMP PRESS WIND HUMID RAD NETRAD PM25 D* k=1 155.15 3.61 1011.94 0.87 62.25 99.35 3.15 106.07 D* k=2 151.78 2.99 1015.50 1.02 57.67 103.56

  • 3.44

110.56 D* k=3 228.43 7.66 1006.61 0.90 55.57 201.57 28.57 131.29 D k=1 160.00 2.41 1013.80 0.86 59.92 82.69

  • 14.38

107.23 D k=2 167.73 5.20 1011.84 0.96 59.44 141.12 15.41 117.78 D k=3 149.47 2.02 1010.45 0.74 65.65 55.65

  • 6.06

89.71 161.94 3.93 1011.87 0.89 61.01 109.96 4.82 109.13 SD HMIX TEMP PRESS WIND HUMID RAD NETRAD PM25 D* k=1 8.49 0.56 0.71 0.05 1.39 16.59 11.31 3.12 D* k=2 16.52 1.10 1.85 0.07 2.84 25.06 16.50 4.96 D* k=3 19.65 1.31 0.81 0.13 5.46 65.94 31.09 0.81 D k=1 18.92 0.84 1.89 0.06 2.59 20.86 13.58 5.39 D k=2 10.84 0.67 0.76 0.06 1.83 22.37 14.25 2.63 D k=3 10.75 0.84 1.37 0.09 1.92 21.25 16.22 5.85

slide-33
SLIDE 33