[PPT] - Vine copula mixture models and clustering for non-Gaussian data PowerPoint Presentation

SLIDE 1

Vine copula mixture models and clustering for non-Gaussian data

Statistical Methods in Machine Learning

Prof. Claudia Czado

¨ Ozge Sahin <ozge.sahin@tum.de>

Bernoulli-IMS One World Symposium August 2020

SLIDE 2

Finite mixture models

k components generate data

The density of a finite mixture model for X = (X1, . . . , Xd)⊤ at x = (x1, . . . , xd)⊤ can be written as: g(x; η) =

k

j=1

πj · gj(x; ψj). (1) How to select densities of each component gj(x; ψj)? Symmetric distributions, skewed distributions, and

thers...

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 1 / 13

SLIDE 3

Vine copula mixture models, vcmm

Representation of diverse dependence structures in the data

The density of a finite mixture model for X = (X1, . . . , Xd)⊤ at x = (x1, . . . , xd)⊤ can be written as: g(x; η) =

k

j=1

πj · gj(x; ψj). (2) How to select flexible densities of each component gj(x; ψj) so the model can represent different asymmetric or/and tail dependencies for different pairs of variables? Vine copulas

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 2 / 13

SLIDE 4

Vine copulas

Efficient tools for high-dimensional dependence modeling

A bivariate copula C: Distribution on [0, 1]2 with univariate uniform margins. Vine copulas: For higher-dimensional data, Bivariate copulas are building blocks [Aas et al., 2009], Bivariate copulas and a nested set of trees determine dependence structure [Bedford and Cooke, 2002]. Sklar’s Theorem [Sklar, 1959] A d-dimensional density can be decomposed into products of marginal densities and bivariate copula densities assuming absolute continuity of random variables: g(x) = c

F1(x1), . . . , Fd(xd) · f1(x1) · · · fd(xd),

x ∈ Rd. (3)

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 3 / 13

SLIDE 5

Vine copula mixture models, vcmm

Decompose a component’s density into marginal and 2d-copula dens.

T(1)1 1 2 3

C(1)1,2 C(1)2,3

T(1)2 1,2 2,3

C(1)1,3;2

(a) First component

T(2)1 2 1 3

C(2)1,2 C(2)1,3

T(2)2 1,2 1,3

C(2)2,3;1

(b) Second component

Figure 1: Vine copula model of two components.

The density of the first component at x = (x1, x2, x3)⊤:

g1(x; ψ1) =c(1)1,2

F1(1)(x1; γ1(1)), F2(1)(x2; γ2(1)); θ(1)1,2
· c(1)2,3
F2(1)(x2; γ2(1)), F3(1)(x3; γ3(1)); θ(1)2,3
· c(1)1,3;2
F(1)1|2(x1|x2; γ1(1), γ2(1), θ(1)1,2), F(1)3|2(x3|x2; γ3(1), γ2(1), θ(1)2,3); θ(1)1,3;2
· f1(1)(x1; γ1(1)) · f2(1)(x2; γ2(1)) · f3(1)(x3; γ3(1)),

(4) ¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 4 / 13

SLIDE 6

Vine copula mixture models, vcmm

Work with an assignment of the observations to the components

Input: d-dimensional n observations to cluster xi = (xi,1, . . . , xi,d)⊤ ∈ Rd for i = 1, . . . , n, Total number of clusters k. A partition of the observations: Total number of observations assigned to the jth component is nj, The observations belonging to the jth component x(j)ij = (x(j)ij,1, . . . , x(j)ij,d)⊤ for ij = 1, . . . nj and j = 1, . . . , k,

k

j=1

nj = n and

∀(j,ij)

x(j)ij =

∀i

xi.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 5 / 13

SLIDE 7

Vine copula mixture models, vcmm

Parametric model selection

For a variable xp(j) = (x(j)1,p, . . . , x(j)nj,p)⊤, p = 1, . . . , d and j = 1, . . . , k,

1. Marginal distribution selection Fj: For each candidate for

marginal distribution on the variable xp(j), find the parameters that maximize the log-likelihood ℓ(ˆ γp(j)), then select the marginal distribution ˆ Fp(j) with the lowest AIC.

2. Vine tree structure selection Vj: Obtain u-data by applying

probability integral transformation: ˆ up(j) = ˆ Fp(j)(xp(j); ˆ γp(j)). Then follow the greedy algorithm of [Dißmann et al., 2013].

3. Pair copula family selection Bj(Vj): Given the vine tree

structure, estimate the copula parameters that maximize the log-likelihood ℓ(ˆ θ(j)ea,eb;De). Later choose the copula family with the lowest AIC.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 6 / 13

SLIDE 8

Vine copula mixture models

Estimate parameters with the modified ECM algorithm

The log-likelihood of the given data: ℓ(η) = log

n

i=1

g(xi; ψ) = log

n

i=1

k

j=1

πj · gj(xi; ψj). (5) Introduce latent variables zi = (zi,1, . . . , zi,k)⊤ zi,j =

1,

if xi belongs to the jth component, 0,

therwise,

(6) and

k

j=1

zi,j = 1. The complete data log-likelihood ℓc(η; z, x) of the com- plete data yi = (xi, zi)⊤:

ℓc(η; z, x) = log

n

i=1

k

j=1

[πj ·gj (xi ; ψj )]zi,j =

n

i=1

k

j=1

zi,j ·log πj +

n

i=1

k

j=1

zi,j ·log gj (xi ; ψj ), (7) ¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 7 / 13

SLIDE 9

Vine copula mixture models, vcmm

Estimate parameters with the modified ECM algorithm

Our steps at the (t + 1)th iteration:

1. E-step (Posterior probabilities)

r (t+1)

i,j

= π(t)

j

gj(xi; ψ(t)

j

)

k

j=1

π(t)

j

gj(xi; ψ(t)

j

) for i = 1, . . . n and j = 1, . . . k. (8)

2. CM-step 1 (Mixture weights)

π(t+1)

j

=

n

i=1

r (t+1)

i,j

n for j = 1, . . . k. (9)

3. CM-step 2 (Marginal parameters)

max

γj n

i=1

r (t+1)

i,j

· log gj(xi; γj, θ(t)

j

) for j = 1, . . . k (10)

4. CMR-step (Pair copula parameters updated sequentially)

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 8 / 13

SLIDE 10

Vine copula based clustering, vcmmc

Consists of 7 primary building blocks

1. Initial clustering assignment,
2. Initial model selection with Markov trees and

parametric marginal distributions,

3. Iterative parameter estimation with the modified ECM,
4. Temporary clustering assignment,
5. Temporary model selection with full vine specification,
6. Final model selection with different initial clustering

methods, i.e. run the steps 1-5 with different initial partitions,

7. Final clustering assignment.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 9 / 13

SLIDE 11

Vine copula based clustering, vcmmc

Captures the non-Gaussian components hidden in the data Figure 2: Pairwise scatter plot of the subset of AIS data(left), red:females, green:males. Pairs plots of females(middle) and males(right).

Model vcmmc GMM skew normal t skew-t k-means Misclassification rate 0.02 0.09 0.04 0.29 0.04 0.34 BIC 6942 7062 7055 7092 7048

Number of free parameters

41 30 51 41 51

Table 1: Comparison of clustering algorithm performances on the subset
f AIS data.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 10 / 13

SLIDE 12

Vine copula based clustering, vcmmc

Nicely interprets the structure of the data

Ferr Ht LBM Wt WBC Males

N(-0.27/-0.17) C(1.84/0.48) SG(7.64/0.87) F(1.62/0.18)

LBM Wt WBC Ht Ferr Females

SG(3.90/0.74) F(-0.15/-0.02) C(1.95/0.49) N(0.11/0.07)

Figure 3: The first tree level of the estimated vine copula model for females and males. A capital letter at an edge refers to its bivariate copula family, where N: Gaussian, C: Clayton, SG: Survival Gumbel, and F: Frank copula. The estimated parameter value and corresponding Kendall’s τ of the pair copula are given inside the parenthesis (estimated parameter/Kendall’s ˆ τ).

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 11 / 13

SLIDE 13

Vine copula mixture models and clustering

Appealing ad promising framework

What we have done: A vine copula mixture model, called vcmm, that works with continuous data and fits all classes of vine tree structures, Use of parametric marginal distributions and pair copula families with a single parameter, Data-driven approach for model selection problems, Modified the ECM algorithm [Meng and Rubin, 1993] for parameter estimation, A new and promising model-based clustering algorithm, called vcmmc. Future research directions: Extension for discrete ordinal variables, Dimensionality reduction for vine copula based clustering, Parsimonious vine copula mixture models.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 12 / 13

SLIDE 14

References

Aas, K., Czado, C., Frigessi, A., and Bakken, H. (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and Economics, 44(2):182 – 198. Bedford, T. and Cooke, R. M. (2002). Vines - A new graphical model for dependent random variables. Annals of Statistics, 30(4):1031–1068. Dißmann, J., Brechmann, E. C., Czado, C., and Kurowicka, D. (2013). Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics and Data Analysis, 59:52 – 69. Meng, X.-L. and Rubin, D. B. (1993). Maximum Likelihood Estimation via the ECM Algorithm: A General Framework. Biometrika, 80(2):267–278. Sklar, A. (1959). Fonctions de R´ epartition ` a n Dimensions et Leurs Marges. Publications de L’Institut de Statistique de L’Universit´ e de Paris, (8):229–231.

¨ Ozge Sahin Vine copula mixture models and clustering for non-Gaussian data August 2020 13 / 13