Online k -MLE for mixture modelling with exponential families - - PowerPoint PPT Presentation

online k mle for mixture modelling with exponential
SMART_READER_LITE
LIVE PREVIEW

Online k -MLE for mixture modelling with exponential families - - PowerPoint PPT Presentation

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen G eometry S cience I nformation 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay Application Context We are interested in building a


slide-1
SLIDE 1

Online k-MLE for mixture modelling with exponential families

Christophe Saint-Jean Frank Nielsen

Geometry Science Information 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay

slide-2
SLIDE 2

Application Context

2/27

We are interested in building a system (a model) which evolves when new data is available: x1, x2, . . . , xN, . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X

slide-3
SLIDE 3

Outline of this talk

3/27

1

Online learning exponential families

2

Online learning of mixture of exponential families Introduction, EM, k-MLE Recursive EM, Online EM Stochastic approximations of k-MLE Experiments

3

Conclusions

slide-4
SLIDE 4

Reminder : (Regular) Exponential Family

4/27

Firstly, π will be approximated by a member of a (regular) exponential family (EF): EF = {f (x; θ) = exp {s(x), θ + k(x) − F(θ)|θ ∈ Θ} Terminology: λ source parameters. θ natural parameters. η expectation parameters. s(x) sufficient statistic. k(x) auxiliary carrier measure. F(θ) the log-normalizer: differentiable, strictly convex Θ = {θ ∈ RD|F(θ) < ∞} is an open convex set Almost all common distributions are EF members but uniform, Cauchy distributions.

slide-5
SLIDE 5

Reminder : Maximum Likehood Estimate (MLE)

5/27

Maximum Likehood Estimate for general p.d.f: ˆ θ(N) = argmax

θ N

  • i=1

f (xi; θ) = argmin

θ

− 1 N

N

  • i=1

log f (xi; θ) assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations. Maximum Likehood Estimate for an EF: ˆ θ(N) = argmin

θ

  • 1

N

  • i

s(xi), θ

  • − cst(χ) + F(θ)
  • which is exactly solved in H, the space of expectation parameters:

ˆ η(N) = ∇F(ˆ θ(N)) = 1 N

  • i

s(xi) ≡ ˆ θ(N) = (∇F)−1

  • 1

N

  • i

s(xi)

slide-6
SLIDE 6

Exact Online MLE for exponential family

6/27

A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input: a sequence S of observations Input: Functions s and (∇F)−1 for some EF Output: a sequence of MLE for all observations seen before ˆ η(0) = 0; N = 1; for xN ∈ S do ˆ η(N) = ˆ η(N−1) + N−1(s(xN) − ˆ η(N−1)); yield ˆ η(N) or yield (∇F)−1(ˆ η(N)); N = N + 1; Analytical expressions of (∇F)−1 exist for most EF (but not all)

slide-7
SLIDE 7

Case of Multivariate normal distribution (MVN)

7/27

Probability density function of MVN: N(x; µ, Σ) = (2π)− d

2 |Σ|− 1 2 exp− 1 2 (x−µ)T Σ−1(x−µ)

One possible decomposition: N(x; θ1, θ2) = exp{θ1, x + θ2, −xxTF − 1 4

tθ1θ−1 2 θ1 − d

2 log(π) + 1 2 log |θ2|} = ⇒ s(x) = (x, −xxT) (∇F)−1(η1, η2) =

  • (−η1ηT

1 − η2)−1η1, 1 2(−η1ηT 1 − η2)−1

slide-8
SLIDE 8

Case of the Wishart distribution

8/27

See details in the paper.

slide-9
SLIDE 9

Finite (parametric) mixture models

9/27

Now, π will be approximated by a finite (parametric) mixture f (·; θ) indexed by θ: π(x) ≈ f (x; θ) =

K

  • j=1

wj fj(x; θj), 0 ≤ wj ≤ 1,

K

  • j=1

wj = 1 where wj are the mixing proportions, fj are the component distributions. When all fj’s are EFs, it is called a Mixture of EFs (MEF).

−5 5 10 0.00 0.05 0.10 0.15 0.20 0.25 x 0.1 * dnorm(x) + 0.6 * dnorm(x, 4, 2) + 0.3 * dnorm(x, −2, 0.5)

Unknown true distribution f* Mixture distribution f Components density functions f_j

slide-10
SLIDE 10

Incompleteness in mixture models

10/27

incomplete

  • bservable

χ = {x1, . . . , xN}

deterministic

← complete unobservable χc = {y1 = (x1, z1), . . . , yN} Zi ∼ catK(w) Xi|Zi = j ∼ fj(·; θj) For a MEF, the joint density p(x, z; θ) is an EF: log p(x, z; θ) =

K

  • j=1

[z = j]{log(wj) + θj, sj(x) + kj(x) − Fj(θj)} =

K

  • j=1
  • [z = j]

[z = j] sj(x)

  • ,

log wj − Fj(θj) θj

  • + k(x, z)
slide-11
SLIDE 11

Expectation-Maximization (EM) [1]

11/27

The EM algorithm maximizes iteratively Q(θ; ˆ θ(t), χ). Algorithm 2: EM algorithm Input: ˆ θ(0) initial parameters of the model Input: χ(N) = {x1, . . . , xN} Output: A (local) maximizer ˆ θ(t∗) of log f (χ; θ) t ← 0; repeat Compute Q(θ; ˆ θ(t), χ) := Eˆ

θ(t)[log p(χc; θ)|χ] ;

// E-Step Choose ˆ θ(t+1) = argmaxθ Q(θ; ˆ θ(t), χ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood;

slide-12
SLIDE 12

EM for MEF

12/27

For a mixture, the E-Step is always explicit: ˆ z(t)

i,j = ˆ

w(t)

j

f (xi; ˆ θ(t)

j

)/

  • j′

ˆ w(t)

j′ f (xi; ˆ

θ(t)

j′ )

For a MEF, the M-Step then reduces to: ˆ θ(t+1) = argmax

{wj,θj} K

  • j=1
  • i ˆ

z(t)

i,j

  • i ˆ

z(t)

i,j sj(xi)

  • ,

log wj − Fj(θj) θj

  • ˆ

w(t+1)

j

=

N

  • i=1

ˆ z(t)

i,j /N

ˆ η(t+1)

j

= ∇F(ˆ θ(t+1)

j

) =

  • i ˆ

z(t)

i,j sj(xi)

  • i ˆ

z(t)

i,j

(weighted average of SS)

slide-13
SLIDE 13

k-Maximum Likelihood Estimator (k-MLE) [2]

13/27

The k-MLE introduces a geometric split χ = K

j=1 ˆ

χ(t)

j

to accelerate EM : ˜ z(t)

i,j = [argmax j′

wj′f (xi; ˆ θ(t)

j′ ) = j]

Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k-MLE then reduces to: ˆ θ(t+1) = argmax

{wj,θj} K

  • j=1

χ(t)

j |

  • xi∈ˆ

χ(t)

j

sj(xi)

  • ,

log wj − Fj(θj) θj

  • ˆ

w(t+1)

j

= |ˆ χ(t)

j |/N

ˆ η(t+1)

j

= ∇F(ˆ θ(t+1)

j

) =

  • xi∈ˆ

χ(t)

j

sj(xi) |ˆ χ(t)

j |

(cluster-wise unweighted average of SS)

slide-14
SLIDE 14

Online learning of mixtures

14/27

Consider now the online setting x1, x2, . . . , xN, . . . Denote ˆ θ(N) or ˆ η(N) the parameter estimate after dealing N

  • bservations

Denote ˆ θ(0) or ˆ η(0) their initial values Remark: For a fixed-size dataset χ, one may apply multiple passes (with shuffle) on χ. The increase in the likelihood function is no more guaranteed after an iteration.

slide-15
SLIDE 15

Stochastic approximations of EM(1)

15/27

Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] ˆ θ(N) = ˆ θ(N−1) + {NIc(ˆ θ(N−1)}−1∇θ log f (xN; ˆ θ(N−1)) where Ic is the Fisher Information matrix for the complete data: Ic(ˆ θ(N−1)) = −Eˆ

θ(N−1)

j

log p(x, z; θ) ∂θ∂θT

  • A justification for this formula comes from the Fisher’s

Identity: ∇ log f (x; θ) = Eθ[log p(x, z; θ)|x] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert Ic after each iteration.

slide-16
SLIDE 16

Stochastic approximations of EM(2)

16/27

Stochastic E-Step : Online EM (2009) [7] ˆ Q(N)(θ) = ˆ Q(N−1)(θ)+α(N) Eˆ

θ(N−1)[log p(xN, zN; θ)|xN] − ˆ

Q(N−1)(θ)

  • In case of a MEF, the algorithm works only with the cond.

expectation of the sufficient statistics for complete data. ˆ zN,j = Eθ(N−1)[zN,j|xN] ˆ S(N)

wj

ˆ S(N)

θj

  • =

ˆ S(N−1)

wj

ˆ S(N−1)

θj

  • + α(N)
  • ˆ

zN,j ˆ zN,j sj(xN)

ˆ S(N−1)

wj

ˆ S(N−1)

θj

  • The M-Step is unchanged:

ˆ w(N)

j

= ˆ η(N)

wj

= ˆ S(N)

wj

ˆ θ(N)

j

= (∇Fj)−1(ˆ η(N)

θj

= ˆ S(N)

θj

/ ˆ S(N)

wj )

slide-17
SLIDE 17

Stochastic approximations of EM(3)

17/27

Some properties: Initial values ˆ S(0) may be used for introducing a ”prior”: ˆ S(0)

wj = wj, ˆ

S(0)

θj

= wjη(0)

j

Parameters constraints are automatically respected No matrix to invert ! Policy for α(N) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !!

slide-18
SLIDE 18

Stochastic approximations of k-MLE(1)

18/27

In order to keep previous advantages of online EM for an online k-MLE, our only choice concerns the way to affect xN to a cluster. Strategy 1 Maximize the likelihood of the complete data (xN, zN) ˜ zN,j = [argmax

j′

ˆ w(N−1)

j′

f (xN; ˆ θ(N−1)

j′

) = j] Equivalent to Online CEM and similar to Mac-Queen iterative k-Means.

slide-19
SLIDE 19

Stochastic approximations of k-MLE(2)

19/27

Strategy 2 Maximize the likelihood of the complete data (xN, zN) after the M-Step: ˜ zN,j = [argmax

j′

ˆ w(N)

j′

f (xN; ˆ θ(N)

j′

) = j] Similar to Hartigan’s method for k-means. Additional cost: pre-compute all possible M-Steps for the Stochastic E-Step.

slide-20
SLIDE 20

Stochastic approximations of k-MLE(3)

20/27

Strategy 3 Draw ˜ zN,j from the categorical distribution ˜ zN sampled from CatK({pj = log( ˆ w(N−1)

j

fj(xN; ˆ θ(N−1)

j

))}j) Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k-MLE. For strategies 1 and 3, the M-Step reduces the update of the parameters for a single component.

slide-21
SLIDE 21

Experiments

21/27

True distribution π = 0.5N(0, 1) + 0.5N(µ2, σ2

2)

Different values for µ2, σ2 for more or less overlap between components. A small subset of observations has be taken for initialization (k-MLE++ / k-MLE). Video illustrating the inconsistency of online k-MLE.

slide-22
SLIDE 22

Experiments on Wishart

22/27

slide-23
SLIDE 23

Conclusions - Future works

23/27

On consistency: EM, Online EM are consistent k-MLE, online k-MLE (Strategies 1,2) are inconsistent (due to the Bayes error in maximizing the classification likelihood) Online stochastic k-MLE (Strategy 3) : consistency ? So, when components overlap, online EM > k-MLE > online k-MLE for parameter learning. Need to study how the dimension influences the inconstancy/convergence rate for online k-MLE. Convergence rate is lower for online methods (sub-linear convergence of the SGD) Time for an update vs sample size:

  • nline k-MLE (1,3) < online EM < online k-MLE (2) << k-MLE
slide-24
SLIDE 24

24/27

  • nline EM appears to be the best compromise !!
slide-25
SLIDE 25

References I

25/27

Dempster, A.P., Laird, N.M. and Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. Nielsen, F.: On learning statistical mixtures maximizing the complete likelihood Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), AIP Conference Proceedings Publishing, 1641, pp. 238-245, 214. Celeux, G. and Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14(3), pp. 315-332, 1992.

slide-26
SLIDE 26

References II

26/27

Sam´ e, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model Statistics and Computing, 17(3), pp. 209–218, 2007. Titterington, D. M. : Recursive Parameter Estimation Using Incomplete Data. Journal of the Royal Statistical Society. Series B (Methodological), Volume 46, Number 2, pp. 257–267, 1984. Amari, S. I. : Natural gradient works efficiently in learning. Neural Computation, Volume 10, Number 2, pp. 251?276, 1998. Capp´ e, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3):593-613, 2009.

slide-27
SLIDE 27

References III

27/27

Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in graphical models, pages 355-368. MIT Press, Cambridge, 1999. Bottou, L´ eon : Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks, Saad, David Eds.,Cambridge University Press, 1998.