[PPT] - Extended Variational Inference for Non-Gaussian Statistical Models PowerPoint Presentation

SLIDE 1

Extended Variational Inference for Non-Gaussian Statistical Models

Zhanyu Ma mazhanyu@bupt.edu.cn

Pattern Recognition and Intelligent System Lab., Beijing University of Posts and Telecommunications, Beijing, China.

VALSE Webinar May 20, 2015

SLIDE 2

Collaborators

2

SLIDE 3

References

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015. [2] Z. Ma and A. Leijon, “Bayesian Estimation of Beta Mixture Models with Variational Inference”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 33, pp. 2160 – 2173, Nov. 2011. [3] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition (PR), Volume 47, Issue 9, pp. 3143-3157, September 2014. [4] J. Taghia, Z. Ma, A. Leijon, “Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 36, Issue 9, pp. 1701-1715, September, 2014. [5] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

3

SLIDE 4

Outline

Non-Gaussian vs. Gaussian
Advantages and Challenges

Non-Gaussian Statistical Models

Formulations and Conditions
Convergence and Bias

Variational Inference (VI) and Extended VI

Beta/Dirichlet Mixture Model
BG-NMF

Related Applications

4

SLIDE 5

Non-Gaussian Statistical Models

Definition

– Statistical model for non-Gaussian data – Belong to exponential family

5 von Mises- Fisher

Directional

data

L2 norm =1

Dirichlet /Beta

Bounded

support

L1 norm =1

Gamma

Semi-bounded

support

Non-Gaussian

SLIDE 6

6

Non-Gaussian Statistical Models

Why non-Gaussian? OR Why not Gaussian?

 Real-life data are not Gaussian

Speech Spectra
Image pixel value
Edge strength in complex network
DNA methylation level
……….

SLIDE 7

7

Non-Gaussian Statistical Models

Gaussian distribution

Advantages

the widely used probability distribution
analytically tractable solution
Gaussian mixture model can model arbitrary

distribution

vast applications

Disadvantages

not all the data are Gaussian distributed
unbounded support and symmetric shape for

bounded/semi-bounded/well-structured data

flexible model with the cost of high model complexity

SLIDE 8

8

Non-Gaussian Statistical Models

Non-Gaussian distribution

Advantages

well defined for bounded/semi-bounded/well-structured

data

belong to exponential family  mathematical

convenience and conjugate match

non-Gaussian mixture model can model data more

efficiently

Disadvantages

numerically challenging in parameter estimation, both

ML and Bayesian estimations!

lack of closed-form solution for real applications

SLIDE 9

Example 1: beta distribution

– Bounded support and flexible shape – Image processing, speech coding, DNA methylation analysis

9

( ) ( ) ( ) ( ) ( ) ( ) ∫

∞ − − − −

= Γ − Γ Γ + Γ =

1 1 1

, 1 , ; beta dt e t z x x v u v u v u x

t z v u

Non-Gaussian Statistical Models

SLIDE 10

Example 2: Dirichlet distribution (neutral vector)

– Conventionally used as conjugate prior of multi categorical distribution or multinomial distribution, describing mixture weights in mixture modeling – Recently, it was applied to model proportional data (i.e., data with L1 norm) – Speech coding, skin color detection, multiview 3D enhancement, etc.

10

( )

. , , 1 , ; Dir

1 1 1 1 1

> > = Γ Γ =

∑ ∏ ∏ ∑

= = − = = k k K k k K k a k K k k K k k

a x x x a a

k

a x

Non-Gaussian Statistical Models

SLIDE 11

Example 3: von Mises-Fisher distribution

– Distributed on K-dimensional sphere – Two-dimensional vMF = circle – Directional statistics, gene expressions, speech coding

11

( ) ( ) ( ) ( )

kind first the

f

function Bessel modified the denotes , 2 , ;

1 1

2 2 2

v I e I f

p

K K K

1 x x μ x

T x μT

= =

⋅ − − λ

λ π λ λ

Non-Gaussian Statistical Models

SLIDE 12

Summary

– Non-Gaussian distribution represents a family of distributions which are not Gaussian distributed – Not conflicting with central limit theorem – Well-defined for bounded/semi- bounded/structured data – More efficient than Gaussian distribution – Hard to estimate, computationally costly, and difficult to use in practice

12

Non-Gaussian Statistical Models

SLIDE 13

Beta/Dirichlet Mixture Model
BG-NMF

Outline

Non-Gaussian vs. Gaussian
Advantages and Challenges

Non-Gaussian Statistical Models

Formulations and Conditions
Convergence and Bias

Variational Inference (VI) and Extended VI Related Applications

13

SLIDE 14

Maximum likelihood (ML) estimation

– Widely used for point estimation of the parameters – Expectation-maximization (EM) algorithm – Converge to local maxima and may yield

verfitting

– No analytically tractable solution for most non- Gaussian distributions

14

Formulation and Conditions

SLIDE 15

Bayesian estimation

– Estimating the distributions of the parameters, rather than point estimate – Conjugate match in exponential family – No overfitting, feasible for online learning – Without approximation, there is no analytically tractable solution for non-Gaussian distributions

15

Formulation and Conditions

SLIDE 16

Example: ML estimation for beta mixture model[1]

– M step – Numerical solution, Gibbs sampling, Newton-Raphson method, MCMC, etc.

16

( ) ( ) ( ) ( ) ( )

1 ln ln

1 1 1 1

=         − + − + + − +

∑ ∑

= = N n n N N n n N

x v v u x u v u ψ ψ ψ ψ ( ) ( )

dt e e t e dz z d z

t zt t

∫

∞ − − −

        − − = Γ = 1 ln ψ

[1] Z. Ma and A. Leijon, ‘Beta Mixture Model and the Application to Image Classification’, IEEE International Conference

n Image Processing, pp. 2045-2048, 2009.

Formulation and Conditions

SLIDE 17

Example: Bayesian estimation of beta distribution[1]

– Prior – Likelihood – Posterior – No closed-form expression for mean, variance, etc. – No analytically tractable solution for mixture model – Not applicable in practice

17

( ) ( ) ( ) ( )

( ) ( )

1 1

, , ; ,

− − − −

      Γ Γ + Γ ∝

v u

e e v u v u v u p

β α ν

ν β α

( ) ( ) ( ) ( )

( ) ( ) ( )

1 1 ln 1 ln 1 1

, , ; | ,

−       − − − −       − − +

∑ ∑       Γ Γ + Γ ∝

= = v x u x N N n n N n n

e e v u v u v u p

β α ν

ν β α X

[1] Z. Ma and A. Leijon, ‘Bayesian Estimation of Beta Mixture Models with Variational Inference’, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 33, pp. 2160 – 2173, Nov. 2011.

Formulation and Conditions

( ) ( ) ( ) ( ) ( )

1 1

beta ; , 1

v u

u v x u v x x u v

− −

Γ + = − Γ Γ

SLIDE 18

Variational inference[1]

– Mean field theory in physics, dates back to 18th

century, by Euler, Lagrange, etc. – Function over function – Closed form solution with certain constraints – Goal: approximate by via either maximizing or minimizing

18

( ) ( ) ( ) θ

θ θ d f x f x f

∫

= | ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

f g g d x f g g d g x f g x f || KL | ln , ln ln + = + =

∫ ∫

L θ θ θ θ θ θ θ θ

( )

x f | θ

( )

θ g

( )

g L ( )

f g || KL

[1] C. M. Bishop, ‘Pattern Recognition and Machine Learning’, Springer, 2006

Formulation and Conditions

SLIDE 19

Factorized approximation[1]

– No constraints on the form of – Directly maximizing – Always converges but may fall in local maxima – Analytically tractable form solution for Gaussian

19

( ) ( )

∏

≈

i i i

g g θ θ

( )

i i

g θ

( ) ( ) [ ] C

x f g

i j i i

+ =

≠

θ θ , ln E ln

*

( )

g L

[1] C. M. Bishop, ‘Pattern Recognition and Machine Learning’, Springer, 2006

Formulation and Conditions

SLIDE 20

Extended factorized approximation (EFA)[1,2]

– Optimal solution:

Strong requirement with larger gap:
Weak requirement with smaller gap:

– An efficient way to derive analytically tractable solution for non-Gaussian distribution – SLB vs MLB [2]

20

[1] Z. Ma and A. Leijon, ‘Bayesian Estimation of Beta Mixture Models with Variational Inference’, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 33, pp. 2160 – 2173, Nov. 2011. [2] Z. Ma, J. Taghia, P. K. Rana, M. Flierl, and A. Leijon, ‘Bayesian Estimation of Beta Mixture Models with Variational Inference’, Pattern Recognition, Vol 47, No. 9, pp. 3143-3157, Sep. 2014.

( ) ( ) [ ] ( ) [ ] ( )

[ ]

( ) [ ]

θ θ θ θ g x f g x f g E , ~ E E , E − ≥ − = L

( ) ( )

[ ] C

x f g

i j i i

+ =

≠

θ θ , ~ ln E ln

*

( ) ( )

θ θ , ~ , x f x f ≥

( ) [ ] ( )

[ ]

θ θ , ~ E , E x f x f ≥

Formulation and Conditions

Auxiliary function

SLIDE 21

21

Convergence and Bias

Multiple lower-bound (MLB) approximation[1][2]

– Different auxiliary functions for different variable (group) – Optimal solution for wach variable (group)

[1] Z. Ma and A. Leijon, ‘Bayesian Estimation of Beta Mixture Models with Variational Inference’, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 33, pp. 2160 – 2173, Nov. 2011. [2] W. Fan, N. Bouguila, and D. Ziou, “Variational learning for finite Dirichlet mixture models and applications,” IEEE Transactions on Neural Network and Learning Systems, vol. 23, no. 5, pp. 762–774, May 2012

SLIDE 22

22

Convergence and Bias

[1] Z. Ma, J. Taghia, and J. Guo, “On the Convergence of Extended Variational Inference for Non-Gaussian Statistical Models”, IEEE Transaction on Pattern Analysis and Machine Intelligence, under review.

Update Z1 and Z2 iteratively: Update Z1 and Z2 iteratively: Convergence not guaranteed!

SLIDE 23

23

Convergence and Bias

Single lower-bound (SLB) approximation[1][2]

– One auxiliary functions for all the different variable (group) – Optimal solution

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 37, No. 4, pp. 876 – 889, Apr. 2015 [2] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition, Vol. 47, No. 9, pp. 3143-3157, Sep. 2014.

Convergence guaranteed!

SLIDE 24

24

Convergence and Bias

[1] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition, Vol. 47, No. 9, pp. 3143-3157, Sep. 2014.

Bias always exists, due to factorized approximations and

lower-bound approximation.

Bias will vanish when increasing the amount of training data.

True posterior distribution vs. approximating distribution[1].Dirichlet distribution with u=[3 5 8].

SLIDE 25

25

Convergence and Bias

Summary

– EVI provides a flexible way to carry out Bayesian estimation of NG statistical model – Certain requirements should be fulfilled when implementing EVI – MLB vs. SLB – Systematic gap

SLIDE 26

Outline

Non-Gaussian vs. Gaussian
Advantages and Challenges

Non-Gaussian Statistical Models

Formulations and Conditions
Convergence and Bias

Variational Inference (VI) and Extended VI

Beta/Dirichlet Mixture Model
BG-NMF

Related Applications

26

SLIDE 27

27

Dirichlet Mixture Model

[1] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition, Vol. 47, No. 9, pp. 3143-3157, Sep. 2014.

Graphical Model of DMM[1]

SLIDE 28

– Auxiliary function

Step I
Step II

28

Dirichlet Mixture Model

[1] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition, Vol. 47, No. 9, pp. 3143-3157, Sep. 2014.

SLIDE 29

Speech coding[1]

– Quantization of line spectral frequency (LSF) – Well-structured vector

all the elements are in (0,π)
strictly ordered

29

[1] Z. Ma, A. Leijon, and W. B. Kleijn, ‘Vector Quantization of LSF Parameters with a Mixture of Dirichlet Distributions’, IEEE Transaction on Audio, Speech, and Language Processing, 2013.

Dirichlet Mixture Model

SLIDE 30

Speech coding

– Solution: Dirichlet mixture model[1,2]

transfer LSF vector to ΔLSF vector
well-structured: nonnegative elements, L1 norm equals one
a neutral vector that can be nonlinearly decorrelated

(comparable to KLT/PCA for Gaussian source!)

30

[1] Z. Ma and A. Leijon, ‘Modeling Speech Line Spectral Frequencies with Dirichlet Mixture Models’, INTERSPEECH, 2010. [2] Z. Ma, A. Leijon, and W. B. Kleijn, ‘Vector Quantization of LSF Parameters with a Mixture of Dirichlet Distributions’, IEEE Transaction on Audio, Speech, and Language Processing, 2013.

Dirichlet Mixture Model

SLIDE 31

Speech coding

– Solution: Dirichlet mixture model[1,2]

transfer LSF vector to ΔLSF vector
well-structured: nonnegative elements, L1 norm equals one
a neutral vector that can be nonlinearly decorrelated

(comparable to KLT/PCA for Gaussian source!)

31

[1] Z. Ma and A. Leijon, ‘Modeling Speech Line Spectral Frequencies with Dirichlet Mixture Models’, INTERSPEECH, 2010. [2] Z. Ma, A. Leijon, and W. B. Kleijn, ‘Vector Quantization of LSF Parameters with a Mixture of Dirichlet Distributions’, IEEE Transaction on Audio, Speech, and Language Processing, vol.21, no.9, pp.1777-1790, Sep. 2013.

Dirichlet Mixture Model

SLIDE 32

PRObabilistic Multiview Depth Enhancement

(PROMED) [1]

32

Multiview video imagery

Free-viewpoint TV

Dirichlet Mixture Model

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

SLIDE 33

33

Dirichlet Mixture Model

PROMDE Flow Chart

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

SLIDE 34

34

Dirichlet Mixture Model

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

Two concatenated Newspaper views with approximately superpixels as

btain by using SLIC[1].

SLIDE 35

35

Dirichlet Mixture Model

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

SLIDE 36

36

Dirichlet Mixture Model

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

Selected regions of synthesized virtual views of test sequences as generated by VSRS 3.5 using MPEG depth maps and enhanced depth maps from our depth enhancement algorithm.

SLIDE 37

37

Dirichlet Mixture Model

[1] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

The objective quality of three intermediate virtual views as generated by VSRS 3.5 using the large baseline setting.

SLIDE 38

38

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

Graphical Model of BG-NMF[1]

SLIDE 39

39

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

Bayesian matrix factorization for bounded support dataBGNMF
Handle highly sparse matrix  low rank matrix approximation

( ) ( ) ( ) ( ) ( )

, , 1 , , 1 , , , , , , , ,

~ ; , 1

p t p t b p t p t a p t p t p t p t p t p t p t p t

a b X Beta X a b X X a b

− −

Γ + = − Γ Γ



P T

X

×



P K K T P T

a A z

× × ×

= ×



P K K T P T

b B z

× × × =

×

( ) ( ) ( )

, , . , , . , , , 1 , , , , , 1 , , , , , 1 , , , , ,

~ ; , ~ ; , ~ ; ,

p k p k p k p k p k p k k t k t k t A p k p k p k p k p k B p k p k p k p k p k z k t k t k t k t k t

A Gam A A e B Gam B B e z Gam z z e

µ α ν β ρ ζ

α µ β ν ζ ρ

− − − − − −

 ∝   ∝   ∝   Bz Az Az X + = ˆ

SLIDE 40

40

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

Objective function. Need to find auxiliary function for the LIB function

,: ,: :,

F( , , )

p p t

A B H

SLIDE 41

41

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

Auxiliary function with relative convexity, Jensen inequality.

SLIDE 42

42

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

DNA methylation analysis [1]

– Motivation: using statistical model as a robust analysis tool in bioinformatics area – Improve analyzing performance comparing with benchmark methods – DNA methylation matrix of 27k ×136 – Methylation level in [0,1] – Preprocessing: feature selection via variance. – 27k5000

SLIDE 43

43

Beta Gamma-NMF (BG-NMF)

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015.

BG-NMF，500014.

SLIDE 44

– PCA + VB-GMM (500014)

44

9 cancers to normal 0 normal to cancer 9 out of 136

Beta Gamma-NMF (BG-NMF)

SLIDE 45

– BGNMF+ VB-BMM (500014)

45

4 cancers to normal 1 normal to cancer 5 out of 136 124 sec. < 139 sec. (RPBMM)

Beta Gamma-NMF (BG-NMF)

SLIDE 46

46

Related Applications

Summary

– EVI-based NG statistical model shows advantages in several applications. – Fitting data better  improved performance – Needs a lot of effort to design and derive.

SLIDE 47

References

[1] Z. Ma, A.E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, and J. Guo, “Variational Bayesian Matrix Factorization for Bounded Support Data”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 37, Issue 4, pp. 876 – 889, Apr. 2015. [2] Z. Ma and A. Leijon, “Bayesian Estimation of Beta Mixture Models with Variational Inference”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 33, pp. 2160 – 2173, Nov. 2011. [3] Z. Ma, P. K. Rana, J. Taghia, M. Flierl, and A. Leijon, “Bayesian Estimation of Dirichlet Mixture Model with Variational Inference”, Pattern Recognition (PR), Volume 47, Issue 9, pp. 3143-3157, September 2014. [4] J. Taghia, Z. Ma, A. Leijon, “Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), Volume 36, Issue 9, pp. 1701-1715, September, 2014. [5] P. K. Rana, J. Taghia, Z. Ma, and M. Flierl, “Probabilistic Multiview Depth Image Enhancement Using Variational Inference”, IEEE Journal of Selected Topics in Signal Processing (J-STSP), Volume 9, Issue 3, pp. 435-448, Apr. 2015

47

SLIDE 48

48

Extended Variational Inference for Non-Gaussian Statistical Models - - PowerPoint PPT Presentation

Thanks!