On the Properties of Variational Approximations in Statistical - - PowerPoint PPT Presentation

on the properties of variational approximations in
SMART_READER_LITE
LIVE PREVIEW

On the Properties of Variational Approximations in Statistical - - PowerPoint PPT Presentation

Introduction Variational Approximations On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD Dublin - Statistics Seminar - 29/10/15 Pierre Alquier Properties of Variational Approximations Introduction


slide-1
SLIDE 1

Introduction Variational Approximations

On the Properties of Variational Approximations in Statistical Learning.

Pierre Alquier UCD Dublin - Statistics Seminar - 29/10/15

Pierre Alquier Properties of Variational Approximations

slide-2
SLIDE 2

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Learning vs. estimation

In many applications one would like to learn from a sample without being able to write the likelihood.

Pierre Alquier Properties of Variational Approximations

slide-3
SLIDE 3

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Learning vs. estimation

In many applications one would like to learn from a sample without being able to write the likelihood.

Pierre Alquier Properties of Variational Approximations

slide-4
SLIDE 4

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

Pierre Alquier Properties of Variational Approximations

slide-5
SLIDE 5

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

Pierre Alquier Properties of Variational Approximations

slide-6
SLIDE 6

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d.

Pierre Alquier Properties of Variational Approximations

slide-7
SLIDE 7

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ).

Pierre Alquier Properties of Variational Approximations

slide-8
SLIDE 8

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ). → fθ(X) meant to predict Y .

Pierre Alquier Properties of Variational Approximations

slide-9
SLIDE 9

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ). → fθ(X) meant to predict Y . a criterion of success, R(θ) :

Pierre Alquier Properties of Variational Approximations

slide-10
SLIDE 10

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ). → fθ(X) meant to predict Y . a criterion of success, R(θ) : → for example R(θ) = P(fθ(X) = Y ) (classification error). In this talk R(θ) = E[ℓ(Y , fθ(X))]. We want to minimize R(θ). But note that it is unknown in practice.

Pierre Alquier Properties of Variational Approximations

slide-11
SLIDE 11

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ). → fθ(X) meant to predict Y . a criterion of success, R(θ) : → for example R(θ) = P(fθ(X) = Y ) (classification error). In this talk R(θ) = E[ℓ(Y , fθ(X))]. We want to minimize R(θ). But note that it is unknown in practice. an empirical proxy r(θ) for this criterion of success :

Pierre Alquier Properties of Variational Approximations

slide-12
SLIDE 12

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Typical machine learning problem

Main ingredients :

  • bservations object-label : (X1, Y1), (X2, Y2), ...

→ either given once and for all (batch learning), once at a time (online learning), upon request... In this talk, (X1, Y1), ..., (Xn, Yn) i.i.d. a restricted set of predictors (fθ, θ ∈ Θ). → fθ(X) meant to predict Y . a criterion of success, R(θ) : → for example R(θ) = P(fθ(X) = Y ) (classification error). In this talk R(θ) = E[ℓ(Y , fθ(X))]. We want to minimize R(θ). But note that it is unknown in practice. an empirical proxy r(θ) for this criterion of success : → here r(θ) = 1

n

n

i=1 ℓ(Yi, fθ(Xi)).

Pierre Alquier Properties of Variational Approximations

slide-13
SLIDE 13

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Empirical risk minimization (ERM)

ˆ θn = arg min

θ∈Θ r(θ).

Pierre Alquier Properties of Variational Approximations

slide-14
SLIDE 14

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Empirical risk minimization (ERM)

ˆ θn = arg min

θ∈Θ r(θ).

Theorem (Vapnik and Chervonenkis, in the 70’s)

Vapnik, V. (1998). Statistical Learning Theory, Springer.

Classification setting. Let dΘ denote the VC-dim. of Θ. P

  • R(ˆ

θn) ≤ inf

θ∈Θ R(θ) + 4

  • dΘ log(n + 1) + log(2)

n +

  • log(2/ε)

2n

  • ≥ 1 − ε.

Pierre Alquier Properties of Variational Approximations

slide-15
SLIDE 15

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

ERM with linear classifiers

Table: Linear classifiers in Rp : dΘ = p + 1. Source : http ://mlpy.sourceforge.net/

Pierre Alquier Properties of Variational Approximations

slide-16
SLIDE 16

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

ERM with linear classifiers

Table: Linear classifiers in Rp : dΘ = p + 1. Source : http ://mlpy.sourceforge.net/

Here dΘ = 3, n = 500.

Pierre Alquier Properties of Variational Approximations

slide-17
SLIDE 17

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

ERM with linear classifiers

Table: Linear classifiers in Rp : dΘ = p + 1. Source : http ://mlpy.sourceforge.net/

Here dΘ = 3, n = 500. With probability at least 90%, R(ˆ θn) ≤ inf

θ∈Θ R(θ)+0.842.

Pierre Alquier Properties of Variational Approximations

slide-18
SLIDE 18

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

ERM with linear classifiers

Table: Linear classifiers in Rp : dΘ = p + 1. Source : http ://mlpy.sourceforge.net/

Here dΘ = 3, n = 500. With probability at least 90%, R(ˆ θn) ≤ inf

θ∈Θ R(θ)+0.842.

With n = 5000 we would have R(ˆ θn) ≤ inf

θ∈Θ R(θ)+0.301.

Pierre Alquier Properties of Variational Approximations

slide-19
SLIDE 19

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

The PAC-Bayesian approach : origins

Idea : combine these tools with a prior π on Θ.

Shawe-Taylor, J. & Williamson, R. C. (1997). A PAC Analysis of a Bayesian Estimator. COLT’97. McAllester, D. A. (1998). Some PAC-Bayesian Theorems. COLT’98. “A PAC performance guarantee theorem applies to a broad class of experimental settings. A Bayesian correctness theorem applies to only experimental settings consistent with the prior used in the algorithm. However, in this restricted class of settings the Bayesian learning algorithm can be optimal and will generally outperform PAC learning algorithms. (...) The PAC-Bayesian theorems and algorithms (...) attempt to get the best of both PAC and Bayesian approaches by combining the ability to be tuned with an informal prior with PAC guarantees that hold in all i.i.d experimental settings.” Pierre Alquier Properties of Variational Approximations

slide-20
SLIDE 20

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

The PAC-Bayesian approach

EWA / pseudo-posterior / Gibbs estimator / ... ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ).

Pierre Alquier Properties of Variational Approximations

slide-21
SLIDE 21

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

The PAC-Bayesian approach

EWA / pseudo-posterior / Gibbs estimator / ... ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ). Theorem - for a bounded loss ℓ ≤ B.

Catoni, O. (2007). PAC-Bayesian Supervised Classification (The Thermodynamics of Statistical Learning), volume 56 of Lecture Notes-Monograph Series, IMS.

∀λ > 0, P

  • Rdˆ

ρλ ≤ inf

ρ

  • Rdρ+

λB2 n + 2K(ρ, π) + 2 log(2/ε) λ

  • ≥ 1 − ε.

Pierre Alquier Properties of Variational Approximations

slide-22
SLIDE 22

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Another point of view

Bissiri, P., Holmes, C. and Walker, S. (2013). Fast learning Rates in Statistical Inference through

  • Aggregation. Preprint.

Provides decision theoretic reason to use ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ) instead of π(dθ|(X1, Y1), . . . , (Xn, Yn)) ∝ L(θ)π(dθ). The likelihood L(θ) might be too complicated or not even available ; We might think it’s safer to replace it by a robust loss function (Huber...).

Pierre Alquier Properties of Variational Approximations

slide-23
SLIDE 23

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Bibliographical remarks

PAC-Bayesian bounds : many authors including Langford, Seeger, Meir, Cesa-Bianchi, Li, Jiang, Tanner, Laviolette, sorry for not being exhaustive, see the papers for more references !

Pierre Alquier Properties of Variational Approximations

slide-24
SLIDE 24

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Bibliographical remarks

PAC-Bayesian bounds : many authors including Langford, Seeger, Meir, Cesa-Bianchi, Li, Jiang, Tanner, Laviolette, sorry for not being exhaustive, see the papers for more references ! Related to other works on aggregation : Barron, Vovk, Rissanen, Abramovitch, Nemirovski, Yang, Zhang, Rigollet, Lecué, Bellec, Suzuki...

Pierre Alquier Properties of Variational Approximations

slide-25
SLIDE 25

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Bibliographical remarks

PAC-Bayesian bounds : many authors including Langford, Seeger, Meir, Cesa-Bianchi, Li, Jiang, Tanner, Laviolette, sorry for not being exhaustive, see the papers for more references ! Related to other works on aggregation : Barron, Vovk, Rissanen, Abramovitch, Nemirovski, Yang, Zhang, Rigollet, Lecué, Bellec, Suzuki... Related work on misspecification in Bayesian statistics : the “safe Bayes rule” of

Grünwald, P. D. & van Ommen, T. (2013). Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It. Preprint. Pierre Alquier Properties of Variational Approximations

slide-26
SLIDE 26

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Reminder : pseudo-posterior

ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ).

Pierre Alquier Properties of Variational Approximations

slide-27
SLIDE 27

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Reminder : pseudo-posterior

ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ). Depending on the setting, we have to sample from ˆ ρλ, compute

  • θˆ

ρλ(dθ).

Pierre Alquier Properties of Variational Approximations

slide-28
SLIDE 28

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

Reminder : pseudo-posterior

ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ). Depending on the setting, we have to sample from ˆ ρλ, compute

  • θˆ

ρλ(dθ). How to do it ?

Pierre Alquier Properties of Variational Approximations

slide-29
SLIDE 29

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

A natural idea : MCMC methods for PAC-Bayes

Langevin Monte-Carlo :

Dalalyan, A. and Tsybakov, A. (2011). Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Science. Pierre Alquier Properties of Variational Approximations

slide-30
SLIDE 30

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

A natural idea : MCMC methods for PAC-Bayes

Langevin Monte-Carlo :

Dalalyan, A. and Tsybakov, A. (2011). Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Science.

Markov Chain Monte-Carlo :

Alquier, P. & Biau, G. (2013). Sparse Single-Index Model. Journal of Machine Learning Reseach. Guedj, B. & Alquier, P. (2013). PAC-Bayesian Estimation and Prevision in Sparse Additive

  • Models. Electronic Journal of Statistics.

Pierre Alquier Properties of Variational Approximations

slide-31
SLIDE 31

Introduction Variational Approximations Statistical Learning Setting (Pseudo)-Bayesian Approach

A natural idea : MCMC methods for PAC-Bayes

Langevin Monte-Carlo :

Dalalyan, A. and Tsybakov, A. (2011). Sparse regression learning by aggregation and Langevin Monte-Carlo. Journal of Computer and System Science.

Markov Chain Monte-Carlo :

Alquier, P. & Biau, G. (2013). Sparse Single-Index Model. Journal of Machine Learning Reseach. Guedj, B. & Alquier, P. (2013). PAC-Bayesian Estimation and Prevision in Sparse Additive

  • Models. Electronic Journal of Statistics.

However : usually not possible to provide guarantees after a finite number of steps. See however

Dalalyan, A. (2014). Theoretical Guarantees for Approximate Sampling from a Smooth and Log-Concave Density. Preprint. Pierre Alquier Properties of Variational Approximations

slide-32
SLIDE 32

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Variational Bayes methods

Idea from Bayesian statistics : approximate the posterior distribution π(θ|x). We fix a convenient family of probability distributions F and approximate the posterior by ˜ π(θ) : ˜ π = arg min

ρ∈F K(ρ, π(·|x)).

Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer. Pierre Alquier Properties of Variational Approximations

slide-33
SLIDE 33

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Variational Bayes methods

Idea from Bayesian statistics : approximate the posterior distribution π(θ|x). We fix a convenient family of probability distributions F and approximate the posterior by ˜ π(θ) : ˜ π = arg min

ρ∈F K(ρ, π(·|x)).

Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer.

F is either parametric or non-parametric. In the parametric case, the problem boils down to an optimization problem : F = {ρa, a ∈ Rd} min

a∈Rd K(ρa, π(·|x)).

Pierre Alquier Properties of Variational Approximations

slide-34
SLIDE 34

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Example : Gaussian approximation

Table: The true posterior and the best Gaussian approximation.

Pierre Alquier Properties of Variational Approximations

slide-35
SLIDE 35

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

VB in PAC-Bayesian framework

ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ). Then : K(ρa, ˆ ρλ) =

  • log

dρa dπ dπ dˆ ρ

  • dρa

= λ

  • r(θ)ρa(dθ) + K(ρa, π) + log
  • exp[−λr]dπ.

Pierre Alquier Properties of Variational Approximations

slide-36
SLIDE 36

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

VB in PAC-Bayesian framework

ˆ ρλ(dθ) ∝ exp [−λr(θ)]π(dθ). Then : K(ρa, ˆ ρλ) =

  • log

dρa dπ dπ dˆ ρ

  • dρa

= λ

  • r(θ)ρa(dθ) + K(ρa, π) + log
  • exp[−λr]dπ.

We put ˜ aλ = arg min

a∈A

  • λ
  • r(θ)ρa(dθ) + K(ρa, π)
  • and ˜

ρλ = ρˆ

aλ.

Pierre Alquier Properties of Variational Approximations

slide-37
SLIDE 37

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

A PAC-Bound for VB Approximation

Theorem

Alquier, P., Ridgway, J. & Chopin, N. (2015). On the Properties of Variational Approximations of Gibbs Posteriors. Preprint.

∀λ > 0, P

  • Rd˜

ρλ ≤ inf

a∈A

  • Rdρa+

λB2 n + 2K(ρa, π) + 2 log(2/ε) λ

  • ≥ 1 − ε.

Pierre Alquier Properties of Variational Approximations

slide-38
SLIDE 38

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

A PAC-Bound for VB Approximation

Theorem

Alquier, P., Ridgway, J. & Chopin, N. (2015). On the Properties of Variational Approximations of Gibbs Posteriors. Preprint.

∀λ > 0, P

  • Rd˜

ρλ ≤ inf

a∈A

  • Rdρa+

λB2 n + 2K(ρa, π) + 2 log(2/ε) λ

  • ≥ 1 − ε.

if the infimum on the right is small enough, VB approximation is “at no cost”.

Pierre Alquier Properties of Variational Approximations

slide-39
SLIDE 39

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P.

Pierre Alquier Properties of Variational Approximations

slide-40
SLIDE 40

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd.

Pierre Alquier Properties of Variational Approximations

slide-41
SLIDE 41

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd. R(θ) = P[Y = fθ(X)].

Pierre Alquier Properties of Variational Approximations

slide-42
SLIDE 42

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd. R(θ) = P[Y = fθ(X)]. rn(θ) = 1

n

n

i=1 1[Yi = fθ(Xi)].

Pierre Alquier Properties of Variational Approximations

slide-43
SLIDE 43

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd. R(θ) = P[Y = fθ(X)]. rn(θ) = 1

n

n

i=1 1[Yi = fθ(Xi)].

Gaussian prior π = N(0, ϑI).

Pierre Alquier Properties of Variational Approximations

slide-44
SLIDE 44

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd. R(θ) = P[Y = fθ(X)]. rn(θ) = 1

n

n

i=1 1[Yi = fθ(Xi)].

Gaussian prior π = N(0, ϑI). Gaussian approx. of the posterior : F =

  • N(µ, Σ), µ ∈ Rd, Σ s. pos. def.
  • .

Pierre Alquier Properties of Variational Approximations

slide-45
SLIDE 45

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application to a linear classification problem

(X1, Y1), (X2, Y2), ..., (Xn, Yn) iid from P. fθ(x) = 1(θ, x ≥ 0), x, θ ∈ Rd. R(θ) = P[Y = fθ(X)]. rn(θ) = 1

n

n

i=1 1[Yi = fθ(Xi)].

Gaussian prior π = N(0, ϑI). Gaussian approx. of the posterior : F =

  • N(µ, Σ), µ ∈ Rd, Σ s. pos. def.
  • .

Optimization criterion : Fλ(µ, Σ) = λ n

n

  • i=1

Φ

  • −Yi Xi, µ
  • Xi, ΣXi
  • + µ2

2ϑ + 1 2 1 ϑtr(Σ) − log |Σ|

  • .

Pierre Alquier Properties of Variational Approximations

slide-46
SLIDE 46

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application of the main theorem

Corollary Assume that, for θ = θ′ = 1, P(θ, X θ′, X < 0) ≤ cθ − θ′ and take λ = √ nd and ϑ = 1/ √

  • d. Then

P

  • Rd˜

ρλ ≤ inf

θ R(θ)

+

  • d

n

  • log(4ne2) + c
  • + 2 log

2

ε

nd

  • ≥ 1 − ε.

Pierre Alquier Properties of Variational Approximations

slide-47
SLIDE 47

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application of the main theorem

Corollary Assume that, for θ = θ′ = 1, P(θ, X θ′, X < 0) ≤ cθ − θ′ and take λ = √ nd and ϑ = 1/ √

  • d. Then

P

  • Rd˜

ρλ ≤ inf

θ R(θ)

+

  • d

n

  • log(4ne2) + c
  • + 2 log

2

ε

nd

  • ≥ 1 − ε.

N.B : under margin assumption, possible to obtain d/n rates...

Pierre Alquier Properties of Variational Approximations

slide-48
SLIDE 48

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Implementation : deterministic annealing

Algorithm 1 Deterministic annealing Input (λt)t∈[0,T] a sequence of temperature

  • Init. Set µ = 0 and Σ = ϑId, the values minimizing

KL-divergence for λ = 0 Loop t=1,. . .,T

  • a. µλt, Σλt = Minimize F λt(m, Σ) using

some local optimization routine (gradient descent) with initial points µλt−1, Σλt−1

  • b. Break if the empirical bound

increases. End Loop

Pierre Alquier Properties of Variational Approximations

slide-49
SLIDE 49

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Test on real data

Dataset Covariates VB SMC SVM Pima 7 21.3 22.3 30.4 Credit 60 33.6 32.0 32.0 DNA 180 23.6 23.6 20.4 SPECTF 22 06.9 08.5 10.1 Glass 10 19.6 23.3 4.7 Indian 11 25.5 26.2 26.8 Breast 10 1.1 1.1 1.7

Table: Comparison of misclassification rates (%). Last column : kernel-SVM with radial kernel. The hyper-parameters λ and ϑ are chosen by cross-validation.

Pierre Alquier Properties of Variational Approximations

slide-50
SLIDE 50

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Convexification of the loss

Can replace the 0/1 loss by a convex surrogate at “no” cost :

Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics. Pierre Alquier Properties of Variational Approximations

slide-51
SLIDE 51

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Convexification of the loss

Can replace the 0/1 loss by a convex surrogate at “no” cost :

Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics.

R(θ) = E[(1 − Yfθ(X))+] (hinge loss). rn(θ) = 1

n

n

i=1(1 − Yifθ(Xi))+.

Gaussian approx. : F =

  • N(µ, σ2I), µ ∈ Rd, σ > 0
  • .

Pierre Alquier Properties of Variational Approximations

slide-52
SLIDE 52

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Convexification of the loss

Can replace the 0/1 loss by a convex surrogate at “no” cost :

Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics.

R(θ) = E[(1 − Yfθ(X))+] (hinge loss). rn(θ) = 1

n

n

i=1(1 − Yifθ(Xi))+.

Gaussian approx. : F =

  • N(µ, σ2I), µ ∈ Rd, σ > 0
  • .

the following criterion (which turns out to be convex !) : 1 n

n

  • i=1

(1 − Yi µ, Xi) Φ 1 − Yi µ, Xi σXi2

  • +1

n

n

  • i=1

σXiϕ 1 − Yi µ, Xi σXi2

  • +µ2

2

2ϑ +d 2 ϑ σ2 − log σ2

  • .

Pierre Alquier Properties of Variational Approximations

slide-53
SLIDE 53

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Application of the main theorem

Optimization with stochastic gradient descent on a ball of radius M. On this ball, the objetive function is L-Lipschitz. After k step, we have the approximation ˜ ρ(k)

λ

  • f the posterior.

Corollary Assume X ≤ cx a.s., take λ = √ nd and ϑ = 1/ √

  • d. Then

P

  • Rd˜

ρ(k)

λ

≤ inf

θ R(θ)

+ LM √ 1 + k + cx 2

  • d

n log n d

  • +

c2

x +1

2cx + 2cx log

2

ε

nd

  • ≥ 1 − ε.

Pierre Alquier Properties of Variational Approximations

slide-54
SLIDE 54

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

(One more) test on real data

Dataset Convex VB VB SMC SVM Pima 21.8 21.3 22.3 30.4 Credit 27.2 33.6 32.0 32.0 DNA 4.2 23.6 23.6 20.4 SPECTF 19.2 06.9 08.5 10.1 Glass 26.1 19.6 23.3 4.7 Indian 26.2 25.5 26.2 26.8 Breast 0.5 1.1 1.1 1.7

Table: Comparison of misclassification rates (%), including the convexified version of VB.

Pierre Alquier Properties of Variational Approximations

slide-55
SLIDE 55

Introduction Variational Approximations A Short Introduction to Variational Bayes Methods Theoretical Analysis of VB Approximations

Convergence graphs

1 2 3 25 50 75 100 Iterations Emprical Bound 95% 1 2 3 100 200 300 Iterations Emprical Bound 95%

Figure: Stochastic gradient descent, Pima and Adult datasets.

Pierre Alquier Properties of Variational Approximations