Fast Quantification of Uncertainty and Robustness with Variational - - PowerPoint PPT Presentation

fast quantification of uncertainty and robustness with
SMART_READER_LITE
LIVE PREVIEW

Fast Quantification of Uncertainty and Robustness with Variational - - PowerPoint PPT Presentation

Fast Quantification of Uncertainty and Robustness with Variational Bayes Tamara Broderick ITT Career Development Assistant Professor, MIT With: Ryan Giordano, Rachael Meager, Jonathan H. Huggins, Michael I. Jordan Bayesian inference


slide-1
SLIDE 1

Fast Quantification of Uncertainty and Robustness with Variational Bayes

ITT Career Development Assistant Professor, MIT

Tamara Broderick

With: Ryan Giordano, Rachael Meager, Jonathan H. Huggins, Michael I. Jordan

slide-2
SLIDE 2
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-3
SLIDE 3
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-4
SLIDE 4
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-5
SLIDE 5
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-6
SLIDE 6
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-7
SLIDE 7
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-8
SLIDE 8
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-9
SLIDE 9
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-10
SLIDE 10
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-11
SLIDE 11
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

Some reasonable priors

slide-12
SLIDE 12
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-13
SLIDE 13
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-14
SLIDE 14
  • Bayesian inference
  • Complex, modular models; posterior distribution

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-15
SLIDE 15
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-16
SLIDE 16
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-17
SLIDE 17
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

  • We propose: linear response

variational Bayes

slide-18
SLIDE 18
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

  • We propose: linear response

variational Bayes

slide-19
SLIDE 19
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • We propose: linear response

variational Bayes

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

slide-20
SLIDE 20
  • Bayesian inference
  • Complex, modular models; posterior distribution

Uncertainty & robustness quantification

1

p(θ|x) ∝θ p(x|θ)p(θ)

  • We propose: linear response

variational Bayes

Some reasonable priors

Bayes Theorem

  • Challenge: Express

prior beliefs in a distribution

  • Time-consuming;

subjective; complex models

  • Challenge:

Approximating the posterior can be computationally expensive

[see also Opper, Winther 2003]

slide-21
SLIDE 21

2

Roadmap

slide-22
SLIDE 22
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

2

Roadmap

slide-23
SLIDE 23
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

2

Roadmap

slide-24
SLIDE 24
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

2

Roadmap

slide-25
SLIDE 25
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

2

Roadmap

slide-26
SLIDE 26
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

2

Roadmap

  • Big idea: derivatives/perturbations are relatively easy in VB
slide-27
SLIDE 27
  • Variational Bayes (VB)
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

p(θ|x) q(θ) q∗(θ)

Variational Bayes

3

slide-28
SLIDE 28
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

p(θ|x) q(θ) q∗(θ)

Variational Bayes

3

slide-29
SLIDE 29
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

p(θ|x) q(θ) q∗(θ)

Variational Bayes

3

slide-30
SLIDE 30
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

3

slide-31
SLIDE 31
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

q(θ)

3

slide-32
SLIDE 32

q(θ)

  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x)

3

slide-33
SLIDE 33

q(θ)

  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x)

3

slide-34
SLIDE 34
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Liebler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ) p(θ|x) q∗(θ)

Variational Bayes

3

slide-35
SLIDE 35
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x) q∗(θ)

3

slide-36
SLIDE 36
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x) q∗(θ)

3

slide-37
SLIDE 37
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x) q∗(θ)

3

slide-38
SLIDE 38
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

Variational Bayes

p(θ|x) q∗(θ)

3

slide-39
SLIDE 39
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast

q∗(θ)

[Broderick, Boyd, Wibisono, Wilson, Jordan 2013]

Variational Bayes

p(θ|x) q∗(θ)

3

slide-40
SLIDE 40
  • VB approximation
  • Approximation for

posterior

  • Minimize Kullback-Leibler

(KL) divergence: p(θ|x) KL(qkp(·|x))

  • VB practical success
  • point estimates and prediction
  • fast, streaming, distributed

q∗(θ)

[Broderick, Boyd, Wibisono, Wilson, Jordan 2013]

Variational Bayes

p(θ|x) q∗(θ)

3

slide-41
SLIDE 41
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2

4

slide-42
SLIDE 42
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

q(θ) =

J

Y

j=1

q(θj)

4

slide-43
SLIDE 43
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) θ1 θ2 p(θ|x)

4

slide-44
SLIDE 44
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2 p(θ|x)

4

slide-45
SLIDE 45
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2 p(θ|x)

4

slide-46
SLIDE 46
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2 p(θ|x) q∗(θ)

4

slide-47
SLIDE 47
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2 p(θ|x) q∗(θ)

4

slide-48
SLIDE 48
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

[Bishop 2006]

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2 p(θ|x) q∗(θ)

4

slide-49
SLIDE 49
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2

[MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011]

p(θ|x) q∗(θ)

4

slide-50
SLIDE 50
  • Variational Bayes

!

  • Mean-field variational Bayes (MFVB)

!

  • Underestimates variance (sometimes

severely)

  • No covariance estimates

What about uncertainty?

q(θ) =

J

Y

j=1

q(θj) KL(q||p(·|x)) = Z

θ

q(θ) log q(θ) p(θ|x)dθ θ1 θ2

[MacKay 2003; Bishop 2006; Wang, Titterington 2004; Turner, Sahani 2011]

p(θ|x) q∗(θ)

[Fosdick 2013; Dunson 2014; Bardenet, Doucet, Holmes 2015]

4

slide-51
SLIDE 51
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ)

5

[see also Opper, Winther 2003]

slide-52
SLIDE 52
  • Cumulant-generating function

!

  • True posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ)

5

[see also Opper, Winther 2003]

slide-53
SLIDE 53
  • Cumulant-generating function

!

  • True posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ)

5

[see also Opper, Winther 2003]

slide-54
SLIDE 54
  • Cumulant-generating function

!

  • True posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ)

5

[see also Opper, Winther 2003]

slide-55
SLIDE 55
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x)

[Bishop 2006]

5

[see also Opper, Winther 2003]

slide-56
SLIDE 56
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ C(t) := log EetT θ p(θ|x)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-57
SLIDE 57
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-58
SLIDE 58
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-59
SLIDE 59
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ q∗(θ)

[Bishop 2006]

5

p(θ|x)

[Bishop 2006] [see also Opper, Winther 2003]

slide-60
SLIDE 60
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x)

[Bishop 2006] [see also Opper, Winther 2003]

slide-61
SLIDE 61
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x)

[Bishop 2006] [see also Opper, Winther 2003]

slide-62
SLIDE 62
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

, MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x) log pt(θ) := log p(θ|x) + tT θ − C(t)

[Bishop 2006] [see also Opper, Winther 2003]

slide-63
SLIDE 63
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x)

[Bishop 2006] [see also Opper, Winther 2003]

slide-64
SLIDE 64
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x) log pt(θ) := log p(θ|x) + tT θ − C(t)

[Bishop 2006] [see also Opper, Winther 2003]

slide-65
SLIDE 65
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

[Bishop 2006]

C(t) := log EetT θ

[Bishop 2006]

5

p(θ|x) log pt(θ) := log p(θ|x) + tT θ − C(t)

[Bishop 2006] [see also Opper, Winther 2003]

slide-66
SLIDE 66
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-67
SLIDE 67
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-68
SLIDE 68
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ) Σ = d dtT  d dtCp(·|x)(t)

  • t=0

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-69
SLIDE 69

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-70
SLIDE 70

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-71
SLIDE 71

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-72
SLIDE 72

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ

  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

C(t) := log EetT θ p(θ|x) q∗(θ) p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-73
SLIDE 73
  • Cumulant-generating function

!

  • Exact posterior covariance vs MFVB covariance

!

  • “Linear response”

!

  • The LRVB approximation

V := d2 dtT dtCq∗(t)

  • t=0

Linear response

mean = d dtC(t)

  • t=0

Σ := d2 dtT dtCp(·|x)(t)

  • t=0

log pt(θ) := log p(θ|x) + tT θ − C(t), MFVB q∗

t

Σ = d dtT Eptθ

  • t=0

≈ d dtT Eq∗

t θ

  • t=0

=: ˆ Σ C(t) := log EetT θ p(θ|x) q∗(θ)

5

[Bishop 2006] [see also Opper, Winther 2003]

slide-74
SLIDE 74
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt

6

slide-75
SLIDE 75
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt

6

slide-76
SLIDE 76
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-77
SLIDE 77
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-78
SLIDE 78
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-79
SLIDE 79
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-80
SLIDE 80
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization
  • Symmetric and positive definite at local min of KL
  • The LRVB assumption:

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-81
SLIDE 81
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization
  • Symmetric and positive definite at local min of KL
  • The LRVB assumption: Eptθ ≈ Eq∗

t θ

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-82
SLIDE 82
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization
  • Symmetric and positive definite at local min of KL
  • The LRVB assumption: Eptθ ≈ Eq∗

t θ

p(θ|x) q∗(θ)

[Bishop 2006]

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-83
SLIDE 83
  • LRVB covariance estimate
  • Suppose exponential family with mean parametrization
  • Symmetric and positive definite at local min of KL
  • The LRVB assumption: Eptθ ≈ Eq∗

t θ

p(θ|x) q∗(θ)

  • LRVB estimate is exact when MFVB gives

exact mean (e.g. multivariate normal)

[Bishop 2006]

LRVB estimator

ˆ Σ = ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 ˆ Σ := d dtT Eq∗

t θ

  • t=0

qt mt = (I − V H)−1V

6

slide-84
SLIDE 84

Microcredit Experiment

  • Simplified from Meager (2015)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

7

slide-85
SLIDE 85

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

7

slide-86
SLIDE 86

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

7

slide-87
SLIDE 87

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

7

slide-88
SLIDE 88

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

7

slide-89
SLIDE 89

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit

7

slide-90
SLIDE 90

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit

7

slide-91
SLIDE 91

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit

7

slide-92
SLIDE 92

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit

7

slide-93
SLIDE 93

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit 1 if microcredit

7

slide-94
SLIDE 94

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit 1 if microcredit

7

slide-95
SLIDE 95

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit 1 if microcredit

7

slide-96
SLIDE 96

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit 1 if microcredit

7

slide-97
SLIDE 97

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

profit 1 if microcredit

7

slide-98
SLIDE 98

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

✓ µk τk ◆

iid

∼ N ✓✓ µ τ ◆ , C ◆ profit 1 if microcredit

7

slide-99
SLIDE 99

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

✓ µk τk ◆

iid

∼ N ✓✓ µ τ ◆ , C ◆ σ−2

k iid

∼ Γ(a, b) profit 1 if microcredit

7

slide-100
SLIDE 100

Microcredit Experiment

  • Simplified from Meager (2016)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

✓ µk τk ◆

iid

∼ N ✓✓ µ τ ◆ , C ◆ ✓ µ τ ◆

iid

∼ N ✓✓ µ0 τ0 ◆ , Λ−1 ◆ σ−2

k iid

∼ Γ(a, b) profit 1 if microcredit

7

C ∼ Sep&LKJ(η, c, d)

slide-101
SLIDE 101

Microcredit Experiment

MFVB

8

slide-102
SLIDE 102

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

8

slide-103
SLIDE 103

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

8

slide-104
SLIDE 104

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

8

slide-105
SLIDE 105

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

LRVB,! MFVB

8

slide-106
SLIDE 106

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

LRVB,! MFVB

8

slide-107
SLIDE 107

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

LRVB,! MFVB

8

slide-108
SLIDE 108

Microcredit Experiment

  • One set of 2500

MCMC draws: 45 minutes

  • All of MFVB
  • ptimization, LRVB

uncertainties, all sensitivity measures: 58 seconds

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

MFVB

LRVB,! MFVB

8

slide-109
SLIDE 109
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package (function rnmixGibbs; at least 500 effective samples)

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

with conjugate priors on π, µ, Λ

9

slide-110
SLIDE 110
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package (function rnmixGibbs; at least 500 effective samples)

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

with conjugate priors on π, µ, Λ

9

slide-111
SLIDE 111
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

with conjugate priors on π, µ, Λ

9

slide-112
SLIDE 112

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

LRVB, MFVB with conjugate priors on π, µ, Λ

  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

9

slide-113
SLIDE 113

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

LRVB, MFVB with conjugate priors on π, µ, Λ

  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

9

slide-114
SLIDE 114
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

LRVB, MFVB with conjugate priors on π, µ, Λ

9

slide-115
SLIDE 115
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

LRVB, MFVB with conjugate priors on π, µ, Λ

9

slide-116
SLIDE 116
  • Gaussian mixture model

! !

  • 68 simulated data sets (2 components, 2 dimensions),

10,000 data points each, R bayesm package

  • MNIST digits: 12,665 0s and 1s; PCA for 25 dimensions

P(znk = 1) = πk, p(x|π, µ, Λ, z) = Y

n=1:N

Y

k=1:K

N(xn|µk, Λ−1

k )znk

p(π) ∝ 1, p(µ) ∝ 1, p(Λ) ∝ 1

Experiments

LRVB, MFVB with conjugate priors on π, µ, Λ

9

slide-117
SLIDE 117

Experiments

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package (20,000 samples)

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

10

slide-118
SLIDE 118

Experiments

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package (20,000 samples)

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

10

slide-119
SLIDE 119

Experiments

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

10

slide-120
SLIDE 120

Experiments

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

10

slide-121
SLIDE 121

Experiments

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

LRVB, MFVB

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

10

slide-122
SLIDE 122

Experiments

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

LRVB, MFVB

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

10

slide-123
SLIDE 123

Experiments

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

LRVB, MFVB

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

10

slide-124
SLIDE 124

Experiments

zn|β, τ

indep

∼ N

  • zn|βxn, τ −1

, yn|zn

indep

∼ Poisson (yn| exp(zn)) , β ∼ N(β|0, σ2

β),

τ ∼ Gamma(τ|ατ, βτ)

LRVB, MFVB

  • Non-conjugate normal-Poisson generalized linear mixed

model

!

  • 100 simulated data sets, 500 data points each, R

MCMCglmm package

10

slide-125
SLIDE 125

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-126
SLIDE 126

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-127
SLIDE 127

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-128
SLIDE 128

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-129
SLIDE 129

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-130
SLIDE 130

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-131
SLIDE 131

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-132
SLIDE 132

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-133
SLIDE 133

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-134
SLIDE 134

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-135
SLIDE 135

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-136
SLIDE 136

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-137
SLIDE 137

Scaling the matrix inverse

  • LRVB estimate
  • Decomposition of parameter vector

!

  • Schur complement

!

  • Sparsity patterns

θ = (αT , zT )T ˆ Σα = (Iα − VαHα − VαHαz

  • Iz − VzHz)−1VzHzα

−1 Vα ˆ Σ = (I − V H)−1V H = Hα Hz Hαz Hzα V H I − V H

11

slide-138
SLIDE 138
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

Roadmap

12

slide-139
SLIDE 139
  • Variational Bayes as an alternative to MCMC
  • Challenges of VB
  • Accurate uncertainties from VB
  • Accurate robustness quantification from VB

Roadmap

13

slide-140
SLIDE 140

Robustness quantification

  • Bayes Theorem

p(θ|x) ∝θ p(x|θ)p(θ)

14

slide-141
SLIDE 141

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

14

slide-142
SLIDE 142

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

14

slide-143
SLIDE 143

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

14

slide-144
SLIDE 144

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

14

Some reasonable priors

slide-145
SLIDE 145

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α

14

Some reasonable priors

slide-146
SLIDE 146

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α

14

Some reasonable priors

slide-147
SLIDE 147

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α

14

Some reasonable priors

slide-148
SLIDE 148

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α

14

Some reasonable priors

slide-149
SLIDE 149

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α

14

Some reasonable priors

slide-150
SLIDE 150

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α ≈ dEq∗

α[g(θ)]

  • α

∆α =: ˆ S

14

Some reasonable priors

slide-151
SLIDE 151

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α ≈ dEq∗

α[g(θ)]

  • α

∆α =: ˆ S LRVB estimator

14

Some reasonable priors

slide-152
SLIDE 152

Bayes Theorem

Robustness quantification

  • Bayes Theorem

pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α ≈ dEq∗

α[g(θ)]

  • α

∆α =: ˆ S LRVB estimator

  • When in exponential family

q∗

α

14

Some reasonable priors

slide-153
SLIDE 153

Bayes Theorem

Robustness quantification

  • Bayes Theorem

ˆ S = A ✓ ∂2KL ∂m∂mT

  • m=m∗

◆−1 B pα(θ) := p(θ|x, α) ∝θ p(x|θ)p(θ|α)

  • Sensitivity

S := dEpα[g(θ)] dα

  • α

∆α ≈ dEq∗

α[g(θ)]

  • α

∆α =: ˆ S LRVB estimator

  • When in exponential family

q∗

α

14

Some reasonable priors

slide-154
SLIDE 154

C ∼ Sep&LKJ(η, c, d)

Microcredit Experiment

  • Simplified from Meager (2015)
  • K microcredit trials (Mexico, Mongolia, Bosnia, India,

Morocco, Philippines, Ethiopia)

  • Nk businesses in kth site (~900 to ~17K)
  • Profit of nth business at kth site:

! !

  • Priors and hyperpriors:

ykn

indep

∼ N(µk + Tknτk, σ2

k)

✓ µk τk ◆

iid

∼ N ✓✓ µ τ ◆ , C ◆ ✓ µ τ ◆

iid

∼ N ✓✓ µ0 τ0 ◆ , Λ−1 ◆ σ−2

k iid

∼ Γ(a, b) profit 1 if microcredit

15

slide-155
SLIDE 155

Microcredit Experiment

16

slide-156
SLIDE 156

Microcredit Experiment

MFVB

16

slide-157
SLIDE 157

Microcredit Experiment

  • Perturb Λ11:

0.03 ➔ 0.04

MFVB

16

slide-158
SLIDE 158

Microcredit Experiment

  • Perturb Λ11:

0.03 ➔ 0.04 Sensitivity

MFVB LRVB

16

slide-159
SLIDE 159

Microcredit Experiment

  • Perturb Λ11:

0.03 ➔ 0.04 Sensitivity

MFVB LRVB

16

slide-160
SLIDE 160

Microcredit Experiment

17

slide-161
SLIDE 161
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-162
SLIDE 162
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-163
SLIDE 163
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-164
SLIDE 164
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-165
SLIDE 165
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-166
SLIDE 166
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-167
SLIDE 167
  • Sensitivity of the

expected microcredit effect (τ)

  • Normalized to be on

scale of τ std devs

  • τ mean (MFVB):

3.08 USD PPP

  • τ std dev (LRVB):

1.83 USD PPP

  • Mean is 1.68 std

dev from 0

  • Λ11 += 0.04

⇒ Mean > 2 std dev

Microcredit Experiment

17

slide-168
SLIDE 168

Conclusions

  • We provide linear response variational Bayes: supplements

MFVB for fast & accurate covariance estimate

  • More from LRVB: fast & accurate robustness quantification
  • Interested in your data and models:
  • Sensitivity to prior perturbations
  • Sensitivity to likelihood, data perturbations
  • Computational statistical trade-offs
  • New data summaries: coresets, approx. sufficient stats
  • Criteo data set: 40 million data points, 3 million features,
  • ur runtime: ~20 seconds on 24 cores
  • Theoretical guarantees on finite-sample quality

18

[Huggins, Campbell, Broderick 2016; Huggins, Adams, Broderick, submitted]

slide-169
SLIDE 169

T Broderick, N Boyd, A Wibisono, AC Wilson, and MI Jordan. Streaming variational

  • Bayes. NIPS 2013.

! T Campbell*, JH Huggins*, J How, and T Broderick. Truncated random measures.

  • Submitted. ArXiv:1603.00861. Poster at ISBA 2016.

! R Giordano, T Broderick, and MI Jordan. Linear response methods for accurate covariance estimates from mean field variational Bayes. NIPS, 2015.! ! R Giordano, T Broderick, R Meager, JH Huggins, and MI Jordan. Fast robustness quantification with variational Bayes. ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016. ArXiv:1606.07153.! ! JH Huggins, T Campbell, and T Broderick. Core sets for scalable Bayesian logistic

  • regression. NIPS 2016.

! R Meager. Understanding the impact of microcredit expansions: A Bayesian hierarchical analysis of 7 randomised experiments. ArXiv:1506.06669, 2016.

19

References

slide-170
SLIDE 170

References

R Bardenet, A Doucet, and C Holmes. On Markov chain Monte Carlo methods for tall data. arXiv, 2015. CM Bishop. Pattern Recognition and Machine Learning, 2006. D Dunson. Robust and scalable approach to Bayesian inference. Talk at ISBA 2014. B Fosdick. Modeling Heterogeneity within and between Matrices and Arrays, Chapter 4.7. PhD Thesis, University of Washington, 2013. DJC MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. M Opper and O Winther. Variational linear response. NIPS 2003. RE Turner and M Sahani. Two problems with variational expectation maximisation for time- series models. In D Barber, AT Cemgil, and S Chiappa, editors, Bayesian Time Series Models, 2011. B Wang and M Titterington. Inadequacy of interval estimates corresponding to variational Bayesian approximations. In AISTATS, 2004.

20