Etre bay esien quand on a trop de donn ees Pr ec ed e dune - - PowerPoint PPT Presentation

etre bay esien quand on a trop de donn ees
SMART_READER_LITE
LIVE PREVIEW

Etre bay esien quand on a trop de donn ees Pr ec ed e dune - - PowerPoint PPT Presentation

Etre bay esien quand on a trop de donn ees Pr ec ed e dune introduction au mille-feuille CRIStAL emi Bardenet 1 R 1 CNRS & CRIStAL, Univ. Lille, France R emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data


slide-1
SLIDE 1

ˆ Etre bay´ esien quand on a trop de donn´ ees

Pr´ ec´ ed´ e d’une introduction au mille-feuille CRIStAL R´ emi Bardenet1

1CNRS & CRIStAL, Univ. Lille, France R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 1

slide-2
SLIDE 2

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-3
SLIDE 3

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-4
SLIDE 4

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

◮ 222 permanents dont 22 CNRS et 27 Inria.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-5
SLIDE 5

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

◮ ∼ 40 permanents.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-6
SLIDE 6

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

◮ 13 permanents.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-7
SLIDE 7

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-8
SLIDE 8

R´ emi ∈ SigMA ⊂ DatIng ⊂ CRIStAL ⊂ Univ. Lille

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 2

slide-9
SLIDE 9

Bayesian inference

◮ A biologist decides on

◮ a likelihood p(x|θ), ◮ a prior p(θ),

◮ Then he has implicitely decided on

◮ a posterior p(θ|x) = p(x|θ)p(θ)

Z

.

◮ Bayesian inference is all about computing integrals

  • h(θ)p(θ|x)dθ.

◮ MCMC samples an ergodic Markov chain (θt)t=1,...,T with

stationary distribution p(·|θ), so that when T → ∞, √ T

  • 1

T

T

  • t=1

h(θt) −

  • h(θ)p(θ|x)dθ
  • d

− − − − →

T→∞ N(0, σ2). ◮ Sampling (θt)t=1,...,T requires T likelihood evaluations.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 3

slide-10
SLIDE 10

Bayesian inference

◮ A biologist decides on

◮ a likelihood p(x|θ), ◮ a prior p(θ),

◮ Then he has implicitely decided on

◮ a posterior p(θ|x) = p(x|θ)p(θ)

Z

.

◮ Bayesian inference is all about computing integrals

  • h(θ)p(θ|x)dθ.

◮ MCMC samples an ergodic Markov chain (θt)t=1,...,T with

stationary distribution p(·|θ), so that when T → ∞, √ T

  • 1

T

T

  • t=1

h(θt) −

  • h(θ)p(θ|x)dθ
  • d

− − − − →

T→∞ N(0, σ2). ◮ Sampling (θt)t=1,...,T requires T likelihood evaluations.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 3

slide-11
SLIDE 11

Bayesian inference

◮ A biologist decides on

◮ a likelihood p(x|θ), ◮ a prior p(θ),

◮ Then he has implicitely decided on

◮ a posterior p(θ|x) = p(x|θ)p(θ)

Z

.

◮ Bayesian inference is all about computing integrals

  • h(θ)p(θ|x)dθ.

◮ MCMC samples an ergodic Markov chain (θt)t=1,...,T with

stationary distribution p(·|θ), so that when T → ∞, √ T

  • 1

T

T

  • t=1

h(θt) −

  • h(θ)p(θ|x)dθ
  • d

− − − − →

T→∞ N(0, σ2). ◮ Sampling (θt)t=1,...,T requires T likelihood evaluations.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 3

slide-12
SLIDE 12

Bayesian inference

◮ A biologist decides on

◮ a likelihood p(x|θ), ◮ a prior p(θ),

◮ Then he has implicitely decided on

◮ a posterior p(θ|x) = p(x|θ)p(θ)

Z

.

◮ Bayesian inference is all about computing integrals

  • h(θ)p(θ|x)dθ.

◮ MCMC samples an ergodic Markov chain (θt)t=1,...,T with

stationary distribution p(·|θ), so that when T → ∞, √ T

  • 1

T

T

  • t=1

h(θt) −

  • h(θ)p(θ|x)dθ
  • d

− − − − →

T→∞ N(0, σ2). ◮ Sampling (θt)t=1,...,T requires T likelihood evaluations.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 3

slide-13
SLIDE 13

Bayesian inference

◮ A biologist decides on

◮ a likelihood p(x|θ), ◮ a prior p(θ),

◮ Then he has implicitely decided on

◮ a posterior p(θ|x) = p(x|θ)p(θ)

Z

.

◮ Bayesian inference is all about computing integrals

  • h(θ)p(θ|x)dθ.

◮ MCMC samples an ergodic Markov chain (θt)t=1,...,T with

stationary distribution p(·|θ), so that when T → ∞, √ T

  • 1

T

T

  • t=1

h(θt) −

  • h(θ)p(θ|x)dθ
  • d

− − − − →

T→∞ N(0, σ2). ◮ Sampling (θt)t=1,...,T requires T likelihood evaluations.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 3

slide-14
SLIDE 14

Tall data

◮ Assume data are independent conditional on θ,

p(x|θ) =

n

  • i=1

p(xi|θ) .

◮ Can you get the same central limit theorem while never

evaluating all terms in the product?

◮ Yes [1], sometimes using o(n) datapoints per iteration! [2] ◮ Unanswered yet: What is the equivalent of stochastic gradient

for integration?

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 4

slide-15
SLIDE 15

Tall data

◮ Assume data are independent conditional on θ,

p(x|θ) =

n

  • i=1

p(xi|θ) .

◮ Can you get the same central limit theorem while never

evaluating all terms in the product?

◮ Yes [1], sometimes using o(n) datapoints per iteration! [2] ◮ Unanswered yet: What is the equivalent of stochastic gradient

for integration?

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 4

slide-16
SLIDE 16

Metropolis-Hastings MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 5

slide-17
SLIDE 17

Metropolis-Hastings MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 5

slide-18
SLIDE 18

Metropolis-Hastings MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 5

slide-19
SLIDE 19

Metropolis-Hastings MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 5

slide-20
SLIDE 20

Metropolis-Hastings MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 5

slide-21
SLIDE 21

Subsampling approaches MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 α =

n

i=1 p(xi|θ′)p(θ′)

n

i=1 p(xi|θ)p(θ)

q(θ|θ′) q(θ′|θ)

5 if u < α 6 θk ← θ′ ⊲ Accept 7 else θk ← θ ⊲ Reject 8 return (θk)k=1,...,Niter

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 6

slide-22
SLIDE 22

Subsampling approaches MH

  • p(x|θ), p(θ), q(θ′|θ), θ0, Niter, X
  • 1

for k ← 1 to Niter 2 θ ← θk−1 3 θ′ ∼ q(.|θ), u ∼ U(0,1), 4 ψ(u, θ, θ′) ← 1

n log

  • u p(θ)q(θ′|θ)

p(θ′)q(θ|θ′)

  • 5

Λn(θ, θ′) ← 1

n

n

i=1 log

  • p(xi|θ′)

p(xi|θ)

  • 6

if Λn(θ, θ′) > ψ(u, θ, θ′) 7 θk ← θ′ ⊲ Accept 8 else θk ← θ ⊲ Reject 9 return (θk)k=1,...,Niter

◮ Can we use

Λ∗

t (θ, θ′) = 1

t

t

  • i=1

log p(x∗

i |θ′)

p(x∗

i |θ)

  • ?

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 6

slide-23
SLIDE 23

Concentration inequalities

◮ Let δ > 0, θ, θ′ ∈ Θ. We can find

  • t, ct(δ)
  • such that

P(|Λ∗

t (θ, θ′) − Λn(θ, θ′)| ≤ ct(δ)) ≥ 1 − δ. ◮ For example, sampling without replacement, we prove [3]

ct(δ) = · · · ×

  • 1 − t/n ˆ

σt √t + · · · × Cθ,θ′ t . is valid, where Cθ,θ′ = max1≤i≤n | log p(xi|θ′) − log p(xi|θ)|.

◮ Assume you can compute Cθ,θ′ in o(n) time. ◮ Can we make the right decision with probability 1 − δ?

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 7

slide-24
SLIDE 24

An adaptive choice of t

◮ Given θ, θ′ ∈ Θ and u ∈ [0, 1], an adaptive choice of t can

guarantee we know whether Λn(θ, θ) > ψ(u, θ, θ′) with probability 1 − δ.

ψ Λn Λ∗

1000

2c1000(δ1)

◮ Taking (δt) such that t≥1 δt ≤ δ gives the result by a union

bound.

◮ This Markov kernel inherits the ergodicity of the original MH

kernel [1, 4].

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 8

slide-25
SLIDE 25

An adaptive choice of t

◮ Given θ, θ′ ∈ Θ and u ∈ [0, 1], an adaptive choice of t can

guarantee we know whether Λn(θ, θ) > ψ(u, θ, θ′) with probability 1 − δ.

ψ Λn Λ∗

2000

2c2000(δ2)

◮ Taking (δt) such that t≥1 δt ≤ δ gives the result by a union

bound.

◮ This Markov kernel inherits the ergodicity of the original MH

kernel [1, 4].

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 8

slide-26
SLIDE 26

An adaptive choice of t

◮ Given θ, θ′ ∈ Θ and u ∈ [0, 1], an adaptive choice of t can

guarantee we know whether Λn(θ, θ) > ψ(u, θ, θ′) with probability 1 − δ.

b b

ψ Λn

b

Λ∗

4000

2c4000(δ3)

◮ Taking (δt) such that t≥1 δt ≤ δ gives the result by a union

bound.

◮ This Markov kernel inherits the ergodicity of the original MH

kernel [1, 4].

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 8

slide-27
SLIDE 27

Confidence MH with 2nd order Taylor proxy on a toy example

◮ X ∼ N(0, 1), ◮ p(·|θ) = N(·|µ, σ2).

−0.005 0.000 0.005 0.010 0.015 0.990 0.995 1.000 1.005 Ref BvM

50000 100000 150000 200000 2000 4000 6000 8000 10000 mean=1.2% median=0.3% n R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 9

slide-28
SLIDE 28

Confidence MH with 2nd order Taylor proxy on a toy example

◮ X ∼ LogNormal(0, 1), ◮ p(·|θ) = N(·|µ, σ2).

1.63 1.64 1.65 1.66 1.67 2.150 2.155 2.160 2.165 2.170 2.175 2.180 Ref BvM

50000 100000 150000 200000 1000 2000 3000 4000 5000 6000 7000 8000 mean=27.1% median=16.4% n R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 9

slide-29
SLIDE 29

A saturation phenomenon

◮ Toy 2D logistic regression. ◮ We can use 2nd order Taylor proxies in this case.

−6 −5 −4 −3 −2 −1 1 1 2 3 4 5

lg10 n = 3 lg10 n = 4 lg10 n = 5 lg10 n = 6 lg10 n = 7 Figure: Histograms of the number of likelihood evaluations

◮ We seem to have hit the sample complexity of the problem!

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 10

slide-30
SLIDE 30

Open issues

◮ An explicit relation between δ and the computational budget. ◮ Clear recommendations as to when to use Laplace and when

to be a proper Bayesian.

◮ How do we take into account various data access constraints? ◮ Shouldn’t we go up one step and start from cost-aware loss

functions, then rederive algorithms? To go further Our exhaustive (in 2016) review of MCMC for tall data [2]

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 11

slide-31
SLIDE 31

Pour continuer ` a communiquer

◮ Les responsables de s´

eminaires peuvent contacter Benjamin Guedj ou moi-mˆ eme.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 12

slide-32
SLIDE 32

References I

[1] R. Bardenet, A. Doucet, and C. Holmes. Towards scaling up MCMC: an adaptive subsampling approach. In Proceedings of the International Conference on Machine Learning (ICML), 2014. http://jmlr.org/proceedings/papers/v32/bardenet14-supp.pdf. [2] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. accepted in Journal of Machine Learning Research (JMLR), 2016. [3] R. Bardenet and O.-A. Maillard. Concentration inequalities for sampling without replacement. Bernoulli, 2015. [4] D. Rudolf and N. Schweizer. Perturbation theory for Markov chains via Wasserstein distance. arXiv preprint arXiv:1503.04123, 2015.

R´ emi Bardenet CRIStAL, DatIng, SigMA, Bayes et les tall data 13