Efficient Bayesian computation by proximal Markov chain Monte Carlo: - - PowerPoint PPT Presentation

efficient bayesian computation by proximal markov chain
SMART_READER_LITE
LIVE PREVIEW

Efficient Bayesian computation by proximal Markov chain Monte Carlo: - - PowerPoint PPT Presentation

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt,


slide-1
SLIDE 1

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau

  • Dr. Marcelo Pereyra

http://www.macs.hw.ac.uk/∼mp71/

Maxwell Institute for Mathematical Sciences, Heriot-Watt University

June 2017, Heriot-Watt, Edinburgh. Joint work with Alain Durmus and Eric Moulines

  • M. Pereyra (MI — HWU)

LMS 17 0 / 25

slide-2
SLIDE 2

Outline

1

Bayesian inference in imaging inverse problems

2

Proximal Markov chain Monte Carlo

3

Experiments

4

Conclusion

  • M. Pereyra (MI — HWU)

LMS 17 1 / 25

slide-3
SLIDE 3

Imaging inverse problems

We are interested in an unknown x ∈ Rd. We measure y, related to x by a statistical model p(y∣x). The recovery of x from y is ill-posed or ill-conditioned, resulting in significant uncertainty about x. For example, in many imaging problems y = Ax + w, for some operator A that is rank-deficient, and additive noise w.

  • M. Pereyra (MI — HWU)

LMS 17 2 / 25

slide-4
SLIDE 4

The Bayesian framework

We use priors to reduce uncertainty and deliver accurate results. Given the prior p(x), the posterior distribution of x given y p(x∣y) = p(y∣x)p(x)/p(y) models our knowledge about x after observing y. In this talk we consider that p(x∣y) is log-concave; i.e., p(x∣y) = exp{−φ(x)}/Z , where φ(x) is a convex function and Z = ∫ exp{−φ(x)}dx.

  • M. Pereyra (MI — HWU)

LMS 17 3 / 25

slide-5
SLIDE 5

Inverse problems in mathematical imaging

More precisely, we consider models of the form p(x∣y) ∝ exp{−f (x) − g(x)} (1) where f (x) and g(x) are lower semicontinuous convex functions from Rd → (−∞,+∞] and f is Lf -Lipschitz differentiable. For example, f (x) =

1 2σ2 ∥y − Ax∥2 2

for some observation y ∈ Rp and linear operator A ∈ Rp×n, and g(x) = α∥Bx∥† + 1S(x) for some norm ∥ ⋅ ∥†, dictionary B ∈ Rn×n, and convex set S. Often, g ∉ C1.

  • M. Pereyra (MI — HWU)

LMS 17 4 / 25

slide-6
SLIDE 6

Maximum-a-posteriori (MAP) estimation

The predominant Bayesian approach in imaging is MAP estimation ˆ xMAP = argmax

x∈Rd

p(x∣y), = argmin

x∈Rd

f (x) + g(x), (2) that can be computed efficiently by “proximal” convex optimisation. For example, the proximal gradient algorithm xm+1 = proxL−1

g {xm + L−1∇f (xm)},

with proxλ

g(x) = argmaxu∈RN g(u) − 1 2λ∣∣u − x∣∣2 converges at rate O(1/m).

However, ˆ xMAP provides very little about p(x∣y).

  • M. Pereyra (MI — HWU)

LMS 17 5 / 25

slide-7
SLIDE 7

Illustrative example: image resolution enhancement

Recover x ∈ Rd from low resolution and noisy measurements y = Hx + w, where H is a circulant blurring matrix. We use the Bayesian model p(x∣y) ∝ exp(−∥y − Hx∥2/2σ2 − β∥x∥1). (3)

y ˆ xMAP Uncertainty estimates? Figure : Resolution enhancement of the Molecules image of size 256×256 pixels.

  • M. Pereyra (MI — HWU)

LMS 17 6 / 25

slide-8
SLIDE 8

Illustrative example: tomographic image reconstruction

Recover x ∈ Rd from partially observed and noisy Fourier measurements y = ΦFx + w, where Φ is a mask and F is the 2D Fourier operator. We use the model p(x∣y) ∝ exp(−∥y − ΦFx∥2/2σ2 − β∥∇dx∥1−2), (4) where ∇d is the 2d discrete gradient operator and ∥ ⋅ ∥1−2 the ℓ1 − ℓ2 norm.

y ˆ xMAP Possible solution? Figure : Tomographic reconstruction of the Shepp-Logan phantom image.

  • M. Pereyra (MI — HWU)

LMS 17 7 / 25

slide-9
SLIDE 9

Modern Bayesian computation

Recent surveys on Bayesian computation...

25th anniversary special issue on Bayesian computation

  • P. Green, K. Latuszynski, M. Pereyra, C. P. Robert, ”Bayesian computation: a perspective on

the current state, and sampling backwards and forwards”, Statistics and Computing, vol. 25,

  • no. 4, pp 835-862, Jul. 2015.

Special issue on “Stochastic simulation and optimisation in signal processing”

  • M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S.

McLaughlin, “A Survey of Stochastic Simulation and Optimization Methods in Signal Pro- cessing” IEEE Sel. Topics in Signal Processing, in press.

  • M. Pereyra (MI — HWU)

LMS 17 8 / 25

slide-10
SLIDE 10

Outline

1

Bayesian inference in imaging inverse problems

2

Proximal Markov chain Monte Carlo

3

Experiments

4

Conclusion

  • M. Pereyra (MI — HWU)

LMS 17 9 / 25

slide-11
SLIDE 11

Inference by Markov chain Monte Carlo integration

Monte Carlo integration Given a set of samples X1,...,XM distributed according to p(x∣y), we approximate posterior expectations and probabilities 1 M

M

m=1

h(Xm) → E{h(x)∣y}, as M → ∞ Guarantees from CLTs, e.g.,

1 √ M ∑M m=1 h(Xm) ∼ N[E{h(x)∣y},Σ].

Markov chain Monte Carlo: Construct a Markov kernel Xm+1∣Xm ∼ K(⋅∣Xm) such that the Markov chain X1,...,XM has p(x∣y) as stationary distribution. MCMC simulation in high-dimensional spaces is very challenging.

  • M. Pereyra (MI — HWU)

LMS 17 10 / 25

slide-12
SLIDE 12

Unadjusted Langevin algorithm

Suppose for now that p(x∣y) ∈ C1. Then, we can generate samples by mimicking a Langevin diffusion process that converges to p(x∣y) as t → ∞, X ∶ dXt = 1 2∇log p (Xt∣y)dt + dWt, 0 ≤ t ≤ T, X(0) = x0. where W is the n-dimensional Brownian motion. Because solving Xt exactly is generally not possible, we use an Euler Maruyama approximation and obtain the “unadjusted Langevin algorithm” ULA ∶ Xm+1 = Xm + δ∇log p(Xm∣y) + √ 2δZm+1, Zm+1 ∼ N(0,In) ULA is remarkably efficient when p(x∣y) is sufficiently regular.

  • M. Pereyra (MI — HWU)

LMS 17 11 / 25

slide-13
SLIDE 13

Unadjusted Langevin algorithm

However, our interest is in high-dimensional models of the form p(x∣y) ∝ exp{−f (x) − g(x)} with f ,g l.s.c. convex, ∇f Lf -Lipschitz continuous, and g ∉ C1. Unfortunately, such models are beyond the scope of ULA, which may perform poorly if p(x∣y) is not Lipchitz differentiable. Idea: Regularise p(x∣y) to enable efficiently Langevin sampling.

  • M. Pereyra (MI — HWU)

LMS 17 12 / 25

slide-14
SLIDE 14

Approximation of p(x∣y)

Moreau-Yoshida approximation of p(x∣y) (Pereyra, 2015): Let λ > 0. We propose to approximate p(x∣y) with the density pλ(x∣y) = exp[−f (x) − gλ(x)] ∫Rd exp[−f (x) − gλ(x)]dx , where gλ is the Moreau-Yoshida envelope of g given by gλ(x) = inf

u∈Rd{g(u) − (2λ)−1∥u − x∥2 2},

and where λ controls the approximation error involved.

  • M. Pereyra (MI — HWU)

LMS 17 13 / 25

slide-15
SLIDE 15

Moreau-Yoshida approximations

Key properties (Pereyra, 2015; Durmus et al., 2017):

1 ∀λ > 0, pλ defines a proper density of a probability measure on Rd. 2 Convexity and differentiability:

pλ is log-concave on Rd. pλ ∈ C1 even if p not differentiable, with ∇log pλ(x∣y) = −∇f (x) + {proxλ

g (x) − x}/λ,

and proxλ

g (x) = argmaxu∈RN g(u) − 1 2λ∣∣u − x∣∣2.

∇log pλ is Lipchitz continuous with constant L ≤ Lf + λ−1.

3 Approximation error between pλ(x∣y) and p(x∣y):

limλ→0 ∥pλ − p∥TV = 0. If g is Lg-Lipchitz, then ∥pλ − p∥TV ≤ λL2

g.

  • M. Pereyra (MI — HWU)

LMS 17 14 / 25

slide-16
SLIDE 16

Illustration

Examples of Moreau-Yoshida approximations:

p(x) ∝ exp(−∣x∣) p(x) ∝ exp(−x4) p(x) ∝ 1[−0.5,0.5](x) Figure : True densities (solid blue) and approximations (dashed red).

  • M. Pereyra (MI — HWU)

LMS 17 15 / 25

slide-17
SLIDE 17

Proximal ULA

We approximate X with the “regularised” auxiliary Langevin diffusion Xλ ∶ dXλ

t = 1

2∇log pλ (Xλ

t ∣y)dt + dWt,

0 ≤ t ≤ T, Xλ(0) = x0, which targets pλ(x∣y). Remark: we can make Xλ arbitrarily close to X. Finally, an Euler Maruyama discretisation of Xλ leads to the (Moreau-Yoshida regularised) proximal ULA MYULA ∶ Xm+1 = (1 − δ

λ)Xm − δ∇f {Xm} + δ λ proxλ g{Xm} +

√ 2δZm+1, where we used that ∇gλ(x) = {x − proxλ

g(x)}/λ.

  • M. Pereyra (MI — HWU)

LMS 17 16 / 25

slide-18
SLIDE 18

Convergence results

Non-asymptotic estimation error bound

Theorem 2.1 (Durmus et al. (2017))

Let δmax

λ

= (L1 + 1/λ)−1. Assume that g is Lipchitz continuous. Then, there exist δǫ ∈ (0,δmax

λ

] and Mǫ ∈ N such that ∀δ < δǫ and ∀M ≥ Mǫ ∥δx0QM

δ − p∥TV < ǫ + λL2 g,

where QM

δ

is the kernel assoc. with M iterations of MYULA with step δ. Note: δǫ and Mǫ are explicit and tractable. If f + g is strongly convex

  • utside some ball, then Mǫ scales with order O(d log(d)) (otherwise at

worse O(d5)). See Durmus et al. (2017) for other convergence results.

  • M. Pereyra (MI — HWU)

LMS 17 17 / 25

slide-19
SLIDE 19

Outline

1

Bayesian inference in imaging inverse problems

2

Proximal Markov chain Monte Carlo

3

Experiments

4

Conclusion

  • M. Pereyra (MI — HWU)

LMS 17 18 / 25

slide-20
SLIDE 20

Sparse image deblurring

Bayesian credible region C ∗

α = {x ∶ p(x∣y) ≥ γα} with

P[x ∈ Cα∣y] = 1 − α, and p(x∣y) ∝ exp(−∥y − Hx∥2/2σ2 − β∥x∥1) y ˆ xMAP uncertainty estimates Figure : Live-cell microscopy data (Zhu et al., 2012). Uncertainty analysis (±78nm × ±125nm) in close agreement with the experimental precision ±80nm. Computing time 4 minutes. M = 105 iterations. Estimation error 0.2%..

  • M. Pereyra (MI — HWU)

LMS 17 19 / 25

slide-21
SLIDE 21

Sparse image deblurring

Estimation of reg. param. β by marginal maximum likelihood ˆ β = argmax

β∈R+

p(y∣β), with p(y∣β) ∝ ∫ exp(−∥y − Hx∥2/2σ2 − β∥x∥1)dx y ˆ xMAP

  • Reg. param β

Figure : Maximum marginal likelihood estimation of regularisation parameter β. Computing time 0.75 secs..

  • M. Pereyra (MI — HWU)

LMS 17 20 / 25

slide-22
SLIDE 22

Bayesian model selection

p(Mk∣y) = p(Mk)∫ p(x,y∣Mk)dx/p(y) with p(x,y∣M1) ∝ exp[−(∥y − H1x∥2/2σ2) − βTV (x)], p(x,y∣M2) ∝ exp[−(∥y − H2x∥2/2σ2) − βTV (x)]. Boat image deblurring experiment (comp. time 30 minutes p/model):

  • bservation y

(5 × 5 uniform blur, BSNR 40dB)

ˆ xM1 (PSNR 34dB) p(M1∣y) = 0.96 ˆ xM2 (PSNR 33dB) p(M2∣y) = 0.04

  • M. Pereyra (MI — HWU)

LMS 17 21 / 25

slide-23
SLIDE 23

Uncertainty quantification of MRI tomographic image

Bayesian credible region C ∗

α = {x ∶ p(x∣y) ≥ γα} with

P[x ∈ Cα∣y] = 1 − α, and p(x∣y) ∝ exp(−∥y − ΦFx∥2/2σ2 − β∥∇dx∥1−2),

ˆ xMAP (tumour intensity 0.30)

  • min. tumour intensity 0.27
  • max. tumour intensity 0.33

Figure : Shepp-Logan experiment: uncertainty in tumour intensity 10%. Computing time 1 minute. M = 105 iterations. Estimation error 3%. .

  • M. Pereyra (MI — HWU)

LMS 17 22 / 25

slide-24
SLIDE 24

Outline

1

Bayesian inference in imaging inverse problems

2

Proximal Markov chain Monte Carlo

3

Experiments

4

Conclusion

  • M. Pereyra (MI — HWU)

LMS 17 23 / 25

slide-25
SLIDE 25

Conclusion

The challenges facing modern image processing require a paradigm shift, and a new wave of analysis and computation methodologies. Great potential for synergy between Bayesian and variational approaches at algorithmic, methodological, and theoretical levels. MYULA delivers reliable and computationally efficient approximate inferences, with good control of accuracy vs. computing-time.

  • M. Pereyra (MI — HWU)

LMS 17 24 / 25

slide-26
SLIDE 26

Thank you!

Bibliography:

Durmus, A., Moulines, E., and Pereyra, M. (2017). Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM J. Imaging

  • Sci. to appear.

Pereyra, M. (2015). Proximal Markov chain Monte Carlo algorithms. Statistics and

  • Computing. open access paper, http://dx.doi.org/10.1007/s11222-015-9567-4.

Zhu, L., Zhang, W., Elnatan, D., and Huang, B. (2012). Faster STORM using compressed sensing. Nat. Meth., 9(7):721–723.

  • M. Pereyra (MI — HWU)

LMS 17 25 / 25