efficient bayesian computation by proximal markov chain
play

Efficient Bayesian computation by proximal Markov chain Monte Carlo: - PowerPoint PPT Presentation

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt,


  1. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau Dr. Marcelo Pereyra http://www.macs.hw.ac.uk/ ∼ mp71/ Maxwell Institute for Mathematical Sciences, Heriot-Watt University June 2017, Heriot-Watt, Edinburgh. Joint work with Alain Durmus and Eric Moulines M. Pereyra (MI — HWU) LMS 17 0 / 25

  2. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 1 / 25

  3. Imaging inverse problems We are interested in an unknown x ∈ R d . We measure y , related to x by a statistical model p ( y ∣ x ) . The recovery of x from y is ill-posed or ill-conditioned, resulting in significant uncertainty about x . For example, in many imaging problems y = Ax + w , for some operator A that is rank-deficient, and additive noise w . M. Pereyra (MI — HWU) LMS 17 2 / 25

  4. The Bayesian framework We use priors to reduce uncertainty and deliver accurate results. Given the prior p ( x ) , the posterior distribution of x given y p ( x ∣ y ) = p ( y ∣ x ) p ( x )/ p ( y ) models our knowledge about x after observing y . In this talk we consider that p ( x ∣ y ) is log-concave; i.e., p ( x ∣ y ) = exp {− φ ( x )}/ Z , where φ ( x ) is a convex function and Z = ∫ exp {− φ ( x )} d x . M. Pereyra (MI — HWU) LMS 17 3 / 25

  5. Inverse problems in mathematical imaging More precisely, we consider models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} (1) where f ( x ) and g ( x ) are lower semicontinuous convex functions from R d → (−∞ , +∞] and f is L f -Lipschitz differentiable. For example, f ( x ) = 2 σ 2 ∥ y − Ax ∥ 2 1 2 for some observation y ∈ R p and linear operator A ∈ R p × n , and g ( x ) = α ∥ Bx ∥ † + 1 S ( x ) for some norm ∥ ⋅ ∥ † , dictionary B ∈ R n × n , and convex set S . Often, g ∉ C 1 . M. Pereyra (MI — HWU) LMS 17 4 / 25

  6. Maximum-a-posteriori (MAP) estimation The predominant Bayesian approach in imaging is MAP estimation x MAP = argmax p ( x ∣ y ) , ˆ x ∈ R d (2) = argmin f ( x ) + g ( x ) , x ∈ R d that can be computed efficiently by “proximal” convex optimisation. For example, the proximal gradient algorithm x m + 1 = prox L − 1 g { x m + L − 1 ∇ f ( x m )} , 2 λ ∣∣ u − x ∣∣ 2 converges at rate O ( 1 / m ) . g ( x ) = argmax u ∈ R N g ( u ) − 1 with prox λ x MAP provides very little about p ( x ∣ y ) . However, ˆ M. Pereyra (MI — HWU) LMS 17 5 / 25

  7. Illustrative example: image resolution enhancement Recover x ∈ R d from low resolution and noisy measurements y = Hx + w , where H is a circulant blurring matrix. We use the Bayesian model p ( x ∣ y ) ∝ exp (−∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) . (3) y ˆ x MAP Uncertainty estimates? Figure : Resolution enhancement of the Molecules image of size 256 × 256 pixels. M. Pereyra (MI — HWU) LMS 17 6 / 25

  8. Illustrative example: tomographic image reconstruction Recover x ∈ R d from partially observed and noisy Fourier measurements y = Φ F x + w , where Φ is a mask and F is the 2D Fourier operator. We use the model p ( x ∣ y ) ∝ exp (−∥ y − Φ F x ∥ 2 / 2 σ 2 − β ∥∇ d x ∥ 1 − 2 ) , (4) where ∇ d is the 2d discrete gradient operator and ∥ ⋅ ∥ 1 − 2 the ℓ 1 − ℓ 2 norm. y ˆ x MAP Possible solution? Figure : Tomographic reconstruction of the Shepp-Logan phantom image. M. Pereyra (MI — HWU) LMS 17 7 / 25

  9. Modern Bayesian computation Recent surveys on Bayesian computation... 25th anniversary special issue on Bayesian computation P. Green, K. Latuszynski, M. Pereyra, C. P. Robert, ”Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, Statistics and Computing, vol. 25, no. 4, pp 835-862, Jul. 2015. Special issue on “Stochastic simulation and optimisation in signal processing” M. Pereyra, P. Schniter, E. Chouzenoux, J.-C. Pesquet, J.-Y. Tourneret, A. Hero, and S. McLaughlin, “A Survey of Stochastic Simulation and Optimization Methods in Signal Pro- cessing” IEEE Sel. Topics in Signal Processing, in press. M. Pereyra (MI — HWU) LMS 17 8 / 25

  10. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 9 / 25

  11. Inference by Markov chain Monte Carlo integration Monte Carlo integration Given a set of samples X 1 ,..., X M distributed according to p ( x ∣ y ) , we approximate posterior expectations and probabilities M 1 h ( X m ) → E { h ( x )∣ y } , as M → ∞ ∑ M m = 1 m = 1 h ( X m ) ∼ N [ E { h ( x )∣ y } , Σ ] . 1 M ∑ M Guarantees from CLTs, e.g., √ Markov chain Monte Carlo: Construct a Markov kernel X m + 1 ∣ X m ∼ K ( ⋅ ∣ X m ) such that the Markov chain X 1 ,..., X M has p ( x ∣ y ) as stationary distribution. MCMC simulation in high-dimensional spaces is very challenging. M. Pereyra (MI — HWU) LMS 17 10 / 25

  12. Unadjusted Langevin algorithm Suppose for now that p ( x ∣ y ) ∈ C 1 . Then, we can generate samples by mimicking a Langevin diffusion process that converges to p ( x ∣ y ) as t → ∞ , d X t = 1 2 ∇ log p ( X t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X ( 0 ) = x 0 . X ∶ where W is the n -dimensional Brownian motion. Because solving X t exactly is generally not possible, we use an Euler Maruyama approximation and obtain the “unadjusted Langevin algorithm” √ ULA ∶ X m + 1 = X m + δ ∇ log p ( X m ∣ y ) + 2 δ Z m + 1 , Z m + 1 ∼ N( 0 , I n ) ULA is remarkably efficient when p ( x ∣ y ) is sufficiently regular. M. Pereyra (MI — HWU) LMS 17 11 / 25

  13. Unadjusted Langevin algorithm However, our interest is in high-dimensional models of the form p ( x ∣ y ) ∝ exp {− f ( x ) − g ( x )} with f , g l.s.c. convex, ∇ f L f -Lipschitz continuous, and g ∉ C 1 . Unfortunately, such models are beyond the scope of ULA, which may perform poorly if p ( x ∣ y ) is not Lipchitz differentiable. Idea: Regularise p ( x ∣ y ) to enable efficiently Langevin sampling. M. Pereyra (MI — HWU) LMS 17 12 / 25

  14. Approximation of p ( x ∣ y ) Moreau-Yoshida approximation of p ( x ∣ y ) (Pereyra, 2015): Let λ > 0. We propose to approximate p ( x ∣ y ) with the density exp [ − f ( x ) − g λ ( x )] p λ ( x ∣ y ) = ∫ R d exp [ − f ( x ) − g λ ( x )] d x , where g λ is the Moreau-Yoshida envelope of g given by g λ ( x ) = inf u ∈ R d { g ( u ) − ( 2 λ ) − 1 ∥ u − x ∥ 2 2 } , and where λ controls the approximation error involved. M. Pereyra (MI — HWU) LMS 17 13 / 25

  15. Moreau-Yoshida approximations Key properties (Pereyra, 2015; Durmus et al., 2017): 1 ∀ λ > 0, p λ defines a proper density of a probability measure on R d . 2 Convexity and differentiability : p λ is log-concave on R d . p λ ∈ C 1 even if p not differentiable, with ∇ log p λ ( x ∣ y ) = −∇ f ( x ) + { prox λ g ( x ) − x }/ λ, g ( x ) = argmax u ∈ R N g ( u ) − 1 2 λ ∣∣ u − x ∣∣ 2 . and prox λ ∇ log p λ is Lipchitz continuous with constant L ≤ L f + λ − 1 . 3 Approximation error between p λ ( x ∣ y ) and p ( x ∣ y ) : lim λ → 0 ∥ p λ − p ∥ TV = 0. If g is L g -Lipchitz, then ∥ p λ − p ∥ TV ≤ λ L 2 g . M. Pereyra (MI — HWU) LMS 17 14 / 25

  16. Illustration Examples of Moreau-Yoshida approximations: p ( x ) ∝ exp (− x 4 ) p ( x ) ∝ 1 [− 0 . 5 , 0 . 5 ] ( x ) p ( x ) ∝ exp (−∣ x ∣) Figure : True densities (solid blue) and approximations (dashed red). M. Pereyra (MI — HWU) LMS 17 15 / 25

  17. Proximal ULA We approximate X with the “regularised” auxiliary Langevin diffusion t = 1 X λ ∶ d X λ 2 ∇ log p λ ( X λ t ∣ y ) d t + d W t , 0 ≤ t ≤ T , X λ ( 0 ) = x 0 , which targets p λ ( x ∣ y ) . Remark: we can make X λ arbitrarily close to X . Finally, an Euler Maruyama discretisation of X λ leads to the (Moreau-Yoshida regularised) proximal ULA √ X m + 1 = ( 1 − δ λ ) X m − δ ∇ f { X m } + δ λ prox λ g { X m } + MYULA ∶ 2 δ Z m + 1 , where we used that ∇ g λ ( x ) = { x − prox λ g ( x )}/ λ . M. Pereyra (MI — HWU) LMS 17 16 / 25

  18. Convergence results Non-asymptotic estimation error bound Theorem 2.1 (Durmus et al. (2017)) = ( L 1 + 1 / λ ) − 1 . Assume that g is Lipchitz continuous. Then, Let δ max λ there exist δ ǫ ∈ ( 0 ,δ max ] and M ǫ ∈ N such that ∀ δ < δ ǫ and ∀ M ≥ M ǫ λ ∥ δ x 0 Q M δ − p ∥ TV < ǫ + λ L 2 g , where Q M is the kernel assoc. with M iterations of MYULA with step δ . δ Note: δ ǫ and M ǫ are explicit and tractable. If f + g is strongly convex outside some ball, then M ǫ scales with order O( d log ( d )) (otherwise at worse O( d 5 ) ). See Durmus et al. (2017) for other convergence results. M. Pereyra (MI — HWU) LMS 17 17 / 25

  19. Outline Bayesian inference in imaging inverse problems 1 Proximal Markov chain Monte Carlo 2 3 Experiments Conclusion 4 M. Pereyra (MI — HWU) LMS 17 18 / 25

  20. Sparse image deblurring α = { x ∶ p ( x ∣ y ) ≥ γ α } with Bayesian credible region C ∗ p ( x ∣ y ) ∝ exp ( − ∥ y − Hx ∥ 2 / 2 σ 2 − β ∥ x ∥ 1 ) P [ x ∈ C α ∣ y ] = 1 − α, and y x MAP ˆ uncertainty estimates Figure : Live-cell microscopy data (Zhu et al., 2012). Uncertainty analysis ( ± 78 nm × ± 125 nm ) in close agreement with the experimental precision ± 80 nm . Computing time 4 minutes. M = 10 5 iterations. Estimation error 0 . 2%.. M. Pereyra (MI — HWU) LMS 17 19 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend