The Bayesian approach to inverse problems Youssef Marzouk - - PowerPoint PPT Presentation

the bayesian approach to inverse problems
SMART_READER_LITE
LIVE PREVIEW

The Bayesian approach to inverse problems Youssef Marzouk - - PowerPoint PPT Presentation

The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu , http://uqgroup.mit.edu 7 July 2015 Marzouk (MIT)


slide-1
SLIDE 1

The Bayesian approach to inverse problems

Youssef Marzouk

Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

7 July 2015

Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29

slide-2
SLIDE 2

Statistical inference

Why is a statistical perspective useful in inverse problems?

To characterize uncertainty in the inverse solution

To understand how this uncertainty depends on the number and quality of observations, features of the forward model, prior information, etc.

To make probabilistic predictions To choose “good” observations or experiments To address questions of model error, model validity, and model selection

Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29

slide-3
SLIDE 3

Bayesian inference

Bayes’ rule

p(θ|y) = p(y|θ)p(θ) p(y) Key idea: model parameters θ are treated as random variables (For simplicity, we let our random variables have densities)

Notation

θ are model parameters; y are the data; assume both to be finite-dimensional unless otherwise indicated p(θ) is the prior probability density L(θ) ≡ p(y|θ) is the likelihood function p(θ|y) is the posterior probability density p(y) is the evidence, or equivalently, the marginal likelihood

Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29

slide-4
SLIDE 4

Bayesian inference

Summaries of the posterior distribution

What information to extract? Posterior mean of θ; maximum a posteriori (MAP) estimate of θ Posterior covariance or higher moments of θ Quantiles Credibile intervals: C(y) such that P [θ ∈ C(y) | y] = 1 − α.

Credibile intervals are not uniquely defined above; thus consider, for example, the HPD (highest posterior density) region.

Posterior realizations: for direct assessment, or to estimate posterior predictions or other posterior expectations

Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29

slide-5
SLIDE 5

Bayesian and frequentist statistics

Understanding both perspectives is useful and important. . .

Key differences between these two statistical paradigms

Frequentists do not assign probabilities to unknown parameters θ. One can write likelihoods pθ(y) ≡ p(y|θ) but not priors p(θ) or

  • posteriors. θ is not a random variable.

In the frequentist viewpoint, there is no single preferred methodology for inverting the relationship between parameters and

  • data. Instead, consider various estimators ˆ

θ(y) of θ. The estimator ˆ θ is a random variable. Why? Frequentist paradigm considers y to result from a random and repeatable experiment.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29

slide-6
SLIDE 6

Bayesian and frequentist statistics

Key differences (continued)

Evaluate quality of ˆ θ through various criteria: bias, variance, mean-square error, consistency, efficiency, . . . One common estimator is maximum likelihood: ˆ θML = argmaxθ p(y|θ). p(y|θ) defines a family of distributions indexed by θ. Link to Bayesian approach: MAP estimate maximizes a “penalized likelihood.” What about Bayesian versus frequentist prediction of ynew ⊥ y | θ?

Frequentist: “plug-in” or other estimators of ynew Bayesian: posterior prediction via integration

Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29

slide-7
SLIDE 7

Bayesian inference

Likelihood functions

In general, p(y|θ) is a probabilistic model for the data In the inverse problem or parameter estimation context, the likelihood function is where the forward model appears, along with a noise model and (if applicable) an expression for model discrepancy Contrasting example (but not really!): parametric density estimation, where the likelihood function results from the probability density itself. Selected examples of likelihood functions

1

Bayesian linear regression

2

Nonlinear forward model g(θ) with additive Gaussian noise

3

Nonlinear forward model with noise + model discrepancy

Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29

slide-8
SLIDE 8

Bayesian inference

Prior distributions

In ill-posed parameter estimation problems, e.g., inverse problems, prior information plays a key role Intuitive idea: assign lower probability to values of θ that you don’t expect to see, higher probability to values of θ that you do expect to see Examples

1

Gaussian processes with specified covariance kernel

2

Gaussian Markov random fields

3

Gaussian priors derived from differential operators

4

Hierarchical priors

5

Besov space priors

6

Higher-level representations (objects, marked point processes)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29

slide-9
SLIDE 9

Gaussian process priors

Key idea: any finite-dimensional distribution of the stochastic process θ(x, ω) : D × Ω → R is multivariate normal. In other words: θ(x, ω) is a collection of jointly Gaussian random variables, indexed by x Specify via mean function and covariance function

E [θ(x)] = µ(x) E [(θ(x) − µ) (θ(x′) − µ)] = C(x, x′)

Smoothness of process is controlled by behavior of covariance function as x′ → x Restrictions: stationarity, isotropy, . . .

Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29

slide-10
SLIDE 10

Example: stationary Gaussian random fields

(exponential covariance kernel) (Gaussian covariance kernel)

Both are θ(x, ω) : D × Ω → R, with D = [0, 1]2.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 10 / 29

slide-11
SLIDE 11

Gaussian Markov random fields

Key idea: discretize space and specify a sparse inverse covariance (“precision”) matrix W p(θ) ∝ exp

  • −1

2γθTWθ

  • where γ controls scale

Full conditionals p(θi|θ∼i) are available analytically and may simplify

  • dramatically. Represent as an undirected graphical model

Example: E [θi|θ∼i] is just an average of site i’s nearest neighbors Quite flexible; even used to simulate textures

Marzouk (MIT) ICERM IdeaLab 7 July 2015 11 / 29

slide-12
SLIDE 12

Priors through differential operators

Key idea: return to infinite-dimensional setting; again penalize roughness in θ(x) Stuart 2010: define the prior using fractional negative powers of the Laplacian A = −∆: θ ∼ N

  • θ0, βA−α

Sufficiently large α (α > d/2), along with conditions on the likelihood, ensures that posterior measure is well defined

Marzouk (MIT) ICERM IdeaLab 7 July 2015 12 / 29

slide-13
SLIDE 13

GPs, GMRFs, and SPDEs

In fact, all three “types” of Gaussian priors just described are closely connected. Linear fractional SPDE:

  • κ2 − ∆

β/2 θ(x) = W(x), x ∈ Rd, β = ν + d/2, κ > 0, ν > 0 Then θ(x) is a Gaussian field with Mat´ ern covariance: C(x, x′) = σ2 2ν−1Γ(ν)(κx − x′)νKν (κx − x′) Covariance kernel is Green’s function of differential operator

  • κ2 − ∆

β C(x, x′) = δ(x − x′) ν = 1/2 equivalent to exponential covariance; ν → ∞ equivalent to squared exponential covariance Can construct a discrete GMRF that approximates the solution of SPDE (See Lindgren, Rue, Lindstr¨

  • m JRSSB 2011.)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29

slide-14
SLIDE 14

Hierarchical Gaussian priors

0.2 0.4 0.6 0.8 1 −0.2 −0.1 0.1 0.2 0.3 0.4 0.5 0.2 0.4 0.6 0.8 1 − 0.2 − 0.1 0.1 0.2 0.3 0.4 0.5

Figure 1. Three realization drawn from the prior (6) with constant variance θj = θ0 (left) and from the corresponding prior where the variance is 100–fold at two points indicated by arrows (right).

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 14 / 29

slide-15
SLIDE 15

Hierarchical Gaussian priors

Iteration 1 Iteration 3 Iteration 7 Iteration 1 Iteration 3 Iteration 7

Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to compute the updated of the image at each iteration step

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 15 / 29

slide-16
SLIDE 16

Non-Gaussian priors

Besov space Bs

pq(T):

θ(x) = c0 +

  • j=0

2j−1

  • h=0

wj,hψj,h(x) and θBs

pq(T) :=

  |c0|q +

  • j=0

2jq(s+ 1

2 − 1 p )

 

2j−1

  • h=0

|wj,h|p  

q/p

 

1/q

< ∞. Consider p = q = s = 1: θB1

11(T) = |c0| +

  • j=0

2j−1

  • h=0

2j/2|wj,h|. Then the distribution of θ is a Besov prior if αc0 and α2j/2wj,h are independent and Laplace(1). Loosely, π(θ) = exp

  • −αθB1

11(T)

  • .

Marzouk (MIT) ICERM IdeaLab 7 July 2015 16 / 29

slide-17
SLIDE 17

Higher-level representations

Marked point processes, and more:

Rue & Hurn, Biometrika 86 (1999) 649–660.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 17 / 29

slide-18
SLIDE 18

Bayesian inference

Hierarchical modeling

One of the key flexibilities of the Bayesian construction! Hierarchical modeling has important implications for the design of efficient MCMC samplers (later in the lecture) Examples:

1

Unknown noise variance

2

Unknown variance of a Gaussian process prior (cf. choosing the regularization parameter)

3

Many more, as dictated by the physical models at hand

Marzouk (MIT) ICERM IdeaLab 7 July 2015 18 / 29

slide-19
SLIDE 19

Example: prior variance hyperparameter in an inverse diffusion problem

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

θ p(θ) or p(θ|d)

hyperprior posterior, ς=10−1, 13 sensors posterior, ς=10−2, 25 sensors

Figure : Posterior marginal density of the variance hyperparameter θ, versus quality of data, contrasted with its hyperprior density. “Regularization” ∝ ς2/θ.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 19 / 29

slide-20
SLIDE 20

The linear Gaussian model

A key building-block problem:

Parameters θ ∈ Rn, observations y ∈ Rm Forward model f (θ) = Gθ, where G ∈ Rm×n Additive noise yields observations: y = Gθ + ǫ ǫ ∼ N(0, Γobs) and is independent of θ Endow θ with a Gaussian prior, θ ∼ N(0, Γpr).

Posterior probability density

p(θ|y) ∝ p(y|θ)p(θ) = L(θ)p(θ) = exp

  • −1

2 (y − Gθ)T Γ−1

  • bs (y − Gθ)
  • exp
  • −1

2θTΓ−1

pr θ

  • =

exp

  • −1

2 (θ − µpos)T Γ−1

pos (θ − µpos)

  • Marzouk (MIT)

ICERM IdeaLab 7 July 2015 20 / 29

slide-21
SLIDE 21

The linear Gaussian model

A key building-block problem:

Parameters θ ∈ Rn, observations y ∈ Rm Forward model f (θ) = Gθ, where G ∈ Rm×n Additive noise yields observations: y = Gθ + ǫ ǫ ∼ N(0, Γobs) and is independent of θ Endow θ with a Gaussian prior, θ ∼ N(0, Γpr).

Posterior probability density

p(θ|y) ∝ p(y|θ)p(θ) = L(θ)p(θ) = exp

  • −1

2 (y − Gθ)T Γ−1

  • bs (y − Gθ)
  • exp
  • −1

2θTΓ−1

pr θ

  • =

exp

  • −1

2 (θ − µpos)T Γ−1

pos (θ − µpos)

  • Marzouk (MIT)

ICERM IdeaLab 7 July 2015 20 / 29

slide-22
SLIDE 22

The linear Gaussian model

Posterior is again Gaussian: Γpos =

  • G TΓ−1
  • bsG + Γ−1

pr

−1 = Γpr − ΓprG T GΓprG T + Γobs −1 GΓpr = (I − KG) Γpr µpos = ΓposG TΓ−1

  • bsy

In the context of filtering, K is known as the (optimal) Kalman gain. H := G TΓ−1

  • bsG is the Hessian of the negative log-likelihood

How does low rank of H affect the structure of the posterior? How does H interact with the prior?

Marzouk (MIT) ICERM IdeaLab 7 July 2015 21 / 29

slide-23
SLIDE 23

Likelihood-informed directions

Consider the Rayleigh ratio R(w) = w ⊤Hw w ⊤Γ−1

pr w .

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λΓ−1

pr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues: Γpos = Γpr −

n

  • i=1

λi 1 + λi wiw ⊤

i

≈ Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(1)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29

slide-24
SLIDE 24

Likelihood-informed directions

Consider the Rayleigh ratio R(w) = w ⊤Hw w ⊤Γ−1

pr w .

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λΓ−1

pr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues: Γpos = Γpr −

n

  • i=1

λi 1 + λi wiw ⊤

i

≈ Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(1)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29

slide-25
SLIDE 25

Likelihood-informed directions

Consider the Rayleigh ratio R(w) = w ⊤Hw w ⊤Γ−1

pr w .

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem Hw = λΓ−1

pr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues: Γpos = Γpr −

n

  • i=1

λi 1 + λi wiw ⊤

i

≈ Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(1)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29

slide-26
SLIDE 26

Optimality results for Γpos

It turns out that the approximation

  • Γpos = Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(2) is optimal in a class of loss functions L( Γpos, Γpos) for approximations of form Γpos = Γpr − KK ⊤, where rank(K) ≤ r.1

  • Γpos minimises the Hellinger distance and the KL-divergence between

N(µpos(y), Γpos) and N(µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear

inverse problems, http://arxiv.org/abs/1407.3463

Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29

slide-27
SLIDE 27

Optimality results for Γpos

It turns out that the approximation

  • Γpos = Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(2) is optimal in a class of loss functions L( Γpos, Γpos) for approximations of form Γpos = Γpr − KK ⊤, where rank(K) ≤ r.1

  • Γpos minimises the Hellinger distance and the KL-divergence between

N(µpos(y), Γpos) and N(µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear

inverse problems, http://arxiv.org/abs/1407.3463

Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29

slide-28
SLIDE 28

Optimality results for Γpos

It turns out that the approximation

  • Γpos = Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(2) is optimal in a class of loss functions L( Γpos, Γpos) for approximations of form Γpos = Γpr − KK ⊤, where rank(K) ≤ r.1

  • Γpos minimises the Hellinger distance and the KL-divergence between

N(µpos(y), Γpos) and N(µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear

inverse problems, http://arxiv.org/abs/1407.3463

Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29

slide-29
SLIDE 29

Optimality results for Γpos

It turns out that the approximation

  • Γpos = Γpr −

r

  • i=1

λi 1 + λi wiw ⊤

i

(2) is optimal in a class of loss functions L( Γpos, Γpos) for approximations of form Γpos = Γpr − KK ⊤, where rank(K) ≤ r.1

  • Γpos minimises the Hellinger distance and the KL-divergence between

N(µpos(y), Γpos) and N(µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear

inverse problems, http://arxiv.org/abs/1407.3463

Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29

slide-30
SLIDE 30

Remarks on the optimal approximation

  • Γ∗

pos = Γpr − KK ⊤,

KK ⊤ =

r

  • i=1

λi 1 + λi wiw ⊤

i

The form of the optimal update is widely used (Flath et al. 2011) Compute with Lanczos, randomized SVD, etc. Directions wi = Γ−1

pr wi maximize the relative difference between

prior and posterior variance: Var

  • w ⊤

i x

  • − Var
  • w ⊤

i x | y

  • Var
  • w ⊤

i x

  • =

λi 1 + λi Using the Frobenius norm as a loss would instead yield directions of greatest absolute difference between prior and posterior variance.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 24 / 29

slide-31
SLIDE 31

A metric between covariance matrices

  • rstner metric

Let A, B ≻ 0, and (σi) be the eigenvalues

  • f (A, B), then:

d2

F (A, B)

= tr

  • ln2

B− 1

2 AB− 1 2

  • =
  • i

ln2 (σi)

B−1/2AB−1/2 σ1 σ2

Compare curvatures: sup

u u⊤Au u⊤Bu = σ1

Invariance properties: dF (A, B) = dF

  • A−1, B−1

dF (A, B) = dF

  • MAM⊤, MBM⊤

Frobenius dF (A, B) = A − BF does not share the same properties

Marzouk (MIT) ICERM IdeaLab 7 July 2015 25 / 29

slide-32
SLIDE 32

Example: computerized tomography

X-rays travel from sources to detectors through an object of interest. The intensities from the sources are measured at the detectors, and the goal is to reconstruct the density of the object.

20 40 60 80 100 0.9 0.92 0.94 0.96 0.98 1 1.02

detector pixel intensity

This synthetic example is motivated by a real application: real-time X-ray imaging of logs that enter a saw mill for the purpose of automatic quality

  • control. 2

2Check www.bintec.fi Marzouk (MIT) ICERM IdeaLab 7 July 2015 26 / 29

slide-33
SLIDE 33

Example: computerized tomography

Weaker data → faster decay of generalized eigenvalues → lower order approximations possible.

index i

100 200 300 400 500

eigenvalues

10-8 10-7 10-6 10-5 10-4 10-3 prior limited angle full angle

index i

100 200 300 400 500

generalized eigenvalues

10-2 10-1 100 101 102 103 104 105 limited angle full angle

rank of update

100 200 300 400 500

dF

100 101 102 103 limited angle full angle

In the limited angle case, roughly r = 200 is enough to get a good approximation (with full angle r ≈ 800 needed).

prior 5.8 5.6 5.4 rank = 50 rank = 100 rank = 200 posterior 7 6.8 6.6 6.4 6.2

Marzouk (MIT) ICERM IdeaLab 7 July 2015 27 / 29

slide-34
SLIDE 34

Example: computerized tomography

Approximation of the mean: µpos(y) = Γpos G ⊤Γ−1

  • bs y ≈ Ar y

Note: pre-computing the LIS basis offline allows fast online reconstructions for

Marzouk (MIT) ICERM IdeaLab 7 July 2015 28 / 29

slide-35
SLIDE 35

Questions yet to answer

How to simulate from or explore the posterior distribution? How to make Bayesian inference computationally tractable when the forward model is expensive (e.g., a PDE) and the parameters are high- or infinite-dimensional? Downstream questions: model selection, optimal experimental design, decision-making, etc.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 29 / 29