Course on Inverse Problems Albert Tarantola Lesson XV: Square Root - - PowerPoint PPT Presentation

course on inverse problems
SMART_READER_LITE
LIVE PREVIEW

Course on Inverse Problems Albert Tarantola Lesson XV: Square Root - - PowerPoint PPT Presentation

Institut de Physique du Globe de Paris & Universit Pierre et Marie Curie (Paris VI) Course on Inverse Problems Albert Tarantola Lesson XV: Square Root Variable Metric Algorithm The Square Root Variable Metric Algorithm for Least-Squares


slide-1
SLIDE 1

Institut de Physique du Globe de Paris & Université Pierre et Marie Curie (Paris VI)

Course on Inverse Problems

Albert Tarantola

Lesson XV: Square Root Variable Metric Algorithm

slide-2
SLIDE 2

The Square Root Variable Metric Algorithm for Least-Squares Problems

Albert Tarantola

1 Introduction

In the context of least-squares, there is an a priori probability distribution on the model parameter space: a Gaussian distribution, with mean mprior and covariance Cprior . The result of measuring the observ- able parameters is also expressed with a Gaussian probability distribution, with mean oobs and covari- ance Cobs . If the relation m → o = o(m) predicting the observations that should correspond to model m is not too strongly nonlinear, then, the a posteriori probability distribution in the model space is — approximately— Gaussian. The first numerical problem to solve is that of computing the mean mpost and the covariance Cpost of this posterior Gaussian distribution. The second numerical problem is to

  • btain samples of that distribution.
slide-3
SLIDE 3

2 Misfit Function

The model mpost , center of the posterior Gaussian distribution, is the model minimizing the misfit function (sum of squares) 2 S(m) = o(m) − oobs 2 + m − mprior 2 , (1) where

  • (m) − oobs 2 = (o(m) − oobs)t C-1
  • bs (o(m) − oobs)

(2) and m − mprior 2 = (m − mprior)t C-1

prior (m − mprior)

. (3)

slide-4
SLIDE 4

3 Partial Derivatives

Gradient methods of optimization require the computation of the matrix O of partial derivatives Oiα = ∂oi ∂mα . (4) This computation is sometimes performed analytically, as in sin(t0)′ = cos(t0) , (5) sometimes numerically, as in sin(t0)′ ≈ sin(t0 + ∆t) − sin(t0 − ∆t) 2 ∆t ; ∆t = 0.001 . (6)

slide-5
SLIDE 5

4 Newton Algorithm

When the computer is big enough, the (quasi) Newton algorithm can be implemented. It is then the better choice. The model mpost , center of the posterior Gaussian, is obtained as the convergence point of the algo- rithm mk+1 = mk − ( Ot

k C-1

  • bs Ok + C-1

prior )-1 ( Ot k C-1

  • bs (o(mk) − oobs) + C-1

prior (mk − mprior) )

, (7) while the covariance of the posterior Gaussian is the value, at the convergence point, of the matrix Cpost = ( Ot C-1

  • bs O + C-1

prior )-1

. (8)

slide-6
SLIDE 6

5 Square Root Variable Metric Algorithm

5.1 Introduction

In differential geometry, when one considers a manifold with some coordinates {x1, x2 . . . xn} , and the square of length element is written ds2 = gij dxi dxj (9) (implicit sum over the indices assumed), one calls the matrix with elements gij the metric of the mani-

  • fold. If it is a linear manifold, then, the square of the finite distance between two points can be written

s2 = gij (xi

1 − xi 2) (xj 1 − xj 2)

, (10)

  • r, using, obvious matrix notations,

s2 = (x1 − x2)t g (x1 − x2) . (11) In least-squares theory, we have expressions like D2 = (m1 − m2)t C-1 (m1 − m2) , (12) where C is a covariance matrix. These expressions are also to be interpreted as corresponding to a squared distance. We thus see that the inverse of a covariance matrix (a weight matrix) can be considered to define a metric (over the model parameter space)1. The variable methods aim at starting using as metric the inverse of the prior covariance matrix, and

slide-7
SLIDE 7

to define a metric (over the model parameter space) . The variable methods aim at starting using as metric the inverse of the prior covariance matrix, and making it evolve so at the end we are using as metric the inverse of the posterior covariance matrix. These methods have a rapid convergence, similar to the conjugate directions methods, but have a ad- vantage over them: that of providing a direct estimation of the posterior covariance matrix, say Cpost . There are two kinds of variable metric algorithms. While the first kind (variable metric algorithms) aim at directly producing Cpost the second kind (square root variable metric algorithms) aim at produc- ing the square root of Cpost , say Tpost . One then has Cpost = Tpost Tt

post

(13) (so one can formally write Tpost = Cpost ). The interest in having the square root of the posterior covariance matrix arises from the property that if a vector x is a sample vector of a Gaussian distribution with zero mean and unit covariance matrix, then the vector m = mpost + Tpost x (14)

1When considering as metric the inverse of the prior covariance we arrive to the steepest descent method, while when consid-

ering as metric the inverse of the posterior covariance we arrive to the Newton method.

slide-8
SLIDE 8

is a sample vector of a Gaussian distribution with mean mpost and covariance Cpost , i.e., a sample vector of the (approximate) posterior distribution in the model parameter space2. In what follows, let us denote by Tprior the square root of the prior covariance matrix, Cprior = Tprior Tt

prior

. (15) The algorithm is initialized at some values m0 and T0 that are the best approximations available of the posterior model and the posterior covariance, m0 ≈ mpost ; T0 ≈ Tpost . (16) Typically there is no better approximation than the prior values, and one then takes m0 = mprior ; T0 = Tprior . (17) The square root variable metric algorithm is3 (other possible algorithms differ only in details): For k = 0, 1, 2 . . .

slide-9
SLIDE 9

γk = C-1

prior (mk − mprior) + Gt k C-1

  • bs ( o(mk) − oobs )

φk = Tk Tt

k γk

  • φk = C-1

prior φk

bk = Gk φk

  • bk = C-1
  • bs bk

µk = γt

k φk

φt

k

φk + bt

k

bk mk+1 = mk − µk φk gk = µk ( φk + Gt

k

bk) yk = µk γk − gk wk = Tt

k yk

ak = wt

k Tt k gk

bk = wt

k wk

ck = 1 − √1 + bk/ak bk Tk+1 = Tk (I − ck wk wt

k)

. (18)

slide-10
SLIDE 10

As already indicated, mk → mpost ; Tk Tt

k → Cpost

. (19)

2Quick and dirty demonstration: the mean of m is m = (mpost + Tpost x) = mpost + Tpost x = mpost + Tpostx = mpost +

Tpost0 = mpost and the covariance matrix of m is C = (m − mpost) (m − mpost)t = (Tpost x) (Tpost x)t = Tpost x xt Tt

post =

Tpost x xt Tt

post = Tpost I Tt post = Tpost Tt post . 3Williamson, W.E., 1975, Square-root variable metric method for function minimization, AIAA journal, 13 (1), 107–109. Morf,

M., and T. Kailath, 1975, Square-root algorithms for least-squares estimation, IEEE Transactions on automatic control, Vol. AC-40,

  • No. 4, 487–497. Hull, D.G., and B.D. Tapley, 1977, Square-root variable-metric methods for minimization, J. Optim. App., Vol. 21,
  • No. 3, 251–259.
slide-11
SLIDE 11

The linear form γk is the gradient of the misfit function at iteration mk . The vector φk is the direction of search. The scalar µk defines a reasonable step length (along that search direction). The linear form gk is, in fact the difference between two consecutive gradients4. For small-sized problems, the matrices Tk can be built. For large-sized problems, those matrices would be too large to store in the computer. The matrix Cpost would be too large too. (This is typically not true for Cprior of for T0 ≈ Cprior , as we often have analytical expressions for them.) It is then very importat to realize that the only “objects” to be computed (and stored in the computer’s memory) are the constants c0 , c1 , c2 . . . and the vectors w0 , w1 , w2 . . . . With these constants and these vectors, the result of the action of any of the Tk on any vector can be computed. For instance, given a vector v , how do we compute the vector ψ = T1 v ? (20)

slide-12
SLIDE 12

The last of equations 18 tells us that T1 = T0 (I − c0 w0 wt

0)

. (21) We have the constant c0 and the vector w0 , and the operator T0 is assumed to be simple enough (typically, T0 = I , or T0 is chosen so that it has an analytical expression). Then, ψ = T1 v = T0 (I − c0 w0 wt

0) v

= T0 (v − c0 w0 wt

0 v)

= T0 (v − c0 (wt

0 v) scalar

w0) . (22) This computation only requires the handling of vectors (and of the operator T0 that is assumed to be a simple one). Applying this notion recursively, we can compute the action of any of the Tk on any vector vk : ψk = Tk vk . (23) Notably, this will apply when, after convergence of the square root variable metric algorithm, we face the task of generating samples of the posterior Gaussian, via expression 14.

slide-13
SLIDE 13

4gk = γk − γk+1 = C-1 prior (mk − mprior) + Gt k C-1

  • bs (o(mk) − oobs) − C-1

prior (mk+1 − mprior) − Gt k+1 C-1

  • bs (o(mk+1) − oobs) ≈

C-1

prior (mk − mprior) + Gt k C-1

  • bs (o(mk) − oobs) − C-1

prior (mk+1 − mprior) − Gt k C-1

  • bs (o(mk+1) − oobs)

= −C-1

prior (mk+1 − mk) −

Gt

k C-1

  • bs( o(mk+1) − o(mk))

= −C-1

prior (mk+1 − mk) − Gt k C-1

  • bs Gk(mk+1 − mk)

= −(C-1

prior + Gt k C-1

  • bs Gk) (mk+1 − mk)

= µk (C-1

prior + Gt k C-1

  • bs Gk) φk = µk (

φk + Gt

k

bk) .