Improving EnKF with machine learning algorithms John Harlim - - PowerPoint PPT Presentation

improving enkf with machine learning algorithms
SMART_READER_LITE
LIVE PREVIEW

Improving EnKF with machine learning algorithms John Harlim - - PowerPoint PPT Presentation

Improving EnKF with machine learning algorithms John Harlim Department of Mathematics and Department of Meteorology The Pennsylvania State University June 12, 2017 Overview A supervised learning algorithm An unsupervised learning algorithm


slide-1
SLIDE 1

Improving EnKF with machine learning algorithms

John Harlim

Department of Mathematics and Department of Meteorology The Pennsylvania State University

June 12, 2017

slide-2
SLIDE 2

Overview

A supervised learning algorithm An unsupervised learning algorithm (diffusion maps) Learning the localization function of EnKF Learning a likelihood function. Application: To Correct biased

  • bservation model error in DA
slide-3
SLIDE 3

A supervised learning algorithm

The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N.

slide-4
SLIDE 4

A supervised learning algorithm

The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:

◮ The objective is to use the estimated map ˆ

H to predict ys = ˆ H(xs) given new data xs.

slide-5
SLIDE 5

A supervised learning algorithm

The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:

◮ The objective is to use the estimated map ˆ

H to predict ys = ˆ H(xs) given new data xs.

◮ Various methods to estimate H include regression, SVM,

KNN, Neural Nets, etc.

slide-6
SLIDE 6

A supervised learning algorithm

The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:

◮ The objective is to use the estimated map ˆ

H to predict ys = ˆ H(xs) given new data xs.

◮ Various methods to estimate H include regression, SVM,

KNN, Neural Nets, etc.

◮ For this talk, we will focus on how to use regression in

appropriate spaces to improve EnKF.

slide-7
SLIDE 7

An unsupervised learning algorithm

Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data.

1Coifman & Lafon 2006, Berry & H, 2016.

slide-8
SLIDE 8

An unsupervised learning algorithm

Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data. In this talk, I will focus on a nonlinear manifold learning algorithm, the diffusion maps1: Given {xi} ∈ M ⊂ Rn with a sampling measure q, the diffusion maps algorithm is a kernel based method that produces orthonormal basis functions on the manifold, ϕk ∈ L2(M, q).

1Coifman & Lafon 2006, Berry & H, 2016.

slide-9
SLIDE 9

An unsupervised learning algorithm

Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data. In this talk, I will focus on a nonlinear manifold learning algorithm, the diffusion maps1: Given {xi} ∈ M ⊂ Rn with a sampling measure q, the diffusion maps algorithm is a kernel based method that produces orthonormal basis functions on the manifold, ϕk ∈ L2(M, q). These basis functions are solutions of an eigenvalue problem, q−1div

  • q∇ϕk(x)
  • = λkϕk(x),

where the weighted Laplacian operator is approximated with an integral operator with appropriate normalization.

1Coifman & Lafon 2006, Berry & H, 2016.

slide-10
SLIDE 10

Examples:

Example: Uniformly distributed data on a circle, we obtain the Fourier basis.

1 2 3 4 5 6

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

eix

1 2 3 4 5 6

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

ei2x

1 2 3 4 5 6

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

ei3x

Example: Gaussian distributed data on a real line, we obtain the Hermite polynomials.

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 ϕ1(x)

  • 30
  • 20
  • 10

10 20 30 estimate true

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 ϕ2(x)

  • 300
  • 200
  • 100

100 200 300 x

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 ϕ3(x)

  • 600
  • 400
  • 200

200 400 600

slide-11
SLIDE 11

Example: Nonparametric basis functions estimated on nontrivial manifold

−10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45

Remark: Essentially, one can view the DM as a method to learn generalized Fourier basis on the manifold.

slide-12
SLIDE 12

Learning the localization function of EnKF

◮ When EnKF is performed with small ensemble size, one way

to alleviate the spurious correlation is to employ a localization function.

slide-13
SLIDE 13

Learning the localization function of EnKF

◮ When EnKF is performed with small ensemble size, one way

to alleviate the spurious correlation is to employ a localization function.

◮ For example, in the serial EnKF, for each scalar observation,

yi, one “localizes” the Kalman gain, K = Lxyi ◦ XY ⊤

i (YiY ⊤ i

+ R)−1, with an empirically chosen localization function Lxyi (Gaspari-Cohn, etc), which requires some tunings.

slide-14
SLIDE 14

Learning the localization function of EnKF

◮ When EnKF is performed with small ensemble size, one way

to alleviate the spurious correlation is to employ a localization function.

◮ For example, in the serial EnKF, for each scalar observation,

yi, one “localizes” the Kalman gain, K = Lxyi ◦ XY ⊤

i (YiY ⊤ i

+ R)−1, with an empirically chosen localization function Lxyi (Gaspari-Cohn, etc), which requires some tunings.

◮ Let’s use the idea from machine learning to train this

localization function. The key idea is to find a map that takes poorly estimated correlations to accurately estimated correlations.

slide-15
SLIDE 15

Learning localization map2

Given a set of large ensemble EnKF solutions, {xa,k

m }

k=1,...,L

m=1,...,M

as a training data set, where L is large enough so the correlation, ρL

ij ≈ ρ(xi, yj), is accurate.

2De La Chevroti`

ere & H, 2017.

slide-16
SLIDE 16

Learning localization map2

Given a set of large ensemble EnKF solutions, {xa,k

m }

k=1,...,L

m=1,...,M

as a training data set, where L is large enough so the correlation, ρL

ij ≈ ρ(xi, yj), is accurate. ◮ Operationally, we wish to run EnKF with K ≪ L ensemble

  • members. Then our goal is to train a map that transform the

subsampled correlation ρK

ij into the accurate correlation ρL ij.

2De La Chevroti`

ere & H, 2017.

slide-17
SLIDE 17

Learning localization map2

Given a set of large ensemble EnKF solutions, {xa,k

m }

k=1,...,L

m=1,...,M

as a training data set, where L is large enough so the correlation, ρL

ij ≈ ρ(xi, yj), is accurate. ◮ Operationally, we wish to run EnKF with K ≪ L ensemble

  • members. Then our goal is to train a map that transform the

subsampled correlation ρK

ij into the accurate correlation ρL ij. ◮ Basically, we consider the following optimization problem:

min

Lxi yj

  • [−1,1]
  • [−1,1]
  • LxiyjρK

ij − ρL ij

2 p(ρK

ij |ρL ij)p(ρL ij) dρK ij dρL ij MC

≈ min

Lxi yj

1 MS

M,S

  • m,s=1

(LxiyjρK

ij,m,s − ρL ij,m)2,

where ρL

ij,m ∼ p(ρL ij) and ρK ij,m,s ∼ p(ρK ij |ρL ij) is an estimated

correlation using only K out of L training data.

2De La Chevroti`

ere & H, 2017.

slide-18
SLIDE 18

Example: On Monsoon-Hadley multicloud model3

It’s a Galerkin projection of zonally symmetric β-plane primitive eqns into the barotropic, and first two baroclinic modes, stochastically driven by a three-cloud model paradigm. Consider

  • bservation model h(x) that is similar to a RTM.
  • 3M. De La Chevroti`

ere and B. Khouider 2016.

slide-19
SLIDE 19

Example of trained localization map

Channel 3 and θ1 Channel 6 and θeb

slide-20
SLIDE 20

DA results

slide-21
SLIDE 21

Correcting biased observation model error4

All the Kalman based DA method assumes unbiased observation model error, e.g., yi = h(xi) + ηi, ηi ∼ N(0, R). Suppose the operator h is un known. Instead, we are only given ˜ h, then yi = ˜ h(xi) + bi where we introduce a biased model error, bi = h(xi) − ˜ h(xi) + ηi.

4Berry & H, 2017.

slide-22
SLIDE 22

Example: Basic radiative transfer model

Consider solutions of the stochastic cloud model5, {T(z), θeb, q, fd, fs, fc}. Based on this solutions, define a basic radiative transfer model as follows, hν(x) = θebTν(0) + ∞ T(z)∂Tν ∂z (z) dz, where Tν is the transmission between heights z to ∞ that is defined to depend

  • n q.

The weighting function, ∂Tν

∂z

are defined as follows:

0.05 0.1 0.15 2 4 6 8 10 12 14 16 height (z) weighting function ( ∂T ν

∂z )

5Khouider, Biello, Majda 2010

slide-23
SLIDE 23

Example: Basic radiative transfer model

Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)

  • θebTν(0) +

zd T(z)∂Tν ∂z (z) dz

  • +(fd + fs)T(zt)Tν(zd) +

zd

T(z)∂Tν ∂z (z) dz

slide-24
SLIDE 24

Example: Basic radiative transfer model

Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)

  • θebTν(0) +

zd T(z)∂Tν ∂z (z) dz

  • +(fd + fs)T(zt)Tν(zd) +

zd

T(z)∂Tν ∂z (z) dz = (1 − fd − fs)

  • (1 − fc)
  • θebTν(0) +

zc T(z)∂Tν ∂z (z) dz

  • +fcT(zc)Tν(zc) +

zd

zc

T(z)∂Tν ∂z (z) dz

  • +(fd + fs)T(zd)Tν(zt) +

zd

T(z)∂Tν ∂z (z) dz

slide-25
SLIDE 25

Example: Basic radiative transfer model

Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)

  • θebTν(0) +

zd T(z)∂Tν ∂z (z) dz

  • +(fd + fs)T(zt)Tν(zd) +

zd

T(z)∂Tν ∂z (z) dz = (1 − fd − fs)

  • (1 − fc)
  • θebTν(0) +

zc T(z)∂Tν ∂z (z) dz

  • +fcT(zc)Tν(zc) +

zd

zc

T(z)∂Tν ∂z (z) dz

  • +(fd + fs)T(zd)Tν(zt) +

zd

T(z)∂Tν ∂z (z) dz One can check that hν(x, 0) corresponds to cloud-free RTM.

slide-26
SLIDE 26

Systematic model error in data assimilation

Suppose the observation is generated with yν = hν(x, f ) + η, η ∼ N(0, R) The difficulty in estimating the cloud fractions, cloud top heights and (in reality we don’t know precisely how many clouds under a column) induces model error.

slide-27
SLIDE 27

Systematic model error in data assimilation

Suppose the observation is generated with yν = hν(x, f ) + η, η ∼ N(0, R) The difficulty in estimating the cloud fractions, cloud top heights and (in reality we don’t know precisely how many clouds under a column) induces model error. In an extreme case, we consider filtering with a cloud-free RTM: yν = hν(x, 0) + bν where bν = hν(x, f ) − hν(x, 0) + η is model error with bias.

slide-28
SLIDE 28

Observations (yν) v Model error (bν)

slide-29
SLIDE 29

State estimation of the model error

We propose a secondary filter to estimate the statistics for bi as follows: Prior

Primary Filter

− − − − − − − − − − − − − − → Posterior p(xi) p(xi | yi)   

  Error Prior

Secondary Filter

− − − − − − − − − − − − − − →Error Posterior p(b) p(b | yi)

 

  Observation

RKHS+Training Data

− − − − − − − − − − − − − − − → Likelihood yi p(yi | b) A machine learning technique, kernel embedding of conditional distribution6, is employed to train a nonparametric likelihood function.

6Song, Fukumizu, Gretton, 2013.

slide-30
SLIDE 30

Secondary Bayesian filter

p(b|yi) ∝ p(b)p(yi|b)

  • 0.2
  • 0.15
  • 0.1
  • 0.05

Obs Model Error

20 40 60 80 100 120

Probability

Prior, p(bℓ) Likelihood, p(yi = 0.184 | bℓ) Posterior, p(bℓ | yi = 0.184) True Model Error, bi

  • 0.2
  • 0.15
  • 0.1
  • 0.05

Obs Model Error

10 20 30 40 50 60 70 80 90 100

Probability

Prior, p(bℓ) Likelihood, p(yi = 0.195 | bℓ) Posterior, p(bℓ | yi = 0.195) True Model Error, bi

slide-31
SLIDE 31

Filter estimates (with adaptive tuning of R and Q).

70 71 72 73 74 75 76 77 78 Time

  • 0.1
  • 0.05

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 θ1 Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78

Time

  • 1
  • 0.5

0.5 1 1.5 2 2.5 θeb Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78 Time 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 q Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78 Time 0.05 0.1 0.15 0.2 fs Truth EnKF, True Obs EnKF, Wrong Obs RKHS

slide-32
SLIDE 32

Example: Lorenz-96

Biased occurs random in space and times.

20 40

True State

10 20

Obs

  • 10
  • 5

5 10 15

10 20 30 40 50

Time

20 40

RKHS

20 40

EnKF

slide-33
SLIDE 33

Nonparametric likelihood function

We will use the kernel embedding of conditional distribution.7 Recall: Let X be a r.v on M and distribution P(X). Given a kernel K : M × M → R, the Moore-Aronszajn theorem states that there exists a Reproducing Kernel Hilbert Space (RKHS) L2(M, q). This means that that f (x) = f , K(x, ·)q.

7Song, Fukumizu, Gretton, 2013.

slide-34
SLIDE 34

Nonparametric likelihood function

The kernel embedding of conditional distribution P(Y |B) is defined as, µY |b = EY |b[ ˜ K(Y , ·)] =

  • N

˜ K(y, ·)dP(y|b). Given g ∈ L2(N, ˜ q), EY |b[g(Y )] =

  • N

g(y)dP(y|b) =

  • N

g, ˜ K(y, ·)˜

qdP(y|b)

=

  • g,
  • N

˜ K(y, ·)dP(y|b)

  • ˜

q = g, µY |b˜ q.

One can verify that µY |b = qCYBC−1

BBK(b, ·),

where CBY =

  • M×N

K(b, ·) ⊗ ˜ K(y, ·) dP(b, y) is the kernel embedding of P(B, Y ) on appropriate Hilbert spaces.

slide-35
SLIDE 35

Nonparametric likelihood function p(y|b)

Given {bi}N

i=1 and {yi}N i=1 Apply diffusion maps to learn the data-driven

  • rthonormal basis functions ϕj(b) ∈ L2(M, q) and ˜

ϕk(y) ∈ L2(M, ˜ q). Let p(y|b) =

  • k

µY |b,k ˜ ϕk(y)˜ q(y) where µY |b,k = p(·|b), ˜ ϕk = EY |b[ ˜ ϕk] = µY |b, ˜ ϕk˜

q

= qCYBC−1

BB K(b, ·), ˜

ϕk˜

q

= . . . =

  • j

ϕj(x)[CYBC −1

BB ]kj

where [CYB]jk = CYB, ˜ ϕj ⊗ ϕk˜

q⊗q ≈ 1

N

N

  • i=1

˜ ϕj(yi)ϕk(bi), [CBB]jk = CBB, ϕj ⊗ ϕkq ≈ 1 N

N

  • i=1

ϕj(bi)ϕk(bi),

slide-36
SLIDE 36

References:

  • 1. M. De La Chevroti`

ere & H, “A data-driven method for improving the correlation estimation in serial ensemble Kalman filters.”, Mon.

  • Wea. Rev. 145(3), 985-1001, 2017.
  • 2. M. De La Chevroti`

ere & H, “Localization mappings for filtering a monsoon-Hadley multicloud model” (in progress).

  • 3. T. Berry & H, “Correcting biased observation model error in data

assimilation”, Mon. Wea. Rev. (in press).

  • 4. T. Berry & H, “Variable bandwidth diffusion kernels”, Appl.
  • Comput. Harmon. Anal. 40, 68-96, 2016.
  • 5. H, “An introduction to data-driven methods for stochastic modeling
  • f dynamical systems”, Springer (to appear).