Improving EnKF with machine learning algorithms John Harlim - - PowerPoint PPT Presentation
Improving EnKF with machine learning algorithms John Harlim - - PowerPoint PPT Presentation
Improving EnKF with machine learning algorithms John Harlim Department of Mathematics and Department of Meteorology The Pennsylvania State University June 12, 2017 Overview A supervised learning algorithm An unsupervised learning algorithm
Overview
A supervised learning algorithm An unsupervised learning algorithm (diffusion maps) Learning the localization function of EnKF Learning a likelihood function. Application: To Correct biased
- bservation model error in DA
A supervised learning algorithm
The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N.
A supervised learning algorithm
The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:
◮ The objective is to use the estimated map ˆ
H to predict ys = ˆ H(xs) given new data xs.
A supervised learning algorithm
The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:
◮ The objective is to use the estimated map ˆ
H to predict ys = ˆ H(xs) given new data xs.
◮ Various methods to estimate H include regression, SVM,
KNN, Neural Nets, etc.
A supervised learning algorithm
The basic idea of supervised learning algorithm is to train a map H : X → Y , from a pair of data set {xi, yi}i=1,...,N. Remarks:
◮ The objective is to use the estimated map ˆ
H to predict ys = ˆ H(xs) given new data xs.
◮ Various methods to estimate H include regression, SVM,
KNN, Neural Nets, etc.
◮ For this talk, we will focus on how to use regression in
appropriate spaces to improve EnKF.
An unsupervised learning algorithm
Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data.
1Coifman & Lafon 2006, Berry & H, 2016.
An unsupervised learning algorithm
Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data. In this talk, I will focus on a nonlinear manifold learning algorithm, the diffusion maps1: Given {xi} ∈ M ⊂ Rn with a sampling measure q, the diffusion maps algorithm is a kernel based method that produces orthonormal basis functions on the manifold, ϕk ∈ L2(M, q).
1Coifman & Lafon 2006, Berry & H, 2016.
An unsupervised learning algorithm
Given a data set {xi}, the main task is to learn a function ϕ(xi) that can describe the data. In this talk, I will focus on a nonlinear manifold learning algorithm, the diffusion maps1: Given {xi} ∈ M ⊂ Rn with a sampling measure q, the diffusion maps algorithm is a kernel based method that produces orthonormal basis functions on the manifold, ϕk ∈ L2(M, q). These basis functions are solutions of an eigenvalue problem, q−1div
- q∇ϕk(x)
- = λkϕk(x),
where the weighted Laplacian operator is approximated with an integral operator with appropriate normalization.
1Coifman & Lafon 2006, Berry & H, 2016.
Examples:
Example: Uniformly distributed data on a circle, we obtain the Fourier basis.
1 2 3 4 5 6
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1
eix
1 2 3 4 5 6
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1
ei2x
1 2 3 4 5 6
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1
ei3x
Example: Gaussian distributed data on a real line, we obtain the Hermite polynomials.
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 ϕ1(x)
- 30
- 20
- 10
10 20 30 estimate true
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 ϕ2(x)
- 300
- 200
- 100
100 200 300 x
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 ϕ3(x)
- 600
- 400
- 200
200 400 600
Example: Nonparametric basis functions estimated on nontrivial manifold
−10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45 −10 10 −20 −10 10 20 5 10 15 20 25 30 35 40 45
Remark: Essentially, one can view the DM as a method to learn generalized Fourier basis on the manifold.
Learning the localization function of EnKF
◮ When EnKF is performed with small ensemble size, one way
to alleviate the spurious correlation is to employ a localization function.
Learning the localization function of EnKF
◮ When EnKF is performed with small ensemble size, one way
to alleviate the spurious correlation is to employ a localization function.
◮ For example, in the serial EnKF, for each scalar observation,
yi, one “localizes” the Kalman gain, K = Lxyi ◦ XY ⊤
i (YiY ⊤ i
+ R)−1, with an empirically chosen localization function Lxyi (Gaspari-Cohn, etc), which requires some tunings.
Learning the localization function of EnKF
◮ When EnKF is performed with small ensemble size, one way
to alleviate the spurious correlation is to employ a localization function.
◮ For example, in the serial EnKF, for each scalar observation,
yi, one “localizes” the Kalman gain, K = Lxyi ◦ XY ⊤
i (YiY ⊤ i
+ R)−1, with an empirically chosen localization function Lxyi (Gaspari-Cohn, etc), which requires some tunings.
◮ Let’s use the idea from machine learning to train this
localization function. The key idea is to find a map that takes poorly estimated correlations to accurately estimated correlations.
Learning localization map2
Given a set of large ensemble EnKF solutions, {xa,k
m }
k=1,...,L
m=1,...,M
as a training data set, where L is large enough so the correlation, ρL
ij ≈ ρ(xi, yj), is accurate.
2De La Chevroti`
ere & H, 2017.
Learning localization map2
Given a set of large ensemble EnKF solutions, {xa,k
m }
k=1,...,L
m=1,...,M
as a training data set, where L is large enough so the correlation, ρL
ij ≈ ρ(xi, yj), is accurate. ◮ Operationally, we wish to run EnKF with K ≪ L ensemble
- members. Then our goal is to train a map that transform the
subsampled correlation ρK
ij into the accurate correlation ρL ij.
2De La Chevroti`
ere & H, 2017.
Learning localization map2
Given a set of large ensemble EnKF solutions, {xa,k
m }
k=1,...,L
m=1,...,M
as a training data set, where L is large enough so the correlation, ρL
ij ≈ ρ(xi, yj), is accurate. ◮ Operationally, we wish to run EnKF with K ≪ L ensemble
- members. Then our goal is to train a map that transform the
subsampled correlation ρK
ij into the accurate correlation ρL ij. ◮ Basically, we consider the following optimization problem:
min
Lxi yj
- [−1,1]
- [−1,1]
- LxiyjρK
ij − ρL ij
2 p(ρK
ij |ρL ij)p(ρL ij) dρK ij dρL ij MC
≈ min
Lxi yj
1 MS
M,S
- m,s=1
(LxiyjρK
ij,m,s − ρL ij,m)2,
where ρL
ij,m ∼ p(ρL ij) and ρK ij,m,s ∼ p(ρK ij |ρL ij) is an estimated
correlation using only K out of L training data.
2De La Chevroti`
ere & H, 2017.
Example: On Monsoon-Hadley multicloud model3
It’s a Galerkin projection of zonally symmetric β-plane primitive eqns into the barotropic, and first two baroclinic modes, stochastically driven by a three-cloud model paradigm. Consider
- bservation model h(x) that is similar to a RTM.
- 3M. De La Chevroti`
ere and B. Khouider 2016.
Example of trained localization map
Channel 3 and θ1 Channel 6 and θeb
DA results
Correcting biased observation model error4
All the Kalman based DA method assumes unbiased observation model error, e.g., yi = h(xi) + ηi, ηi ∼ N(0, R). Suppose the operator h is un known. Instead, we are only given ˜ h, then yi = ˜ h(xi) + bi where we introduce a biased model error, bi = h(xi) − ˜ h(xi) + ηi.
4Berry & H, 2017.
Example: Basic radiative transfer model
Consider solutions of the stochastic cloud model5, {T(z), θeb, q, fd, fs, fc}. Based on this solutions, define a basic radiative transfer model as follows, hν(x) = θebTν(0) + ∞ T(z)∂Tν ∂z (z) dz, where Tν is the transmission between heights z to ∞ that is defined to depend
- n q.
The weighting function, ∂Tν
∂z
are defined as follows:
0.05 0.1 0.15 2 4 6 8 10 12 14 16 height (z) weighting function ( ∂T ν
∂z )
5Khouider, Biello, Majda 2010
Example: Basic radiative transfer model
Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)
- θebTν(0) +
zd T(z)∂Tν ∂z (z) dz
- +(fd + fs)T(zt)Tν(zd) +
∞
zd
T(z)∂Tν ∂z (z) dz
Example: Basic radiative transfer model
Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)
- θebTν(0) +
zd T(z)∂Tν ∂z (z) dz
- +(fd + fs)T(zt)Tν(zd) +
∞
zd
T(z)∂Tν ∂z (z) dz = (1 − fd − fs)
- (1 − fc)
- θebTν(0) +
zc T(z)∂Tν ∂z (z) dz
- +fcT(zc)Tν(zc) +
zd
zc
T(z)∂Tν ∂z (z) dz
- +(fd + fs)T(zd)Tν(zt) +
∞
zd
T(z)∂Tν ∂z (z) dz
Example: Basic radiative transfer model
Suppose the deep and stratiform cloud top height is zd = 12km, while the cumulus cloud top height is zc = 3km. Define f = {fd, fc, fs} and x = {T(z), θeb, q}. Then the cloudy RTM is given by, hν(x, f ) = (1 − fd − fs)
- θebTν(0) +
zd T(z)∂Tν ∂z (z) dz
- +(fd + fs)T(zt)Tν(zd) +
∞
zd
T(z)∂Tν ∂z (z) dz = (1 − fd − fs)
- (1 − fc)
- θebTν(0) +
zc T(z)∂Tν ∂z (z) dz
- +fcT(zc)Tν(zc) +
zd
zc
T(z)∂Tν ∂z (z) dz
- +(fd + fs)T(zd)Tν(zt) +
∞
zd
T(z)∂Tν ∂z (z) dz One can check that hν(x, 0) corresponds to cloud-free RTM.
Systematic model error in data assimilation
Suppose the observation is generated with yν = hν(x, f ) + η, η ∼ N(0, R) The difficulty in estimating the cloud fractions, cloud top heights and (in reality we don’t know precisely how many clouds under a column) induces model error.
Systematic model error in data assimilation
Suppose the observation is generated with yν = hν(x, f ) + η, η ∼ N(0, R) The difficulty in estimating the cloud fractions, cloud top heights and (in reality we don’t know precisely how many clouds under a column) induces model error. In an extreme case, we consider filtering with a cloud-free RTM: yν = hν(x, 0) + bν where bν = hν(x, f ) − hν(x, 0) + η is model error with bias.
Observations (yν) v Model error (bν)
State estimation of the model error
We propose a secondary filter to estimate the statistics for bi as follows: Prior
Primary Filter
− − − − − − − − − − − − − − → Posterior p(xi) p(xi | yi)
-
Error Prior
Secondary Filter
− − − − − − − − − − − − − − →Error Posterior p(b) p(b | yi)
-
-
Observation
RKHS+Training Data
− − − − − − − − − − − − − − − → Likelihood yi p(yi | b) A machine learning technique, kernel embedding of conditional distribution6, is employed to train a nonparametric likelihood function.
6Song, Fukumizu, Gretton, 2013.
Secondary Bayesian filter
p(b|yi) ∝ p(b)p(yi|b)
- 0.2
- 0.15
- 0.1
- 0.05
Obs Model Error
20 40 60 80 100 120
Probability
Prior, p(bℓ) Likelihood, p(yi = 0.184 | bℓ) Posterior, p(bℓ | yi = 0.184) True Model Error, bi
- 0.2
- 0.15
- 0.1
- 0.05
Obs Model Error
10 20 30 40 50 60 70 80 90 100
Probability
Prior, p(bℓ) Likelihood, p(yi = 0.195 | bℓ) Posterior, p(bℓ | yi = 0.195) True Model Error, bi
Filter estimates (with adaptive tuning of R and Q).
70 71 72 73 74 75 76 77 78 Time
- 0.1
- 0.05
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 θ1 Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78
Time
- 1
- 0.5
0.5 1 1.5 2 2.5 θeb Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78 Time 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 q Truth EnKF, True Obs EnKF, Wrong Obs RKHS 70 71 72 73 74 75 76 77 78 Time 0.05 0.1 0.15 0.2 fs Truth EnKF, True Obs EnKF, Wrong Obs RKHS
Example: Lorenz-96
Biased occurs random in space and times.
20 40
True State
10 20
Obs
- 10
- 5
5 10 15
10 20 30 40 50
Time
20 40
RKHS
20 40
EnKF
Nonparametric likelihood function
We will use the kernel embedding of conditional distribution.7 Recall: Let X be a r.v on M and distribution P(X). Given a kernel K : M × M → R, the Moore-Aronszajn theorem states that there exists a Reproducing Kernel Hilbert Space (RKHS) L2(M, q). This means that that f (x) = f , K(x, ·)q.
7Song, Fukumizu, Gretton, 2013.
Nonparametric likelihood function
The kernel embedding of conditional distribution P(Y |B) is defined as, µY |b = EY |b[ ˜ K(Y , ·)] =
- N
˜ K(y, ·)dP(y|b). Given g ∈ L2(N, ˜ q), EY |b[g(Y )] =
- N
g(y)dP(y|b) =
- N
g, ˜ K(y, ·)˜
qdP(y|b)
=
- g,
- N
˜ K(y, ·)dP(y|b)
- ˜
q = g, µY |b˜ q.
One can verify that µY |b = qCYBC−1
BBK(b, ·),
where CBY =
- M×N
K(b, ·) ⊗ ˜ K(y, ·) dP(b, y) is the kernel embedding of P(B, Y ) on appropriate Hilbert spaces.
Nonparametric likelihood function p(y|b)
Given {bi}N
i=1 and {yi}N i=1 Apply diffusion maps to learn the data-driven
- rthonormal basis functions ϕj(b) ∈ L2(M, q) and ˜
ϕk(y) ∈ L2(M, ˜ q). Let p(y|b) =
- k
µY |b,k ˜ ϕk(y)˜ q(y) where µY |b,k = p(·|b), ˜ ϕk = EY |b[ ˜ ϕk] = µY |b, ˜ ϕk˜
q
= qCYBC−1
BB K(b, ·), ˜
ϕk˜
q
= . . . =
- j
ϕj(x)[CYBC −1
BB ]kj
where [CYB]jk = CYB, ˜ ϕj ⊗ ϕk˜
q⊗q ≈ 1
N
N
- i=1
˜ ϕj(yi)ϕk(bi), [CBB]jk = CBB, ϕj ⊗ ϕkq ≈ 1 N
N
- i=1
ϕj(bi)ϕk(bi),
References:
- 1. M. De La Chevroti`
ere & H, “A data-driven method for improving the correlation estimation in serial ensemble Kalman filters.”, Mon.
- Wea. Rev. 145(3), 985-1001, 2017.
- 2. M. De La Chevroti`
ere & H, “Localization mappings for filtering a monsoon-Hadley multicloud model” (in progress).
- 3. T. Berry & H, “Correcting biased observation model error in data
assimilation”, Mon. Wea. Rev. (in press).
- 4. T. Berry & H, “Variable bandwidth diffusion kernels”, Appl.
- Comput. Harmon. Anal. 40, 68-96, 2016.
- 5. H, “An introduction to data-driven methods for stochastic modeling
- f dynamical systems”, Springer (to appear).