Data assimilation and Markov Chain Monte Carlo techniques Moritz - - PowerPoint PPT Presentation
Data assimilation and Markov Chain Monte Carlo techniques Moritz - - PowerPoint PPT Presentation
Data assimilation and Markov Chain Monte Carlo techniques Moritz Schauer, Gothenburg University, Sweden National Institute of Marine Science and Technology, 2019 Introduction Outline Information about content and organisation of the course
Introduction
Outline
- Information about content and organisation of the course
- What is MCMC?
- Probability and Bayesian statistics refresher
- Getting you started with the lab
1
My interests
Statistical models for evolutionary change / Or tracking the growth
- f plants
Data on past temperature in climate science
2
Course content
- 1. Introduction
- 2. Probability and statistics
refresher
- 3. Markov chain
- 4. Filtering
- 5. Hierarchical Bayesian
modelling
- 6. Markov Chain Monte Carlo
- 7. Stochastic gradient descent
- 8. Automatic differentiation
- 9. Synthesis and modern stochastic
simulation
3
Markov Chain Monte Carlo
- Markov chain Monte Carlo methods. Computational algorithms for
generating samples from a probability distribution say with density x → p(x) for example a posterior density in Bayesian statistics.
- The Metropolis-Hastings algorithm, from which many MCMC
methods can be derived.
4
Monte Carlo
Draw independent samples X1, X2, . . . with probability density p. By the law of large numbers, 1 n
n
- i=1
h(Xi)
a.s.
− − →
- h(x)p(x)dx.
5
Markov chain Monte Carlo
Generate Markov chain X1, X2, . . . with equilibrium density p. Under the conditions of the ergodic theorem still 1 n
n
- i=1
h(Xi)
a.s.
− − →
- h(x)p(x)dx,
but the convergence becomes slower.
6
Origin
Markov chain Monte Carlo (MCMC) was invented soon after ordinary Monte Carlo at Los Alamos, one of the few places where adequate computers were available at the time. Metropolis et al. (1953) invented their algorithm to simulate a liquid in equilibrium with its gas phase.1
1Their article Equation of State Calculations by Fast Computing Machines was
deemed important enough to have its own Wikipedia article.
7
Metropolis-Hastings
Hastings2 generalized the work to the form now called the Metropolis-Hastings algorithm, the fundamental Markov chain Monte Carlo algorithm. This algorithm has completely changed the way Bayesian inference is done and where it is used. Now Markov chain Monte Carlo is the workhorse of Bayesian statistics.
2https://www.jstor.org/stable/2334940
8
Probability and statistics refresher
Outline
- Probability, random variables
- Bayesian inference
- (Conditional) independence
- Transformations of random variables
9
Probability
- Probability is a numerical measure of how likely an event is to
happen.
- Probability is a proportion, a number between 0 and 1.
- Notation: P(something that can happen) = a probability. E.g.
P(coin heads-up) = 1
2.
Figure from https://mathwithbaddrawings.com/2015/09/23/what-does-probability-mean-in-your-profession/.
10
Example: Discrete probability distributions
To specify a discrete probability distribution one lists all possible
- utcomes and the probabilities with which they occur.
The probability distribution for a fair six-sided die: Event ✆ ✝ ✞ ✟ ✠ ✡ Probability
1 6 1 6 1 6 1 6 1 6 1 6
E.g. P({✆, ✝}) = 1
3. 11
Random variables
When dealing with several related random experiments, it is convenient to work with random variables. Random variables are numeric quantities whose value depends on the
- utcome of a random event.
12
Random variables
Definition: Random variables are maps, defined on a probability space Ω. Think of Ω as a large set containing all information needed to determine the outcome of experiments, but not specified in detail. Imagine nature picks a random state ω in the set Ω equipped with a probability measure P.
13
Random variables
Definition: Random variables are maps, defined on a probability space Ω. Think of Ω as a large set containing all information needed to determine the outcome of experiments, but not specified in detail. Imagine nature picks a random state ω in the set Ω equipped with a probability measure P. A random variable X maps ω to a quantity of interest, e.g. the number
- f eyes of a die shows, {1, 2, . . . , 6}.
Then PX(A) := P({ω: X(ω) ∈ A}) defines a measure, the distribution of X (push forward).
13
Example: Just one die
A random variable X taking values ✆ to ✡. If we are taking about X, we do not know what is the result of the experiment, only that P(X = ✞ ) = 1 6 Can you make this formal?
14
Coupled random variables
If several random variables are defined on the same underlying probability space, they are coupled. In this case we can talk about dependence and independence etc. Example?
15
Conditional distributions
Conditional distributions describe how probabilities of a random variable Y changes if Y = y is known/observed (both coupled). In the discrete case, P(X = x | Y = y) = P({Y = y} ∩ {X = x}) P(Y = y) . The probabilistic framework incorporates information (and missing information) naturally.
- (Bayesian) probabilistic updating
16
Models from random variables
Use random variables to model
- measurement errors
- statistical uncertainty
- Bayesian inference
- naturally random processes (sic!)
Distinct notions, but reconcilable.3
3A very Dutch topic: frequentist analysis of Bayesian methods.
17
Examples
- Unknown mass of the Higgs boson
- Spam filter
- Decay of atoms
- The movement of molecules in a cubic meter of ocean
18
Random variables with densities
Random variable X with density (probability density function, pdf) fX : R → [0, ∞). Probabilities are integrals P(X ∈ [a, b]) =
- [a,b]
fX(x)dx.
19
Discrete random variables
Random variable X with probability mass function pX(x) = P(X = x), pX : R → [0, ∞) where pX(x) = 0 except for a finite (or countable) set of atoms D for which pX(x) > 0, x ∈ D. Probabilities are sums, P(X ∈ A) =
- x∈A∩D
P(X = x).
20
Mixed random variables
Random variable X with (sub-) probability mass function with support D pd : R → [0, ∞), (sub-) probability density function f : R → [0, ∞) Probabilities are computed as P(X ∈ A) =
- A
f (x)dx +
- x∈A∩D
P(X = x). (with
x∈D P(X = x) +
- f (x)dx = 1.)
21
Example: mixed random variable
Let Z ∼ N(0, 1). Set X =
- Z
with probability p
- therwise
(independent of Z). What is... P(X = 0) = ? P(X ≥ 0) = ?
22
Example: mixed random variable
Let Z ∼ N(0, 1). Set X =
- Z
with probability p
- therwise
(independent of Z). What is... P(X = 0) = 1 − p P(X ≥ 0) = (1 − p) + p ∞ φ(x)dx = (1 − p) + 1
2p 22
Definition of a probability measure
Start with a set of events E on a space Ω (σ-algebra.) A probability measure is a map P : E → [0, 1] with P(Ω) = 1 and ∞
- i=1
Ai
- =
∞
- i=1
P(Ai) for disjoint events Ai ∈ E.
- Kolmogorov′s axioms
23
Joint and marginal densities
Random variables X, Y with joint bivariate density fX,Y : R × R → [0, ∞) Marginal densities fX(x) =
- fX,Y (x, y)dy,
fY (y) =
- fX,Y (x, y)dx
24
Joint and marginal densities
Example bivariate Normal with correlation ρ, means µX, µY and variances σ2
X, σ2 Y .
fX,Y (x, y) = 1 2πσXσY
- 1 − ρ2 exp
- −
z 2(1 − ρ2)
- ,
z = (x − µX)2 σ2
X
− 2ρ(x − µX)(y − µY ) σXσY + (y − µY )2 σ2
Y
.
25
Conditional densities
The density of X given Y = y (so the density of the conditional distribution of X after observation that Y took value y) fX|Y =y(x) = fX,Y (x, y)
- fX,Y (x′, y)dx′
= C −1
y
- fX,Y (x, y)
(Cy =
- fX,Y (x′, y)dx′ is the normalising constant depending on
y.)
26
Bayes formula
The Bayesian formalism equips unknown parameters with a prior
- distribution. Then the posterior distribution over the parameters given
data is just the conditional distribution p(θ) prior density of the parameter θ f (y | θ) likelihood (density of observation y given θ) p(θ | y) posterior parameter density θ p(θ | y) = f (y | θ)p(θ)
- f (y | θ′)p(θ′)dθ′
27
Bayesian convention
Often the subscripts of the densities are dropped. Random variable X density fX. Random variable Y with density fY . X, Y have joint bivariate density fX,Y . Denote all densities by p and distinguish them by their arguments. For example the formula for the marginal density becomes p(x) =
- p(x, y)dy.
28
Bayesian convention
Often the subscripts of the densities are dropped. Random variable X density fX. Random variable Y with density fY . X, Y have joint bivariate density fX,Y . Denote all densities by p and distinguish them by their arguments. For example the formula for the marginal density becomes p(x) =
- p(x, y)dy.
The conditional density of X given Y = y becomes p(x | y) = p(x, y)
- p(x′, y)dx′ .
28
Radon-Nikodym theorem
For probability measures P(A) = 0 ⇒ Q(A) = 0 for all A is equivalent to the existence of dQ
dP with
Q(A) =
- A
dQ dP (ω)dP(ω) for all A
29
Transformations of random variables
More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X)
30
Transformations of random variables
More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X) Distribution: P(Y ∈ A) = P(X ∈ h−1(A)).
30
Transformations of random variables
More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X) Distribution: P(Y ∈ A) = P(X ∈ h−1(A)). Simulation: Generate X, apply h.
30
Example: Sum of the eyes of two dice
For example, consider two independent throws of six sided dies.
31
Sums and products of random variables
In some cases the distribution of sums and products of random variables is known in closed form. Examples: Sum of jointly normal random variables is normal. Ratio of independent centred normal X and root of chi-squared V is t-distributed.
32
Example: Differential equation with uncertain parameter
Example: Deterministic modelling + uncertainty. Differential equation du(t) dt = −Θu(t)
33
Example: Differential equation with uncertain parameter
Example: Deterministic modelling + uncertainty. Differential equation du(t) dt = −Θu(t) with uncertainty about the parameter Θ and starting value u(0) = U, quantified by U ∼ N(5, 3) Θ ∼ N(0.5, 0.1).
33
Example: Differential equation
Model is of functional form u(t) = ft(Θ, U) with ft(θ, u) = exp(−θt)u the solution of the ODE.
34
Example: Differential equation
Model is of functional form u(t) = ft(Θ, U) with ft(θ, u) = exp(−θt)u the solution of the ODE. Explore model through simulations https: //nextjournal.com/mschauer/probabilistic-programming
34
Exact computations
Example: Sum of uniform random variables Z = X1 + X2, where X1, X2 ∼ U[0, 1] independent.
35
Exact computations
Example: Sum of uniform random variables Z = X1 + X2, where X1, X2 ∼ U[0, 1] independent. Z has triangular distribution with density fZ(x) =
- x
0 ≤ x ≤ 1 2 − x 1 ≤ x ≤ 2.
35
Working with mean and variance
Sometimes it is sufficient to determine mean and variance of the result of
- perations on random variables.
For sums of r.v. X and Y : E[X + Y ] = EX + EY . Var(aX + Y ) = a2 Var(X) + Var(Y ) if X, Y independent
36
Working with mean and variance
Sometimes it is sufficient to determine mean and variance of the result of
- perations on random variables.
For sums of r.v. X and Y : E[X + Y ] = EX + EY . Var(aX + Y ) = a2 Var(X) + Var(Y ) if X, Y independent In the situation of the central limit theorem, a sum of random variables can approximated by a normal with the right mean and variance.
36
Error propagation
Independent X1 . . . Xn random with mean 10 and std. deviation 0.3 = √ 0.09. Z = n
i=1 Xi.
Z is approx. N(10n, 0.09n) distributed Z = 10n
- mean
± 0.3√n
std.dev 37
Error propagation
Independent X1 . . . Xn random with mean 10 and std. deviation 0.3 = √ 0.09. Z = n
i=1 Xi.
Z is approx. N(10n, 0.09n) distributed Z = 10n
- mean
± 0.3√n
std.dev
Very different from how the engineer argues: X1, . . . , Xn are quantities approximately in the interval [9.7, 10.3] Z ∈ [9.7n, 10.3n]
- r
Z = 10n ± 0.3n.
37
Julia programming language
I have prepared some things using the programming language Julia. You can follow the course and lab using a programming language you are familar with, but you might enjoy giving Julia a try. There are many online ressources at julialang.org/learning/.
38
Setup
- Install Julia. Download the Julia 1.1 for your operating system from
julialang.org
- r
- Make a free account on https://nextjournal.com/. Nextjournal
is an online notebook environment running Julia (you need a working internet connection to use it.)
- In any case, download https:
//www.dropbox.com/s/yvwgz71zgvkj7sg/icecoredata.csv and read https://nextjournal.com/mschauer/bayesian-filter for the lab.
39
Markov chain
Simple Markov chain
Recursively, let Xi = h(Xi−1, Zi), i > 0 and X0 ∼ g using independent identically distributed innovation or noise random variables Z1, . . . , Zn, . . . .
40
Example: Autoregressive chain
Autoregressive chain: Take η ∈ (0, 1), h(x, z) = ηx + z and Z1, . . . , Zn,
i.i.d.
∼ N(0, 1). So Xi = ηXi−1 + Zi, i > 0.
41
Example: Autoregressive chain
Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1.
42
Example: Autoregressive chain
Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1. Assume that X0 is normal with mean zero. Then all Xi are equally normal with mean zero. Setting Var(Xi) ≡ σ2
42
Example: Autoregressive chain
Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1. Assume that X0 is normal with mean zero. Then all Xi are equally normal with mean zero. Setting Var(Xi) ≡ σ2 and solving σ2 = η2σ2 + 1 gives σ2 = 1 1 − η2 . {X1, X2, . . . } is a Markov chain with stationary distribution Xi ∼ N(0,
1 1−η2 ). 42
Filtering
Smoothing/Filtering
Start with joint density factorized as f (x, y) = f (y | x)f (x) This is convenient if e.g. x is a parameter with prior and the density of Y given X = x is determined by the model. Y is observed.
43
Smoothing/Filtering
Start with joint density factorized as f (x, y) = f (y | x)f (x) This is convenient if e.g. x is a parameter with prior and the density of Y given X = x is determined by the model. Y is observed. Then as before f (x | y) = f (y | x)f (x)
- f (y | x′)f (x′)dx′ =
f (x, y)
- f (x′, y)dx′ .
43
Random walk filter example for the lab
See https://nextjournal.com/mschauer/bayesian-filter for example and details.
44
Random walk filter example for the lab
The North Greenland Ice Core Project obtained a core sample of ice from drilling through the arctic ice shield. Scientists are interested in a quantity Xt (the past δ18O values) which changes over time, written as t = 1, . . . , n.
45
Measurements
They have produced a sequence of n measurements Y1, . . . , Yn and, as the measurement errors stem from a variety of unrelated causes, model the measurement errors ǫt at time t, Yt = Xt + ǫt. as independent random variables ǫt with Gaussian distribution N(0, σ2) with variance σ2 for each t = 1, . . . , n.
46
Model
Model Xt+1 = Xt + ηt, for t = 1, . . . , n − 1, where ηt are independent normally distributed N(0, s2) with mean zero and variance s2 for t = 1, . . . , n.
47
Filter
The filtered distribution of Xt is the conditional distribution of Xt at time t given observations Y1 = y1, Y2 = y2 up to and including Yt = yt. It captures what we know about Xt if we have seen the measurement points y1, y2, . . . , yt.
48
Kalman filtering equations
For a random walk model with independent observation errors as above the filtered distribution of Xt is a normal distribution and its mean µt and variance p2
t can be computed recursively using the filtering
equations p2
t = σ2Kt
µt = µt−1 + Kt(yt − µt−1) with Kt the so called Kalman gain Kt = s2 + p2
t−1
s2 + p2
t−1 + σ2 ,
t > 1 and K1 =
p2 p2
0+σ2 .
49
Kalman filtering equations
The mean of the filtered distribution, µt, serves as estimate of the unknown location of Xt and the standard deviation pt can be seen as measure of uncertainty about the location.
50
Full Kalman filter
The more general Kalman filter for multivariate linear models xk = Φxk1 + b + wk, wk ∼ N(0, Q) yk = Hxk + vk, vk ∼ N(0, R) proceeds similarly.
51
Hierarchical Bayesian modelling
Outline
- Hierarchical probabilistic model
- Bayesian inference
- Monte Carlo methods
- Metropolis–Hastings
52
Hierarchical models
Use hierarchical models to deal with
- dependency between variables
- indirect observations and missing data
- smoothing/filtering
- Generally: If the data generating process is somewhat understood
53
A dilemma when modelling groups
Pool all groups together. Model each group separately.
- Ignore latent differences.
- Observations are not
independent.
- Small sample sizes problematic.
- Many more parameters to
estimate.
- Ignores latent similarity.
Hierarchical modelling shows a way out...
54
Example: Group means model
Latent group means bi ∼ N(µ, σ2
b)
with density φi for m groups. Specimen j from group i yij ∼ N(bi, σ2
y)
with density φij(· − bi). Factorisation of the joint density
- i∈I
φi(bi)
- j∈Ji
φij(yij − bi)
55
Example: Group means model
Latent group means bi ∼ N(µ, σ2
b)
with density φi for m groups. Specimen j from group i yij ∼ N(bi, σ2
y)
with density φij(· − bi). Factorisation of the joint density
- i∈I
φi(bi)
- j∈Ji
φij(yij − bi) Only the specimens are observed, the conditional density is p(yij)(b1, . . . , bm) =
i∈I
φi(bi)
- j∈Ji
φij(yij − bi)db1 · · · dbm
55
Bayesian net
This is an example of a Bayesian net. The variables with nodes together with the parent operator pa form a DAG, a directed, acyclic graph. In general, the joint density in a Bayesian net can be factorises in the form p(x1, . . . , xn) =
n
- i=1
p(xi | pa(xi))
56
Bayesian net, factor graph representation
p(x1, . . . , xd) =
- α∈F
fα(xα) where α ⊂ {1, . . . , d} and xα is the tuple of variables (xi)i∈α.
57
Bayes’ formula for hierarchical models
58
Markov Chain Monte Carlo
What is on the menu
- Markov chain Monte Carlo methods. Computational algorithms for
generating samples from a probability distribution say with density x → p(x) for example a posterior density in Bayesian statistics.
- The Metropolis-Hastings algorithm, from which many MCMC
methods can be derived.
59
Recap: Markov chain
A Markov chain
- is a sequence of random variables X1, X2, . . . (with values in some
state space E)
- with the Markov property:
P(Xi ∈ B | Xi−1 = xi−1, . . . , X1 = x1) = P(Xi ∈ B | Xi−1 = xi−1).
60
Recap: Markov chain
A Markov chain
- is a sequence of random variables X1, X2, . . . (with values in some
state space E)
- with the Markov property:
P(Xi ∈ B | Xi−1 = xi−1, . . . , X1 = x1) = P(Xi ∈ B | Xi−1 = xi−1). The transition kernel Q(x; B) = P(Xi ∈ B | Xi−1 = x)
- f a time-homogenous Markov chain fully describes the evolution of the
chain. Denote its density in x′ by q(x; x′).
60
Markov chain Monte Carlo
Generate Markov chain X1, X2, . . . with equilibrium density p. Under the conditions of the ergodic theorem still 1 n
n
- i=1
f (Xi)
a.s.
− − →
- f (x)p(x)dx,
but the convergence becomes slower.
61
Bayesian inference
A typical application of MCMC is in Bayesian inference. Posterior density π(· | y) is proportional to pθ(y)π(θ). (y is an observation with density pθ(y) and π is a prior distribution of a parameter θ) Statistical questions should have answers of the form
- f (θ)dπ(θ | y).
62
Expectations
Statistical questions should have answers of the form
- f (θ)dπ(θ | y).
For example posterior mean and variance or probabilities arise this way. “Statistical” questions which cannot be posed as expectations of the posterior are very problematic. What would be interesting choices of f ?
63
Expectations
Statistical questions should have answers of the form
- f (θ)dπ(θ | y).
For example posterior mean and variance or probabilities arise this way. “Statistical” questions which cannot be posed as expectations of the posterior are very problematic. What would be interesting choices of f ? Probabilities are expectations of indicators and the median is minimiser
- f expected deviation.
63
Probabilistic programming
Hierarchical Bayesian model s2 ∼ InverseGamma(2, 3) m ∼ N(0, s2) x ∼ N(m, s2) y ∼ N(m, s2) What is the posterior distribution of parameters s and m given
- bservations x = 1.5, y = 2?
64
Probabilistic programming
Short illustration using the Turing.jl package in .
https://nextjournal.com/mschauer/probabilistic-programming 65
Probabilistic programming
Generated Markov chain.
66
Probabilistic programming
Posterior density estimates.
67
Markov chains and detailed balance
The chain is in detailed balance if there is a density p such that p(x)q(x; x′) = p(x′)q(x′; x), i.e. symmetry of the joint distribution of two iterates X, X ′ when starting the chain in X ∼ p. Implies that p is stationary density p(x) =
- p(x)q(x; x′)dx′
=
- p(x′)q(x′; x)dx′
- the density after one step
. (A sufficient condition.)
68
Metropolis-Hastings algorithm
Starting point is a Markov chain with some transition density q(x; x′). For example, a symmetric random walk, q(x; x′) = C exp(−x − x′2 2σ2
mh
) = q(x′; x).
69
How to obtain a chain with stationary density p
Define a new chain, for which detailed balance holds for p by modifying the transition kernel.
70
Metropolis-Hastings algorithm
Define the Metropolis-Hastings acceptance probability α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1
- .
71
Metropolis-Hastings algorithm
Define the Metropolis-Hastings acceptance probability α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1
- .
Starting from X1, we recursively...
- Given Xi, draw random proposal X ◦
i+1 from the density q(Xi; ·)
- With probability α(Xi; X ◦
i+1)
accept and move: Xi+1 = X ◦
i+1,
else reject and stay: Xi+1 = Xi.
71
Detailed balance
Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p.
72
Detailed balance
Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p. That means hat the joint law p(x)QMH(x, dx′)dx is symmetric: p(x)QMH(x; dx′)dx = p(x′)QMH(x′; dx)dx′.
72
Detailed balance
Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p. That means hat the joint law p(x)QMH(x, dx′)dx is symmetric:
- A
- B
p(x)QMH(x; dx′)dx =
- B
- A
p(x′)QMH(x′; dx)dx′ ∀A, B
72
Proof: α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1
- For any x, x′, at least one of α(x; x′), α(x′; x) equals 1.
73
Proof: To show symmetry take x = x′ with α(x; x′) < 1 (otherwise switch x, x′ below.) Then p(x)QMH(x, dx′)dx = p(x)q(x; x′)α(x; x′)dxdx′ = p(x)q(x; x′)p(x′)q(x′; x) p(x)q(x; x′) dxdx′
73
Proof: To show symmetry take x = x′ with α(x; x′) < 1 (otherwise switch x, x′ below.) Then p(x)QMH(x, dx′)dx = p(x)q(x; x′)α(x; x′)dxdx′ = p(x)q(x; x′)p(x′)q(x′; x) p(x)q(x; x′) dxdx′ = p(x′)q(x′; x) · 1
- α(x′;x)
dxdx′ = p(x′)QMH(x′, dx)dx′ Therefore detailed balance holds and p is stationary density of the chain.
73
Conditional sampling / Gibbs sampling
Coordinatewise MH: To sample from p(x1, x2, . . . , xd) repeat for each coordinate
- Propose new value x′
i for xi from transition density qi(x1, . . . , xd; ·)
and accept or reject with and accept with αi = p(x1, . . . , x′
i , . . . , xd)qi(x1, . . . , x′ i , . . . , xd; xi)
p(x1, . . . , xi, . . . , xd)qi(x1, . . . , xi, . . . , xd; x′
i ) . 74
Conditional sampling / Gibbs sampling
Special case: Gibbs sampler, where one samples from the conditional densities qi(x1, . . . , xd; xi) = p(xi | x1, . . . , xi−1, xi+1, . . . , xd),
- respective. Then all proposals are accepted (αi = 1).
75
Exploring a bivariate Gaussian
76
Exploring a bivariate Gaussian
77
Data augmentation
A typical situation: There is a natural Bayesian model for the internal state of a system X ∼ π(x, θ) = fθ(x)π(θ). But the state is only observed indirectly and with error Y = h(X) + ǫ with h: Rm → Rn, n < m and observation error ǫ. Denote the density of Y given X = x by fx. Use Gibbs or coordinatewise MH with coordinates θ and x to sample from π(x, θ | Y = y) ∝ fx(y)fθ(x)π(θ).
78
Data augmentation
Iteratively: Imputation: Propose new value for x x′ ∼ qx(x, θ; ·) and accept with αx = π(x′, θ | Y = y)qx(x′, θ; x) π(x, θ | Y = y)qx(x, θ; x′) . Inference: Propose new value for θ θ′ ∼ qθ(x, θ; ·) and accept with αθ = π(x, θ′ | Y = y)qθ(x, θ′; θ) π(x, θ | Y = y)qθ(x, θ; θ′) .
79
Earlier example
Hierarchical Bayesian model s2 ∼ InverseGamma(2, 3) m ∼ N(0, s2) x ∼ N(m, s2) y ∼ N(m, s2)
80
Stochastic gradient descent
Outline
- Gradient descent
- Automatic differentiation / back propagation
- Stochastic gradient descent
- Applications in estimation, regression, minimising a loss function
81
Derivatives (notation and recap)
Multiple chain rule ∂ ∂x z(w(v(x)) = ∂z ∂w ∂w ∂v ∂v ∂x
82
Derivatives (notation and recap)
Multiple chain rule ∂ ∂x z(w(v(x)) = ∂z ∂w ∂w ∂v ∂v ∂x Total derivative ∂ ∂x z(w1(x), . . . , wk(x)) = ∂z ∂w1 ∂w1 ∂x + · · · + ∂z ∂wk ∂wk ∂x
82
Gradients and derivatives
Differentiable function F : Rd → R. The gradient is the vector of derivatives ∇F(x) =
∂ ∂x1 F(x)
. . .
∂ ∂xd F(x)
.
83
Gradients and derivatives
Differentiable function F : Rd → R. The gradient is the vector of derivatives ∇F(x) =
∂ ∂x1 F(x)
. . .
∂ ∂xd F(x)
. The vector −∇F(x) points in the direction of steepest descent.
83
Gradient descent
Objective: Get close to a (local) minimum x⋆ of F(x). xi+1 = xi − γ∇F(xi). Move in direction of steepest descent. γ is called the learning rate.
84
Learning rate
γ too small: moving slowly with small steps. γ too large: overshooting.
85
Julia example
f (x) = x4 − 3x3 + 2, with derivative f ′(x) = 4x3 − 9x2 Result: x = 2.24996. True optimum: 2.25 = 9
4. 86
Automatic differentiation
Finite differences
f ′(x) ≈ f (x + hei) − f (x) h , h small. This is inefficient - it requires d evaluations of F : Rd → R to compute the gradient ∇F(x).
87
Finite differences
f ′(x) ≈ f (x + hei) − f (x) h , h small. This is inefficient - it requires d evaluations of F : Rd → R to compute the gradient ∇F(x). It is also only approximate and h must be balanced between floating-point precision and mathematical precision.
87
Automatic differentiation
Task: Compute gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3.
88
Automatic differentiation
Task: Compute gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): x1 = 2 x2 = 3 w1 = x1x2 w2 = sin(x1) z = w1 + w2
88
SSA form in Julia code
89
Back propagation
Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 w2 = sin(x1) = sin(2) z = w1 + w2
90
Back propagation
Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1
90
Back propagation
Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1 Gradient: ∂z ∂x1 = ∂z ∂w1 ∂w1 ∂x1 + ∂z ∂w2 ∂w2 ∂x1 = 3 + cos(2) (Combine all paths from x1 to z.)
90
Back propagation
Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1 Gradient: ∂z ∂x1 = ∂z ∂w1 ∂w1 ∂x1 + ∂z ∂w2 ∂w2 ∂x1 = 3 + cos(2) (Combine all paths from x1 to z.) Similar for
∂z ∂x2 . 90
Automatic derivatives in Julia
91
Automatic derivatives in Julia
Automatic derivatives are exact:
91
Maximum likelihood method
If X1, . . . , Xn are independent observations from density pθ(x) with parameter θ ∈ Rd, then the negative log-likelihood is F(θ) = −ℓ(θ; x1, . . . , xn) = − log
n
- i=1
pθ(xi) = −
n
- i=1
log pθ(xi) and ˆ θ = argmin
θ
F(θ) is the maximum likelihood estimator (assuming a unique global minimum).
92
Maximum likelihood method
If X1, . . . , Xn are independent observations from density pθ(x) with parameter θ ∈ Rd, then the negative log-likelihood is F(θ) = −ℓ(θ; x1, . . . , xn) = − log
n
- i=1
pθ(xi) = −
n
- i=1
log pθ(xi) and ˆ θ = argmin
θ
F(θ) is the maximum likelihood estimator (assuming a unique global minimum). Estimators that arise as minimisers of sums are called M-estimators.
92
Regression with quadratic loss
If (y1, x1), . . . , (yn, xn) are data points, ˆ θ = argmin
θ
Fn(θ) = argmin
θ
1 n
n
- i=1
L(θ; yi, xi) with quadratic loss L(θ; y, x) = y − f (x; θ)2
2
e.g. with θ = (m, b) and f (x; θ) = mx + b.
93
Regularised maximised likelihood
If x1, . . . , xn are i.i.d from density pθ, θ ∈ Rd a parameter, target ˆ θ = argmin
θ
Fn(θ) = argmin
θ
1 n
n
- i=1
L(θ; xi) with L(θ; xi) = − log p(yi, xi; θ) + λθ2
2 94
Estimate the gradient
Empirical loss over the data: Fn(θ) = 1 n
n
- i=1
L(θ, xi)
95
Estimate the gradient
Empirical loss over the data: Fn(θ) = 1 n
n
- i=1
L(θ, xi) Better objective - minimise expected loss: F(θ) = EX[L(θ, X)] Use gradient descent and law of large numbers (∇ linear...) ∇F(θ) = EX[∇L(θ, X)] = lim
n→∞
1 n
n
- i=1
L(θ, xi)
95
Estimate the gradient
What if we estimate gradient with just one sample? (In practise: just pick a random one)
96
Estimate the gradient
What if we estimate gradient with just one sample? (In practise: just pick a random one) Unbiased estimate of gradient: E[∇L(θ, Xi)] = ∇F(θ) Noisy, but useful!
96
Stochastic gradient descent
In the maximum log likelihood example the objective is of the form F(x) = 1 n
n
- i=1
fi(x) Often n is very large. Stochastic gradient descent moves along the gradient of a random fi: xj+1 = xj − γ∇fm(xj) m ∼ U({1, . . . , n}) Note that E[∇fm(x)] = ∇F(x). “Moves on average in the right direction.”
97
Batch gradient descent
xj+1 = xj − γ 1 K
K
- k=1
∇fmk(xj) m1, . . . , mK ∼ U({1, . . . , n}) Again E[ 1 K
K
- k=1
∇fmk(xj)] = ∇F(x). Less noisy, but also more expensive to compute...
98
Online stochastic gradient descent
Example: Maximum likelihood estimation again. Infinite stream of i.i.d. samples X1, . . . , Xj, . . . from pθ(x) with parameter θ ∈ Rd, θj+1 = θj + γ∇ log pθj(Xj). Doesn’t require to hold all data in memory.
99
AD software
Julia: http://www.juliadiff.org, Zygote.jl and others Matlab: http://www.autodiff.org Python: https://github.com/HIPS/autograd
100
AD software
Julia: http://www.juliadiff.org, Zygote.jl and others Matlab: http://www.autodiff.org Python: https://github.com/HIPS/autograd R? Possibly http://tobiasmadsen.com/2017/03/31/ automatic differentiation in r/
100