Data assimilation and Markov Chain Monte Carlo techniques Moritz - - PowerPoint PPT Presentation

data assimilation and markov chain monte carlo techniques
SMART_READER_LITE
LIVE PREVIEW

Data assimilation and Markov Chain Monte Carlo techniques Moritz - - PowerPoint PPT Presentation

Data assimilation and Markov Chain Monte Carlo techniques Moritz Schauer, Gothenburg University, Sweden National Institute of Marine Science and Technology, 2019 Introduction Outline Information about content and organisation of the course


slide-1
SLIDE 1

Data assimilation and Markov Chain Monte Carlo techniques

Moritz Schauer, Gothenburg University, Sweden

National Institute of Marine Science and Technology, 2019

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Outline

  • Information about content and organisation of the course
  • What is MCMC?
  • Probability and Bayesian statistics refresher
  • Getting you started with the lab

1

slide-4
SLIDE 4

My interests

Statistical models for evolutionary change / Or tracking the growth

  • f plants

Data on past temperature in climate science

2

slide-5
SLIDE 5

Course content

  • 1. Introduction
  • 2. Probability and statistics

refresher

  • 3. Markov chain
  • 4. Filtering
  • 5. Hierarchical Bayesian

modelling

  • 6. Markov Chain Monte Carlo
  • 7. Stochastic gradient descent
  • 8. Automatic differentiation
  • 9. Synthesis and modern stochastic

simulation

3

slide-6
SLIDE 6

Markov Chain Monte Carlo

  • Markov chain Monte Carlo methods. Computational algorithms for

generating samples from a probability distribution say with density x → p(x) for example a posterior density in Bayesian statistics.

  • The Metropolis-Hastings algorithm, from which many MCMC

methods can be derived.

4

slide-7
SLIDE 7

Monte Carlo

Draw independent samples X1, X2, . . . with probability density p. By the law of large numbers, 1 n

n

  • i=1

h(Xi)

a.s.

− − →

  • h(x)p(x)dx.

5

slide-8
SLIDE 8

Markov chain Monte Carlo

Generate Markov chain X1, X2, . . . with equilibrium density p. Under the conditions of the ergodic theorem still 1 n

n

  • i=1

h(Xi)

a.s.

− − →

  • h(x)p(x)dx,

but the convergence becomes slower.

6

slide-9
SLIDE 9

Origin

Markov chain Monte Carlo (MCMC) was invented soon after ordinary Monte Carlo at Los Alamos, one of the few places where adequate computers were available at the time. Metropolis et al. (1953) invented their algorithm to simulate a liquid in equilibrium with its gas phase.1

1Their article Equation of State Calculations by Fast Computing Machines was

deemed important enough to have its own Wikipedia article.

7

slide-10
SLIDE 10

Metropolis-Hastings

Hastings2 generalized the work to the form now called the Metropolis-Hastings algorithm, the fundamental Markov chain Monte Carlo algorithm. This algorithm has completely changed the way Bayesian inference is done and where it is used. Now Markov chain Monte Carlo is the workhorse of Bayesian statistics.

2https://www.jstor.org/stable/2334940

8

slide-11
SLIDE 11

Probability and statistics refresher

slide-12
SLIDE 12

Outline

  • Probability, random variables
  • Bayesian inference
  • (Conditional) independence
  • Transformations of random variables

9

slide-13
SLIDE 13

Probability

  • Probability is a numerical measure of how likely an event is to

happen.

  • Probability is a proportion, a number between 0 and 1.
  • Notation: P(something that can happen) = a probability. E.g.

P(coin heads-up) = 1

2.

Figure from https://mathwithbaddrawings.com/2015/09/23/what-does-probability-mean-in-your-profession/.

10

slide-14
SLIDE 14

Example: Discrete probability distributions

To specify a discrete probability distribution one lists all possible

  • utcomes and the probabilities with which they occur.

The probability distribution for a fair six-sided die: Event ✆ ✝ ✞ ✟ ✠ ✡ Probability

1 6 1 6 1 6 1 6 1 6 1 6

E.g. P({✆, ✝}) = 1

3. 11

slide-15
SLIDE 15

Random variables

When dealing with several related random experiments, it is convenient to work with random variables. Random variables are numeric quantities whose value depends on the

  • utcome of a random event.

12

slide-16
SLIDE 16

Random variables

Definition: Random variables are maps, defined on a probability space Ω. Think of Ω as a large set containing all information needed to determine the outcome of experiments, but not specified in detail. Imagine nature picks a random state ω in the set Ω equipped with a probability measure P.

13

slide-17
SLIDE 17

Random variables

Definition: Random variables are maps, defined on a probability space Ω. Think of Ω as a large set containing all information needed to determine the outcome of experiments, but not specified in detail. Imagine nature picks a random state ω in the set Ω equipped with a probability measure P. A random variable X maps ω to a quantity of interest, e.g. the number

  • f eyes of a die shows, {1, 2, . . . , 6}.

Then PX(A) := P({ω: X(ω) ∈ A}) defines a measure, the distribution of X (push forward).

13

slide-18
SLIDE 18

Example: Just one die

A random variable X taking values ✆ to ✡. If we are taking about X, we do not know what is the result of the experiment, only that P(X = ✞ ) = 1 6 Can you make this formal?

14

slide-19
SLIDE 19

Coupled random variables

If several random variables are defined on the same underlying probability space, they are coupled. In this case we can talk about dependence and independence etc. Example?

15

slide-20
SLIDE 20

Conditional distributions

Conditional distributions describe how probabilities of a random variable Y changes if Y = y is known/observed (both coupled). In the discrete case, P(X = x | Y = y) = P({Y = y} ∩ {X = x}) P(Y = y) . The probabilistic framework incorporates information (and missing information) naturally.

  • (Bayesian) probabilistic updating

16

slide-21
SLIDE 21

Models from random variables

Use random variables to model

  • measurement errors
  • statistical uncertainty
  • Bayesian inference
  • naturally random processes (sic!)

Distinct notions, but reconcilable.3

3A very Dutch topic: frequentist analysis of Bayesian methods.

17

slide-22
SLIDE 22

Examples

  • Unknown mass of the Higgs boson
  • Spam filter
  • Decay of atoms
  • The movement of molecules in a cubic meter of ocean

18

slide-23
SLIDE 23

Random variables with densities

Random variable X with density (probability density function, pdf) fX : R → [0, ∞). Probabilities are integrals P(X ∈ [a, b]) =

  • [a,b]

fX(x)dx.

19

slide-24
SLIDE 24

Discrete random variables

Random variable X with probability mass function pX(x) = P(X = x), pX : R → [0, ∞) where pX(x) = 0 except for a finite (or countable) set of atoms D for which pX(x) > 0, x ∈ D. Probabilities are sums, P(X ∈ A) =

  • x∈A∩D

P(X = x).

20

slide-25
SLIDE 25

Mixed random variables

Random variable X with (sub-) probability mass function with support D pd : R → [0, ∞), (sub-) probability density function f : R → [0, ∞) Probabilities are computed as P(X ∈ A) =

  • A

f (x)dx +

  • x∈A∩D

P(X = x). (with

x∈D P(X = x) +

  • f (x)dx = 1.)

21

slide-26
SLIDE 26

Example: mixed random variable

Let Z ∼ N(0, 1). Set X =

  • Z

with probability p

  • therwise

(independent of Z). What is... P(X = 0) = ? P(X ≥ 0) = ?

22

slide-27
SLIDE 27

Example: mixed random variable

Let Z ∼ N(0, 1). Set X =

  • Z

with probability p

  • therwise

(independent of Z). What is... P(X = 0) = 1 − p P(X ≥ 0) = (1 − p) + p ∞ φ(x)dx = (1 − p) + 1

2p 22

slide-28
SLIDE 28

Definition of a probability measure

Start with a set of events E on a space Ω (σ-algebra.) A probability measure is a map P : E → [0, 1] with P(Ω) = 1 and ∞

  • i=1

Ai

  • =

  • i=1

P(Ai) for disjoint events Ai ∈ E.

  • Kolmogorov′s axioms

23

slide-29
SLIDE 29

Joint and marginal densities

Random variables X, Y with joint bivariate density fX,Y : R × R → [0, ∞) Marginal densities fX(x) =

  • fX,Y (x, y)dy,

fY (y) =

  • fX,Y (x, y)dx

24

slide-30
SLIDE 30

Joint and marginal densities

Example bivariate Normal with correlation ρ, means µX, µY and variances σ2

X, σ2 Y .

fX,Y (x, y) = 1 2πσXσY

  • 1 − ρ2 exp

z 2(1 − ρ2)

  • ,

z = (x − µX)2 σ2

X

− 2ρ(x − µX)(y − µY ) σXσY + (y − µY )2 σ2

Y

.

25

slide-31
SLIDE 31

Conditional densities

The density of X given Y = y (so the density of the conditional distribution of X after observation that Y took value y) fX|Y =y(x) = fX,Y (x, y)

  • fX,Y (x′, y)dx′

= C −1

y

  • fX,Y (x, y)

(Cy =

  • fX,Y (x′, y)dx′ is the normalising constant depending on

y.)

26

slide-32
SLIDE 32

Bayes formula

The Bayesian formalism equips unknown parameters with a prior

  • distribution. Then the posterior distribution over the parameters given

data is just the conditional distribution p(θ) prior density of the parameter θ f (y | θ) likelihood (density of observation y given θ) p(θ | y) posterior parameter density θ p(θ | y) = f (y | θ)p(θ)

  • f (y | θ′)p(θ′)dθ′

27

slide-33
SLIDE 33

Bayesian convention

Often the subscripts of the densities are dropped. Random variable X density fX. Random variable Y with density fY . X, Y have joint bivariate density fX,Y . Denote all densities by p and distinguish them by their arguments. For example the formula for the marginal density becomes p(x) =

  • p(x, y)dy.

28

slide-34
SLIDE 34

Bayesian convention

Often the subscripts of the densities are dropped. Random variable X density fX. Random variable Y with density fY . X, Y have joint bivariate density fX,Y . Denote all densities by p and distinguish them by their arguments. For example the formula for the marginal density becomes p(x) =

  • p(x, y)dy.

The conditional density of X given Y = y becomes p(x | y) = p(x, y)

  • p(x′, y)dx′ .

28

slide-35
SLIDE 35

Radon-Nikodym theorem

For probability measures P(A) = 0 ⇒ Q(A) = 0 for all A is equivalent to the existence of dQ

dP with

Q(A) =

  • A

dQ dP (ω)dP(ω) for all A

29

slide-36
SLIDE 36

Transformations of random variables

More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X)

30

slide-37
SLIDE 37

Transformations of random variables

More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X) Distribution: P(Y ∈ A) = P(X ∈ h−1(A)).

30

slide-38
SLIDE 38

Transformations of random variables

More tools in your toolbox for working with random variables: Transformation by function h: Y = h(X) Distribution: P(Y ∈ A) = P(X ∈ h−1(A)). Simulation: Generate X, apply h.

30

slide-39
SLIDE 39

Example: Sum of the eyes of two dice

For example, consider two independent throws of six sided dies.

31

slide-40
SLIDE 40

Sums and products of random variables

In some cases the distribution of sums and products of random variables is known in closed form. Examples: Sum of jointly normal random variables is normal. Ratio of independent centred normal X and root of chi-squared V is t-distributed.

32

slide-41
SLIDE 41

Example: Differential equation with uncertain parameter

Example: Deterministic modelling + uncertainty. Differential equation du(t) dt = −Θu(t)

33

slide-42
SLIDE 42

Example: Differential equation with uncertain parameter

Example: Deterministic modelling + uncertainty. Differential equation du(t) dt = −Θu(t) with uncertainty about the parameter Θ and starting value u(0) = U, quantified by U ∼ N(5, 3) Θ ∼ N(0.5, 0.1).

33

slide-43
SLIDE 43

Example: Differential equation

Model is of functional form u(t) = ft(Θ, U) with ft(θ, u) = exp(−θt)u the solution of the ODE.

34

slide-44
SLIDE 44

Example: Differential equation

Model is of functional form u(t) = ft(Θ, U) with ft(θ, u) = exp(−θt)u the solution of the ODE. Explore model through simulations https: //nextjournal.com/mschauer/probabilistic-programming

34

slide-45
SLIDE 45

Exact computations

Example: Sum of uniform random variables Z = X1 + X2, where X1, X2 ∼ U[0, 1] independent.

35

slide-46
SLIDE 46

Exact computations

Example: Sum of uniform random variables Z = X1 + X2, where X1, X2 ∼ U[0, 1] independent. Z has triangular distribution with density fZ(x) =

  • x

0 ≤ x ≤ 1 2 − x 1 ≤ x ≤ 2.

35

slide-47
SLIDE 47

Working with mean and variance

Sometimes it is sufficient to determine mean and variance of the result of

  • perations on random variables.

For sums of r.v. X and Y : E[X + Y ] = EX + EY . Var(aX + Y ) = a2 Var(X) + Var(Y ) if X, Y independent

36

slide-48
SLIDE 48

Working with mean and variance

Sometimes it is sufficient to determine mean and variance of the result of

  • perations on random variables.

For sums of r.v. X and Y : E[X + Y ] = EX + EY . Var(aX + Y ) = a2 Var(X) + Var(Y ) if X, Y independent In the situation of the central limit theorem, a sum of random variables can approximated by a normal with the right mean and variance.

36

slide-49
SLIDE 49

Error propagation

Independent X1 . . . Xn random with mean 10 and std. deviation 0.3 = √ 0.09. Z = n

i=1 Xi.

Z is approx. N(10n, 0.09n) distributed Z = 10n

  • mean

± 0.3√n

std.dev 37

slide-50
SLIDE 50

Error propagation

Independent X1 . . . Xn random with mean 10 and std. deviation 0.3 = √ 0.09. Z = n

i=1 Xi.

Z is approx. N(10n, 0.09n) distributed Z = 10n

  • mean

± 0.3√n

std.dev

Very different from how the engineer argues: X1, . . . , Xn are quantities approximately in the interval [9.7, 10.3] Z ∈ [9.7n, 10.3n]

  • r

Z = 10n ± 0.3n.

37

slide-51
SLIDE 51

Julia programming language

I have prepared some things using the programming language Julia. You can follow the course and lab using a programming language you are familar with, but you might enjoy giving Julia a try. There are many online ressources at julialang.org/learning/.

38

slide-52
SLIDE 52

Setup

  • Install Julia. Download the Julia 1.1 for your operating system from

julialang.org

  • r
  • Make a free account on https://nextjournal.com/. Nextjournal

is an online notebook environment running Julia (you need a working internet connection to use it.)

  • In any case, download https:

//www.dropbox.com/s/yvwgz71zgvkj7sg/icecoredata.csv and read https://nextjournal.com/mschauer/bayesian-filter for the lab.

39

slide-53
SLIDE 53

Markov chain

slide-54
SLIDE 54

Simple Markov chain

Recursively, let Xi = h(Xi−1, Zi), i > 0 and X0 ∼ g using independent identically distributed innovation or noise random variables Z1, . . . , Zn, . . . .

40

slide-55
SLIDE 55

Example: Autoregressive chain

Autoregressive chain: Take η ∈ (0, 1), h(x, z) = ηx + z and Z1, . . . , Zn,

i.i.d.

∼ N(0, 1). So Xi = ηXi−1 + Zi, i > 0.

41

slide-56
SLIDE 56

Example: Autoregressive chain

Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1.

42

slide-57
SLIDE 57

Example: Autoregressive chain

Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1. Assume that X0 is normal with mean zero. Then all Xi are equally normal with mean zero. Setting Var(Xi) ≡ σ2

42

slide-58
SLIDE 58

Example: Autoregressive chain

Then Var(Xi) = Var(ηXi−1 + Zi) = η2 Var(Xi−1) + 1. Assume that X0 is normal with mean zero. Then all Xi are equally normal with mean zero. Setting Var(Xi) ≡ σ2 and solving σ2 = η2σ2 + 1 gives σ2 = 1 1 − η2 . {X1, X2, . . . } is a Markov chain with stationary distribution Xi ∼ N(0,

1 1−η2 ). 42

slide-59
SLIDE 59

Filtering

slide-60
SLIDE 60

Smoothing/Filtering

Start with joint density factorized as f (x, y) = f (y | x)f (x) This is convenient if e.g. x is a parameter with prior and the density of Y given X = x is determined by the model. Y is observed.

43

slide-61
SLIDE 61

Smoothing/Filtering

Start with joint density factorized as f (x, y) = f (y | x)f (x) This is convenient if e.g. x is a parameter with prior and the density of Y given X = x is determined by the model. Y is observed. Then as before f (x | y) = f (y | x)f (x)

  • f (y | x′)f (x′)dx′ =

f (x, y)

  • f (x′, y)dx′ .

43

slide-62
SLIDE 62

Random walk filter example for the lab

See https://nextjournal.com/mschauer/bayesian-filter for example and details.

44

slide-63
SLIDE 63

Random walk filter example for the lab

The North Greenland Ice Core Project obtained a core sample of ice from drilling through the arctic ice shield. Scientists are interested in a quantity Xt (the past δ18O values) which changes over time, written as t = 1, . . . , n.

45

slide-64
SLIDE 64

Measurements

They have produced a sequence of n measurements Y1, . . . , Yn and, as the measurement errors stem from a variety of unrelated causes, model the measurement errors ǫt at time t, Yt = Xt + ǫt. as independent random variables ǫt with Gaussian distribution N(0, σ2) with variance σ2 for each t = 1, . . . , n.

46

slide-65
SLIDE 65

Model

Model Xt+1 = Xt + ηt, for t = 1, . . . , n − 1, where ηt are independent normally distributed N(0, s2) with mean zero and variance s2 for t = 1, . . . , n.

47

slide-66
SLIDE 66

Filter

The filtered distribution of Xt is the conditional distribution of Xt at time t given observations Y1 = y1, Y2 = y2 up to and including Yt = yt. It captures what we know about Xt if we have seen the measurement points y1, y2, . . . , yt.

48

slide-67
SLIDE 67

Kalman filtering equations

For a random walk model with independent observation errors as above the filtered distribution of Xt is a normal distribution and its mean µt and variance p2

t can be computed recursively using the filtering

equations p2

t = σ2Kt

µt = µt−1 + Kt(yt − µt−1) with Kt the so called Kalman gain Kt = s2 + p2

t−1

s2 + p2

t−1 + σ2 ,

t > 1 and K1 =

p2 p2

0+σ2 .

49

slide-68
SLIDE 68

Kalman filtering equations

The mean of the filtered distribution, µt, serves as estimate of the unknown location of Xt and the standard deviation pt can be seen as measure of uncertainty about the location.

50

slide-69
SLIDE 69

Full Kalman filter

The more general Kalman filter for multivariate linear models xk = Φxk1 + b + wk, wk ∼ N(0, Q) yk = Hxk + vk, vk ∼ N(0, R) proceeds similarly.

51

slide-70
SLIDE 70

Hierarchical Bayesian modelling

slide-71
SLIDE 71

Outline

  • Hierarchical probabilistic model
  • Bayesian inference
  • Monte Carlo methods
  • Metropolis–Hastings

52

slide-72
SLIDE 72

Hierarchical models

Use hierarchical models to deal with

  • dependency between variables
  • indirect observations and missing data
  • smoothing/filtering
  • Generally: If the data generating process is somewhat understood

53

slide-73
SLIDE 73

A dilemma when modelling groups

Pool all groups together. Model each group separately.

  • Ignore latent differences.
  • Observations are not

independent.

  • Small sample sizes problematic.
  • Many more parameters to

estimate.

  • Ignores latent similarity.

Hierarchical modelling shows a way out...

54

slide-74
SLIDE 74

Example: Group means model

Latent group means bi ∼ N(µ, σ2

b)

with density φi for m groups. Specimen j from group i yij ∼ N(bi, σ2

y)

with density φij(· − bi). Factorisation of the joint density

  • i∈I

φi(bi)

  • j∈Ji

φij(yij − bi)

55

slide-75
SLIDE 75

Example: Group means model

Latent group means bi ∼ N(µ, σ2

b)

with density φi for m groups. Specimen j from group i yij ∼ N(bi, σ2

y)

with density φij(· − bi). Factorisation of the joint density

  • i∈I

φi(bi)

  • j∈Ji

φij(yij − bi) Only the specimens are observed, the conditional density is p(yij)(b1, . . . , bm) =

i∈I

φi(bi)

  • j∈Ji

φij(yij − bi)db1 · · · dbm

55

slide-76
SLIDE 76

Bayesian net

This is an example of a Bayesian net. The variables with nodes together with the parent operator pa form a DAG, a directed, acyclic graph. In general, the joint density in a Bayesian net can be factorises in the form p(x1, . . . , xn) =

n

  • i=1

p(xi | pa(xi))

56

slide-77
SLIDE 77

Bayesian net, factor graph representation

p(x1, . . . , xd) =

  • α∈F

fα(xα) where α ⊂ {1, . . . , d} and xα is the tuple of variables (xi)i∈α.

57

slide-78
SLIDE 78

Bayes’ formula for hierarchical models

58

slide-79
SLIDE 79

Markov Chain Monte Carlo

slide-80
SLIDE 80

What is on the menu

  • Markov chain Monte Carlo methods. Computational algorithms for

generating samples from a probability distribution say with density x → p(x) for example a posterior density in Bayesian statistics.

  • The Metropolis-Hastings algorithm, from which many MCMC

methods can be derived.

59

slide-81
SLIDE 81

Recap: Markov chain

A Markov chain

  • is a sequence of random variables X1, X2, . . . (with values in some

state space E)

  • with the Markov property:

P(Xi ∈ B | Xi−1 = xi−1, . . . , X1 = x1) = P(Xi ∈ B | Xi−1 = xi−1).

60

slide-82
SLIDE 82

Recap: Markov chain

A Markov chain

  • is a sequence of random variables X1, X2, . . . (with values in some

state space E)

  • with the Markov property:

P(Xi ∈ B | Xi−1 = xi−1, . . . , X1 = x1) = P(Xi ∈ B | Xi−1 = xi−1). The transition kernel Q(x; B) = P(Xi ∈ B | Xi−1 = x)

  • f a time-homogenous Markov chain fully describes the evolution of the

chain. Denote its density in x′ by q(x; x′).

60

slide-83
SLIDE 83

Markov chain Monte Carlo

Generate Markov chain X1, X2, . . . with equilibrium density p. Under the conditions of the ergodic theorem still 1 n

n

  • i=1

f (Xi)

a.s.

− − →

  • f (x)p(x)dx,

but the convergence becomes slower.

61

slide-84
SLIDE 84

Bayesian inference

A typical application of MCMC is in Bayesian inference. Posterior density π(· | y) is proportional to pθ(y)π(θ). (y is an observation with density pθ(y) and π is a prior distribution of a parameter θ) Statistical questions should have answers of the form

  • f (θ)dπ(θ | y).

62

slide-85
SLIDE 85

Expectations

Statistical questions should have answers of the form

  • f (θ)dπ(θ | y).

For example posterior mean and variance or probabilities arise this way. “Statistical” questions which cannot be posed as expectations of the posterior are very problematic. What would be interesting choices of f ?

63

slide-86
SLIDE 86

Expectations

Statistical questions should have answers of the form

  • f (θ)dπ(θ | y).

For example posterior mean and variance or probabilities arise this way. “Statistical” questions which cannot be posed as expectations of the posterior are very problematic. What would be interesting choices of f ? Probabilities are expectations of indicators and the median is minimiser

  • f expected deviation.

63

slide-87
SLIDE 87

Probabilistic programming

Hierarchical Bayesian model s2 ∼ InverseGamma(2, 3) m ∼ N(0, s2) x ∼ N(m, s2) y ∼ N(m, s2) What is the posterior distribution of parameters s and m given

  • bservations x = 1.5, y = 2?

64

slide-88
SLIDE 88

Probabilistic programming

Short illustration using the Turing.jl package in .

https://nextjournal.com/mschauer/probabilistic-programming 65

slide-89
SLIDE 89

Probabilistic programming

Generated Markov chain.

66

slide-90
SLIDE 90

Probabilistic programming

Posterior density estimates.

67

slide-91
SLIDE 91

Markov chains and detailed balance

The chain is in detailed balance if there is a density p such that p(x)q(x; x′) = p(x′)q(x′; x), i.e. symmetry of the joint distribution of two iterates X, X ′ when starting the chain in X ∼ p. Implies that p is stationary density p(x) =

  • p(x)q(x; x′)dx′

=

  • p(x′)q(x′; x)dx′
  • the density after one step

. (A sufficient condition.)

68

slide-92
SLIDE 92

Metropolis-Hastings algorithm

Starting point is a Markov chain with some transition density q(x; x′). For example, a symmetric random walk, q(x; x′) = C exp(−x − x′2 2σ2

mh

) = q(x′; x).

69

slide-93
SLIDE 93

How to obtain a chain with stationary density p

Define a new chain, for which detailed balance holds for p by modifying the transition kernel.

70

slide-94
SLIDE 94

Metropolis-Hastings algorithm

Define the Metropolis-Hastings acceptance probability α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1

  • .

71

slide-95
SLIDE 95

Metropolis-Hastings algorithm

Define the Metropolis-Hastings acceptance probability α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1

  • .

Starting from X1, we recursively...

  • Given Xi, draw random proposal X ◦

i+1 from the density q(Xi; ·)

  • With probability α(Xi; X ◦

i+1)

accept and move: Xi+1 = X ◦

i+1,

else reject and stay: Xi+1 = Xi.

71

slide-96
SLIDE 96

Detailed balance

Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p.

72

slide-97
SLIDE 97

Detailed balance

Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p. That means hat the joint law p(x)QMH(x, dx′)dx is symmetric: p(x)QMH(x; dx′)dx = p(x′)QMH(x′; dx)dx′.

72

slide-98
SLIDE 98

Detailed balance

Proposition: This creates a Markov chain with a new Markov kernel QMH(x, ·) such that detailed balance holds for p. That means hat the joint law p(x)QMH(x, dx′)dx is symmetric:

  • A
  • B

p(x)QMH(x; dx′)dx =

  • B
  • A

p(x′)QMH(x′; dx)dx′ ∀A, B

72

slide-99
SLIDE 99

Proof: α(x; x′) = min p(x′)q(x′; x) p(x)q(x; x′) , 1

  • For any x, x′, at least one of α(x; x′), α(x′; x) equals 1.

73

slide-100
SLIDE 100

Proof: To show symmetry take x = x′ with α(x; x′) < 1 (otherwise switch x, x′ below.) Then p(x)QMH(x, dx′)dx = p(x)q(x; x′)α(x; x′)dxdx′ = p(x)q(x; x′)p(x′)q(x′; x) p(x)q(x; x′) dxdx′

73

slide-101
SLIDE 101

Proof: To show symmetry take x = x′ with α(x; x′) < 1 (otherwise switch x, x′ below.) Then p(x)QMH(x, dx′)dx = p(x)q(x; x′)α(x; x′)dxdx′ = p(x)q(x; x′)p(x′)q(x′; x) p(x)q(x; x′) dxdx′ = p(x′)q(x′; x) · 1

  • α(x′;x)

dxdx′ = p(x′)QMH(x′, dx)dx′ Therefore detailed balance holds and p is stationary density of the chain.

73

slide-102
SLIDE 102

Conditional sampling / Gibbs sampling

Coordinatewise MH: To sample from p(x1, x2, . . . , xd) repeat for each coordinate

  • Propose new value x′

i for xi from transition density qi(x1, . . . , xd; ·)

and accept or reject with and accept with αi = p(x1, . . . , x′

i , . . . , xd)qi(x1, . . . , x′ i , . . . , xd; xi)

p(x1, . . . , xi, . . . , xd)qi(x1, . . . , xi, . . . , xd; x′

i ) . 74

slide-103
SLIDE 103

Conditional sampling / Gibbs sampling

Special case: Gibbs sampler, where one samples from the conditional densities qi(x1, . . . , xd; xi) = p(xi | x1, . . . , xi−1, xi+1, . . . , xd),

  • respective. Then all proposals are accepted (αi = 1).

75

slide-104
SLIDE 104

Exploring a bivariate Gaussian

76

slide-105
SLIDE 105

Exploring a bivariate Gaussian

77

slide-106
SLIDE 106

Data augmentation

A typical situation: There is a natural Bayesian model for the internal state of a system X ∼ π(x, θ) = fθ(x)π(θ). But the state is only observed indirectly and with error Y = h(X) + ǫ with h: Rm → Rn, n < m and observation error ǫ. Denote the density of Y given X = x by fx. Use Gibbs or coordinatewise MH with coordinates θ and x to sample from π(x, θ | Y = y) ∝ fx(y)fθ(x)π(θ).

78

slide-107
SLIDE 107

Data augmentation

Iteratively: Imputation: Propose new value for x x′ ∼ qx(x, θ; ·) and accept with αx = π(x′, θ | Y = y)qx(x′, θ; x) π(x, θ | Y = y)qx(x, θ; x′) . Inference: Propose new value for θ θ′ ∼ qθ(x, θ; ·) and accept with αθ = π(x, θ′ | Y = y)qθ(x, θ′; θ) π(x, θ | Y = y)qθ(x, θ; θ′) .

79

slide-108
SLIDE 108

Earlier example

Hierarchical Bayesian model s2 ∼ InverseGamma(2, 3) m ∼ N(0, s2) x ∼ N(m, s2) y ∼ N(m, s2)

80

slide-109
SLIDE 109

Stochastic gradient descent

slide-110
SLIDE 110

Outline

  • Gradient descent
  • Automatic differentiation / back propagation
  • Stochastic gradient descent
  • Applications in estimation, regression, minimising a loss function

81

slide-111
SLIDE 111

Derivatives (notation and recap)

Multiple chain rule ∂ ∂x z(w(v(x)) = ∂z ∂w ∂w ∂v ∂v ∂x

82

slide-112
SLIDE 112

Derivatives (notation and recap)

Multiple chain rule ∂ ∂x z(w(v(x)) = ∂z ∂w ∂w ∂v ∂v ∂x Total derivative ∂ ∂x z(w1(x), . . . , wk(x)) = ∂z ∂w1 ∂w1 ∂x + · · · + ∂z ∂wk ∂wk ∂x

82

slide-113
SLIDE 113

Gradients and derivatives

Differentiable function F : Rd → R. The gradient is the vector of derivatives ∇F(x) =   

∂ ∂x1 F(x)

. . .

∂ ∂xd F(x)

   .

83

slide-114
SLIDE 114

Gradients and derivatives

Differentiable function F : Rd → R. The gradient is the vector of derivatives ∇F(x) =   

∂ ∂x1 F(x)

. . .

∂ ∂xd F(x)

   . The vector −∇F(x) points in the direction of steepest descent.

83

slide-115
SLIDE 115

Gradient descent

Objective: Get close to a (local) minimum x⋆ of F(x). xi+1 = xi − γ∇F(xi). Move in direction of steepest descent. γ is called the learning rate.

84

slide-116
SLIDE 116

Learning rate

γ too small: moving slowly with small steps. γ too large: overshooting.

85

slide-117
SLIDE 117

Julia example

f (x) = x4 − 3x3 + 2, with derivative f ′(x) = 4x3 − 9x2 Result: x = 2.24996. True optimum: 2.25 = 9

4. 86

slide-118
SLIDE 118

Automatic differentiation

slide-119
SLIDE 119

Finite differences

f ′(x) ≈ f (x + hei) − f (x) h , h small. This is inefficient - it requires d evaluations of F : Rd → R to compute the gradient ∇F(x).

87

slide-120
SLIDE 120

Finite differences

f ′(x) ≈ f (x + hei) − f (x) h , h small. This is inefficient - it requires d evaluations of F : Rd → R to compute the gradient ∇F(x). It is also only approximate and h must be balanced between floating-point precision and mathematical precision.

87

slide-121
SLIDE 121

Automatic differentiation

Task: Compute gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3.

88

slide-122
SLIDE 122

Automatic differentiation

Task: Compute gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): x1 = 2 x2 = 3 w1 = x1x2 w2 = sin(x1) z = w1 + w2

88

slide-123
SLIDE 123

SSA form in Julia code

89

slide-124
SLIDE 124

Back propagation

Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 w2 = sin(x1) = sin(2) z = w1 + w2

90

slide-125
SLIDE 125

Back propagation

Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1

90

slide-126
SLIDE 126

Back propagation

Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1 Gradient: ∂z ∂x1 = ∂z ∂w1 ∂w1 ∂x1 + ∂z ∂w2 ∂w2 ∂x1 = 3 + cos(2) (Combine all paths from x1 to z.)

90

slide-127
SLIDE 127

Back propagation

Gradient of z = x1x2 + sin(x1) in x1 = 2,x3 = 3. Static single assignment form (SSA): w1 = x1x2 = 3 · 2 ∂w1 ∂x1 = 3, ∂w1 ∂x2 = 2 w2 = sin(x1) = sin(2) ∂w2 ∂x1 = cos(2) z = w1 + w2 ∂z ∂w1 = 1, ∂z ∂w2 = 1 Gradient: ∂z ∂x1 = ∂z ∂w1 ∂w1 ∂x1 + ∂z ∂w2 ∂w2 ∂x1 = 3 + cos(2) (Combine all paths from x1 to z.) Similar for

∂z ∂x2 . 90

slide-128
SLIDE 128

Automatic derivatives in Julia

91

slide-129
SLIDE 129

Automatic derivatives in Julia

Automatic derivatives are exact:

91

slide-130
SLIDE 130

Maximum likelihood method

If X1, . . . , Xn are independent observations from density pθ(x) with parameter θ ∈ Rd, then the negative log-likelihood is F(θ) = −ℓ(θ; x1, . . . , xn) = − log

n

  • i=1

pθ(xi) = −

n

  • i=1

log pθ(xi) and ˆ θ = argmin

θ

F(θ) is the maximum likelihood estimator (assuming a unique global minimum).

92

slide-131
SLIDE 131

Maximum likelihood method

If X1, . . . , Xn are independent observations from density pθ(x) with parameter θ ∈ Rd, then the negative log-likelihood is F(θ) = −ℓ(θ; x1, . . . , xn) = − log

n

  • i=1

pθ(xi) = −

n

  • i=1

log pθ(xi) and ˆ θ = argmin

θ

F(θ) is the maximum likelihood estimator (assuming a unique global minimum). Estimators that arise as minimisers of sums are called M-estimators.

92

slide-132
SLIDE 132

Regression with quadratic loss

If (y1, x1), . . . , (yn, xn) are data points, ˆ θ = argmin

θ

Fn(θ) = argmin

θ

1 n

n

  • i=1

L(θ; yi, xi) with quadratic loss L(θ; y, x) = y − f (x; θ)2

2

e.g. with θ = (m, b) and f (x; θ) = mx + b.

93

slide-133
SLIDE 133

Regularised maximised likelihood

If x1, . . . , xn are i.i.d from density pθ, θ ∈ Rd a parameter, target ˆ θ = argmin

θ

Fn(θ) = argmin

θ

1 n

n

  • i=1

L(θ; xi) with L(θ; xi) = − log p(yi, xi; θ) + λθ2

2 94

slide-134
SLIDE 134

Estimate the gradient

Empirical loss over the data: Fn(θ) = 1 n

n

  • i=1

L(θ, xi)

95

slide-135
SLIDE 135

Estimate the gradient

Empirical loss over the data: Fn(θ) = 1 n

n

  • i=1

L(θ, xi) Better objective - minimise expected loss: F(θ) = EX[L(θ, X)] Use gradient descent and law of large numbers (∇ linear...) ∇F(θ) = EX[∇L(θ, X)] = lim

n→∞

1 n

n

  • i=1

L(θ, xi)

95

slide-136
SLIDE 136

Estimate the gradient

What if we estimate gradient with just one sample? (In practise: just pick a random one)

96

slide-137
SLIDE 137

Estimate the gradient

What if we estimate gradient with just one sample? (In practise: just pick a random one) Unbiased estimate of gradient: E[∇L(θ, Xi)] = ∇F(θ) Noisy, but useful!

96

slide-138
SLIDE 138

Stochastic gradient descent

In the maximum log likelihood example the objective is of the form F(x) = 1 n

n

  • i=1

fi(x) Often n is very large. Stochastic gradient descent moves along the gradient of a random fi: xj+1 = xj − γ∇fm(xj) m ∼ U({1, . . . , n}) Note that E[∇fm(x)] = ∇F(x). “Moves on average in the right direction.”

97

slide-139
SLIDE 139

Batch gradient descent

xj+1 = xj − γ 1 K

K

  • k=1

∇fmk(xj) m1, . . . , mK ∼ U({1, . . . , n}) Again E[ 1 K

K

  • k=1

∇fmk(xj)] = ∇F(x). Less noisy, but also more expensive to compute...

98

slide-140
SLIDE 140

Online stochastic gradient descent

Example: Maximum likelihood estimation again. Infinite stream of i.i.d. samples X1, . . . , Xj, . . . from pθ(x) with parameter θ ∈ Rd, θj+1 = θj + γ∇ log pθj(Xj). Doesn’t require to hold all data in memory.

99

slide-141
SLIDE 141

AD software

Julia: http://www.juliadiff.org, Zygote.jl and others Matlab: http://www.autodiff.org Python: https://github.com/HIPS/autograd

100

slide-142
SLIDE 142

AD software

Julia: http://www.juliadiff.org, Zygote.jl and others Matlab: http://www.autodiff.org Python: https://github.com/HIPS/autograd R? Possibly http://tobiasmadsen.com/2017/03/31/ automatic differentiation in r/

100

slide-143
SLIDE 143

Synthesis and modern stochastic simulation