+ Advanced Sampling Algorithms + Mobashir Mohammad Hirak Sarkar - - PowerPoint PPT Presentation

advanced sampling algorithms mobashir mohammad hirak
SMART_READER_LITE
LIVE PREVIEW

+ Advanced Sampling Algorithms + Mobashir Mohammad Hirak Sarkar - - PowerPoint PPT Presentation

+ Advanced Sampling Algorithms + Mobashir Mohammad Hirak Sarkar Parvathy Sudhir Yamilet Serrano Llerena Advanced Sampling Aditya Kulkarni Algorithms Tobias Bertelsen Nirandika Wanigasekara Malay Singh +


slide-1
SLIDE 1

+

Advanced Sampling Algorithms

slide-2
SLIDE 2

+

Advanced Sampling Algorithms

 Mobashir Mohammad  Hirak Sarkar  Parvathy Sudhir  Yamilet Serrano Llerena  Aditya Kulkarni  Tobias Bertelsen  Nirandika Wanigasekara  Malay Singh

slide-3
SLIDE 3

+

Introduction

Mobashir Mohammad 3

slide-4
SLIDE 4

+Sampling Algorithms

 Monte Carlo Algorithms

 Gives an approximate answer with a certain probability  May terminate with a failure signal

 Las Vegas Algorithms

 Always gives an exact answer  Doesn’t terminates in a definite time 4

Correct answer not Guaranteed Time Bounded Answer Guaranteed Time not Bounded

slide-5
SLIDE 5

+Monte Carlo Method

5

 What are the chances of winning with a properly shuffled

deck?

 Hard to compute because winning or losing depends on a

complex procedure of reorganizing the cards

 Insight

 Why not play just a few hands, and see empirically how many do

in fact we win?

 More generally

 We can approximate a probability density function using only a

few samples from that density

slide-6
SLIDE 6

+Simplest Monte Carlo Method

 Finding the value of π – “Shooting Darts”

6 1 1

 π/4 is equal to the area

  • f a circle of diameter 1

Integral solved with Monte Carlo

p 4 = dx dy

ò

x2+ y2 <1

ò

 Randomly select a large number of points inside the square

p » 4 * N _ inside_ circle N _ total

(Exact when N_total  ∞)

slide-7
SLIDE 7

+Monte Carlo Principle

 Given a very large set X and a distribution p(x) over it  We draw a set of N independent and identically distributed

samples

 We can then approximate the distribution using these

samples.

7

𝑞𝑂 𝑦 = 1 𝑜

𝑗=1 𝑂

1 𝑦𝑗=𝑦 →𝑂→∞ 𝑞 𝑦

 We can also use these sample to

compute expectations, …

slide-8
SLIDE 8

+Sampling

Problem:

 Exact inference from a probability distribution is hard  Determine the number of people likely to go shopping to a

Nike Store at Vivo City on a Sunday?

Solution:

 Use random samples to get an approximate inference.

How:

 Use a Markov Chain to get random samples! 8

slide-9
SLIDE 9

+

Markov Chains

Hirak Sarkar 9

slide-10
SLIDE 10

+Introduction to Markov Chain

 Real world systems contain uncertainty and evolve over time  Stochastic Processes and Markov Chains model such systems  A discrete time stochastic process is a sequence of random

variables 𝑌0, 𝑌1, … and typically denoted by {𝑌𝑜}

10

slide-11
SLIDE 11

+Components of a Stochastic Process

 The state space of a stochastic process is the set of all values

that the {𝑌𝑜} can take.

 Time: 𝑜 = 1,2, …  Set of states is denoted by 𝑇.

𝑇 = {𝑡1, 𝑡2, … , 𝑡𝑙} here there are 𝑙 states Clearly {𝑌𝑜} will take one of 𝑙 values

11

slide-12
SLIDE 12

+Markov Property and Markov Chain

 Consider special class of stochastic processes that satisfy a

certain property called Markov Property.

 Markov Property: State for 𝑌𝑜 only depends on content of

𝑌𝑜−1 not any of 𝑌𝑗 for 𝑗 < 𝑜 − 1

 Formally,

𝑄 𝑌𝑜 = 𝑗𝑜 𝑌0 = 𝑗0, … , 𝑌𝑜−1 = 𝑗𝑜−1) = 𝑄 𝑌𝑜 = 𝑗𝑜 𝑌𝑜−1 = 𝑗𝑜−1) = 𝑞(𝑗𝑜, 𝑗𝑜−1)

 We call such processes as 𝑁𝑏𝑠𝑙𝑝𝑤 𝐷ℎ𝑏𝑗𝑜  A matrix containing 𝑞(𝑗𝑜, 𝑗𝑜−1) in (𝑗𝑜, 𝑗𝑜−1) cell is called

Transition Matrix.

12

slide-13
SLIDE 13

+Transition Matrix

 The one-step transition matrix for a Markov chain with states

𝑇 = 0,1,2 is

 If 𝑄 is same for any point of time then it is called time

homogeneous.

 𝑘=1 𝑛

𝑞𝑗,𝑘 = 1 𝑞𝑗,𝑘 ≥ 0 ∀𝑗, 𝑘 Where 𝑞𝑗𝑘 = Pr 𝑌1 = 𝑘 𝑌0 = 𝑗

          

22 21 20 12 11 10 02 01 00

p p p p p p p p p P

13

slide-14
SLIDE 14

+Example: Random Walk

Adjacency Matrix Transition Matrix

1 1/2 1/2 1 14

slide-15
SLIDE 15

+

15

What is a random walk

1 1/2 1/2 1 A B C

t=0

slide-16
SLIDE 16

+

16

What is a random walk

1 1/2 1/2 1 1 1/2 1/2 1 A B C

t=0 t=1

A B C

slide-17
SLIDE 17

+

17

What is a random walk

1 1/2 1/2 1 1 1/2 1/2 1 1 1/2 1/2 1 A B C

t=0 t=1 t=2

A B C A B C

slide-18
SLIDE 18

+

18

What is a random walk

1 1/2 1/2 1 1 1/2 1/2 1

t=0 t=1

1 1/2 1/2 1

t=2

1 1/2 1/2 1

t=3

A B C A B C A B C A B C

slide-19
SLIDE 19

+

19

Probability Distributions

 𝑦𝑢 𝑗 = 𝑄 𝑌𝑢 = 𝑗 = probability that the surfer is at

node 𝑗 at time t

 𝑦𝑢+1(𝑗)= 𝑘 𝑄 𝑌𝑢+1 = 𝑗 𝑌𝑢 = 𝑘) 𝑄(𝑌𝑢 = 𝑘) = 𝑘 𝑞𝑗,𝑘 𝑦𝑢(𝑘)  𝒚𝒖+𝟐= 𝒚𝒖𝑸= 𝒚𝒖−𝟐𝑸 𝑸= 𝒚𝒖−𝟑𝑸 𝑸 𝑸 = …=𝒚𝟏𝑸𝒖  What happens when the surfer keeps walking for a

long time?

slide-20
SLIDE 20

+

20

Stationary Distribution

 When the surfer keeps walking for a long time  When the distribution does not change anymore

 i.e. 𝑦𝑈+1 = 𝑦𝑈

We denote it as 𝝆

slide-21
SLIDE 21

+Basic Limit Theorem

 𝑄 =

1

1 3 2 3 1 3 1 3 1 3

𝑄5 = 0.246914 0.407407 0.345679 0.251029 0.36214 0.386831 0.251029 0.366255 0.382716

 𝑄10 =

0.250013 0.37474 0.375248 0.249996 0.375095 0.374909 0.249996 0.375078 0.382926

 𝑄20 =

0.2500000002 0.3749999913 0.3750000085 0.2499999999 0.375000003 0.374999997 0.2499999999 0.3750000028 0.3749999973

21

slide-22
SLIDE 22

+Example

22

slide-23
SLIDE 23

+Classification of States

 State 𝑘 is reachable from state 𝑗 if the probability to go from 𝑗 to 𝑘

in 𝑜 > 0 steps is greater than zero (State 𝑘 is reachable from state 𝑗 if in the state transition diagram there is a path from 𝑗 to 𝑘)

 A subset 𝑇′ of the state space S is closed if 𝑞𝑗𝑘 = 0 for every 𝑗 ∈ 𝑇′

and 𝑘 ∉ 𝑇

 A state 𝑗 is said to be absorbing if it is a single element closed

set.

 A closed set 𝑇 of states is irreducible if any state j ∈ 𝑇 is

reachable from every state 𝑗 ∈ 𝑇

 A Markov chain is said to be irreducible if the state space 𝑇 is

irreducible.

23

slide-24
SLIDE 24

+Example

 Reducible Markov Chain

24

1 2

p01 p12 p00 p10 p21 p22 p01 p12 p00 p10 p14 p22

4

p23 p32 p33

1 2 3

Absorbing State Closed irreducible set

 Irreducible Markov Chain

slide-25
SLIDE 25

+Periodic and Aperiodic States

 Suppose that the structure of the Markov Chain is such

that state i is visited after a number of steps that is an integer multiple of an integer d >1. Then the state is called periodic with period d.

 If no such integer exists (i.e., d =1) then the state is

called aperiodic.

 Example

25

1 0.5 0.5

1 2

1

Periodic State d = 2

1 0.5 0.5 1            P

slide-26
SLIDE 26

+Stationary Distribution

26

𝑄 = 𝝆0 = 𝜌0

0 , 𝜌1 0 = 0.5,0.5

𝝆𝟐 = 𝝆𝟏𝑸 = 0.5 0.5 0.3 0.7 0.4 0.6 = (0.36 0.65)

slide-27
SLIDE 27

+Steady State Analysis

 Recall that the probability of finding the MC at state i after

the k th step is given by

27

 

 

Pr

i k

X i k   

     

 

1

, ... k k k    π

 An interesting question is what happens in the “long run”, i.e.,

  lim

i k

k

 



 Questions:

Do these limits exists?

If they exist, do they converge to a legitimate probability distribution, i.e.,

How do we evaluate 𝜌j, for all j.

1

i

 

 This is referred to as steady state or equilibrium or stationary state

probability

slide-28
SLIDE 28

+Steady State Analysis

28

 THEOREM: In a finite irreducible aperiodic Markov chain a

unique stationary state probability vector π exists such that πj > 0 and

 The steady state distribution 𝜌 is determined by solving

and

1

i i

 

lim

𝑙 → ∞ 𝜌𝑘 𝑙 = 𝜌𝑘

𝜌 = 𝜌𝑄

slide-29
SLIDE 29

+Sampling from a

Markov Chain

Parvathy 29

slide-30
SLIDE 30

+Sampling from a Markov Chain

 Sampling – create random samples from the desired probability

distribution (P)

 Construct Markov Chain with steady state distribution π

 Π = P

 Run the Markov chain long enough  On convergence, its states are approximately distributed

according to π.

 Random sampling of Markov chain yields random samples of P  But!

 Difficult to determine how long to run the Markov chain.

30

slide-31
SLIDE 31

+Ising Model

 Named after physicist Ernst Ising, invented by Wilhem Lenz  Mathematical model of ferromagnetism  Consists of discrete variables representing magnetic dipole

moments of atomic spins (+1 or −1)

31

slide-32
SLIDE 32

+1D Ising Model

 Solved by Ernst Ising  N spins arranged in a ring  P(𝑡) = 𝑓−𝛾𝐼(𝑡) 𝑎

 𝛾 − inverse temperature  𝐼 𝑡 − energy  𝑡 = 𝑦1, … , 𝑦𝑂 is one possible configuration

 Higher probability on states with lower energy

32

𝑦1 𝑦𝑂

slide-33
SLIDE 33

+2D Ising Model

 Spins are arranged in a lattice.

 Each spin interacts with 4 neighbors

 Simplest statistical models to show a phase transition

33

slide-34
SLIDE 34

+2D Ising Model (2)

 𝐻 = (𝑊, 𝐹)  𝑦𝑗 ∈ {+1, −1} - value of spin at location 𝑗  𝑇 = {𝑡1, 𝑡2, … , 𝑡𝑜} is the state space.  𝐼 𝑡 = − 𝑦𝑗,𝑦𝑘 ∈𝐹 𝑡 𝑦𝑗 𝑡(𝑦𝑘)

 Similar spins subtract energy  Opposite spins add energy 34

slide-35
SLIDE 35

+2D Ising Model (3)

 P(𝑡) = 𝑓−𝛾𝐼(𝑡) 𝑎

=

𝑓

−𝛾 𝑦𝑗𝑦𝑘 ∈𝐹 𝑡 𝑦𝑗 𝑡(𝑦𝑘)

𝑎

 If 𝛾 = 0 all spin configurations have same probability  If 𝛾 > 0 lower energy preferred 35

slide-36
SLIDE 36

+Exact Inference is Hard

 Posterior distribution over 𝑡

 If 𝑍 is any observed variable  𝑄 𝑡 𝑍) = 𝑞 𝑡,𝑧 𝑞(𝑡,𝑧)

 Intractable computation

 Joint probability distribution - 2𝑂 possible combinations of 𝑡  Marginal probability distribution at a site  MAP estimation 36

slide-37
SLIDE 37

+The Big Question

 Given 𝜌 on 𝑇 = 𝑡1, 𝑡2, … , 𝑡𝑜 simulate a random object with

distribution 𝜌

37

slide-38
SLIDE 38

+Methods

 Generate random samples to estimate a quantity  Samples are generated “Markov-chain style”

 Markov Chain Monte Carlo (MCMC)  Propp Wilson Simulation  Las Vegas variant  Sandwiching  Improvement on Propp Wilson 38

slide-39
SLIDE 39

+MARKOV CHAIN

MONTE CARLO METHOD (MCMC)

YAMILET R. SERRANO LL. 39

slide-40
SLIDE 40

+Recall

Given a probability distribution π on S = {s1, …, sk}, how do we simulate a random object with distribution π ?

40

slide-41
SLIDE 41

+Intuition

Given a probability distribution π on S = {s1, …, sk}, how do we simulate a random object with distribution π ?

High Dimension Space

41

slide-42
SLIDE 42

+MCMC

  • 1. Construct

an irreducible and aperiodic Markov Chain [X0,X1, …], whose stationary distribution π.

  • 2. If we run the chain with arbitrary initial

distribution then

  • 1. The Markov Chain Convergence Theorem

guarantees that the distribution of the chain at time n converges to π.

  • 3. Hence, if we run the chain for a sufficiently

long time n, then the distribution of Xn, will be very close to π. So it can be used as a sample.

42

slide-43
SLIDE 43

+MCMC(2)

 Generally, two types of MCMC algorithm

 Metropolis Hasting  Gibbs Sampling 43

slide-44
SLIDE 44

+Metropolis Hastings Algorithm

 Original method:

 Metropolis, Rosenbluth, Rosenbluth, Teller

and Teller (1953).

 Generalized by Hasting in 1970.  Rediscovered

by Tanner and Wong (1987) and Gelfang and Smith (1990)

 Is one way to implement MCMC.

Nicholas Metropolis

44

slide-45
SLIDE 45

+Metropolis Hastings Algorithm

Basic Idea

GIVEN: A probability distribution π on S = {s1, …, sk} GOAL : Approx. sample from π Start with a proposal distribution Q(x,y)

  • Q(x,y) specifies transition of Markov Chain
  • Q(x,y) plays the role of the transition matrix

By accepting/rejecting the proposal, MH simulates a Markov Chain, whose stationary distribution is π x: current state y: new proposal

45

slide-46
SLIDE 46

+Metropolis Hastings Algorithm

Algorithm

GIVEN:

A probability distribution π on S = {s1, …, sk}

GOAL :

  • Approx. sample from π

Given current sample,  Draw y from the proposal distribution,  Draw U ~ (0,1) and update

where the acceptance probability is 46

slide-47
SLIDE 47

+Metropolis Hastings Algorithm

The Ising Model

Consider m sites around a circle. Each site i can have one of two spins

xi  {-1,1}

The target distribution :

47

slide-48
SLIDE 48

+Metropolis Hastings Algorithm

The Ising Model

Target Distribution: Proposal Distribution:

  • 1. Randomly pick one out of the m spins
  • 2. Flip its sign

Acceptance probability (say i-th spin flipped):

slide-49
SLIDE 49

+Metropolis Hastings Algorithm

The Ising Model

Image has been taken from Montecarlo investigation of the Ising Model(2006) – Tobin Fricke

49

slide-50
SLIDE 50

+Disadvantages of MCMC

 No matter how large n is taken to be in the MCMC algorithm,

there will still be some discrepancy between the distribution

  • f the output and the target distribution π.

 In order to make the previous error small, we need to figure

  • ut how large n needs to be.

50

slide-51
SLIDE 51

+Bounds on Convergence of

MCMC

Aditya Kulkarni 51

slide-52
SLIDE 52

+Seminar on probabilities (Strasunbourge - 1983)

 If

it is difficult to

  • btain

asymptotic bounds

  • n

the convergence time

  • f

MCMC algorithms for Ising models, use quantitative bounds

 Use a characteristics of Ising

model MCMC algorithm to

  • btain quantitative bounds

52

David Aldous

slide-53
SLIDE 53

+What we need to ensure

 As we go along the time

We approach to the stationary distribution in a monotonically decreasing fashion.

 When we stop

The sample should follow a distribution which is not further apart from a stationary distribution by some factor 𝜻.

𝑒 𝑢 → 0 as 𝑢 → ∞ 𝑒 𝑢 ≤ 𝜁

53

slide-54
SLIDE 54

+Total Variation Distance || ⋅ ||𝑈𝑊

 Given two distributions 𝑞 and 𝑟 over a finite set of states 𝑇,  The total variation distance ‖ ⋅ ‖𝑈𝑊 is

| 𝑞 − 𝑟 |𝑈𝑊 =

1 2 ||𝑞 − 𝑟||1 = 1 2 𝑦∈𝑇 |𝑞 𝑦 − 𝑟 𝑦 |

54

slide-55
SLIDE 55

+Convergence Time

 Let 𝑌 = 𝑌0, 𝑌1, …

be a Markov Chain of a state space with stationary distribution 𝜌

 We define 𝑒 𝑢 as the worst total variation distance at time 𝑢.

𝑒 𝑢 = 𝑛𝑏𝑦𝑦∈𝑊 ||𝑄 𝑌𝑢 𝑌0 = 𝑦 − 𝜌||𝑈𝑊

 The mixing time 𝑢 is the minimum time 𝑢 such that 𝑒 𝑢 is at

most 𝜗. τ(𝜁) = min {𝑢 ∶ 𝑒 𝑢 ≤ 𝜁}

 Define

τ = τ

1 2𝑓 = min {𝑢 ∶ 𝑒 𝑢 ≤ 1 2𝑓}

55

slide-56
SLIDE 56

+Quantitative results

𝒆 𝒖 τ(𝜻) 𝒖

𝜻

τ

𝟐 𝟑𝒇

1

56

slide-57
SLIDE 57

+Lemma

 Consider two random walks started from state i and state j

Define 𝜍 𝑢 as the worst total variation distance between their respective probability distributions at time 𝑢 a. 𝜍 𝑢 ≤ 2𝑒 𝑢 b. 𝑒(𝑢) is decreasing

57

slide-58
SLIDE 58

+Upper bound on d(𝑢) proof

From part a and the definition of τ = τ

1 2𝑓 = min {𝑢 ∶ 𝑒 𝑢 ≤ 1 2𝑓}

we get 𝜍 𝜐 ≤ 𝑓−1 Also from part b, we deduce that upper bound of 𝜍 at a particular time 𝑢0 gives upper bound for later times

𝜍 𝑢 ≤ (𝜍(𝑢0))𝑜 ; 𝑜𝑢0 ≤ 𝑢 ≤ (𝑜 + 1)𝑢0 𝜍 𝑢 ≤ (𝜍(𝑢0))(

𝑢 𝑢0−1)

Substitute 𝜐 instead of 𝑢0 𝜍 𝑢 ≤ (𝜍(𝜐))(

𝑢 𝜐−1)

𝑒 𝑢 ≤ exp 1 −

𝑢 τ , 𝑢 ≥ 0

58

slide-59
SLIDE 59

+Upper bound on τ(𝜁) proof

 Algebraic calculations:

𝑒 𝑢 ≤ exp 1 −

𝑢 τ

log 𝑒(𝑢) ≤ 1 − 𝑢 𝜐

𝑢 𝜐 ≤ 1 − log 𝑒 𝑢 ≤ 1 + log( 1 𝑒(𝑢))

𝑢 ≤ 𝜐 ∗ (1 + log(

1 𝜁))

𝜐 𝜁 ≤ 2𝑓𝑢 ∗ (1 + log(

1 𝜁))

59

slide-60
SLIDE 60

+Upper bounds

d 𝑢 ≤ min(1, exp 1 − 𝑢 τ ), 𝑢 ≥ 0 τ(𝜁) ≤ τ ∗ (1 + log 1 𝜁 ), 0 < 𝜁 < 1 τ(𝜁) ≤ 2𝑓𝑢 ∗ (1 + log

1 𝜁 ), 0 < 𝜁 < 1

60

slide-61
SLIDE 61

+Entropy of initial distribution

Measure of randomness: 𝑓𝑜𝑢 𝜈 = −

𝑦∈𝑊

𝜈 𝑦 log 𝜈 𝑦 𝑦 is initial state 𝜈 is initial probability distribution

61

slide-62
SLIDE 62

+Few more lemma’s

1. Let (𝑌𝑢) is a random walk associated with 𝜈, and 𝜈𝑢 be the distribution of 𝑌𝑢 then 𝑓𝑜𝑢 𝜈𝑢 ≤ 𝑢 ⋅ 𝑓𝑜𝑢(𝜈) 2. If 𝑤 is a distribution of 𝑇 such that 𝑤 − 𝜌 ≤ 𝜁 then 𝑓𝑜𝑢 𝑤 ≥ (1 − 𝜁) log |𝑇|

62

slide-63
SLIDE 63

+Lower bound on 𝑒(𝑢) proof

 From lemma 2,

𝑓𝑜𝑢 𝑤 ≥ 1 − 𝜁 log 𝑇

𝑓𝑜𝑢 𝑤 log |𝑇| ≥ 1 − 𝜁 ≥ (1 − 𝑒(𝑢))

𝑒 𝑢 ≥ 1 − 𝑓𝑜𝑢 𝑤 log |𝑇| 𝑒 𝑢 ≥ 1 −

𝑢⋅𝑓𝑜𝑢 𝜈 log |𝑇|

63

slide-64
SLIDE 64

+Lower bound on τ(ε) proof

 From lemma 1,

𝑓𝑜𝑢 𝜈𝑢 ≤ 𝑢 𝑓𝑜𝑢(𝜈) 𝑢 ≥

𝑓𝑜𝑢 𝜈𝑢 𝑓𝑜𝑢(𝜈)  From lemma 2,

𝑓𝑜𝑢 𝜈𝑢 ≥ (1 − 𝜁) log |𝑇| 𝑢 ≥ 1 − 𝜁 log |𝑇| 𝑓𝑜𝑢(𝜈) τ(𝜁) ≥ 1 − 𝜁 log |𝑇| 𝑓𝑜𝑢(𝜈)

64

slide-65
SLIDE 65

+

𝑒 𝑢 ≥ 1 − 𝑢 ⋅ 𝑓𝑜𝑢(𝜈) log 𝑇 τ 𝜁 ≥ 1 − 𝜁 log |𝑇| 𝑓𝑜𝑢(𝜈)

Lower bounds

65

slide-66
SLIDE 66

+

Propp Wilson

Tobias Bertelsen 66

slide-67
SLIDE 67

+An exact version of MCMC

 Problems with MCMC

  • A. Have accuracy error, which depends on starting state

B.

We must know number of iterations

 James Propp and David Wilson propos Coupling from the

past (1996)

 A.k.a. the Propp-Wilson algorithm

 Idea:

 Solve problems by running chain infinitely 67

slide-68
SLIDE 68

+An exact version of MCMC

Theoretical

  • Runs all configurations infinitely
  • Literarily takes ∞ time
  • Impossible

Coupling from the past

  • Runs all configurations for finite time
  • Might take 1000’s of years
  • Infeasible

Sandwiching

  • Run few configurations for finite time
  • Takes seconds
  • Practicable

68

slide-69
SLIDE 69

+Theoretical exact sampling

 Recall the convergence theorem:

 We will approach the stationary distribution as the number of

steps goes to infinity

 Intuitive approach:

 To sample perfectly we start a chain and run for infinity  Start at 𝑢 = 0, sample at 𝑢 = ∞  Problem: We never get a sample

 Alternative approach:

 To sample perfectly we take a chain that have

already been running for an infinite amount of time

 Start at 𝑢 = −∞, sample at 𝑢 = 0 69

slide-70
SLIDE 70

+Theoretical independence of starting state

 Sample from a Markov chain in MCMC depends solely on

 Starting state 𝑌−∞  Sequence of random numbers 𝑉

 We want to be independent of the starting state.

 For a given sequence of random numbers 𝑉−∞, … , 𝑉−1

we want to ensure that the starting state 𝑌−∞ has no effect on 𝑌0

70

slide-71
SLIDE 71

+Theoretical independence of starting state

Collisions:

 For a given 𝑉−∞, … , 𝑉−1

if two Markov chains is at the same state at some 𝑢′ the will continue on together:

71

slide-72
SLIDE 72

+Coupling from the past

 At some finite past time 𝑢 = −𝑂

 All past chains has already run infinitively and has coupled into one

∞ − 𝑂 = ∞

 We want to continue that coupled chain to 𝑢 = 0  But we don’t know which at state they will be at 𝑢 = −𝑂

 Run all states from −𝑂 instead of −∞ 72

slide-73
SLIDE 73

+Coupling from the past

1.

Let 𝑉 = 𝑉−1, 𝑉−2, 𝑉−3, … be the sequence of independnent uniformly random numbers

2.

For 𝑂

𝑘 ∈ 1,2,4,8, …

1.

Extend 𝑉 to length 𝑜𝑘, keeping 𝑉−1, … , 𝑉𝑜𝑘−1 the same

2.

Start one chain from each state at 𝑢 = −𝑂

𝑘 3.

For 𝑢 from −𝑂

𝑘 to zero: Simulate the chains using 𝑉𝑢 4.

If all chains has converged at 𝑇𝑗 at 𝑢 = 0, return 𝑇𝑗

5.

Else repeat loop

73

slide-74
SLIDE 74

+Coupling from the past

74

slide-75
SLIDE 75

+Questions

 Why do we double the lengths?

 Worst case < 4𝑂𝑝𝑞𝑢 steps,

where 𝑂𝑝𝑞𝑢 is the minimal 𝑂 at which we can achieve convergence

 Compare to N ∈ [1,2,3,4,5, … ], 𝑃 𝑂𝑝𝑞𝑢 2

steps

 Why do we have to use the same random numbers?

 Different samples might take longer or shorter to converge.  We must evaluate the same sample in each iteration.  The sample should only be dependent on 𝑉 not the different 𝑂 75

slide-76
SLIDE 76

+Using the same 𝑉

 We have 𝑙 states with the following update function

𝑇𝑗, 𝑉 = 𝑇1 𝑉 <

1 2

𝑇𝑗+1 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 𝜚 𝑇𝑙, 𝑉 = 𝑇1 𝑉 <

1 2

𝑇𝑙 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

 𝜌 𝑇𝑗 = 1 2𝑗 , 𝜌 𝑙 = 2 2𝑙

76

slide-77
SLIDE 77

+

 Lets assume we generate a new 𝑉 for each run: 𝑉1, 𝑉2, 𝑉3, …  The probability of 𝑇1 only depends on the last random

number: 𝑄 𝑇1 = 𝑄 𝑉−1 < 1 2 = 1 2

Using the same 𝑉

𝒐 𝑽 𝑸 𝑻𝟐 𝑸 𝑻𝟐 acc. 1 𝑉−1

1

𝑄 𝑉−1

1 < 1 2

50 % 2 𝑉−2

2 , 𝑉−1 2

𝑄 𝑉−1

2 < 1 2 ∨ 𝑉−1 1 < 1 2

75 % 4 𝑉−4

3 , … , 𝑉−1 3

𝑄 𝑉−1

3 < 1 2 ∨ 𝑉−1 2 < 1 2 ∨ 𝑉−1 1 < 1 2

81.25 % 8 [𝑉−8

4 , … , 𝑉−1 4 ] 𝑄 𝑉−1

3 < 1 2 ∨ 𝑉−1 2 < 1 2 ∨ 𝑉−1 1 < 1 2 ∨ 𝑉1− 4 < 1 2

81.64 %

77

slide-78
SLIDE 78

+

 The probability of 𝑇1 only depends on the last random

number: 𝑄 𝑇1 = 𝑄 𝑉−1 < 1 2 = 1 2

 Lets instead use the same 𝑉 for each run: 𝑉1

Using the same 𝑉

𝒐 𝑽 𝑸 𝑻𝟐 𝑸 𝑻𝟐 acc. 1 𝑉−1

1

𝑄 𝑉−1

1 < 1 2

50 % 2 𝑉−2

1 , 𝑉−1 1

𝑄 𝑉−1

1 < 1 2

50 % 4 𝑉−4

1 , … , 𝑉−1 1

𝑄 𝑉−1

1 < 1 2

50 % 8 [𝑉−8

1 , … , 𝑉−1 1 ]

𝑄 𝑉−1

1 < 1 2

50 %

78

slide-79
SLIDE 79

+Problem

 In each step we update up to 𝑙 chains  Total execution time is 𝑃 𝑂𝑝𝑞𝑢𝑙  BUT 𝑙 = 2 𝑀2 in the Ising model  Worse than the naïve approach 𝑃(𝑙)

79

slide-80
SLIDE 80

+

Sandwiching

Nirandika Wanigasekara 80

slide-81
SLIDE 81

+Sandwiching

  • Many vertices running 𝑙

chains will take time

  • Impractical for large 𝑙

Propp Wilson algorithm

  • Can we still get the same

results?

  • Try sandwiching

Choose a relatively small state space

81

slide-82
SLIDE 82

+Sandwiching

 Idea

 Find two chains bounding all other chains  If we have such two boundary chains  Check if those two chains converge  Then all other chains have also converged 82

slide-83
SLIDE 83

+Sandwiching

 To come up with the boundary chains

 Need a way to order the states  𝑇1 ≤ 𝑇2 ≤ 𝑇3…  Chain in higher state does not cross a chain in lower state

 Results in a Markov Chain obeying certain monotonicity

properties

2 2 2 2 2 2 1 1 1 1 1 1

83

slide-84
SLIDE 84

+Sandwiching

 Lets consider

 A fixed set of states𝑙  State space 𝑇 = {1, … . , 𝑙}  A transition matrix  𝑄 11 = 𝑄 12 = 1 2  𝑄𝑙𝑙 = 𝑄𝑙,𝑙−1 = 1 2  𝑔𝑝𝑠 𝑗=2, …..𝑙−1, 𝑄𝑗,𝑗−1 = 𝑄𝑗,𝑗+1 = 1 2  All the other entries are 0 84

slide-85
SLIDE 85

+Sandwiching

 What is this Markov Chain doing?

 Take one step up and one step down the ladder, at each integer

time, with probability

1 2  If at the top or the bottom(state 𝑙)  It will stays where it is  Ladder Walk on 𝒍 vertices

 The stationary distribution 𝜌 of this Markov Chain

 𝜌𝑗 = 1 𝑙 𝑔𝑝𝑠 𝑗 = 1, … . , 𝑙 85

slide-86
SLIDE 86

+Sandwiching

 Propp-Wilson for this Markov Chain with

 Valid update function 𝜚  if u < 1 2 then step down  if 𝑣 ≥ 1 2 then step up  negative starting times 

𝑂1, 𝑂2, … = 1, 2, 4, 8, …

 States 𝑙 = 5 86

slide-87
SLIDE 87

+Sandwiching

87

slide-88
SLIDE 88

+Sandwiching

 Update function preserves ordering between states

 for all 𝑉 ∈ 0, 1 𝑏𝑜𝑒 𝑏𝑚𝑚 𝑗, 𝑘 ∈ 1, … . , 𝑙 𝑡𝑣𝑑ℎ 𝑢ℎ𝑏𝑢 𝑗 ≤ 𝑘 𝑥𝑓 ℎ𝑏𝑤𝑓

𝜚 𝑗, 𝑉 ≤ 𝜚(𝑘, 𝑉)

 It is sufficient to run only 2 chains rather than k

88

slide-89
SLIDE 89

+Sandwiching

 Are these conditions always met

 No, not always  But, there are frequent instances where these conditions are met  Especially useful when k is large  Ising model is a good example for this 89

slide-90
SLIDE 90

+

Ising Model

Malay Singh 90

slide-91
SLIDE 91

+2D Ising Model

 A grid with sites numbered from 1 to 𝑀2 where 𝑀 is grid size.  Each site 𝑗 can have spin 𝑦𝑗 ∈ −1, +1  𝑊 is set of sites

𝑇 = −1,1 𝑊 defines all possible configurations (states)

 Magnetisation 𝑛 for a state 𝑡 is

 𝑛 𝑡 = 𝑗 𝑡 𝑦𝑗 𝑀2

 The energy of a state 𝑡

 H 𝑡 = − 𝑗𝑘 𝑡 𝑦𝑗 𝑡 𝑦𝑘 91

slide-92
SLIDE 92

+Ordering of states

 For two states 𝑡, 𝑡′ we say s ≤ 𝑡′

if 𝑡 𝑦 < 𝑡′ 𝑦 ∀ 𝑦 ∈ 𝑊

Maxima 𝑛 = 1 Minima 𝑛 = −1 𝑡max x = +1 ∀𝑦 ∈ 𝑊 𝑡min x = −1 ∀𝑦 ∈ 𝑊 Hence 𝑡min ≤ 𝑡 ≤ 𝑡max for all 𝑡

92

slide-93
SLIDE 93

+The update function

We use the sequence of random numbers 𝑉𝑜, 𝑉𝑜−1, … , 𝑉0 We are updating the state at 𝑌𝑜. We choose a site 𝑦 in 0, 𝑀2 uniformly. 𝑌𝑜+1 𝑦 = +1, 𝑗𝑔 𝑉𝑜+1 < exp 2𝛾 𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 exp 2𝛾 𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 + 1 −1, 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

93

slide-94
SLIDE 94

+Maintaining ordering

 Ordering after update (From 𝑌𝑜 to 𝑌𝑜+1)

 We choose the same site 𝑦 to update in both chains  The spin of 𝑦 at 𝑌𝑜+1 depends on update function

 We want to check

exp 2𝛾 𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 exp 2𝛾 𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 + 1 ≤ exp 2𝛾 𝐿+ 𝑦, 𝑡′ − 𝐿− 𝑦, 𝑡′ exp 2𝛾 𝐿+ 𝑦, 𝑡′ − 𝐿− 𝑦, 𝑡′ + 1

 That is equivalent to checking

𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 ≤ 𝐿+ 𝑦, 𝑡′ − 𝐿− 𝑦, 𝑡′

94

slide-95
SLIDE 95

+Maintaining ordering

 As 𝑡 ≤ 𝑡′ we have  𝐿+ 𝑦, 𝑡 ≤ 𝐿+ 𝑦, 𝑡′  𝐿− 𝑦, 𝑡 ≥ 𝐿− 𝑦, 𝑡′  First equation minus second equation  𝐿+ 𝑦, 𝑡 − 𝐿− 𝑦, 𝑡 ≤ 𝐿+ 𝑦, 𝑡′ − 𝐿− 𝑦, 𝑡′

95

slide-96
SLIDE 96

+Ising Model L = 4

𝑈 = 3.5 and 𝑂 = 512

96

𝑈 = 4.8 and 𝑂 = 128

slide-97
SLIDE 97

+Ising Model 𝑀 = 8

97

𝑈 = 5.9 and 𝑂 = 512

slide-98
SLIDE 98

+Ising Model 𝑀 = 16

98

𝑈 = 5.3 and 𝑂 = 16 384

slide-99
SLIDE 99

+Summary

 Exact sampling  Markov chains monte carlo but when we converge  Propp Wilson with Sandwiching to rescue

99

slide-100
SLIDE 100

+

Questions?

100

slide-101
SLIDE 101

+Additional

 Example 6.2

101

slide-102
SLIDE 102

+Aditional

 Update function

102

slide-103
SLIDE 103

+Additional

 Proof for order of between states?? Need to find

103