[PPT] - Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling PowerPoint Presentation

SLIDE 1

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Christopher De Sa Kunle Olukotun Christopher Ré

{cdesa,kunle,chrismre}@stanford.edu Stanford

1

SLIDE 2

Overview

SLIDE 3

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010

…etc.

SLIDE 4

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work?

SLIDE 5

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

SLIDE 6

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

SLIDE 7

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

1. The “folklore” is not necesarily true.
2. ...but it works under reasonable conditions.

SLIDE 8

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

1. The “folklore” is not necesarily true.
2. ...but it works under reasonable conditions.

SLIDE 9

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

1. The “folklore” is not necesarily true.
2. ...but it works under reasonable conditions.

SLIDE 10

10

Problem: given a probability distribution, produce samples from it.

e.g. to do inference in a graphical model

SLIDE 11

11

Problem: given a probability distribution, produce samples from it.

e.g. to do inference in a graphical model

Algorithm: Gibbs sampling

de facto Markov chain Monte Carlo

(MCMC) method for inference

produces a series of approximate samples

that approach the target distribution

SLIDE 12

What is Gibbs Sampling?

12

SLIDE 13

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

13

SLIDE 14

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

14

SLIDE 15

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

15

Choose a variable to update at random. x5

SLIDE 16

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

16

Compute its conditional distribution given the

ther variables.

x4 x6 x7 P( ) = 0.7 P( ) = 0.3

x5 x5

x5

SLIDE 17

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

17

Update the variable by sampling from its conditional distribution. Compute its conditional distribution given the

ther variables.

x4 x6 x7 P( ) = 0.7 P( ) = 0.3

x5 x5

SLIDE 18

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

utput x

end loop

What is Gibbs Sampling?

18

Output the current state as a sample. x5 x5

SLIDE 19

Gibbs Sampling: A Practical Perspective

19

SLIDE 20

Gibbs Sampling: A Practical Perspective

Pros of Gibbs sampling

– Easy to implement – Updates are sparse à fast on modern CPUs

Cons of Gibbs sampling

– sequential algorithm à can’t naively parallelize

20

SLIDE 21

Gibbs Sampling: A Practical Perspective

Pros of Gibbs sampling

– Easy to implement – Updates are sparse à fast on modern CPUs

Cons of Gibbs sampling

– sequential algorithm à can’t naively parallelize

21

64 core

No parallelism Leave up to 98%

f performance
n the table!

e.g.

SLIDE 22

Asynchronous Gibbs Sampling

22

SLIDE 23

Asynchronous Gibbs Sampling

Run multiple threads in parallel without locks

– also known as HOGWILD! – adapted from a popular technique for stochastic gradient descent (SGD)

When we read a variable, it could be stale

– while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling

23

SLIDE 24

Asynchronous Gibbs Sampling

Run multiple threads in parallel without locks

– also known as HOGWILD! – adapted from a popular technique for stochastic gradient descent (SGD)

When we read a variable, it could be stale

– while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling

24

SLIDE 25

25

SLIDE 26

26

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

SLIDE 27

27

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

SLIDE 28

28

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates

ê

bound the

bias

Two desiderata

SLIDE 29

29

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates

ê

bound the

bias

Two desiderata

want to be independent

f initial conditions

quickly

ê

bound the

mixing time

SLIDE 30

Previous Work

30

SLIDE 31

Previous Work

“Hogwild: A Lock-Free Approach to

Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011.

follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

“Analyzing Hogwild Parallel Gaussian

Gibbs Sampling” — Johnson et al, NIPS 2013.

31

SLIDE 32

Previous Work

“Hogwild: A Lock-Free Approach to

Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011.

follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

“Analyzing Hogwild Parallel Gaussian

Gibbs Sampling” — Johnson et al, NIPS 2013.

32

SLIDE 33

33

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent

f initial conditions

quickly

ê

bound the

mixing time

want to get

accurate estimates

ê

bound the

bias

SLIDE 34

Bias

34

SLIDE 35

Bias

How close are samples to target distribution?

– standard measurement: total variation distance

For sequential Gibbs, no asymptotic bias:

35

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

SLIDE 36

Bias

How close are samples to target distribution?

– standard measurement: total variation distance

For sequential Gibbs, no asymptotic bias:

36

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

8µ0, lim

t→∞ kP (t)µ0 πkTV = 0

SLIDE 37

Bias

How close are samples to target distribution?

– standard measurement: total variation distance

For sequential Gibbs, no asymptotic bias:

37

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!

8µ0, lim

t→∞ kP (t)µ0 πkTV = 0

SLIDE 38

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

38

SLIDE 39

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

39

SLIDE 40

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

40

SLIDE 41

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

41

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Two threads update starting here.

SLIDE 42

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

42

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

Two threads update starting here.

SLIDE 43

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

43

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

Should have zero probability! Two threads update starting here.

SLIDE 44

Nonzero Asymptotic Bias

.

. . . . . . . (,) (,) (,) (,) probability state Distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

44

Measured

Bias

(total variation distance) sequential < 0.1% unbiased asynchronous 9.8% biased

SLIDE 45

Nonzero Asymptotic Bias

.

. . . . . . . (,) (,) (,) (,) probability state Distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

45

.

. . . . . . . (,X) (,X) (X,) (X,) probability state Marginal distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

Measured

Bias

(total variation distance) sequential < 0.1% unbiased asynchronous 9.8% biased

SLIDE 46

Are we using the right metric?

46

SLIDE 47

Are we using the right metric?

No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

New metric: sparse variation distance

where |A| is the number of variables on which event A depends

47

SLIDE 48

Are we using the right metric?

No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

New metric: sparse variation distance

where |A| is the number of variables on which event A depends

48

kµ νkSV(ω) = max

|A|≤ω |µ(A) ν(A)|

SLIDE 49

Are we using the right metric?

No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

New metric: sparse variation distance

where |A| is the number of variables on which event A depends

49

kµ νkSV(ω) = max

|A|≤ω |µ(A) ν(A)|

Simple Example: Bias of Asynchronous Gibbs Total variation: 9.8% Sparse Variation ( ): 0.4%

ω = 1

SLIDE 50

Total Influence Parameter

50

SLIDE 51

Total Influence Parameter

Old condition that was used to study mixing

times of spin statistics systems

– means X and Y equal except variable j. – is conditional distribution of variable i given the values of all the other variables in state X. – Dobrushin’s condition holds when

51

α = max

i∈I

X

j∈I

max

(X,Y )∈Bj

πi(·|XI\{i}) − πi(·|YI\{i})
TV

(X, Y ) ∈ Bj

πi(·|XI\{i})

SLIDE 52

Total Influence Parameter

Old condition that was used to study mixing

times of spin statistics systems

– means X and Y equal except variable j. – is conditional distribution of variable i given the values of all the other variables in state X. – Dobrushin’s condition holds when

52

α = max

i∈I

X

j∈I

max

(X,Y )∈Bj

πi(·|XI\{i}) − πi(·|YI\{i})
TV

(X, Y ) ∈ Bj

πi(·|XI\{i})

α < 1.

SLIDE 53

Asymptotic Result

For any class of distributions with bounded

total influence

– big-O notation is over number of variables

If timesteps of sequential Gibbs suffice to

achieve arbitrarily small bias

– measured by sparse variation distance, for fixed

…then asynchronous Gibbs requires only

additional timesteps to achieve the same bias!

53

α = O(1). n.

O(n)

ω-

O(1)

ω-

SLIDE 54

Asymptotic Result

For any class of distributions with bounded

total influence

– big-O notation is over number of variables

If timesteps of sequential Gibbs suffice to

achieve arbitrarily small bias

– measured by sparse variation distance, for fixed

…then asynchronous Gibbs requires only

additional timesteps to achieve the same bias!

54

α = O(1). n.

O(n)

ω-

O(1)

more details, explicit bounds, et cetera in the paper

ω-

SLIDE 55

want to get

accurate estimates

ê

bound the

bias

55

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent

f initial conditions

quickly

ê

bound the

mixing time

SLIDE 56

Mixing Time

56

SLIDE 57

Mixing Time

How long do we need to run until the samples

are independent of initial conditions?

Mixing time of a Markov chain is the first time

at which the distribution of the sample is close to the stationary distribution.

– in terms of total variation distance – feasible to run MCMC if mixing time is small

57

SLIDE 58

Mixing Time

How long do we need to run until the samples

are independent of initial conditions?

Mixing time of a Markov chain is the first time

at which the distribution of the sample is close to the stationary distribution.

– in terms of total variation distance – feasible to run MCMC if mixing time is small

58

“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!

SLIDE 59

Mixing Time Example

.

. . .

estimation of P (TY > )

sample number (thousands) Mixing of Sequential vs Hogwild! Gibbs τ = . τ = . τ = . sequential true distribution

59

is hardware- dependent read staleness parameter

τ

HOGWILD!

SLIDE 60

Mixing Time Example

.

. . .

estimation of P (TY > )

sample number (thousands) Mixing of Sequential vs Hogwild! Gibbs τ = . τ = . τ = . sequential true distribution

60

Sequential Gibbs achieves correct marginal quickly. tmix = O(n log n) Asynchronous Gibbs takes much longer. Asynchronous Gibbs takes much longer. Asynchronous Gibbs takes much longer. tmix = exp(Ω(n))

is hardware- dependent read staleness parameter

τ

HOGWILD!

SLIDE 61

Bounding the Mixing Time

61

α < 1.

SLIDE 62

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ).

Mixing time of sequential Gibbs (known result)
Mixing time of asynchronous Gibbs is

62

α < 1.

tmix−seq(✏) ≤ n 1 − ↵ log ⇣n ✏ ⌘ . tmix−hog(✏) ≤ n + ↵⌧ 1 − ↵ log ⇣n ✏ ⌘ .

is hardware- dependent read staleness parameter

τ

SLIDE 63

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ).

Mixing time of sequential Gibbs (known result)
Mixing time of asynchronous Gibbs is

63

α < 1.

tmix−seq(✏) ≤ n 1 − ↵ log ⇣n ✏ ⌘ . tmix−hog(✏) ≤ n + ↵⌧ 1 − ↵ log ⇣n ✏ ⌘ .

Takeaway message: can compare the two mixing time bounds with …they differ by a negligible factor!

tmix−hog(✏) ≈ 1 + ↵⌧n−1 tmix−seq(✏)

is hardware- dependent read staleness parameter

τ

SLIDE 64

Theory Matches Experiment

16500 17000 17500 18000 18500 19000 50 100 150 200 mixing time expected delay parameter (τ ∗) Estimated tmix of HOGWILD! Gibbs on Large Ising Model estimated theory

64

expected staleness parameter ( )

τ

SLIDE 65

Conclusion

Analyzed and modeled asynchronous Gibbs

sampling, and identified two success metrics

– sample bias à how close to target distribution? – mixing time à how long do we need to run?

Showed that asynchronicity can cause problems
Proved bounds on the effect of asynchronicity

– using the new sparse variation distance, together with – the classical condition of total influence

65

SLIDE 66

Conclusion

Analyzed and modeled asynchronous Gibbs

sampling, and identified two success metrics

– sample bias à how close to target distribution? – mixing time à how long do we need to run?

Showed that asynchronicity can cause problems
Proved bounds on the effect of asynchronicity

– using the new sparse variation distance, together with – the classical condition of total influence

66