Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling - - PowerPoint PPT Presentation

ensuring rapid mixing and low bias for asynchronous gibbs
SMART_READER_LITE
LIVE PREVIEW

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling - - PowerPoint PPT Presentation

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Christopher De Sa Kunle Olukotun Christopher R {cdesa,kunle,chrismre}@stanford.edu Stanford 1 Overview Asynchronous Gibbs sampling is a popular algorithm thats used in


slide-1
SLIDE 1

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Christopher De Sa Kunle Olukotun Christopher Ré

{cdesa,kunle,chrismre}@stanford.edu Stanford

1

slide-2
SLIDE 2

Overview

slide-3
SLIDE 3

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010

…etc.

slide-4
SLIDE 4

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work?

slide-5
SLIDE 5

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

slide-6
SLIDE 6

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

slide-7
SLIDE 7

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

  • 1. The “folklore” is not necesarily true.
  • 2. ...but it works under reasonable conditions.
slide-8
SLIDE 8

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

  • 1. The “folklore” is not necesarily true.
  • 2. ...but it works under reasonable conditions.
slide-9
SLIDE 9

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems. Question: when and why does it work? “Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does …but there’s no theoretical guarantee.

Our contributions

  • 1. The “folklore” is not necesarily true.
  • 2. ...but it works under reasonable conditions.
slide-10
SLIDE 10

10

Problem: given a probability distribution, produce samples from it.

  • e.g. to do inference in a graphical model
slide-11
SLIDE 11

11

Problem: given a probability distribution, produce samples from it.

  • e.g. to do inference in a graphical model

Algorithm: Gibbs sampling

  • de facto Markov chain Monte Carlo

(MCMC) method for inference

  • produces a series of approximate samples

that approach the target distribution

slide-12
SLIDE 12

What is Gibbs Sampling?

12

slide-13
SLIDE 13

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

13

slide-14
SLIDE 14

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

14

slide-15
SLIDE 15

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

15

Choose a variable to update at random. x5

slide-16
SLIDE 16

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

16

Compute its conditional distribution given the

  • ther variables.

x4 x6 x7 P( ) = 0.7 P( ) = 0.3

x5 x5

x5

slide-17
SLIDE 17

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

17

Update the variable by sampling from its conditional distribution. Compute its conditional distribution given the

  • ther variables.

x4 x6 x7 P( ) = 0.7 P( ) = 0.3

x5 x5

x5 x5

slide-18
SLIDE 18

x1 x2 x3 x4 x5 x7 x6

Algorithm 1 Gibbs sampling Require: Variables xi for 1 ≤ i ≤ n, and distribution π. loop Choose s by sampling uniformly from {1, . . . , n}. Re-sample xs uniformly from Pπ(xs|x{1,...,n}\{s}).

  • utput x

end loop

What is Gibbs Sampling?

18

Output the current state as a sample. x5 x5

slide-19
SLIDE 19

Gibbs Sampling: A Practical Perspective

19

slide-20
SLIDE 20

Gibbs Sampling: A Practical Perspective

  • Pros of Gibbs sampling

– Easy to implement – Updates are sparse à fast on modern CPUs

  • Cons of Gibbs sampling

– sequential algorithm à can’t naively parallelize

20

slide-21
SLIDE 21

Gibbs Sampling: A Practical Perspective

  • Pros of Gibbs sampling

– Easy to implement – Updates are sparse à fast on modern CPUs

  • Cons of Gibbs sampling

– sequential algorithm à can’t naively parallelize

21

64 core

No parallelism Leave up to 98%

  • f performance
  • n the table!

e.g.

slide-22
SLIDE 22

Asynchronous Gibbs Sampling

22

slide-23
SLIDE 23

Asynchronous Gibbs Sampling

  • Run multiple threads in parallel without locks

– also known as HOGWILD! – adapted from a popular technique for stochastic gradient descent (SGD)

  • When we read a variable, it could be stale

– while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling

23

slide-24
SLIDE 24

Asynchronous Gibbs Sampling

  • Run multiple threads in parallel without locks

– also known as HOGWILD! – adapted from a popular technique for stochastic gradient descent (SGD)

  • When we read a variable, it could be stale

– while we re-sample a variable, its adjacent variables can be overwritten by other threads – semantics not equivalent to standard (sequential) Gibbs sampling

24

slide-25
SLIDE 25

25

slide-26
SLIDE 26

26

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

slide-27
SLIDE 27

27

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

slide-28
SLIDE 28

28

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates

ê

bound the

bias

Two desiderata

slide-29
SLIDE 29

29

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

want to get

accurate estimates

ê

bound the

bias

Two desiderata

want to be independent

  • f initial conditions

quickly

ê

bound the

mixing time

slide-30
SLIDE 30

Previous Work

30

slide-31
SLIDE 31

Previous Work

  • “Hogwild: A Lock-Free Approach to

Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011.

follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

  • “Analyzing Hogwild Parallel Gaussian

Gibbs Sampling” — Johnson et al, NIPS 2013.

31

slide-32
SLIDE 32

Previous Work

  • “Hogwild: A Lock-Free Approach to

Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011.

follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

  • “Analyzing Hogwild Parallel Gaussian

Gibbs Sampling” — Johnson et al, NIPS 2013.

32

slide-33
SLIDE 33

33

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent

  • f initial conditions

quickly

ê

bound the

mixing time

want to get

accurate estimates

ê

bound the

bias

slide-34
SLIDE 34

Bias

34

slide-35
SLIDE 35

Bias

  • How close are samples to target distribution?

– standard measurement: total variation distance

  • For sequential Gibbs, no asymptotic bias:

35

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

slide-36
SLIDE 36

Bias

  • How close are samples to target distribution?

– standard measurement: total variation distance

  • For sequential Gibbs, no asymptotic bias:

36

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

8µ0, lim

t→∞ kP (t)µ0 πkTV = 0

slide-37
SLIDE 37

Bias

  • How close are samples to target distribution?

– standard measurement: total variation distance

  • For sequential Gibbs, no asymptotic bias:

37

kµ νkTV = max

A⊂Ω |µ(A) ν(A)|

“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!

8µ0, lim

t→∞ kP (t)µ0 πkTV = 0

slide-38
SLIDE 38

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

38

slide-39
SLIDE 39

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

39

slide-40
SLIDE 40

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

40

slide-41
SLIDE 41

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

41

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Two threads update starting here.

slide-42
SLIDE 42

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

42

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

Two threads update starting here.

slide-43
SLIDE 43

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

Simple Bias Example

43

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4 3/4 3/4 1/2 1/2 1/2

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

p(0, 1) = p(1, 0) = p(1, 1) = 1 3 p(0, 0) = 0.

(0, 0) (0, 1) (1, 0) (1, 1)

1/4 1/4 1/4 1/4

Should have zero probability! Two threads update starting here.

slide-44
SLIDE 44

Nonzero Asymptotic Bias

  • .

. . . . . . . (,) (,) (,) (,) probability state Distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

44

Measured

Bias

(total variation distance) sequential < 0.1% unbiased asynchronous 9.8% biased

slide-45
SLIDE 45

Nonzero Asymptotic Bias

  • .

. . . . . . . (,) (,) (,) (,) probability state Distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

45

  • .

. . . . . . . (,X) (,X) (X,) (X,) probability state Marginal distribution of Sequential vs. Hogwild! Gibbs sequential Hogwild!

Bias introduced by Hogwild!-Gibbs ( samples).

Measured

Bias

(total variation distance) sequential < 0.1% unbiased asynchronous 9.8% biased

slide-46
SLIDE 46

Are we using the right metric?

46

slide-47
SLIDE 47

Are we using the right metric?

  • No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

  • New metric: sparse variation distance

where |A| is the number of variables on which event A depends

47

slide-48
SLIDE 48

Are we using the right metric?

  • No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

  • New metric: sparse variation distance

where |A| is the number of variables on which event A depends

48

kµ νkSV(ω) = max

|A|≤ω |µ(A) ν(A)|

slide-49
SLIDE 49

Are we using the right metric?

  • No, total variation distance is too conservative

– depends on events that don’t matter for inference – usually only care about small number of variables

  • New metric: sparse variation distance

where |A| is the number of variables on which event A depends

49

kµ νkSV(ω) = max

|A|≤ω |µ(A) ν(A)|

Simple Example: Bias of Asynchronous Gibbs Total variation: 9.8% Sparse Variation ( ): 0.4%

ω = 1

slide-50
SLIDE 50

Total Influence Parameter

50

slide-51
SLIDE 51

Total Influence Parameter

  • Old condition that was used to study mixing

times of spin statistics systems

– means X and Y equal except variable j. – is conditional distribution of variable i given the values of all the other variables in state X. – Dobrushin’s condition holds when

51

α = max

i∈I

X

j∈I

max

(X,Y )∈Bj

  • πi(·|XI\{i}) − πi(·|YI\{i})
  • TV

(X, Y ) ∈ Bj

πi(·|XI\{i})

slide-52
SLIDE 52

Total Influence Parameter

  • Old condition that was used to study mixing

times of spin statistics systems

– means X and Y equal except variable j. – is conditional distribution of variable i given the values of all the other variables in state X. – Dobrushin’s condition holds when

52

α = max

i∈I

X

j∈I

max

(X,Y )∈Bj

  • πi(·|XI\{i}) − πi(·|YI\{i})
  • TV

(X, Y ) ∈ Bj

πi(·|XI\{i})

α < 1.

slide-53
SLIDE 53

Asymptotic Result

  • For any class of distributions with bounded

total influence

– big-O notation is over number of variables

  • If timesteps of sequential Gibbs suffice to

achieve arbitrarily small bias

– measured by sparse variation distance, for fixed

  • …then asynchronous Gibbs requires only

additional timesteps to achieve the same bias!

53

α = O(1). n.

O(n)

ω-

O(1)

ω-

slide-54
SLIDE 54

Asymptotic Result

  • For any class of distributions with bounded

total influence

– big-O notation is over number of variables

  • If timesteps of sequential Gibbs suffice to

achieve arbitrarily small bias

– measured by sparse variation distance, for fixed

  • …then asynchronous Gibbs requires only

additional timesteps to achieve the same bias!

54

α = O(1). n.

O(n)

ω-

O(1)

more details, explicit bounds, et cetera in the paper

ω-

slide-55
SLIDE 55

want to get

accurate estimates

ê

bound the

bias

55

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

Two desiderata

want to be independent

  • f initial conditions

quickly

ê

bound the

mixing time

slide-56
SLIDE 56

Mixing Time

56

slide-57
SLIDE 57

Mixing Time

  • How long do we need to run until the samples

are independent of initial conditions?

  • Mixing time of a Markov chain is the first time

at which the distribution of the sample is close to the stationary distribution.

– in terms of total variation distance – feasible to run MCMC if mixing time is small

57

slide-58
SLIDE 58

Mixing Time

  • How long do we need to run until the samples

are independent of initial conditions?

  • Mixing time of a Markov chain is the first time

at which the distribution of the sample is close to the stationary distribution.

– in terms of total variation distance – feasible to run MCMC if mixing time is small

58

“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!

slide-59
SLIDE 59

Mixing Time Example

  • .

. . .

  • estimation of P (TY > )

sample number (thousands) Mixing of Sequential vs Hogwild! Gibbs τ = . τ = . τ = . sequential true distribution

59

is hardware- dependent read staleness parameter

τ

HOGWILD!

slide-60
SLIDE 60

Mixing Time Example

  • .

. . .

  • estimation of P (TY > )

sample number (thousands) Mixing of Sequential vs Hogwild! Gibbs τ = . τ = . τ = . sequential true distribution

60

Sequential Gibbs achieves correct marginal quickly. tmix = O(n log n) Asynchronous Gibbs takes much longer. Asynchronous Gibbs takes much longer. Asynchronous Gibbs takes much longer. tmix = exp(Ω(n))

is hardware- dependent read staleness parameter

τ

HOGWILD!

slide-61
SLIDE 61

Bounding the Mixing Time

61

α < 1.

slide-62
SLIDE 62

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ).

  • Mixing time of sequential Gibbs (known result)
  • Mixing time of asynchronous Gibbs is

62

α < 1.

tmix−seq(✏) ≤ n 1 − ↵ log ⇣n ✏ ⌘ . tmix−hog(✏) ≤ n + ↵⌧ 1 − ↵ log ⇣n ✏ ⌘ .

is hardware- dependent read staleness parameter

τ

slide-63
SLIDE 63

Bounding the Mixing Time

Suppose that our target distribution satisfies Dobrushin’s condition (total influence ).

  • Mixing time of sequential Gibbs (known result)
  • Mixing time of asynchronous Gibbs is

63

α < 1.

tmix−seq(✏) ≤ n 1 − ↵ log ⇣n ✏ ⌘ . tmix−hog(✏) ≤ n + ↵⌧ 1 − ↵ log ⇣n ✏ ⌘ .

Takeaway message: can compare the two mixing time bounds with …they differ by a negligible factor!

tmix−hog(✏) ≈ 1 + ↵⌧n−1 tmix−seq(✏)

is hardware- dependent read staleness parameter

τ

slide-64
SLIDE 64

Theory Matches Experiment

16500 17000 17500 18000 18500 19000 50 100 150 200 mixing time expected delay parameter (τ ∗) Estimated tmix of HOGWILD! Gibbs on Large Ising Model estimated theory

64

expected staleness parameter ( )

τ

slide-65
SLIDE 65

Conclusion

  • Analyzed and modeled asynchronous Gibbs

sampling, and identified two success metrics

– sample bias à how close to target distribution? – mixing time à how long do we need to run?

  • Showed that asynchronicity can cause problems
  • Proved bounds on the effect of asynchronicity

– using the new sparse variation distance, together with – the classical condition of total influence

65

slide-66
SLIDE 66

Conclusion

  • Analyzed and modeled asynchronous Gibbs

sampling, and identified two success metrics

– sample bias à how close to target distribution? – mixing time à how long do we need to run?

  • Showed that asynchronicity can cause problems
  • Proved bounds on the effect of asynchronicity

– using the new sparse variation distance, together with – the classical condition of total influence

66

Thank you!

cdesa@stanford.edu stanford.edu/~cdesa