with Applications to Change-point Detection and Restricted Boltzmann - - PowerPoint PPT Presentation

β–Ά
with applications to change point detection
SMART_READER_LITE
LIVE PREVIEW

with Applications to Change-point Detection and Restricted Boltzmann - - PowerPoint PPT Presentation

The Gibbs Sampling Algorithm: with Applications to Change-point Detection and Restricted Boltzmann Machine Restricted Boltzmann machine Change-point model Yubo Paul Yang Jan. 24 2017 Introduction: History Stuart Geman Donald Geman @


slide-1
SLIDE 1

The Gibbs Sampling Algorithm: with Applications to Change-point Detection and Restricted Boltzmann Machine

Yubo β€œPaul” Yang

  • Jan. 24 2017

Change-point model Restricted Boltzmann machine

slide-2
SLIDE 2

Introduction: History

Donald Geman @ Johns Hopkins

  • 1965 B.A. in English

Literature from UIUC

  • 1970 Ph.D in Mathematics

from Northwestern Stuart Geman @ Brown

  • 1971 B.S. in Physics from

UMich

  • 1973 MS in

Neurophysiology from Dartmouth

  • 1977 Ph.D in Applied

Mathematics from MIT

  • 1984 Gibbs Sampling (IEEE Trans. Pattern Anal. Mach. Intell, 6, 721-741, 1984.)
  • 1986 Markov Random Field Image Models (PICM. Ed. A.M. Gleason, AMS, Providence)
  • 1997 Decision Trees and Random Forest (Neural Computation., 9, 1545-1588, 1997) with Y. Amit
slide-3
SLIDE 3

Gibbs Sampling: One variable at a time

  • One variable at a time
  • Special case of Metropolis-Hasting (MH)

i.e. Acceptance = 1

Basic version: Block version: Collapsed version: Samplers within Gibbs:

  • Sample all independent variables simultaneously
  • Trace over some variables (i.e. ignore them)
  • Eg. Sample some variables with MH

Basic Gibbs sampling from bivariate Normal

slide-4
SLIDE 4

Basic Example: Sample from Bivariate Normal Distribution

Example inspired by: MCMC: The Gibbs Sampler , The Clever Machine, https://theclevermachine.wordpress.com/2012/1 1/05/mcmc-the-gibbs-sampler/

Q0/ How to sample 𝑦 from standard normal distribution Ɲ(𝜈 = 0, 𝜏 = 1)?

slide-5
SLIDE 5

Basic Example: Sample from Bivariate Normal Distribution

Example inspired by: MCMC: The Gibbs Sampler , The Clever Machine, https://theclevermachine.wordpress.com/2012/1 1/05/mcmc-the-gibbs-sampler/

𝑄 𝑦1, 𝑦2 = Ɲ(𝜈1, 𝜈2, Ξ£) = 1 2𝜌𝜏1𝜏2 1 βˆ’ 𝜍2 π‘“π‘¦π‘ž βˆ’ 𝑨 2(1 βˆ’ 𝜍2) Q0/ How to sample 𝑦 from standard normal distribution Ɲ(𝜈 = 0, 𝜏 = 1)? A0/ np.random.randn() samples from P(x) =

1 2𝜌𝜏2 exp[βˆ’ π‘¦βˆ’πœˆ 2 2𝜏2 ].

Bivariate normal distribution is the generalization of the normal distribution to two variables:

slide-6
SLIDE 6

Basic Example: Sample from Bivariate Normal Distribution

For simplicity, let 𝜈1 = 𝜈2 = 0, and 𝜏1 = 𝜏2 = 1 then:

Example inspired by: MCMC: The Gibbs Sampler , The Clever Machine, https://theclevermachine.wordpress.com/2012/1 1/05/mcmc-the-gibbs-sampler/

𝑄 𝑦1, 𝑦2 = Ɲ(𝜈1, 𝜈2, Ξ£) = 1 2𝜌𝜏1𝜏2 1 βˆ’ 𝜍2 π‘“π‘¦π‘ž βˆ’ 𝑨 2(1 βˆ’ 𝜍2) Ξ£ = 𝜏1 𝜍 𝜍 𝜏2 where 𝑨 = 𝑦1 βˆ’ 𝜈1 2 𝜏1

2

βˆ’ 2𝜍 𝑦1 βˆ’ 𝜈1 𝑦2 βˆ’ 𝜈2 𝜏1𝜏2 + 𝑦2 βˆ’ 𝜈2 2 𝜏2

2

and lπ‘œ 𝑄 𝑦1, 𝑦2 = βˆ’ 𝑦1

2 βˆ’ 2πœπ‘¦1𝑦2 + 𝑦2 2

2 1 βˆ’ 𝜍2 + π‘‘π‘π‘œπ‘‘π‘’. Q/ How to sample π’šπŸ, π’šπŸ‘ from 𝑸(π’šπŸ, π’šπŸ‘)? Q0/ How to sample 𝑦 from standard normal distribution Ɲ(𝜈 = 0, 𝜏 = 1)? A0/ np.random.randn() samples from P(x) =

1 2𝜌𝜏2 exp[βˆ’ π‘¦βˆ’πœˆ 2 2𝜏2 ].

Bivariate normal distribution is the generalization of the normal distribution to two variables:

slide-7
SLIDE 7

Basic Example: Sample from Bivariate Normal Distribution

lπ‘œ 𝑄 𝑦1, 𝑦2 = βˆ’ 𝑦1

2 βˆ’ 2πœπ‘¦1𝑦2 + 𝑦2 2

2 1 βˆ’ 𝜍2 + π‘‘π‘π‘œπ‘‘π‘’. Q/ How to sample π’šπŸ, π’šπŸ‘ from 𝑸(π’šπŸ, π’šπŸ‘)? A/ Gibbs sampling. Fix x2, sample x1 from 𝑸(π’šπŸ|π’šπŸ‘) Fix x1, sample x2 from 𝑸(π’šπŸ‘|π’šπŸ) Rinse and repeat The joint probability distribution of 𝑦1, 𝑦2 has log:

slide-8
SLIDE 8

Basic Example: Sample from Bivariate Normal Distribution

lπ‘œ 𝑄 𝑦1, 𝑦2 = βˆ’ 𝑦1

2 βˆ’ 2πœπ‘¦1𝑦2 + 𝑦2 2

2 1 βˆ’ 𝜍2 + π‘‘π‘π‘œπ‘‘π‘’. Q/ How to sample π’šπŸ, π’šπŸ‘ from 𝑸(π’šπŸ, π’šπŸ‘)? A/ Gibbs sampling. Fix x2, sample x1 from 𝑸(π’šπŸ|π’šπŸ‘) Fix x1, sample x2 from 𝑸(π’šπŸ‘|π’šπŸ) Rinse and repeat ln 𝑄 𝑦1 𝑦2 = βˆ’ 𝑦1

2 βˆ’ 2πœπ‘¦1𝑦2

2 1 βˆ’ 𝜍2 + π‘‘π‘π‘œπ‘‘π‘’. = βˆ’ (𝑦1βˆ’πœπ‘¦2)2 2 1 βˆ’ 𝜍2 + π‘‘π‘π‘œπ‘‘π‘’. β‡’ 𝑄 𝑦1 𝑦2 = Ɲ(𝜈 = πœπ‘¦2, 𝜏 = 1 βˆ’ 𝜍2) The joint probability distribution of 𝑦1, 𝑦2 has log: The full conditional probability distribution of 𝑦1 has log: new_x1 = np.sqrt(1-rho*rho) * np.random.randn() + rho*x2

slide-9
SLIDE 9

Basic Example: Sample from Bivariate Normal Distribution

Fixing x2 shifts the mean of x1 and changes its variance 𝜍 = 0.8

slide-10
SLIDE 10

Basic Example: Sample from Bivariate Normal Distribution

Gibbs sampler has worse correlation than numpy’s built-in multivariate_normal sampler, but is much better than naΓ―ve Metropolis ( reversible moves, 𝐡 = min(1,

𝑄(π’šβ€²) 𝑄(π’š)) )

Both Gibbs and Metropolis still fail when correlation is too high.

slide-11
SLIDE 11

Model Example: Train a Change-point Model with Bayesian Inference

Bayesian Inference: Improve β€˜guess’ model with data.

Example inspired by: Ilker Yildirim’s notes

  • n Gibbs sampling,

http://www.mit.edu/~ilkery/papers/Gibbs Sampling.pdf

The question that change-point model answers: when did a change occur to the distribution of a random variable? How to estimate the change point from observations? π‘œ

slide-12
SLIDE 12

Model Example: Train a Change-point Model with Bayesian Inference

Example inspired by: Ilker Yildirim’s notes

  • n Gibbs sampling,

http://www.mit.edu/~ilkery/papers/Gibbs Sampling.pdf

  • change-point model: a particular probability distribution of observables and model parameters

(Gamma prior, Poisson posterior) 𝑄 𝑦0, 𝑦1, … , π‘¦π‘‚βˆ’1, πœ‡1, πœ‡2, π‘œ =

𝑗=0 π‘œβˆ’1

π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦𝑗, πœ‡1

𝑗=π‘œ π‘‚βˆ’1

π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦𝑗, πœ‡2 𝐻𝑏𝑛𝑛𝑏 πœ‡1; 𝑏 = 2, 𝑐 = 1 𝐻𝑏𝑛𝑛𝑏 πœ‡2; 𝑏 = 2, 𝑐 = 1 π‘‰π‘œπ‘—π‘”π‘π‘ π‘›(π‘œ, 𝑂) π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦; πœ‡ = π‘“βˆ’πœ‡ πœ‡π‘¦ 𝑦! 𝐻𝑏𝑛𝑛𝑏 πœ‡; 𝑏, 𝑐 = 1 Ξ“(𝑏) π‘π‘πœ‡π‘βˆ’1 exp βˆ’π‘πœ‡ π‘‰π‘œπ‘—π‘”π‘π‘ π‘› π‘œ; 𝑂 = 1/𝑂 Q/ What is the full conditional probability of 𝝁𝟐? where πœ‡2 πœ‡1 π‘œ

slide-13
SLIDE 13

Model Example: Train a Change-point Model with Bayesian Inference

Example inspired by: Ilker Yildirim’s notes

  • n Gibbs sampling,

http://www.mit.edu/~ilkery/papers/Gibbs Sampling.pdf

  • change-point model: a particular probability distribution of observables and model parameters

(Gamma prior, Poisson posterior) 𝑄 𝑦0, 𝑦1, … , π‘¦π‘‚βˆ’1, πœ‡1, πœ‡2, π‘œ =

𝑗=0 π‘œβˆ’1

π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦𝑗, πœ‡1

𝑗=π‘œ π‘‚βˆ’1

π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦𝑗, πœ‡2 𝐻𝑏𝑛𝑛𝑏 πœ‡1; 𝑏 = 2, 𝑐 = 1 𝐻𝑏𝑛𝑛𝑏 πœ‡2; 𝑏 = 2, 𝑐 = 1 π‘‰π‘œπ‘—π‘”π‘π‘ π‘›(π‘œ, 𝑂) π‘„π‘π‘—π‘‘π‘‘π‘π‘œ 𝑦; πœ‡ = π‘“βˆ’πœ‡ πœ‡π‘¦ 𝑦! 𝐻𝑏𝑛𝑛𝑏 πœ‡; 𝑏, 𝑐 = π‘“βˆ’π‘πœ‡ πœ‡π‘βˆ’1 Ξ“(𝑏) Γ— 𝑐𝑏 π‘‰π‘œπ‘—π‘”π‘π‘ π‘› π‘œ; 𝑂 = 1/𝑂 where

  • Without observation, model parameters come from the prior distribution (the guess):

𝑄 πœ‡1, πœ‡2, π‘œ = 𝐻𝑏𝑛𝑛𝑏 πœ‡1; 𝑏 = 2, 𝑐 = 1 𝐻𝑏𝑛𝑛𝑏 πœ‡2; 𝑏 = 2, 𝑐 = 1 π‘‰π‘œπ‘—π‘”π‘π‘ π‘›(π‘œ, 𝑂)

  • After observations, model parameters should be sampled from the posterior distribution:

𝑄 πœ‡1, πœ‡2, π‘œ|𝑦0, 𝑦1, … , π‘¦π‘‚βˆ’1 Q/ How to sample from the joint posterior distribution of πœ‡1, πœ‡2, π‘œ? πœ‡2 πœ‡1 π‘œ

slide-14
SLIDE 14

Model Example: Train a Change-point Model with Bayesian Inference

Example inspired by: Ilker Yildirim’s notes

  • n Gibbs sampling,

http://www.mit.edu/~ilkery/papers/Gibbs Sampling.pdf

ln 𝑄 πœ‡1 πœ‡2, π‘œ, π’š = ln 𝐻𝑏𝑛𝑛𝑏(πœ‡1; 𝑏 +

𝑗=0 π‘œβˆ’1

𝑦𝑗 , 𝑐 + π‘œ) ln 𝑄 πœ‡2 πœ‡1, π‘œ, π’š = ln 𝐻𝑏𝑛𝑛𝑏(πœ‡2; 𝑏 +

𝑗=π‘œ π‘‚βˆ’1

𝑦𝑗 , 𝑐 + 𝑂 βˆ’ π‘œ) ln 𝑄 π‘œ πœ‡1, πœ‡2, π’š = 𝑛𝑓𝑑𝑑 π‘œ πœ‡1, πœ‡2, π’š Q/How to sample this mess?! Gibbs sampling require full conditionals

slide-15
SLIDE 15

Model Example: Train a Change-point Model with Bayesian Inference

Example inspired by: Ilker Yildirim’s notes

  • n Gibbs sampling,

http://www.mit.edu/~ilkery/papers/Gibbs Sampling.pdf

ln 𝑄 πœ‡1 πœ‡2, π‘œ, π’š = ln 𝐻𝑏𝑛𝑛𝑏(πœ‡1; 𝑏 +

𝑗=0 π‘œβˆ’1

𝑦𝑗 , 𝑐 + π‘œ) ln 𝑄 πœ‡2 πœ‡1, π‘œ, π’š = ln 𝐻𝑏𝑛𝑛𝑏(πœ‡2; 𝑏 +

𝑗=π‘œ π‘‚βˆ’1

𝑦𝑗 , 𝑐 + 𝑂 βˆ’ π‘œ) ln 𝑄 π‘œ πœ‡1, πœ‡2, π’š = 𝑛𝑓𝑑𝑑 π‘œ πœ‡1, πœ‡2, π’š Q/How to sample this mess?! A/ In general: Metropolis within Gibbs. In this case: bruteforce 𝑄 π‘œ πœ‡1, πœ‡2, π’š , βˆ€π‘œ = 0, … , 𝑂 βˆ’ 1 because N is rather small. Gibbs sampling require full conditionals

slide-16
SLIDE 16

Model Example: Train a Change-point Model with Bayesian Inference

πœ‡1 samples from Gibbs and naΓ―ve Metropolis Model sampled from Metropolis sampler

slide-17
SLIDE 17

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

Binary Restricted Boltzmann Machine (BRBM):

  • A particular probability distribution of observables and model parameters
  • The β€œmachine” is specified by 2 real (shift) vectors and 1 real (weight) matrix
  • The state of the β€œmachine” is specified by 2 Binary vectors (hidden & visible)

𝑄 π’˜, π’Š, 𝑋, 𝒃, 𝒄 = exp π’ƒπ‘ˆπ’˜ + π’„π‘ˆπ’Š + π’Šπ‘ˆπ‘‹π’˜ π‘Ž π‘Ž =

π’˜,π’Š

exp[

π‘˜=0 π‘œπ‘€π‘—π‘‘βˆ’1

π‘π‘˜π‘€π‘˜ +

𝑗=0 π‘œβ„Žπ‘—π‘’βˆ’1

π‘π‘—β„Žπ‘— +

𝑗,π‘˜

β„Žπ‘—π‘‹

π‘—π‘˜π‘€π‘˜]

  • In binary RBM, π’˜, π’Š are vectors of 1s and 0s.

See Dima’s presentation for more detailed description of RBM: http://algorithm- interest-group.me/algorithm/Boltzmann- Machines-Dima-Kochkov/

visualize 𝑋 π‘œβ„Žπ‘—π‘’ = 3 π‘œπ‘€π‘—π‘‘ = 4

slide-18
SLIDE 18

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

Binary Restricted Boltzmann Machine (BRBM):

  • A particular probability distribution of observables and model parameters
  • The β€œmachine” is specified by 2 real (shift) vectors and 1 real (weight) matrix
  • The state of the β€œmachine” is specified by 2 Binary vectors (hidden & visible)

𝑄 π’˜, π’Š, 𝑋, 𝒃, 𝒄 = exp π’ƒπ‘ˆπ’˜ + π’„π‘ˆπ’Š + π’Šπ‘ˆπ‘‹π’˜ π‘Ž π‘Ž =

π’˜,π’Š

exp[

π‘˜=0 π‘œπ‘€π‘—π‘‘βˆ’1

π‘π‘˜π‘€π‘˜ +

𝑗=0 π‘œβ„Žπ‘—π‘’βˆ’1

π‘π‘—β„Žπ‘— +

𝑗,π‘˜

β„Žπ‘—π‘‹

π‘—π‘˜π‘€π‘˜]

  • In binary RBM, π’˜, π’Š are vectors of 1s and 0s.

Thus full conditionals are simple:

See Dima’s presentation for more detailed description of RBM: http://algorithm- interest-group.me/algorithm/Boltzmann- Machines-Dima-Kochkov/

visualize 𝑋 𝑄 π‘€π‘˜ = 1 βˆ— 𝑄 π‘€π‘˜ = 0 βˆ— = exp π’ƒπ‘ˆπ’˜ + π’„π‘ˆπ’Š + π’Šπ‘ˆπ‘‹π’˜ vj=1 exp π’ƒπ‘ˆπ’˜ + π’„π‘ˆπ’Š + π’Šπ‘ˆπ‘‹π’˜ vj=0 = exp π‘π‘˜ +

𝑗

β„Žπ‘—π‘‹

π‘—π‘˜

π‘œβ„Žπ‘—π‘’ = 3 π‘œπ‘€π‘—π‘‘ = 4 𝑄 π‘€π‘˜ = 1 βˆ— =

𝑄 β„Žπ‘— = 1 βˆ— 𝑄 β„Žπ‘— = 1 βˆ— +𝑄 β„Žπ‘— = 0 βˆ— = 1 1+exp βˆ’π‘π‘˜βˆ’ 𝑗 β„Žπ‘—π‘‹π‘—π‘˜ = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(π‘π‘˜ + 𝑗 β„Žπ‘—π‘‹ π‘—π‘˜)

That is: we can sample binary RBM efficiently with block Gibbs sampling! Notice no matrix element among π‘€π‘˜ (restricted), thus: 𝑄 π’˜ = 1 βˆ— = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(𝒃 + π‘‹π‘ˆπ’Š)

slide-19
SLIDE 19

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

Q/ How to β€œtrain” a BRBM? Q1/ What is the outcome/goal of β€œtraining”? Q2/ What are the inputs in a β€œtraining”? Q3/ What does it mean to β€œtrain”? Q4/ What changes in the β€œtraining”? MNIST database: 70,000 handwritten digits from 0 to 9 Each picture has 28Γ—28 gray scale pixels {0,1,…,255}. For input into the BRBM, scale to [0,1.0) and cutoff at 0.5. nvis = 28Γ—28 = 784

slide-20
SLIDE 20

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

Q/ How to β€œtrain” a BRBM? Q1/ What is the outcome/goal of β€œtraining”? A1/ A joint probability distribution of 784 Bernoulli random variables, which favors configurations that look like digits. i.e. want 𝑄(π’˜| βˆ—) that represents data. Q2/ What are the inputs in a β€œtraining”? A2/ π’˜π‘‘, s=1,2,…,ndata. Each π’˜π‘‘ is a vector 784 0s and 1s. Q3/ What does it mean to β€œtrain”? A3/ Increase the probability of 𝑄(π’˜π‘‘| βˆ—). Q4/ What changes in the β€œtraining”? A4/ The β€œmachine”. Specifically: {𝐛, 𝒄, 𝑋} A/ Increase 𝑄 π’˜π‘‘ βˆ— , βˆ€π‘‘ by changing 𝐛, 𝒄, 𝑋 . MNIST database: 70,000 handwritten digits from 0 to 9 Each picture has 28Γ—28 gray scale pixels {0,1,…,255}. For input into the BRBM, scale to [0,1.0) and cutoff at 0.5. nvis = 28Γ—28 = 784

slide-21
SLIDE 21

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

𝑄 π’˜ = 1 βˆ— = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(𝒃 + π‘‹π‘ˆπ’Š) 𝑄 π’Š = 1 βˆ— = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(𝒄 + 𝑋 π’˜)

  • Gradient of cost function (ref: http://deeplearning.net/tutorial/rbm.html)

πœ– ln 𝑄 πœ–π‘‹

π‘—π‘˜

=< β„Žπ‘—π‘€π‘˜ >𝑒𝑏𝑒𝑏 βˆ’ < β„Žπ‘—π‘€π‘˜ >π‘›π‘π‘’π‘“π‘š

  • Training procedure: Contrastive Divergence (a.k.a. shitty steepest decent)

G.E. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Neural Networks: Tricks of the Trade, vol. 7700, pp 599-619, 2010.

slide-22
SLIDE 22

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST

Shift vector for visible units Rows of weight matrix 𝑋 (ordered by shift vector for hidden units 𝒄) BRBM samples after training 𝒃

slide-23
SLIDE 23

Advanced Example: Train a Binary Restricted Boltzmann Machine on MNIST Q/ What number is this?

slide-24
SLIDE 24

Conclusions:

Pros:

  • The Gibbs sampling technique draws samples from a multivariate

probability distribution by sampling the full conditional of each variable in turn.

  • Independent variables can be sampled simultaneously, making

block Gibbs sampling highly efficient for certain distributions. Cons:

  • Calculating full conditionals may be intractable and error prone
  • Fails when random variables are nearly perfectly correlated

𝑄 π’˜ = 1 βˆ— = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(𝒃 + π‘‹π‘ˆπ’Š) 𝑄 π’Š = 1 βˆ— = 𝑑𝑗𝑕𝑛𝑝𝑗𝑒(𝒄 + 𝑋 π’˜)

slide-25
SLIDE 25

References

Bivariate Normal Distribution:

  • MCMC: The Gibbs Sampler, The Clever Machine
  • Bayesian Inference: Metropolis-Hasting Sampling, Ilker Yildirim

Change-point Model:

  • Bayesian Inference: Gibbs Sampling, Ilker Yildirim

Restricted Boltzmann Machine:

  • A Practical Guide to Training Restricted Boltzmann Machines, Geoffrey E. Hinton
  • deeplearning.net
  • Introduction to Restricted Boltzmann Machines, Edwin Chen