Introduction to Machine Learning Undirected Graphical Models - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Undirected Graphical Models - - PowerPoint PPT Presentation

Introduction to Machine Learning Undirected Graphical Models Barnabs Pczos Credits Many of these slides are taken from Ruslan Salakhutdinov, Hugo Larochelle, & Eric Xing http://www.dmi.usherb.ca/~larocheh/neural_networks


slide-1
SLIDE 1

Introduction to Machine Learning

Undirected Graphical Models

Barnabás Póczos

slide-2
SLIDE 2

Credits

Many of these slides are taken from Ruslan Salakhutdinov, Hugo Larochelle, & Eric Xing

  • http://www.dmi.usherb.ca/~larocheh/neural_networks
  • http://www.cs.cmu.edu/~rsalakhu/10707/
  • http://www.cs.cmu.edu/~epxing/Class/10708/

Reading material:

  • http://www.cs.cmu.edu/~rsalakhu/papers/Russ_thesis.pdf
  • Section 30.1 of Information Theory, Inference, and Learning Algorithms by David

MacKay

  • http://www.stat.cmu.edu/~larry/=sml/GraphicalModels.pdf

2

slide-3
SLIDE 3

Probabilistic graphical models: a powerful framework for representing dependency structure between random variables. Markov network (or undirected graphical model) is a set of random variables having a dependency structure described by an undirected graph.

Undirected Graphical Models = Markov Random Fields

Semantic labeling

3

slide-4
SLIDE 4

4

Cliques

Clique: a subset of nodes such that there exists a link between all pairs of nodes in a subset. Maximal Clique: a clique such that it is not possible to include any other nodes in the set without it ceasing to be a clique. This graph has two maximal cliques: Other cliques:

slide-5
SLIDE 5

Undirected Graphical Models = Markov Random Fields

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are useful for expressing dependencies between random variables. The joint distribution defined by the graph is given by the product of non-negative potential functions over the maximal cliques (connected subset of nodes). In this example, the joint distribution factorizes as:

5

slide-6
SLIDE 6

Markov Random Fields (MRFs)

 Each potential function is a mapping from the joint configurations of random variables in a maximal clique to non- negative real numbers.  The choice of potential functions is not restricted to having specific probabilistic interpretations. 

6

where E(x) is called an energy function.

slide-7
SLIDE 7

Conditional Independence

7

Theorem:

It follows that the undirected graphical structure represents conditional independence:

slide-8
SLIDE 8

MRFs with Hidden Variables

8

For many interesting problems, we need to introduce hidden or latent variables.  Our random variables will contain both visible and hidden variables x=(v,h)  Computing the Z partition function is intractable  Computing the summation over hidden variables is intractable  Parameter learning is very challenging.

slide-9
SLIDE 9

Boltzmann Machines

Definition: [Boltzmann machines] MRFs with maximum click size two [pairwise (edge) potentials] on binary-valued nodes are called Boltzmann machines The joint probabilities are given by : The parameter θij measures the dependence of xi on xj, conditioned on the other nodes.

9

slide-10
SLIDE 10

Boltzmann Machines

Theorem: One can prove that the conditional distribution of one node conditioned

  • n the others is given by the logistic function in Boltzmann Machines:

10

Proof:

slide-11
SLIDE 11

Boltzmann Machines

11

Proof [Continued]: Q.E.D.

slide-12
SLIDE 12

Example: Image Denoising

12

Let us look at the example of noise removal from a binary image. The image is an array of {-1, +1} pixel values.  We take the original noise-free image (x) and randomly flip the sign of pixels with a small

  • probability. This process creates the noisy image (y)

 Our goal is to estimate the original image x from the noisy observations y.  We model the joint distribution with

slide-13
SLIDE 13

Inference: Iterated Conditional Models

13

Goal: Solution: coordinate-wise gradient descent

slide-14
SLIDE 14

Gaussian MRFs

  • The information matrix is sparse, but the covariance matrix is not.

14

slide-15
SLIDE 15

Restricted Boltzmann Machines

15

slide-16
SLIDE 16

Restricted Boltzmann Machines

Restricted = no connections in the hidden layer + no connection in the visible layer

16

x

Partition function (intractable)

slide-17
SLIDE 17

[Quadratic in v linear in h]

17

Gaussian-Bernoulli RBM

slide-18
SLIDE 18

Possible Tasks with RBM

Tasks:  Inference:  Evaluate the likelihood function:  Sampling from RBM:  Training RBM:

18

slide-19
SLIDE 19

Inference

19

slide-20
SLIDE 20

Inference

20

Theorem: Inference in RBM is simple: the conditional distributions are logistic functions Similarly,

x

slide-21
SLIDE 21

21

Proof:

slide-22
SLIDE 22

22

Proof [Continued]: Q.E.D.

slide-23
SLIDE 23

Evaluating the Likelihood

23

slide-24
SLIDE 24

Calculating the Likelihood of an RBM

24

Theorem: Calculating the likelihood is simple in RBM (apart from the partition function) Free energy

slide-25
SLIDE 25

Proof:

25

Q.E.D.

slide-26
SLIDE 26

Sampling

26

slide-27
SLIDE 27

Sampling from p(x,h) in RBM

27

Sampling is tricky… it is easier much in directed graphical models. Here we will use Gibbs sampling.

Similarly,

x Goal: Generate samples from

slide-28
SLIDE 28

28

Gibbs Sampling: The Problem

Our goal is to generate samples from Suppose that we can generate samples from

slide-29
SLIDE 29

29

Gibbs Sampling: Pseudo Code

slide-30
SLIDE 30

30

Gibbs Sampling

slide-31
SLIDE 31

31

Training

slide-32
SLIDE 32

32

RBM Training

Training is complicated…

To train an RBM, we would like to minimize the negative log-likelihood function: To solve this, we use stochastic gradient ascent:

Theorem:

Positive phase Negative phase (hard to computer)

slide-33
SLIDE 33

33

RBM Training

Proof:

slide-34
SLIDE 34

34

RBM Training

Proof [Continued]:

First term Second term First term: Difficult to calculate the expectation Negative phase

slide-35
SLIDE 35

35

RBM Training

Proof [Continued]:

slide-36
SLIDE 36

36

RBM Training

Proof [Continued]:

Second term: The conditionals are independent logistic distributions First term Second term Q.E.D Positive phase

slide-37
SLIDE 37

37

Since We need to calculate where

RBM Training

slide-38
SLIDE 38

38

The second term is more tricky. Approximate the expectations with a single sample:

RBM Training

Since We need to calculate where

slide-39
SLIDE 39

39

RBM Training

Logistic Logistic

slide-40
SLIDE 40

40

CD-k (Contrastive Divergence) Pseudocode

slide-41
SLIDE 41

Results

41

slide-42
SLIDE 42

http://deeplearning.net/tutorial/rbm.html

RBM Training Results

Samples generated by the RBM after training. Each row represents a mini-batch of negative particles (samples from independent Gibbs chains). 1000 steps of Gibbs sampling were taken between each of those rows. Original images Learned filters

42

slide-43
SLIDE 43

Summary

Tasks:  Inference:  Evaluate the likelihood function:  Sampling from RBM:  Training RBM:

43

slide-44
SLIDE 44

Thanks for your Attention!

slide-45
SLIDE 45

45

RBM Training Results

slide-46
SLIDE 46

46

Gaussian-Bernoulli RBM Training Results

Each document (story) is represented with a bag of world coming from a multinomial distribution with parameters (h = topics). After training we can generate words from this topics.