Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial - - PowerPoint PPT Presentation

Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring Dept. of Computer Science and Engineering Seoul National University 1 History of Neural Network Research Neural network Deep belief net Back


slide-1
SLIDE 1

1

Deep Learning

Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring

  • Dept. of Computer Science and Engineering

Seoul National University

slide-2
SLIDE 2

Neural network Back propagation

1986

 Solve general learning problems  Tied with biological system

But it is given up…

  • Hard to train
  • Insufficient computational resources
  • Small training sets
  • Does not work well

2006

  • SVM
  • Boosting
  • Decision tree
  • KNN
  • Loose tie with biological systems
  • Shallow model
  • Specific methods for specific tasks

– Hand crafted features (GMM-HMM, SIFT, LBP, HOG)

Kruger et al. TPAMI’13 Deep belief net Science

… … … … … … … …

  • Unsupervised & Layer-wised pre-training
  • Better designs for modeling and training

(normalization, nonlinearity, dropout)

  • Feature learning
  • New development of computer architectures

– GPU – Multi-core computer systems

  • Large scale databases

deep learning results

Speech

2011 2012

How Many Computers to Identify a Cat? 16000 CPU cores

Rank Name Error ra te Description 1

  • U. Toronto

0.15315 Deep Conv Net 2

  • U. Tokyo

0.26172 Hand-crafted fe atures and learn ing models. Bottleneck. 3

  • U. Oxford

0.26979 4 Xerox/INRIA 0.27058

Object recognition over 1,000,000 images and 1,000 categories (2 GPU)

History of Neural Network Research

Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk

slide-3
SLIDE 3

Contents

 Neural Networks

  • Multilayer Perceptron Structure
  • Learning Algorithm based on Back Propagation
  • Deep Belief Network
  • Restricted Boltzmann Machines
  • Deep Learning (Deep Belief Network)

 Convolutional Neural Networks (CNN)

  • CNN Structure and Learning
  • Applications
slide-4
SLIDE 4

NEURAL NETWORKS

Slides by Jiseob Kim jkim@bi.snu.ac.kr

slide-5
SLIDE 5

Classification Problem

 Problem of finding label Y given data X

  • ex1) x: face image,

y: person’s name

  • ex2) x: blood sugar measurement, blood pressure…|

y: diagnosis of diabetes

  • ex3) x: voice,

y: sentence corresponding to the voice

 x: D-dimensional vector, y: integer (Discrete)  Famous pattern recognition algorithms

  • Support Vector Machine
  • Decision Tree
  • K-Nearest Neighbor
  • Multi-Layer Perceptron (Artificial Neural Network; 인공신경망)
slide-6
SLIDE 6

Perceptron (1/2)

slide-7
SLIDE 7

Perceptron (2/2)

x1 x2 1 w1 w2 b

x1 x2 w1*x1 + w2*x2 +b = 0

> 0: < 0:

slide-8
SLIDE 8

Parameter Learning in Perceptron

start: The weight vector w is generated randomly test: A vector x ∈ P ∪ N is selected randomly, If x∈P and w·x>0 goto test, If x∈P and w·x≤0 goto add, If x ∈ N and w · x < 0 go to test, If x ∈ N and w · x ≥ 0 go to subtract. add: Set w = w+x, goto test subtract: Set w = w-x, goto test

slide-9
SLIDE 9

Sigmoid Unit

Classic Perceptron Sigmoid Unit

Sigmoid function is Differentiable ¶s(x) ¶x =s(x)(1-s(x))

slide-10
SLIDE 10

Learning Algorithm of Sigmoid Unit

 Loss Function  Gradient Descent Update

f (s) =1/(1+e-s) f '(s) = f (s)(1- f (s))

X X W ) 1 ( ) ( 2 ) ( 2 f f f d s f f d           

X W W ) 1 ( ) ( f f f d c    

Target Unit Output

2

) ( f d   

slide-11
SLIDE 11

Need for Multiple Units and Multiple Layers

 Multiple boundaries are n eeded (e.g. XOR problem)  Multiple Units  More complex regions are needed (e.g. Polygons)  Multiple Layers

slide-12
SLIDE 12

Structure of Multilayer Perceptron (MLP; Artificial Neural Network)

Input Output

slide-13
SLIDE 13

Learning Parameters of MLP

 Loss Function

  • We have the same Loss Function
  • But the # of parameters are now much more

(Weight for each layer and each unit)

  • To use Gradient Descent, we need to calculat

e the gradient for all the parameters

Target Unit Output

2

) ( f d   

 Recursive Computation of Gradients

  • Computation of loss-gradient of the t
  • p-layer weights is the same as befor

e

  • Using the chain rule, we can compute

the loss-gradient of lower-layer weight s recursively (Back Propagation)

slide-14
SLIDE 14

Back Propagation Learning Algorithm (1/3)

 Gradients of top-layer weights and update rule  Store intermediate value delta for later use of chain rul e

d(k) = ¶e ¶si

( j) = (d - f ) ¶f

¶si

( j)

= (d - f ) f (1- f )

2

) ( f d   

¶e ¶W = -2(d - f )¶f ¶s X = -2(d - f )f (1- f )X

X W W ) 1 ( ) ( f f f d c    

Gradient Descent update rule

slide-15
SLIDE 15

Back Propagation Learning Algorithm (2/3)

 Gradients of lower-layer weights

) ( ) 1 ( ) ( j i j j i

s W X  

) ( ) ( 2 ) (

) ( 2 ) (

j i j i j i

s f f d s f d s           

) 1 ( ) ( ) 1 ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) (

2 ) ( 2

  

                

j j i j j i j j i j i j i j i j i

s f f d s s s X X X W W    

Weighted sum Local gradient ) 1 ( ) ( ) ( ) ( ) ( 

 

j j i j i j i j i

c X W W 

Gradient Descent Update rule for lower-layer weights

slide-16
SLIDE 16

Back Propagation Learning Algorithm (3/3)

 Applying chain rule, recursive relation between delta’s

  

 

1

1 ) 1 ( ) 1 ( ) ( ) ( ) (

) 1 (

j

m l j il j i j i j i j i

w f f  

Algorithm: Back Propagation

  • 1. Randomly Initialize weight parameters
  • 2. Calculate the activations of all units (with input data)
  • 3. Calculate top-layer delta
  • 4. Back-propagate delta from top to the bottom
  • 5. Calculate actual gradient of all units using delta’s
  • 6. Update weights using Gradient Descent rule
  • 7. Repeat 2~6 until converge
slide-17
SLIDE 17

Applications

 Almost All Classification Problems

  • Face Recognition
  • Object Recognition
  • Voice Recognition
  • Spam mail Detection
  • Disease Detection
  • etc.
slide-18
SLIDE 18

Limitations and Breakthrough

 Limitations

  • Back Propagation barely changes lower-layer parameters (Van

ishing Gradient)

  • Therefore, Deep Networks cannot be fully (effectively) trained

with Back Propagation

 Breakthrough

  • Deep Belief Networks (Unsupervised Pre-training)
  • Convolutional Neural Networks (Reducing Redundant Parame

ters)

  • Rectified Linear Unit (Constant Gradient Propagation)

Input x Output y'

Target y

Error Error Error

Back-propagation

slide-19
SLIDE 19

DEEP BELIEF NETWORKS

Slides by Jiseob Kim jkim@bi.snu.ac.kr

slide-20
SLIDE 20

Motivation

 Idea:

  • Greedy Layer-wise training
  • Pre-training + Fine tuning
  • Contrastive Divergence
slide-21
SLIDE 21

Restricted Boltzmann Machine (RBM)

 

 

j j i i

x P P h P P ) | ( ) | ( ) | ( ) | ( h h x x x h

 Energy-Based Model  Energy function

  • E(x,h)=b' x+c' h+h' Wx

 

h x h x h x

h x

, ) , ( ) , (

) , (

E E

e e P

x1 x2 h2 h3 h4 x3 h5 h1

P(x) = e-E(x,h)

h

å

e-E(x,h)

x,h

å

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · · x)

Joint (x, h) Probability

W

Marginal (x) Probability,

  • r

Likelihood

Remark:

  • Conditional Independence
  • Conditional Probability is the

same as Neural Network

Conditional Probability

slide-22
SLIDE 22

Unsupervised Learning of RBM

 Maximum Likelihood

  • Use Gradient Descent

L(X;q) = e-E(x,h)

h

å

e-E(x,h)

x,h

å

¶L(X;q) ¶wij = p(x,q)¶log f (x;q) ¶q

ò

dx - 1 K ¶log f (x(k);q) ¶q

k=1 K

å

= xihj

p(x,q ) - xihj X = xihj ¥ - xihj

» xihj

1 - xihj Distribution of Model Distribution of Dataset

 

j ih

v

 

j ih

v

i j i j i j i j t = 0 t = 1 t = 2 t = infinity a fantasy

slide-23
SLIDE 23

Contrastive Divergence (CD) Learning

  • f RBM parameters

 k-Contrastive Divergence Trick

  • From the previous slide, to get distribution of model, we nee

d to calculate many Gibbs sampling steps

  • And this is per a single parameter update
  • Therefore, we take the sample after only k-steps where in pra

ctice, k=1 is sufficient

 

j ih

v

 

j ih

v

i j i j i j i j t = 0 t = 1 t = 2 t = infinity a fantasy

Take this as a sample of Model distribution

slide-24
SLIDE 24

Effect of Unsupervised Training

Unsupervised Training makes RBM successfully catch the essential patterns RBM trained on MNIST hand-written digit data: Each cell shows the pattern each hidden node encodes

slide-25
SLIDE 25

Deep Belief Network (DBN)

 Deep Belief Network (Deep Bayesian N etwork)

  • Bayesian Network that has similar structur

e to Neural Network

  • Generative model
  • Also, can be used as classifier (with additi
  • nal classifier at top layer)
  • Resolves gradient vanishing by Pre-trainin

g

  • There are two modes (Classifier & Auto-E

ncoder), but we only consider Classifier he re

slide-26
SLIDE 26

Learning Algorithm of DBN

 DBN as a stack of RBMs

1. Regard each layer as RBM 2. Layer-wise Pre-train each RBM in Unsupervised way 3. Attach the classifier and Fine-tune the whole Network in Supervis ed way

… … … … … … … … … … … … h1 x h2 h1 h3 h2

… … … … h0 x0

W

RBM DBN Classifier

slide-27
SLIDE 27

Viewing Learning as Wake-Sleep Algorithm

slide-28
SLIDE 28

Effect of Unsupervised Pre-Training in DBN (1/2)

28

Erhan et. al. AISTATS’2009

slide-29
SLIDE 29

Effect of Unsupervised Pre-Training in DBN (2/2)

29

with pre-training without pre-training

slide-30
SLIDE 30

Internal Representation of DBN

30

slide-31
SLIDE 31

Representation of Higher Layers

 Higher layers have more abstract representations

  • Interpolating between different images is not desirable in lo

wer layers, but natural in higher layers

Bengio et al., ICML 2013

slide-32
SLIDE 32

Inference Algorithm of DBN

 As DBN is a generative model, we can also regenerate the data

  • From the top layer to the bottom, conduct Gibbs sampling to generate th

e data samples Generate data

Occluded Regenerated Lee, Ng et al., ICML 2009

slide-33
SLIDE 33

Applications

 Nowadays, CNN outperforms DBN for Image or Speech data  However, if there is no topological information, DBN is still a good choice  Also, if the generative model is needed, DBN is used

Generate Face patches Tang, Srivastava, Salakhutdinov, NIPS 2014

slide-34
SLIDE 34

CONVOLUTIONAL NEURAL NE TWORKS

Slides by Jiseob Kim jkim@bi.snu.ac.kr

slide-35
SLIDE 35

Motivation

 Idea:

  • Fully connected structure has too many parameters to learn
  • Efficient to learn local patterns when there are geometrical,

topological structure between features such as image data or voice data (spectrograms)

 DBN: different data  CNN: same data

Image 1 Image 2

slide-36
SLIDE 36

Structure of Convolutional Neural Network (CNN)

 Higher features formed by repeated Convolution and Pooling (Subsampling)  Convolution obtains certain Feature from local area  Pooling reduces Dimension, while obtaining Translation- invariant Feature

http://parse.ele.tue.nl/education/cluster2

slide-37
SLIDE 37

Convolution Layer

 The Kernel Detects pattern:  The Resulting value Indicates:

  • How much the pattern matches

at each region

1 1 1 1 1

slide-38
SLIDE 38

Max-Pooling Layer

 The Pooling Layer summarizes the results of Convolution Layer

  • e.g.) 10x10 result is summarized into

1 cell

 The Result of Pooling Layer is Tran slation-invariant

slide-39
SLIDE 39

Remarks

Higher layer Higher layer

  • Higher layer catches more

specific, abstract patterns

  • Lower layer catches more

general patterns

slide-40
SLIDE 40

Parameter Learning of CNN

 CNN is just another Neural Network with sparse connections  Learning Algorithm:

  • Back Propagation on Convolution Layers and Fully-Connected Layers

Back Propagation

slide-41
SLIDE 41

Applications (Image Classification) (1/2) Image Net Competition Ranking

(1000-class, 1 million images)

From Kyunghyun Cho’s dnn tutorial

ALL CNN!!

slide-42
SLIDE 42

Applications (Image Classification) (2/2)

 Krizhevsky et al.: the winner of ImageNet 2012 Competition

1000-class problem, top-5 test error rate of 15.3%

Fully Connected

slide-43
SLIDE 43

Application (Speech Recognition)

Input: Spectrogram of Speech

Convolutional Neural Network CNN outperforms all previous methods that uses GMM of MFCC

slide-44
SLIDE 44

APPENDIX

Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk

slide-45
SLIDE 45

45

Good learning resources

 Webpages:

  • Geoffrey E. Hinton’s readings (with source code available for DBN) http://ww

w.cs.toronto.edu/~hinton/csc2515/deeprefs.html

  • Notes on Deep Belief Networks http://www.quantumg.net/dbns.php
  • MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectur

es.net/mlss2010au_frean_deepbeliefnets/

  • Deep Learning Tutorials http://deeplearning.net/tutorial/
  • Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/
  • Fergus’s Tutorial, http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf
  • CUHK MMlab project : http://mmlab.ie.cuhk.edu.hk/project_deep_learning.h

tml

 People:

  • Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton
  • Andrew Ng http://www.cs.stanford.edu/people/ang/index.html
  • Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/
  • Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/
  • Yoshua Bengio www.iro.umontreal.ca/~bengioy
  • Yann LeCun http://yann.lecun.com/
  • Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean
  • Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php

 Acknowledgement

  • Many materials in this ppt are from these papers, tutorials, etc (especially

Hinton and Frean’s). Sorry for not listing them in full detail.

Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.

slide-46
SLIDE 46

46

Graphical model for Statistics

 Conditional independence b etween random variables  Given C, A and B are indepe ndent:

  • P(A, B|C) = P(A|C)P(B|C)

 P(A,B,C) =P(A, B|C) P(C)

  • =P(A|C)P(B|C)P(C)

 Any two nodes are conditio nally independent given the values of their parents.

http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm

Smoker? Has Lung cancer Has bronchitis

B C A

slide-47
SLIDE 47

47

Directed and undirected graphical m

  • del

 Directed graphical model

  • P(A,B,C) = P(A|C)P(B|C)P(C)
  • Any two nodes are conditionally indepe

ndent given the values of their parents.

 Undirected graphical model

  • P(A,B,C) = P(B,C)P(A,C)
  • Also called Marcov Random Field (MRF)

B A C P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C) B A C D B C A B C A

slide-48
SLIDE 48

48

Modeling undirected model

 Probability:

) ( ) ; ( ) ; ( ) ; ( ) (      Z f f f P x x x x;

x

  

B A C

) , ( ) exp( ) exp( ) exp( ) exp( ) , , (

2 1 2 1 , , 2 1 2 1

w w Z AC w BC w AC w BC w AC w BC w C B A P

C B A

      ;

w1 w2

Example: P(A,B,C) = P(B,C)P(A,C)

Is smoker? Is healthy Has Lung cancer partition function

1 ) ( 

x

x; P

slide-49
SLIDE 49

49

More directed and undirected models

D E A B G H h1 h2 y1 h3 y2 y3 Hidden Marcov model MRF in 2D F C I

slide-50
SLIDE 50

50

More directed and undirected models

A B C D P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C) h1 h2 y1 h3 y2 y3 P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1) P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)

slide-51
SLIDE 51

51

More directed and undirected models

v h ... ... x h1 ... ... h2 h3 ... ... W W 0 W 1 W 2 (a) (b) HMM RBM DBN ( ... x Our d

slide-52
SLIDE 52

52

Extended reading on graphical model

 Zoubin Ghahramani ‘s video lecture on graphical models:  http://videolectures.net/mlss07_ghahramani_grafm/

slide-53
SLIDE 53

Product of Experts

53

, ) ( ) ; ( ) ; ( ) ; ( ) ; (

) ; ( ) ; (

    

 

Z f e e f f P

E E m m m m m m m m

x x x x

x x x x

  

   

 

 

m

x x ) ; ( log ) ; (

m m m

f E  

Energy function Partition function MRF in 2D

... ) ; (

3 4 3 2 1

      CF w BE w AD w BC w AB w E w x

D E A B G H F C I

slide-54
SLIDE 54

Product of Experts

54

 

   

 

15 1 ) ( ) (

) 1 (

i i i

c e

i i

 

u x u x

T

slide-55
SLIDE 55

55

Products of experts versus Mixture model

 Products of experts :

  • "and" operation
  • Sharper than mixture
  • Each expert can constrain a different subset of dimensions.

 Mixture model, e.g. Gaussian Mixture model

  • “or” operation
  • a weighted sum of many density functions

  

x

x x x ) ; ( ) ; ( ) ; (

m m m m m m m m

f f P   

slide-56
SLIDE 56

56

Outline

 Basic background on statistical learning and Gr aphical model

Contrastive divergence and Restricte d Boltzmann machine

  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine

 Deep belief net

slide-57
SLIDE 57

57

Contrastive Divergence (CD)

 Probability:  Maximum Likelihood and gradient descent

) ( / ) ; ( ) x; (    Z x f P 

X x

x x x x x x x X              

                       

  

 

) ; ( log ) ; ( log ) ; ( log 1 ) ; ( log ) , ( ) ; ( log 1 ) ( log ) ; ( 1

) , ( 1 1

f f f K d f p f K Z L K

p K k (k) K k (k)

data dist. model dist. expectation ) ; ( ) ; (

1

      

       X X L

  • r

L

t t

             

 

  K k k K k k

P L P

1 1

) log max ) ; ( max ) max   

  

; (x X ; (x

) ( ) (

x

) x; ( ) (

m

f Z  

slide-58
SLIDE 58

58

Contrastive Divergence (CD)

 Gradient of Likelihood:

  T

 

       

K k (k)

f K d f p L

1

) ; x ( log 1 x ) ; x ( log ) , x ( ) ; X (       

Intractable Tractable Gibbs Sampling Fast contrastive divergence T=1 Easy to compute

Sample p(z1,z2,…,zM)

        

) ; X (

1

L

t t

CD Minimum P(A,B,C) = P(A|C)P(B|C)P(C) B A C Accurate but slow gradient Approximate but fast gradient

slide-59
SLIDE 59

59

Gibbs Sampling for graphical model

x1 x2 h2 h3 h4 x3 h5 h1

More information on Gibbs sampling: Pattern recognition and machine learning(PRML)

slide-60
SLIDE 60

60

Convergence of Contrastive divergence (CD)

 The fixed points of ML are not fixed points of CD and vice versa.

  • CD is a biased learning algorithm.
  • But the bias is typically very small.
  • CD can be used for getting close to ML solution and then ML le

arning can be used for fine-tuning.

 It is not clear if CD learning converges (to a stable fixed poi nt). At 2005, proof is not available.  Further theoretical results? Please inform us

  • M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005
slide-61
SLIDE 61

61

Outline

 Basic background on statistical learning and Gr aphical model

Contrastive divergence and Restricte d Boltzmann machine

  • Product of experts
  • Contrastive divergence
  • Restricted Boltzmann Machine

 Deep belief net

slide-62
SLIDE 62

62

Boltzmann Machine

 Undirected graphical model, wit h hidden nodes.

 

  

 i i i j i j i ij

x x x w E  ) (x;

, ) ( ) ; ( ) ; ( ) ; ( ) ; (

) ; ( ) ; (

    

 

Z f e e f f P

E E m m m m m m m m

x x x x

x x x x

  

   

 

Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh } , { :

i ij

w  

slide-63
SLIDE 63

Restricted Boltzmann Machine (RBM)

 Undirected, loopy, layer  E(x,h)=b' x+c' h+h' Wx

 

 

j j i i

x P P h P P ) | ( ) | ( ) | ( ) | ( h h x x x h

 

h x h x h x

h x

, ) , ( ) , (

) , (

E E

e e P

x1 x2 h2 h3 h4 x3 h5 h1 h x

 

 

h x h x h h x

x

, ) , ( ) , (

) (

E E

e e P

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · · x)

Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh

partition function

W

Read the manuscript for details

slide-64
SLIDE 64

64

Restricted Boltzmann Machine (RBM)

 E(x,h)=b' x+c' h+h' Wx  x = [x1 x2 …]T, h = [h1 h2 …]T  Parameter learning

  • Maximum Log-Likelihood

              

 

  K k k K k k

P L P

1 1

) log min ) ; ( min ) max   

  

; (x X ; (x

) ( ) (

) ( ) ; ( ) (

, ( (

   Z x f e e P   

      h x Wx) h' h c' x b' h Wx) h' h c' x b'

x;

Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)

slide-65
SLIDE 65

65

CD for RBM

 CD for RBM, very fast!

        

) ; X (

1

L

t t

) ( ) ; ( ) x; (

h , x Wx) h' h c' x b' ( h Wx) h' h c' x b' (

   Z x f e e P   

     

X x

x x x x X

j i j i j i j i j i p j i K k (k) ij

h x h x h x h x h x h x f K d f p w L              

 

 

1 ) , ( 1

) ; ( log 1 ) ; ( log ) , ( ) ; (

     

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)

CD

slide-66
SLIDE 66

66

CD for RBM

h1

x1 x2 h2

X

j i j i ij

h x h x w L    

1

) ; ( 

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)

P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x) P(xj = 1|h) = σ(bj +W’• j · h)

slide-67
SLIDE 67

67

RBM for classification

 y: classification label

Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.

slide-68
SLIDE 68

68

RBM itself has many applications

 Multiclass classification  Collaborative filtering  Motion capture modeling  Information retrieval  Modeling natural images  Segmentation

Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013

  • V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence,

2011. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008. Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007. Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009. Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008

slide-69
SLIDE 69

69

Outline

 Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricted Boltzm ann machine

Deep belief net (DBN)

  • Why deep leaning?
  • Learning and inference
  • Applications
slide-70
SLIDE 70

70

Belief Nets

 A belief net is a directed acyclic g raph composed of random varia bles.

random hidden cause visible effect

slide-71
SLIDE 71

71

Deep Belief Net

 Belief net that is deep  A generative model

  • P(x,h1,…,hl) = p(x|h1) p(h1|h2)… p(hl-2|hl-1) p(hl-1,hl)

 Used for unsupervised training of multi-layer deep mo del.

h1 x h2 h3 … … … … … … … …

P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)

Pixels=>edges=> local shapes=> object parts

slide-72
SLIDE 72

72

Why Deep learning?

 The mammal brain is organized in a deep architecture wit h a given input percept represented at multiple levels of a bstraction, each level corresponding to a different area of cortex.  An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whos e depth is matched to the task.  Since the number of computational elements one can affo rd depends on the number of training examples available to tune or select them, the consequences are not just com putational but also statistical: poor generalization may be expected when using an insufficiently deep architecture f

  • r representing some functions.
  • T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational

Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007. Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.

Pixels=>edges=> local shapes=> object parts

slide-73
SLIDE 73

73

 Linear regression, logistic regression: depth 1  Kernel SVM: depth 2  Decision tree: depth 2  Boosting: depth 2  The basic conclusion that these results suggest is that whe n a function can be compactly represented by a deep archit ecture, it might need a very large architecture to be represe nted by an insufficiently deep one. (Example: logic gates, multi-layer NN with linear threshold units and positive we ight).

Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.

Why Deep learning?

slide-74
SLIDE 74

74

Example: sum product network (SPN)

X2 X2

X3 X3

X1 X1 X4

               

X4

X5 X5 2N-1 N2N-1 parameters O(N) parameters

slide-75
SLIDE 75

75

Depth of existing approaches

 Boosting (2 layers)

  • L 1: base learner
  • L 2: vote or linear combination of layer 1

 Decision tree, LLE, KNN, Kernel SVM (2 layers)

  • L 1: matching degree to a set of local templates.
  • L 2: Combine these degrees

 Brain: 5-10 layers

i i iK

b ) , ( x x 

slide-76
SLIDE 76

76

Why decision tree has depth 2?

 Rely on partition of input space.  Local estimator. Rely on partition of input space and use separate params for each region. Each r egion is associated with a leaf.  Need as many as training samples as there are v ariations of interest in the target function. Not g

  • od for highly varying functions.

 Num. training sample is exponential to Num. di m in order to achieve a fixed error rate.

slide-77
SLIDE 77

77

Deep Belief Net

 Inference problem: Infer the states of the unobs erved variables.  Learning problem: Adjust the interactions betw een variables to make the network more likely t

  • generate the observed data

h1 x h2 h3 … … … … … … … …

P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)

slide-78
SLIDE 78

78

Deep Belief Net

  • Inference problem (the problem of explaining away):

B A C

h11 h12 x1 h1 x … … … …

P(A,B|C) = P(A|C)P(B|C) P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)

An example from manuscript Sol: Complementary prior

slide-79
SLIDE 79

79

Deep Belief Net

h1 x h2 h4 … … … … … …

… …

h3 … … 2000 1000 500 30 Sol: Complementary prior

 Inference problem (the problem

  • f explaining away)

 Sol: Complementary prior

slide-80
SLIDE 80

80

Deep Belief Net

 Explaining away problem of Inference (see the manus cript)

  • Sol: Complementary prior, see the manuscript

 Learning problem

  • Greedy layer by layer RBM training (optimize lower boun

d) and fine tuning

  • Contrastive divergence for RBM training

h1 x h2 h3 … … … … … … … …

P(hi = 1|x) = σ(ci +Wi · x)

… … … … … … … … … … … … h1 x h2 h1 h3 h2

slide-81
SLIDE 81

81

Deep Belief Net

 Why greedy layerwise learning work?  Optimizing a lower bound:  When we fix parameters for layer 1 an d optimize the parameters for layer 2, we are optimizing the P(h1) in (1)

 

   

1

h 1 1 1 1 1 1

x | h x | h x | h h x | h h x, x )]} ( log ) ( )] ( log ) ( )[log ( { ) ( log ) ( log Q Q P P Q P P

h

… … … … … … … … … … … … h1 x h2 h1 h3 h2

(1)

slide-82
SLIDE 82

82

Deep Belief Net and RBM

 RBM can be considered as DBN that has infinitive layers

T

W

… … … … … … … … … … h0 x0 h1 x1 x2 …

W W

T

W … … … … h0 x0

W

slide-83
SLIDE 83

83

Pretrain, fine-tune and inference – (autoencoder)

(BP)

slide-84
SLIDE 84

84

Pretrain, fine-tune and inference - 2

y: identity or rotation degree

Pretraining Fine-tuning

slide-85
SLIDE 85

85

How many layers should we use?

 There might be no universally right depth

  • Bengio suggests that several layers is better than one
  • Results are robust against changes in the size of a laye

r, but top layer should be big

  • A parameter. Depends on your task.
  • With enough narrow layers, we can model any distribu

tion over binary vectors [1]

Copied from http://videolectures.net/mlss09uk_hinton_dbn/

[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007

slide-86
SLIDE 86

86

Effect of Unsupervised Pre-training

Erhan et. al. AISTATS’2009

slide-87
SLIDE 87

87

Effect of Depth

w/o pre-training

with pre-training without pre-training

slide-88
SLIDE 88

88

Why unsupervised pre-training makes sense

stuff image label stuff image label

If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity? If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.

high bandwidth low bandwidth

slide-89
SLIDE 89

89

Beyond layer-wise pretraining

 Layer-wise pretraining is efficient but not optimal.  It is possible to train parameters for all layers using a wake

  • sleep algorithm.
  • Bottom-up in a layer-wise manner
  • Top-down and reffiting the earlier models
slide-90
SLIDE 90

90

Fine-tuning with a contrastive versio n of the “wake-sleep” algorithm

After learning many layers of features, we can fine-tune the f eatures to improve generation.

  • 1. Do a stochastic bottom-up pass
  • Adjust the top-down weights to be good at reconstructing the fe

ature activities in the layer below.

  • 2. Do a few iterations of sampling in the top level RBM
  • - Adjust the weights in the top-level RBM.
  • 3. Do a stochastic top-down pass
  • Adjust the bottom-up weights to be good at reconstructing the f

eature activities in the layer above.

slide-91
SLIDE 91

91

Include lateral connections

 RBM has no connection among layers  This can be generalized.  Lateral connections for the first layer [1].

  • Sampling from P(h|x) is still easy. But sampling from p

(x|h) is more difficult.

 Lateral connections at multiple layers [2].

  • Generate more realistic images.
  • CD is still applicable, with small modification.

[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311–3325, December 1997. [2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.

slide-92
SLIDE 92

92

Without lateral connection

slide-93
SLIDE 93

93

With lateral connection

slide-94
SLIDE 94

94

My data is real valued …

 Make it [0 1] linearly: x = ax + b  Use another distribution

slide-95
SLIDE 95

95

My data has temporal dependency …

 Static:  Temporal

slide-96
SLIDE 96

96

Consider DBN as…

 A statistical model that is used for unsupervised traini ng of fully connected deep model  A directed graphical model that is approximated by fa st learning and inference algorithms  A directed graphical model that is fine tuned using ma ture neural network learning approach -- BP.

slide-97
SLIDE 97

97

Outline

 Basic background on statistical learning and Gr aphical model  Contrastive divergence and Restricted Boltzm ann machine

Deep belief net (DBN)

  • Why DBN?
  • Learning and inference
  • Applications
slide-98
SLIDE 98

98

Applications of deep learning

 Hand written digits recognition  Dimensionality reduction  Information retrieval  Segmentation  Denoising  Phone recognition  Object recognition  Object detection  …

Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006. Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004

  • A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech

recognition. Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09

………………………….

slide-99
SLIDE 99

99

Object recognition

 NORB

  • logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kern

el SVM 11.6%, convolutional neural net 6.0%, convolution al net + SVM hybrid 5.9%. DBN 6.5%.

  • With the extra unlabeled data (and the same amount of la

beled data as before), DBN achieves 5.2%.

slide-100
SLIDE 100

100

Learning to extract the orientation of a face p atch (Salakhutdinov & Hinton, NIPS 2007)

slide-101
SLIDE 101

101

The training and test sets

11,000 unlabeled cases 100, 500, or 1000 labeled cases face patches from new people

slide-102
SLIDE 102

102

The root mean squared error in the orientatio n when combining GP’s with deep belief nets 22.2 17.9 15.2 17.2 12.7 7.2 16.3 11.2 6.4

GP on the pixels GP on top-level features GP on top-level features with fine-tuning

100 labels 500 labels 1000 labels

Conclusion: The deep features are much better than the pixels. Fine-tuning helps a lot.

slide-103
SLIDE 103

103

Deep Autoencoders

(Hinton & Salakhutdinov, 200 6)

 They always looked like a really ni ce way to do non-linear dimensio nality reduction:

  • But it is very difficult to opti

mize deep autoencoders usi ng backpropagation.

 We now have a much better way t

  • optimize them:
  • First train a stack of 4 RBM’s
  • Then “unroll” them.
  • Then fine-tune with backpro

p.

1000 neurons 500 neurons 500 neurons 250 neurons 250 neurons 30 1000 neurons

28x28 28x28

1 2 3 4 4 3 2 1

W W W W W W W W

T T T T

linear units

slide-104
SLIDE 104

104

Deep Autoencoders

(Hinton & Salakhutdinov, 2006)

real data 30-D deep auto 30-D PCA

slide-105
SLIDE 105

105

A comparison of methods for compressing dig it images to 30 real numbers.

real data 30-D deep auto 30-D logistic PCA 30-D PCA

slide-106
SLIDE 106

106

Representation of DBN

slide-107
SLIDE 107

107

Summary

 Deep belief net (DBN)

  • is a network with deep layers, which provides strong representa

tion power;

  • is a generative model;
  • can be learned by layerwise RBM using Contrastive Divergence;
  • has many applications and more applications is yet to be found.

Generative models explicitly or implicitly model the distribution of inputs and outputs. Discriminative models model the posterior probabilities directly.

slide-108
SLIDE 108

108

DBN VS SVM

 A very controversial topic  Model

  • DBN is generative, SVM is discriminative. But fine-tuning of DB

N is discriminative

 Application

  • SVM is widely applied.
  • Researchers are expanding the application area of DBN.

 Learning

  • DBN is non-convex and slow
  • SVM is convex and fast (in linear case).

 Which one is better?

  • Time will say.
  • You can contribute

Hinton: The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.