1
Deep Learning
Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring
- Dept. of Computer Science and Engineering
Seoul National University
Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial - - PowerPoint PPT Presentation
Deep Learning Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring Dept. of Computer Science and Engineering Seoul National University 1 History of Neural Network Research Neural network Deep belief net Back
1
Jiseob Kim (jkim@bi.snu.ac.kr) Artificial Intelligence Class of 2016 Spring
Seoul National University
Neural network Back propagation
1986
Solve general learning problems Tied with biological system
But it is given up…
2006
– Hand crafted features (GMM-HMM, SIFT, LBP, HOG)
Kruger et al. TPAMI’13 Deep belief net Science
… … … … … … … …
(normalization, nonlinearity, dropout)
– GPU – Multi-core computer systems
deep learning results
Speech
2011 2012
How Many Computers to Identify a Cat? 16000 CPU cores
Rank Name Error ra te Description 1
0.15315 Deep Conv Net 2
0.26172 Hand-crafted fe atures and learn ing models. Bottleneck. 3
0.26979 4 Xerox/INRIA 0.27058
Object recognition over 1,000,000 images and 1,000 categories (2 GPU)
History of Neural Network Research
Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk
Neural Networks
Convolutional Neural Networks (CNN)
Slides by Jiseob Kim jkim@bi.snu.ac.kr
Problem of finding label Y given data X
y: person’s name
y: diagnosis of diabetes
y: sentence corresponding to the voice
x: D-dimensional vector, y: integer (Discrete) Famous pattern recognition algorithms
x1 x2 1 w1 w2 b
x1 x2 w1*x1 + w2*x2 +b = 0
> 0: < 0:
start: The weight vector w is generated randomly test: A vector x ∈ P ∪ N is selected randomly, If x∈P and w·x>0 goto test, If x∈P and w·x≤0 goto add, If x ∈ N and w · x < 0 go to test, If x ∈ N and w · x ≥ 0 go to subtract. add: Set w = w+x, goto test subtract: Set w = w-x, goto test
Classic Perceptron Sigmoid Unit
Sigmoid function is Differentiable ¶s(x) ¶x =s(x)(1-s(x))
Loss Function Gradient Descent Update
X X W ) 1 ( ) ( 2 ) ( 2 f f f d s f f d
Target Unit Output
2
Multiple boundaries are n eeded (e.g. XOR problem) Multiple Units More complex regions are needed (e.g. Polygons) Multiple Layers
Structure of Multilayer Perceptron (MLP; Artificial Neural Network)
Input Output
Loss Function
(Weight for each layer and each unit)
e the gradient for all the parameters
Target Unit Output
2
Recursive Computation of Gradients
e
the loss-gradient of lower-layer weight s recursively (Back Propagation)
Gradients of top-layer weights and update rule Store intermediate value delta for later use of chain rul e
( j) = (d - f ) ¶f
( j)
2
Gradient Descent update rule
Gradients of lower-layer weights
) ( ) 1 ( ) ( j i j j i
s W X
) ( ) ( 2 ) (
) ( 2 ) (
j i j i j i
s f f d s f d s
) 1 ( ) ( ) 1 ( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) (
2 ) ( 2
j j i j j i j j i j i j i j i j i
s f f d s s s X X X W W
Weighted sum Local gradient ) 1 ( ) ( ) ( ) ( ) (
j j i j i j i j i
Gradient Descent Update rule for lower-layer weights
Applying chain rule, recursive relation between delta’s
1
1 ) 1 ( ) 1 ( ) ( ) ( ) (
j
m l j il j i j i j i j i
Algorithm: Back Propagation
Almost All Classification Problems
Limitations
ishing Gradient)
with Back Propagation
Breakthrough
ters)
Input x Output y'
Target y
Error Error Error
Back-propagation
Slides by Jiseob Kim jkim@bi.snu.ac.kr
Idea:
j j i i
x P P h P P ) | ( ) | ( ) | ( ) | ( h h x x x h
Energy-Based Model Energy function
h x h x h x
h x
, ) , ( ) , (
) , (
E E
e e P
x1 x2 h2 h3 h4 x3 h5 h1
P(x) = e-E(x,h)
h
e-E(x,h)
x,h
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · · x)
Joint (x, h) Probability
Marginal (x) Probability,
Likelihood
Remark:
same as Neural Network
Conditional Probability
Maximum Likelihood
L(X;q) = e-E(x,h)
h
e-E(x,h)
x,h
¶L(X;q) ¶wij = p(x,q)¶log f (x;q) ¶q
dx - 1 K ¶log f (x(k);q) ¶q
k=1 K
= xihj
p(x,q ) - xihj X = xihj ¥ - xihj
» xihj
1 - xihj Distribution of Model Distribution of Dataset
j ih
v
j ih
v
i j i j i j i j t = 0 t = 1 t = 2 t = infinity a fantasy
Contrastive Divergence (CD) Learning
k-Contrastive Divergence Trick
d to calculate many Gibbs sampling steps
ctice, k=1 is sufficient
j ih
v
j ih
v
i j i j i j i j t = 0 t = 1 t = 2 t = infinity a fantasy
Take this as a sample of Model distribution
Unsupervised Training makes RBM successfully catch the essential patterns RBM trained on MNIST hand-written digit data: Each cell shows the pattern each hidden node encodes
Deep Belief Network (Deep Bayesian N etwork)
e to Neural Network
g
ncoder), but we only consider Classifier he re
DBN as a stack of RBMs
1. Regard each layer as RBM 2. Layer-wise Pre-train each RBM in Unsupervised way 3. Attach the classifier and Fine-tune the whole Network in Supervis ed way
… … … … … … … … … … … … h1 x h2 h1 h3 h2
… … … … h0 x0
W
RBM DBN Classifier
28
Erhan et. al. AISTATS’2009
29
with pre-training without pre-training
30
Higher layers have more abstract representations
wer layers, but natural in higher layers
Bengio et al., ICML 2013
As DBN is a generative model, we can also regenerate the data
e data samples Generate data
Occluded Regenerated Lee, Ng et al., ICML 2009
Nowadays, CNN outperforms DBN for Image or Speech data However, if there is no topological information, DBN is still a good choice Also, if the generative model is needed, DBN is used
Generate Face patches Tang, Srivastava, Salakhutdinov, NIPS 2014
Slides by Jiseob Kim jkim@bi.snu.ac.kr
Idea:
topological structure between features such as image data or voice data (spectrograms)
DBN: different data CNN: same data
Image 1 Image 2
Structure of Convolutional Neural Network (CNN)
Higher features formed by repeated Convolution and Pooling (Subsampling) Convolution obtains certain Feature from local area Pooling reduces Dimension, while obtaining Translation- invariant Feature
http://parse.ele.tue.nl/education/cluster2
The Kernel Detects pattern: The Resulting value Indicates:
at each region
1 1 1 1 1
The Pooling Layer summarizes the results of Convolution Layer
1 cell
The Result of Pooling Layer is Tran slation-invariant
Higher layer Higher layer
specific, abstract patterns
general patterns
CNN is just another Neural Network with sparse connections Learning Algorithm:
Back Propagation
(1000-class, 1 million images)
From Kyunghyun Cho’s dnn tutorial
ALL CNN!!
Krizhevsky et al.: the winner of ImageNet 2012 Competition
1000-class problem, top-5 test error rate of 15.3%
Fully Connected
Input: Spectrogram of Speech
Convolutional Neural Network CNN outperforms all previous methods that uses GMM of MFCC
Slides from Wanli Ouyang wlouyang@ee.cuhk.edu.hk
45
Webpages:
w.cs.toronto.edu/~hinton/csc2515/deeprefs.html
es.net/mlss2010au_frean_deepbeliefnets/
tml
People:
Acknowledgement
Hinton and Frean’s). Sorry for not listing them in full detail.
Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
46
http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm
Smoker? Has Lung cancer Has bronchitis
B C A
47
ndent given the values of their parents.
B A C P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C) B A C D B C A B C A
48
Probability:
) ( ) ; ( ) ; ( ) ; ( ) ( Z f f f P x x x x;
x
B A C
) , ( ) exp( ) exp( ) exp( ) exp( ) , , (
2 1 2 1 , , 2 1 2 1
w w Z AC w BC w AC w BC w AC w BC w C B A P
C B A
;
w1 w2
Example: P(A,B,C) = P(B,C)P(A,C)
Is smoker? Is healthy Has Lung cancer partition function
x
49
D E A B G H h1 h2 y1 h3 y2 y3 Hidden Marcov model MRF in 2D F C I
50
A B C D P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C) h1 h2 y1 h3 y2 y3 P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1) P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)
51
v h ... ... x h1 ... ... h2 h3 ... ... W W 0 W 1 W 2 (a) (b) HMM RBM DBN ( ... x Our d
52
Zoubin Ghahramani ‘s video lecture on graphical models: http://videolectures.net/mlss07_ghahramani_grafm/
53
, ) ( ) ; ( ) ; ( ) ; ( ) ; (
) ; ( ) ; (
Z f e e f f P
E E m m m m m m m m
x x x x
x x x x
m
m m m
Energy function Partition function MRF in 2D
3 4 3 2 1
D E A B G H F C I
54
15 1 ) ( ) (
i i i
i i
u x u x
T
55
x
x x x ) ; ( ) ; ( ) ; (
m m m m m m m m
f f P
56
57
Probability: Maximum Likelihood and gradient descent
X x
x x x x x x x X
) ; ( log ) ; ( log ) ; ( log 1 ) ; ( log ) , ( ) ; ( log 1 ) ( log ) ; ( 1
) , ( 1 1
f f f K d f p f K Z L K
p K k (k) K k (k)
1
t t
K k k K k k
1 1
) ( ) (
x
) x; ( ) (
m
f Z
58
Gradient of Likelihood:
T
K k (k)
f K d f p L
1
) ; x ( log 1 x ) ; x ( log ) , x ( ) ; X (
Intractable Tractable Gibbs Sampling Fast contrastive divergence T=1 Easy to compute
Sample p(z1,z2,…,zM)
) ; X (
1
L
t t
CD Minimum P(A,B,C) = P(A|C)P(B|C)P(C) B A C Accurate but slow gradient Approximate but fast gradient
59
x1 x2 h2 h3 h4 x3 h5 h1
More information on Gibbs sampling: Pattern recognition and machine learning(PRML)
60
The fixed points of ML are not fixed points of CD and vice versa.
arning can be used for fine-tuning.
It is not clear if CD learning converges (to a stable fixed poi nt). At 2005, proof is not available. Further theoretical results? Please inform us
61
62
Undirected graphical model, wit h hidden nodes.
i i i j i j i ij
x x x w E ) (x;
, ) ( ) ; ( ) ; ( ) ; ( ) ; (
) ; ( ) ; (
Z f e e f f P
E E m m m m m m m m
x x x x
x x x x
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh } , { :
i ij
w
Undirected, loopy, layer E(x,h)=b' x+c' h+h' Wx
j j i i
x P P h P P ) | ( ) | ( ) | ( ) | ( h h x x x h
h x h x h x
h x
, ) , ( ) , (
) , (
E E
e e P
x1 x2 h2 h3 h4 x3 h5 h1 h x
h x h x h h x
x
, ) , ( ) , (
) (
E E
e e P
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
partition function
Read the manuscript for details
64
E(x,h)=b' x+c' h+h' Wx x = [x1 x2 …]T, h = [h1 h2 …]T Parameter learning
K k k K k k
P L P
1 1
) log min ) ; ( min ) max
; (x X ; (x
) ( ) (
, ( (
h x Wx) h' h c' x b' h Wx) h' h c' x b'
Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)
65
CD for RBM, very fast!
1
t t
) ( ) ; ( ) x; (
h , x Wx) h' h c' x b' ( h Wx) h' h c' x b' (
Z x f e e P
X x
j i j i j i j i j i p j i K k (k) ij
1 ) , ( 1
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)
66
h1
X
j i j i ij
h x h x w L
1
) ; (
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x) P(xj = 1|h) = σ(bj +W’• j · h)
67
y: classification label
Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
68
Multiclass classification Collaborative filtering Motion capture modeling Information retrieval Modeling natural images Segmentation
Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013
2011. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008. Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007. Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009. Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008
69
70
A belief net is a directed acyclic g raph composed of random varia bles.
random hidden cause visible effect
71
h1 x h2 h3 … … … … … … … …
Pixels=>edges=> local shapes=> object parts
72
The mammal brain is organized in a deep architecture wit h a given input percept represented at multiple levels of a bstraction, each level corresponding to a different area of cortex. An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whos e depth is matched to the task. Since the number of computational elements one can affo rd depends on the number of training examples available to tune or select them, the consequences are not just com putational but also statistical: poor generalization may be expected when using an insufficiently deep architecture f
Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007. Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
Pixels=>edges=> local shapes=> object parts
73
Linear regression, logistic regression: depth 1 Kernel SVM: depth 2 Decision tree: depth 2 Boosting: depth 2 The basic conclusion that these results suggest is that whe n a function can be compactly represented by a deep archit ecture, it might need a very large architecture to be represe nted by an insufficiently deep one. (Example: logic gates, multi-layer NN with linear threshold units and positive we ight).
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
74
X2 X2
X3 X3
X1 X1 X4
X4
X5 X5 2N-1 N2N-1 parameters O(N) parameters
75
Boosting (2 layers)
Decision tree, LLE, KNN, Kernel SVM (2 layers)
Brain: 5-10 layers
i i iK
b ) , ( x x
76
77
h1 x h2 h3 … … … … … … … …
78
B A C
h11 h12 x1 h1 x … … … …
P(A,B|C) = P(A|C)P(B|C) P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)
An example from manuscript Sol: Complementary prior
79
h1 x h2 h4 … … … … … …
… …
h3 … … 2000 1000 500 30 Sol: Complementary prior
Inference problem (the problem
Sol: Complementary prior
80
d) and fine tuning
h1 x h2 h3 … … … … … … … …
… … … … … … … … … … … … h1 x h2 h1 h3 h2
81
Why greedy layerwise learning work? Optimizing a lower bound: When we fix parameters for layer 1 an d optimize the parameters for layer 2, we are optimizing the P(h1) in (1)
1
h 1 1 1 1 1 1
x | h x | h x | h h x | h h x, x )]} ( log ) ( )] ( log ) ( )[log ( { ) ( log ) ( log Q Q P P Q P P
h
… … … … … … … … … … … … h1 x h2 h1 h3 h2
82
RBM can be considered as DBN that has infinitive layers
T
W
… … … … … … … … … … h0 x0 h1 x1 x2 …
W W
T
W … … … … h0 x0
W
83
(BP)
84
Pretraining Fine-tuning
85
There might be no universally right depth
Copied from http://videolectures.net/mlss09uk_hinton_dbn/
[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007
86
Erhan et. al. AISTATS’2009
87
w/o pre-training
with pre-training without pre-training
88
stuff image label stuff image label
If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity? If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.
high bandwidth low bandwidth
89
Layer-wise pretraining is efficient but not optimal. It is possible to train parameters for all layers using a wake
90
After learning many layers of features, we can fine-tune the f eatures to improve generation.
ature activities in the layer below.
eature activities in the layer above.
91
RBM has no connection among layers This can be generalized. Lateral connections for the first layer [1].
Lateral connections at multiple layers [2].
[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311–3325, December 1997. [2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.
92
93
94
Make it [0 1] linearly: x = ax + b Use another distribution
95
Static: Temporal
96
97
98
Hand written digits recognition Dimensionality reduction Information retrieval Segmentation Denoising Phone recognition Object recognition Object detection …
Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006. Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004
recognition. Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09
………………………….
99
el SVM 11.6%, convolutional neural net 6.0%, convolution al net + SVM hybrid 5.9%. DBN 6.5%.
beled data as before), DBN achieves 5.2%.
100
101
11,000 unlabeled cases 100, 500, or 1000 labeled cases face patches from new people
102
GP on the pixels GP on top-level features GP on top-level features with fine-tuning
103
They always looked like a really ni ce way to do non-linear dimensio nality reduction:
We now have a much better way t
1000 neurons 500 neurons 500 neurons 250 neurons 250 neurons 30 1000 neurons
28x28 28x28
1 2 3 4 4 3 2 1
W W W W W W W W
T T T T
linear units
104
real data 30-D deep auto 30-D PCA
105
real data 30-D deep auto 30-D logistic PCA 30-D PCA
106
107
tion power;
Generative models explicitly or implicitly model the distribution of inputs and outputs. Discriminative models model the posterior probabilities directly.
108
A very controversial topic Model
N is discriminative
Application
Learning
Which one is better?
Hinton: The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.