D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic - - PowerPoint PPT Presentation
D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic - - PowerPoint PPT Presentation
R ESTRICTED B OLTZMANN M ACHINES AND D EEP B ELIEF N ETWORKS ON M ULTI -C ORE P ROCESSORS Jo Noel Lopes Bernardete Ribeiro ao Gonc alves University of Coimbra Polytechnic Institute of Guarda June 11, 2012 WCCIIJCNN D EEP B ELIEF N
DEEP BELIEF NETWORKS (DBNS)
“Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. [...] The lower layers receive top-down, directed connections from the layers above. The states
- f the units in the lowest layer represent a data vector.”
Geoffrey E. Hinton [Hinton et al., 2006]
OUTLINE
Motivation Deep Belief Networks Restricted Boltzmann Machines GPU implementation Results on MNIST Handwritten Database Conclusions and Future Work
MOTIVATION
The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence.
MOTIVATION
The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009]
MOTIVATION
The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009] Empirical results show their successful application: classification, regression, dimensionality reduction, object recognition, information retrieval, robotics, and collaborative filtering etc. [Larochelle et al., 2007, Swersky et al., 2010].
DEEP VERSUS SHALLOW
ARCHITECTURES
model inputs (x) level 1 low-order features level 2 · · · high-order features level d model outputs (y) deep architecture model inputs (x) non-linear operations model outputs (y) shallow architecture
DEEP BELIEF NETWORKS
DBNs are composed of several Restricted Boltzmann Machines (RBMs) stacked on top of each other.
x · · · h1 · · · h2 · · · h3 · · ·
RESTRICTED BOLTZMANN MACHINES
An RBM is an energy-based generative model that consists of a layer of binary visible units, v, and a layer of binary hidden units, h.
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1 decoder encoder visible units hidden units bias bias
RESTRICTED BOLTZMANN MACHINES
Given an observed state, the energy of the joint configuration of the visible and hidden units (v, h) is given by (1):
E(v, h) = −
I
- i=1
aivi −
J
- j=1
bjhj −
J
- j=1
I
- i=1
Wjivihj ,
(1)
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1
RESTRICTED BOLTZMANN MACHINES
The RBM defines a joint probability over (v, h):
p(v, h) = e−E(v,h) Z ,
(2) where Z is the partition function, obtained by summing the energy
- f all possible (v, h) configurations:
Z =
- v,h
e−E(v,h) .
(3)
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1
RESTRICTED BOLTZMANN MACHINES
Given a random input configuration v, the state of the hidden unit j is set to 1 with probability:
p(hj = 1|v) = σ(bj +
I
- i=1
viWji) ,
(4) Similarly, given a random hidden vector, h, the state of the visible unit i can be set to 1 with probability:
p(vi = 1|h) = σ(ai +
J
- j=1
hjWji) .
(5)
TRAINING AN RBM
The following learning rule performs stochastic steepest ascent in the log probability of the training data:
∂ log p(v, h) ∂Wji = vihj0 − vihj∞
(6) where ·0 denotes the expectations for the data distribution (p0) and ·∞ denotes the expectations under the model distribution
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1
GIBBS SAMPLING
v(0) = x i · · · h(0) · · · j
vihj0
p(hj = 1|v) = σ(bj + I
i=1 viWji)
ALTERNATING GIBBS SAMPLING
v(0) = x i · · · h(0) · · · j
vihj0
v(1) i · · · p(vi = 1|h) = σ(ai + J
j=1 hjWji)
ALTERNATING GIBBS SAMPLING
v(0) = x i · · · h(0) · · · j
vihj0
v(1) i · · · h(1) · · · j v(2) i · · · h(2) · · · j v(∞) i · · · h(∞) · · · j
vihj∞
CONTRASTIVE DIVERGENCE (CD–k)
Hinton proposed the Contrastive Divergence (CD) algorithm CD–k replaces .∞ by ·k for small values of k.
CONTRASTIVE DIVERGENCE (CD–k)
v(0) ← x
Compute the binary (features) states of the hidden units, h(0), using v(0) for n ← 1 to k
Compute the “reconstruction” states for the visible units, v(n), using h(n−1) Compute the “reconstruction” states for the hidden units, h(n), using v(n)
end for Update the weights and biases, according to:
∆Wji = γ(vihj0 − vihjk)
(7)
∆bj = γ(hj0 − hjk)
(8)
∆ai = γ(vi0 − vik)
(9)
DEEP BELIEF NETWORKS (DBN)
x · · · h1 · · · p(x|h1) p(h1|x)
DEEP BELIEF NETWORKS (DBN)
x · · · h1 · · · h2 · · · p(x|h1) p(h1|x) p(h1|h2) p(h2|h1)
DEEP BELIEF NETWORKS (DBN)
x · · · h1 · · · h2 · · · h3 · · · p(x|h1) p(h1|x) p(h1|h2) p(h2|h1) p(h2|h3) p(h3|h2)
DEEP BELIEF NETWORKS (DBN)
x · · · h1 · · · h2 · · · h3 · · · low-level features high-level features (concepts)
GPU IMPLEMENTATION
Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time.
GPU IMPLEMENTATION
Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time. Solution?
GPU Parallel implementation
CUDA – DEVICE ARCHITECTURE
Device Device Memory Streaming Multiprocessor SMN · · · Streaming Multiprocessor SM2 Streaming Multiprocessor SM1 Shared Memory · · ·
Processor 1 Processor 2 Processor M Instruction Unit
CUDA – LAUNCHING A KERNEL GRID
Grid
Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)
Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information.
CUDA – LAUNCHING A KERNEL GRID
Grid
Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)
Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information. However blocks are required to run independently.
CUDA – LAUNCHING A KERNEL GRID
Grid
Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Block(3,0)
Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)
Threads within a block can share information. However blocks are required to run independently. To address scalability the tasks should be partitioned.
CUDA – SCALABILITY
Grid
Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)
Device with 2 SMs Device with 4 SMs execution
SM 0 Block(0,0) Block(1,0) Block(2,0) Block(3,0) SM 1 Block(0,1) Block(1,1) Block(2,1) Block(3,1) SM 0 Block(0,0) Block(0,1) SM 1 Block(1,0) Block(1,1) SM 2 Block(2,0) Block(2,1) SM 3 Block(3,0) Block(3,1)
KERNELS
vdata ∈ IRN×I
RBM inputs (x)
ComputeStatusHiddenUnits
Step 1. Compute hdata
hdata ∈ IRN×J
RBM outputs (data)
ComputeStatusVisibleUnits
Step 2. Compute vrecon
vrecon ∈ IRN×I
reconstructed inputs
ComputeStatusHiddenUnits
Step 3. Compute hrecon
hrecon ∈ IRN×J
reconstructed outputs
w ∈ IRJ×I
weights
a ∈ IRI
visible units bias
b ∈ IRJ
hidden units bias
CorrectWeights
Step 4. Correct weights
COMPUTESTATUSHIDDENUNITS AND COMPUTESTATUSVISIBLEUNITS
KERNELS
Each thread represents a connection
Multiplies the clamped input by the weight Stores the weight in the shared memory
Each block represents a neuron
Uses fast shared memory to sum up the values computed by each thread
Block (Neuron)
Connection 1 Connection 2 Connection 3 . . . Connection J
STORING THE CONNECTION WEIGHTS
ComputeStatusHiddenUnits - Coalesced access ComputeStatusVisibleUnits - Uncoalesced access
w11 w12 w13 w14 w15 · · · w1I w21 w22 w23 w24 w25 · · · w2I w31 w32 w33 w34 w35 · · · w3I w41 w42 w43 w44 w45 · · · w4I w51 w52 w53 w54 w55 · · · w5I · · · · · · · · · · · · · · · · · · · · · wJ1 wJ2 wJ3 wJ4 wJ5 · · · wJI w13 w23 w33 w43 w53 · · · wJ3 w31 w32 w33 w34 w35 · · · w3I
CORRECTWEIGHTS KERNEL FIRST APPROACH
Each thread gathers and sums up the values for one or more samples Each block corrects the weight of a connection
Block (Connection)
Sample 1 Sample 2 Sample 3 . . . Sample N
PROBLEMS?
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1
∆Wji = γ(vihj0 − vihjk) ∆bj = γ(hj0 − hjk)
PROBLEMS?
h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1
∆Wji = γ(vihj0 − vihjk) ∆ai = γ(vi0 − vik)
CORRECTWEIGHTS KERNEL IMPROVED APPROACH
Each block has 16 × 16 threads. Each thread within a block must now process all the samples.
Block 0 (16×16 connections)
Connection (0,0) Connection (0,1) Connection (0,2) . . . Connection (0, 15) Connection (1,0) Connection (1,1) Connection (1,2) . . . Connection (1, 15) Connection (2,0) Connection (2,1) Connection (2,2) . . . Connection (2, 15)
· · · · · · · · · · · · · · ·
Connection (15,0) Connection (15,1) Connection (15,2) . . . Connection (15, 15)
CORRECTWEIGHTS KERNEL IMPROVED APPROACH
But we can access vi and hj variables in a coalesced way and store them in shared memory for faster accesses. Although, this new approach has a smaller number of blocks, it performs much better than our first approach (≈ 15× faster).
Block 0 (16×16 connections)
Connection (0,0) Connection (0,1) Connection (0,2) . . . Connection (0, 15) Connection (1,0) Connection (1,1) Connection (1,2) . . . Connection (1, 15) Connection (2,0) Connection (2,1) Connection (2,2) . . . Connection (2, 15)
· · · · · · · · · · · · · · ·
Connection (15,0) Connection (15,1) Connection (15,2) . . . Connection (15, 15)
TIME SPENT IN EACH TASK FIRST AND SECOND APPROACHES
Generate random numbers (cuRAND) ComputeStatusHiddenUnits kernel ComputeStatusVisibleUnits kernel CorrectWeights kernel 5.53% 10.24% 17.09% 14.86% 27.50% 45.91% 67.14% 11.73% First approach Improved approach
EXPERIMENTAL SETUP
We tested our approach with the MNIST database Each sample has 28 × 28 pixel image of a hand-written digit (784 inputs) Hardware
CPU: Intel dual-core i5-2410M (8GB Memory) GPU: NVIDIA GeForce 460 GTX
NVIDIA GEFORCE 460 GTX
Number of Streaming Multiprocessors 7 Number of cores 336 Peak performance (GFLOPS) 940.8 Device Memory (GB) 1 Memory bandwidth (GB/sec) 112.5 Shading clock speed (GHz) 1.4
RESULTS (1,000 SAMPLES)
0.01 0.1 1 10 100 100 200 300 400 500 600 700 800 900 Time (s) Hidden units
23.26× 23.13× 21.86× 24.46× 29.79×
GTX 460 (GPU) dual-core i5 (CPU)
RESULTS (10,000 SAMPLES)
0.1 1 10 100 1000 100 200 300 400 500 600 700 800 900 Time (s) Hidden units
32.83× 30.29× 28.59× 29.47× 38.16×
GTX 460 (GPU) dual-core i5 (CPU)
RESULTS (60,000 SAMPLES)
1 10 100 1000 10000 100 200 300 400 500 600 700 800 900 Time (s) Hidden units
42.73× 43.46× 38.64× 41.83× 46.07×
GTX 460 (GPU) dual-core i5 (CPU)
ADAPTIVE STEP SIZES
Training images Reconstruction after 10 epochs Reconstruction after 100 epochs Reconstruction after 250 epochs Reconstruction after 500 epochs Reconstruction after 750 epochs Reconstruction after 1000 epochs