D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic - - PowerPoint PPT Presentation

d eep b elief n etworks dbn s
SMART_READER_LITE
LIVE PREVIEW

D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic - - PowerPoint PPT Presentation

R ESTRICTED B OLTZMANN M ACHINES AND D EEP B ELIEF N ETWORKS ON M ULTI -C ORE P ROCESSORS Jo Noel Lopes Bernardete Ribeiro ao Gonc alves University of Coimbra Polytechnic Institute of Guarda June 11, 2012 WCCIIJCNN D EEP B ELIEF N


slide-1
SLIDE 1

RESTRICTED BOLTZMANN MACHINES AND DEEP BELIEF NETWORKS ON MULTI-CORE PROCESSORS

Noel Lopes Bernardete Ribeiro Jo˜ ao Gonc ¸alves

University of Coimbra Polytechnic Institute of Guarda

June 11, 2012 WCCI–IJCNN

slide-2
SLIDE 2

DEEP BELIEF NETWORKS (DBNS)

“Deep belief nets are probabilistic generative models that are composed of multiple layers of stochastic latent variables. The latent variables typically have binary values and are often called hidden units or feature detectors. [...] The lower layers receive top-down, directed connections from the layers above. The states

  • f the units in the lowest layer represent a data vector.”

Geoffrey E. Hinton [Hinton et al., 2006]

slide-3
SLIDE 3

OUTLINE

Motivation Deep Belief Networks Restricted Boltzmann Machines GPU implementation Results on MNIST Handwritten Database Conclusions and Future Work

slide-4
SLIDE 4

MOTIVATION

The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence.

slide-5
SLIDE 5

MOTIVATION

The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009]

slide-6
SLIDE 6

MOTIVATION

The robustness and efficiency by which humans can recognize objects has ever been an intriguing challenge in computational intelligence. Theoretical results suggest that deep architectures are fundamental to learn complex functions that can represent high-level abstractions (e.g. vision, language) [Bengio, 2009] Empirical results show their successful application: classification, regression, dimensionality reduction, object recognition, information retrieval, robotics, and collaborative filtering etc. [Larochelle et al., 2007, Swersky et al., 2010].

slide-7
SLIDE 7

DEEP VERSUS SHALLOW

ARCHITECTURES

model inputs (x) level 1 low-order features level 2 · · · high-order features level d model outputs (y) deep architecture model inputs (x) non-linear operations model outputs (y) shallow architecture

slide-8
SLIDE 8

DEEP BELIEF NETWORKS

DBNs are composed of several Restricted Boltzmann Machines (RBMs) stacked on top of each other.

x · · · h1 · · · h2 · · · h3 · · ·

slide-9
SLIDE 9

RESTRICTED BOLTZMANN MACHINES

An RBM is an energy-based generative model that consists of a layer of binary visible units, v, and a layer of binary hidden units, h.

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1 decoder encoder visible units hidden units bias bias

slide-10
SLIDE 10

RESTRICTED BOLTZMANN MACHINES

Given an observed state, the energy of the joint configuration of the visible and hidden units (v, h) is given by (1):

E(v, h) = −

I

  • i=1

aivi −

J

  • j=1

bjhj −

J

  • j=1

I

  • i=1

Wjivihj ,

(1)

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1

slide-11
SLIDE 11

RESTRICTED BOLTZMANN MACHINES

The RBM defines a joint probability over (v, h):

p(v, h) = e−E(v,h) Z ,

(2) where Z is the partition function, obtained by summing the energy

  • f all possible (v, h) configurations:

Z =

  • v,h

e−E(v,h) .

(3)

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1

slide-12
SLIDE 12

RESTRICTED BOLTZMANN MACHINES

Given a random input configuration v, the state of the hidden unit j is set to 1 with probability:

p(hj = 1|v) = σ(bj +

I

  • i=1

viWji) ,

(4) Similarly, given a random hidden vector, h, the state of the visible unit i can be set to 1 with probability:

p(vi = 1|h) = σ(ai +

J

  • j=1

hjWji) .

(5)

slide-13
SLIDE 13

TRAINING AN RBM

The following learning rule performs stochastic steepest ascent in the log probability of the training data:

∂ log p(v, h) ∂Wji = vihj0 − vihj∞

(6) where ·0 denotes the expectations for the data distribution (p0) and ·∞ denotes the expectations under the model distribution

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1

slide-14
SLIDE 14

GIBBS SAMPLING

v(0) = x i · · · h(0) · · · j

vihj0

p(hj = 1|v) = σ(bj + I

i=1 viWji)

slide-15
SLIDE 15

ALTERNATING GIBBS SAMPLING

v(0) = x i · · · h(0) · · · j

vihj0

v(1) i · · · p(vi = 1|h) = σ(ai + J

j=1 hjWji)

slide-16
SLIDE 16

ALTERNATING GIBBS SAMPLING

v(0) = x i · · · h(0) · · · j

vihj0

v(1) i · · · h(1) · · · j v(2) i · · · h(2) · · · j v(∞) i · · · h(∞) · · · j

vihj∞

slide-17
SLIDE 17

CONTRASTIVE DIVERGENCE (CD–k)

Hinton proposed the Contrastive Divergence (CD) algorithm CD–k replaces .∞ by ·k for small values of k.

slide-18
SLIDE 18

CONTRASTIVE DIVERGENCE (CD–k)

v(0) ← x

Compute the binary (features) states of the hidden units, h(0), using v(0) for n ← 1 to k

Compute the “reconstruction” states for the visible units, v(n), using h(n−1) Compute the “reconstruction” states for the hidden units, h(n), using v(n)

end for Update the weights and biases, according to:

∆Wji = γ(vihj0 − vihjk)

(7)

∆bj = γ(hj0 − hjk)

(8)

∆ai = γ(vi0 − vik)

(9)

slide-19
SLIDE 19

DEEP BELIEF NETWORKS (DBN)

x · · · h1 · · · p(x|h1) p(h1|x)

slide-20
SLIDE 20

DEEP BELIEF NETWORKS (DBN)

x · · · h1 · · · h2 · · · p(x|h1) p(h1|x) p(h1|h2) p(h2|h1)

slide-21
SLIDE 21

DEEP BELIEF NETWORKS (DBN)

x · · · h1 · · · h2 · · · h3 · · · p(x|h1) p(h1|x) p(h1|h2) p(h2|h1) p(h2|h3) p(h3|h2)

slide-22
SLIDE 22

DEEP BELIEF NETWORKS (DBN)

x · · · h1 · · · h2 · · · h3 · · · low-level features high-level features (concepts)

slide-23
SLIDE 23

GPU IMPLEMENTATION

Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time.

slide-24
SLIDE 24

GPU IMPLEMENTATION

Training a DBN is a computationally expensive task that involves training several RBMs and may require a considerable amount of time. Solution?

GPU Parallel implementation

slide-25
SLIDE 25

CUDA – DEVICE ARCHITECTURE

Device Device Memory Streaming Multiprocessor SMN · · · Streaming Multiprocessor SM2 Streaming Multiprocessor SM1 Shared Memory · · ·

Processor 1 Processor 2 Processor M Instruction Unit

slide-26
SLIDE 26

CUDA – LAUNCHING A KERNEL GRID

Grid

Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Block(3,0)

Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Threads within a block can share information.

slide-27
SLIDE 27

CUDA – LAUNCHING A KERNEL GRID

Grid

Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Block(3,0)

Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Threads within a block can share information. However blocks are required to run independently.

slide-28
SLIDE 28

CUDA – LAUNCHING A KERNEL GRID

Grid

Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Block(3,0)

Thread(0,0) Thread(1,0) Thread(2,0) Thread(3,0) Thread(0,1) Thread(1,1) Thread(2,1) Thread(3,1) Thread(0,2) Thread(1,2) Thread(2,2) Thread(3,2)

Threads within a block can share information. However blocks are required to run independently. To address scalability the tasks should be partitioned.

slide-29
SLIDE 29

CUDA – SCALABILITY

Grid

Block(0,0) Block(1,0) Block(2,0) Block(3,0) Block(0,1) Block(1,1) Block(2,1) Block(3,1)

Device with 2 SMs Device with 4 SMs execution

SM 0 Block(0,0) Block(1,0) Block(2,0) Block(3,0) SM 1 Block(0,1) Block(1,1) Block(2,1) Block(3,1) SM 0 Block(0,0) Block(0,1) SM 1 Block(1,0) Block(1,1) SM 2 Block(2,0) Block(2,1) SM 3 Block(3,0) Block(3,1)

slide-30
SLIDE 30

KERNELS

vdata ∈ IRN×I

RBM inputs (x)

ComputeStatusHiddenUnits

Step 1. Compute hdata

hdata ∈ IRN×J

RBM outputs (data)

ComputeStatusVisibleUnits

Step 2. Compute vrecon

vrecon ∈ IRN×I

reconstructed inputs

ComputeStatusHiddenUnits

Step 3. Compute hrecon

hrecon ∈ IRN×J

reconstructed outputs

w ∈ IRJ×I

weights

a ∈ IRI

visible units bias

b ∈ IRJ

hidden units bias

CorrectWeights

Step 4. Correct weights

slide-31
SLIDE 31

COMPUTESTATUSHIDDENUNITS AND COMPUTESTATUSVISIBLEUNITS

KERNELS

Each thread represents a connection

Multiplies the clamped input by the weight Stores the weight in the shared memory

Each block represents a neuron

Uses fast shared memory to sum up the values computed by each thread

Block (Neuron)

Connection 1 Connection 2 Connection 3 . . . Connection J

slide-32
SLIDE 32

STORING THE CONNECTION WEIGHTS

ComputeStatusHiddenUnits - Coalesced access ComputeStatusVisibleUnits - Uncoalesced access

w11 w12 w13 w14 w15 · · · w1I w21 w22 w23 w24 w25 · · · w2I w31 w32 w33 w34 w35 · · · w3I w41 w42 w43 w44 w45 · · · w4I w51 w52 w53 w54 w55 · · · w5I · · · · · · · · · · · · · · · · · · · · · wJ1 wJ2 wJ3 wJ4 wJ5 · · · wJI w13 w23 w33 w43 w53 · · · wJ3 w31 w32 w33 w34 w35 · · · w3I

slide-33
SLIDE 33

CORRECTWEIGHTS KERNEL FIRST APPROACH

Each thread gathers and sums up the values for one or more samples Each block corrects the weight of a connection

Block (Connection)

Sample 1 Sample 2 Sample 3 . . . Sample N

slide-34
SLIDE 34

PROBLEMS?

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1

∆Wji = γ(vihj0 − vihjk) ∆bj = γ(hj0 − hjk)

slide-35
SLIDE 35

PROBLEMS?

h1 h2 h3 · · · hj · · · hJ 1 v1 v2 · · · vi · · · vI 1

∆Wji = γ(vihj0 − vihjk) ∆ai = γ(vi0 − vik)

slide-36
SLIDE 36

CORRECTWEIGHTS KERNEL IMPROVED APPROACH

Each block has 16 × 16 threads. Each thread within a block must now process all the samples.

Block 0 (16×16 connections)

Connection (0,0) Connection (0,1) Connection (0,2) . . . Connection (0, 15) Connection (1,0) Connection (1,1) Connection (1,2) . . . Connection (1, 15) Connection (2,0) Connection (2,1) Connection (2,2) . . . Connection (2, 15)

· · · · · · · · · · · · · · ·

Connection (15,0) Connection (15,1) Connection (15,2) . . . Connection (15, 15)

slide-37
SLIDE 37

CORRECTWEIGHTS KERNEL IMPROVED APPROACH

But we can access vi and hj variables in a coalesced way and store them in shared memory for faster accesses. Although, this new approach has a smaller number of blocks, it performs much better than our first approach (≈ 15× faster).

Block 0 (16×16 connections)

Connection (0,0) Connection (0,1) Connection (0,2) . . . Connection (0, 15) Connection (1,0) Connection (1,1) Connection (1,2) . . . Connection (1, 15) Connection (2,0) Connection (2,1) Connection (2,2) . . . Connection (2, 15)

· · · · · · · · · · · · · · ·

Connection (15,0) Connection (15,1) Connection (15,2) . . . Connection (15, 15)

slide-38
SLIDE 38

TIME SPENT IN EACH TASK FIRST AND SECOND APPROACHES

Generate random numbers (cuRAND) ComputeStatusHiddenUnits kernel ComputeStatusVisibleUnits kernel CorrectWeights kernel 5.53% 10.24% 17.09% 14.86% 27.50% 45.91% 67.14% 11.73% First approach Improved approach

slide-39
SLIDE 39

EXPERIMENTAL SETUP

We tested our approach with the MNIST database Each sample has 28 × 28 pixel image of a hand-written digit (784 inputs) Hardware

CPU: Intel dual-core i5-2410M (8GB Memory) GPU: NVIDIA GeForce 460 GTX

slide-40
SLIDE 40

NVIDIA GEFORCE 460 GTX

Number of Streaming Multiprocessors 7 Number of cores 336 Peak performance (GFLOPS) 940.8 Device Memory (GB) 1 Memory bandwidth (GB/sec) 112.5 Shading clock speed (GHz) 1.4

slide-41
SLIDE 41

RESULTS (1,000 SAMPLES)

0.01 0.1 1 10 100 100 200 300 400 500 600 700 800 900 Time (s) Hidden units

23.26× 23.13× 21.86× 24.46× 29.79×

GTX 460 (GPU) dual-core i5 (CPU)

slide-42
SLIDE 42

RESULTS (10,000 SAMPLES)

0.1 1 10 100 1000 100 200 300 400 500 600 700 800 900 Time (s) Hidden units

32.83× 30.29× 28.59× 29.47× 38.16×

GTX 460 (GPU) dual-core i5 (CPU)

slide-43
SLIDE 43

RESULTS (60,000 SAMPLES)

1 10 100 1000 10000 100 200 300 400 500 600 700 800 900 Time (s) Hidden units

42.73× 43.46× 38.64× 41.83× 46.07×

GTX 460 (GPU) dual-core i5 (CPU)

slide-44
SLIDE 44

ADAPTIVE STEP SIZES

Training images Reconstruction after 10 epochs Reconstruction after 100 epochs Reconstruction after 250 epochs Reconstruction after 500 epochs Reconstruction after 750 epochs Reconstruction after 1000 epochs

Adaptive Step Sizes Fixed (optimized) learning rate 0.1

slide-45
SLIDE 45

CONCLUSIONS AND FUTURE WORK

Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts.

slide-46
SLIDE 46

CONCLUSIONS AND FUTURE WORK

Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. Our work has demonstrated that by taking a non-trivial approach of GPU implementation we attained significant speedups.

slide-47
SLIDE 47

CONCLUSIONS AND FUTURE WORK

Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. Our work has demonstrated that by taking a non-trivial approach of GPU implementation we attained significant speedups. The adaptive step-size procedure for tuning the learning rate has been incorporated in the learning model with excelling results.

slide-48
SLIDE 48

CONCLUSIONS AND FUTURE WORK

Creating a Deep Belief Network (DBN) model is a time consuming and computationally expensive task that involves training several Restricted Boltzmann Machines (RBMs) upholding considerable efforts. Our work has demonstrated that by taking a non-trivial approach of GPU implementation we attained significant speedups. The adaptive step-size procedure for tuning the learning rate has been incorporated in the learning model with excelling results. Future work will test this approach with other databases in particular in real-world problems.

slide-49
SLIDE 49

RESTRICTED BOLTZMANN MACHINES AND DEEP BELIEF NETWORKS ON MULTI-CORE PROCESSORS

Noel Lopes and Bernardete Ribeiro and Jo˜ ao Gonc ¸alves

University of Coimbra Polytechnic Institute of Guarda

June 11, 2012 WCCI–IJCNN

slide-50
SLIDE 50

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127. Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning (ICML 2007), pages 473–480. ACM. Swersky, K., Chen, B., Marlin, B., and de Freitas, N. (2010). A tutorial on stochastic approximation algorithms for training restricted boltzmann machines and deep belief nets. In Information Theory and Applications Workshop.