Parallel Gradient Descent for Multilayer Feedforward Neural Networks - - PowerPoint PPT Presentation

parallel gradient descent for multilayer feedforward
SMART_READER_LITE
LIVE PREVIEW

Parallel Gradient Descent for Multilayer Feedforward Neural Networks - - PowerPoint PPT Presentation

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1 Sungyong Seo 1 Vasileios Zois 1 1 Department of Computer Science University of Southern California May 9, 2016 (University of Southern California)


slide-1
SLIDE 1

Parallel Gradient Descent for Multilayer Feedforward Neural Networks

Palash Goyal1 Nitin Kamra1 Sungyong Seo1 Vasileios Zois1

1Department of Computer Science

University of Southern California

May 9, 2016

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 1 / 24

slide-2
SLIDE 2

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 2 / 24

slide-3
SLIDE 3

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 3 / 24

slide-4
SLIDE 4

Introduction

How to learn to classify objects from images? What algorithms to use? How to scale up these algorithms?

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 4 / 24

slide-5
SLIDE 5

Classification

Dataset D = {x(i), y(i)}i=1:N with x(i) ∈ RD and labels y(i) ∈ RP Make accurate prediction ˆ y on unseen data point x Classifier (parameters θ) approximates label as: y ≈ ˆ y = F(x; θ) Classifier learns parameters (θ) from data D to minimize a pre-specified loss function

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 5 / 24

slide-6
SLIDE 6

Neuron

a = f (wTx + b) w ∈ Rn = Weight vector b ∈ R = Scalar bias

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 6 / 24

slide-7
SLIDE 7

Classifier: Neural Network

For each layer, zl = (Wl)Txl + bl; al = f (zl) W l ∈ Rnl−1×nl = Weight vector bl ∈ Rnl = Scalar bias

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 7 / 24

slide-8
SLIDE 8

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 8 / 24

slide-9
SLIDE 9

Gradient Descent

Minimize the Mean-Squared Error loss: LMSE(θ) = 1 N

N

  • i=1

(y(i) − f (x(i); θ))2 Algorithm: Gradient Descent

1 Initialize all weights (θ) randomly with small values close to 0. 2 Repeat until convergence {

θk := θk − α∂LMSE ∂θk ∀k ∈ {1, 2, ..., K} } Minibatch gradient descent considers a subset of examples

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 9 / 24

slide-10
SLIDE 10

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 10 / 24

slide-11
SLIDE 11

Forward Propagation

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 11 / 24

slide-12
SLIDE 12

Backpropagation

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 12 / 24

slide-13
SLIDE 13

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 13 / 24

slide-14
SLIDE 14

Parallelizing Gradient Descent

Two ways to parallelize: Parallelize Gradient Descent: Derivative of the loss function has the following form: ∂LMSE ∂θk = 1 N

N

  • i=1

(yi − f (xi; θ))∂f (xi; θ) ∂θk Distribute training examples, compute partial gradients, sum up partial gradients Parallelize Backpropagation: Parallelize matrix vector multiplications in forward propagation and backpropagation algorithms

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 14 / 24

slide-15
SLIDE 15

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 15 / 24

slide-16
SLIDE 16

MNIST dataset

28x28 images of handwritten digits 50,000 training examples, 10,000 test examples, 10,000 validation examples Labels: 0 to 9 (one-hot encoding)

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 16 / 24

slide-17
SLIDE 17

Experiments

Network structures # Layers # Nodes # Num (In,Hidden,Out) (In,Hidden,Out) Params Network1 1,1,1 784,1024,10 800,000 Network2 1,2,1 784,1024,1024,10 1,860,000 Serial, Parallelize over examples (Pthreads, CUDA) Serial (BLAS), Parallelize matrix computations (BLAS) Serial (Keras:Theano), Parallel (Keras:Theano), GPU (Keras:Theano) Analyze time per epoch, gigaflops for each implementation Analyze speedup from parallelization over serial counterparts

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 17 / 24

slide-18
SLIDE 18

Outline

1

Introduction

2

Gradient Descent

3

Forward Propagation and Backpropagation

4

Parallel Gradient Descent

5

Experiments

6

Results and analysis

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 18 / 24

slide-19
SLIDE 19

Results - Time per Epoch

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 19 / 24

slide-20
SLIDE 20

Results - Gigaflops

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 20 / 24

slide-21
SLIDE 21

Results - Speedup

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 21 / 24

slide-22
SLIDE 22

Analysis

Our implementation

Parallel computing average speedup ≈ 10 Training time decreases as minibatch size decreases

BLAS

Parallelizing each matrix vector product gives even faster results Speedup independent of batch size, but less than our implementation

CUDA

Our CUDA implementation gives about ≈ 20x speedup If # neurons per layer are not perfect multiple of 32 then some threads do not participate in computation

Theano

Apparently combines both types of parallelization Theano CUDA scales very fast with batch size

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 22 / 24

slide-23
SLIDE 23

Future Work

Combine the two parallelization techniques: Split training examples amongst threads, further hierarchically parallelize matrix computations for each individual example.

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 23 / 24

slide-24
SLIDE 24

Thank you Questions?

(University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 24 / 24