 
              Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1 Sungyong Seo 1 Vasileios Zois 1 1 Department of Computer Science University of Southern California May 9, 2016 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 1 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 2 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 3 / 24
Introduction How to learn to classify objects from images? What algorithms to use? How to scale up these algorithms? (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 4 / 24
Classification Dataset D = { x ( i ) , y ( i ) } i =1: N with x ( i ) ∈ R D and labels y ( i ) ∈ R P Make accurate prediction ˆ y on unseen data point x Classifier (parameters θ ) approximates label as: y ≈ ˆ y = F ( x ; θ ) Classifier learns parameters ( θ ) from data D to minimize a pre-specified loss function (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 5 / 24
Neuron a = f ( w T x + b ) w ∈ R n = Weight vector b ∈ R = Scalar bias (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 6 / 24
Classifier: Neural Network For each layer, z l = ( W l ) T x l + b l ; a l = f ( z l ) W l ∈ R n l − 1 × n l = Weight vector b l ∈ R n l = Scalar bias (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 7 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 8 / 24
Gradient Descent Minimize the Mean-Squared Error loss: N L MSE ( θ ) = 1 ( y ( i ) − f ( x ( i ) ; θ )) 2 � N i =1 Algorithm: Gradient Descent 1 Initialize all weights ( θ ) randomly with small values close to 0. 2 Repeat until convergence { θ k := θ k − α∂ L MSE ∀ k ∈ { 1 , 2 , ..., K } ∂θ k } Minibatch gradient descent considers a subset of examples (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 9 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 10 / 24
Forward Propagation (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 11 / 24
Backpropagation (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 12 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 13 / 24
Parallelizing Gradient Descent Two ways to parallelize: Parallelize Gradient Descent : Derivative of the loss function has the following form: N ∂ L MSE = 1 ( y i − f ( x i ; θ )) ∂ f ( x i ; θ ) � ∂θ k ∂θ k N i =1 Distribute training examples, compute partial gradients, sum up partial gradients Parallelize Backpropagation : Parallelize matrix vector multiplications in forward propagation and backpropagation algorithms (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 14 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 15 / 24
MNIST dataset 28x28 images of handwritten digits 50,000 training examples, 10,000 test examples, 10,000 validation examples Labels: 0 to 9 (one-hot encoding) (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 16 / 24
Experiments Network structures # Layers # Nodes # Num (In, Hidden ,Out) (In, Hidden ,Out) Params Network1 1, 1 ,1 784, 1024 ,10 800,000 Network2 1, 2 ,1 784, 1024,1024 ,10 1,860,000 Serial, Parallelize over examples (Pthreads, CUDA) Serial (BLAS), Parallelize matrix computations (BLAS) Serial (Keras:Theano), Parallel (Keras:Theano), GPU (Keras:Theano) Analyze time per epoch, gigaflops for each implementation Analyze speedup from parallelization over serial counterparts (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 17 / 24
Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 18 / 24
Results - Time per Epoch (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 19 / 24
Results - Gigaflops (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 20 / 24
Results - Speedup (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 21 / 24
Analysis Our implementation Parallel computing average speedup ≈ 10 Training time decreases as minibatch size decreases BLAS Parallelizing each matrix vector product gives even faster results Speedup independent of batch size, but less than our implementation CUDA Our CUDA implementation gives about ≈ 20 x speedup If # neurons per layer are not perfect multiple of 32 then some threads do not participate in computation Theano Apparently combines both types of parallelization Theano CUDA scales very fast with batch size (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 22 / 24
Future Work Combine the two parallelization techniques: Split training examples amongst threads, further hierarchically parallelize matrix computations for each individual example. (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 23 / 24
Thank you Questions? (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 24 / 24
Recommend
More recommend