Parallel Gradient Descent for Multilayer Feedforward Neural Networks - PowerPoint PPT Presentation

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1 Sungyong Seo 1 Vasileios Zois 1 1 Department of Computer Science University of Southern California May 9, 2016 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 1 / 24

Outline Introduction 1 Gradient Descent 2 Forward Propagation and Backpropagation 3 Parallel Gradient Descent 4 Experiments 5 Results and analysis 6 (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 2 / 24

Introduction How to learn to classify objects from images? What algorithms to use? How to scale up these algorithms? (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 4 / 24

Classification Dataset D = { x ( i ) , y ( i ) } i =1: N with x ( i ) ∈ R D and labels y ( i ) ∈ R P Make accurate prediction ˆ y on unseen data point x Classifier (parameters θ ) approximates label as: y ≈ ˆ y = F ( x ; θ ) Classifier learns parameters ( θ ) from data D to minimize a pre-specified loss function (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 5 / 24

Neuron a = f ( w T x + b ) w ∈ R n = Weight vector b ∈ R = Scalar bias (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 6 / 24

Classifier: Neural Network For each layer, z l = ( W l ) T x l + b l ; a l = f ( z l ) W l ∈ R n l − 1 × n l = Weight vector b l ∈ R n l = Scalar bias (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 7 / 24

Gradient Descent Minimize the Mean-Squared Error loss: N L MSE ( θ ) = 1 ( y ( i ) − f ( x ( i ) ; θ )) 2 � N i =1 Algorithm: Gradient Descent 1 Initialize all weights ( θ ) randomly with small values close to 0. 2 Repeat until convergence { θ k := θ k − α∂ L MSE ∀ k ∈ { 1 , 2 , ..., K } ∂θ k } Minibatch gradient descent considers a subset of examples (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 9 / 24

Forward Propagation (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 11 / 24

Backpropagation (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 12 / 24

Parallelizing Gradient Descent Two ways to parallelize: Parallelize Gradient Descent : Derivative of the loss function has the following form: N ∂ L MSE = 1 ( y i − f ( x i ; θ )) ∂ f ( x i ; θ ) � ∂θ k ∂θ k N i =1 Distribute training examples, compute partial gradients, sum up partial gradients Parallelize Backpropagation : Parallelize matrix vector multiplications in forward propagation and backpropagation algorithms (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 14 / 24

MNIST dataset 28x28 images of handwritten digits 50,000 training examples, 10,000 test examples, 10,000 validation examples Labels: 0 to 9 (one-hot encoding) (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 16 / 24

Experiments Network structures # Layers # Nodes # Num (In, Hidden ,Out) (In, Hidden ,Out) Params Network1 1, 1 ,1 784, 1024 ,10 800,000 Network2 1, 2 ,1 784, 1024,1024 ,10 1,860,000 Serial, Parallelize over examples (Pthreads, CUDA) Serial (BLAS), Parallelize matrix computations (BLAS) Serial (Keras:Theano), Parallel (Keras:Theano), GPU (Keras:Theano) Analyze time per epoch, gigaflops for each implementation Analyze speedup from parallelization over serial counterparts (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 17 / 24

Results - Time per Epoch (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 19 / 24

Results - Gigaflops (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 20 / 24

Results - Speedup (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 21 / 24

Analysis Our implementation Parallel computing average speedup ≈ 10 Training time decreases as minibatch size decreases BLAS Parallelizing each matrix vector product gives even faster results Speedup independent of batch size, but less than our implementation CUDA Our CUDA implementation gives about ≈ 20 x speedup If # neurons per layer are not perfect multiple of 32 then some threads do not participate in computation Theano Apparently combines both types of parallelization Theano CUDA scales very fast with batch size (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 22 / 24

Future Work Combine the two parallelization techniques: Split training examples amongst threads, further hierarchically parallelize matrix computations for each individual example. (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 23 / 24

Thank you Questions? (University of Southern California) Parallel Gradient Descent for Multilayer Feedforward Neural Networks May 9, 2016 24 / 24

Parallel Gradient Descent for Multilayer Feedforward Neural Networks - PowerPoint PPT Presentation

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1 Sungyong Seo 1 Vasileios Zois 1 1 Department of Computer Science University of Southern California May 9, 2016 (University of Southern California)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Copper, Nickel & Precious Metals in the U.S. August 2015 Cautionary Statement This

Xyce Parallel Circuit Simulator http://xyce.sandia.gov Heidi Thornquist Sandia National

His istoric ical c l controls ls: t : thin ink c clu luster not parallel Stephen Senn,

ELUX LED PORFOLIO Update week 20, 2015 LIGHTBOARD LED Suspended mounting LED luminaire for

pNFS Parallel Network File System Thijs Stuurman Thijs.Stuurman@os3.nl July 2, 2008 Masters

The first experience with the newly launched Parallel Consultations Chantal Guilhaume On behalf

Presentation Guidelines for Parallel Sessions 7 eme Forum du RWSN / 7 th RWSN Forum, Abidjan

PARALLEL SESSION A DETERMINING CUSTOMER STATUS AND LOCATION MAIN RULE INTRODUCTION 17-18

Sambuz

Useful Links

Newsletter

Mail Us

Parallel Gradient Descent for Multilayer Feedforward Neural Networks - PowerPoint PPT Presentation

Parallel Gradient Descent for Multilayer Feedforward Neural Networks Palash Goyal 1 Nitin Kamra 1 Sungyong Seo 1 Vasileios Zois 1 1 Department of Computer Science University of Southern California May 9, 2016 (University of Southern California)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

MultiLayer Neural Networks Xiaogang Wang xgwang@ee.cuhk.edu.hk January 15, 2019 cuhk Xiaogang

Copper, Nickel &amp; Precious Metals in the U.S. August 2015 Cautionary Statement This

Xyce Parallel Circuit Simulator http://xyce.sandia.gov Heidi Thornquist Sandia National

His istoric ical c l controls ls: t : thin ink c clu luster not parallel Stephen Senn,

ELUX LED PORFOLIO Update week 20, 2015 LIGHTBOARD LED Suspended mounting LED luminaire for

pNFS Parallel Network File System Thijs Stuurman Thijs.Stuurman@os3.nl July 2, 2008 Masters

The first experience with the newly launched Parallel Consultations Chantal Guilhaume On behalf

Presentation Guidelines for Parallel Sessions 7 eme Forum du RWSN / 7 th RWSN Forum, Abidjan

PARALLEL SESSION A DETERMINING CUSTOMER STATUS AND LOCATION MAIN RULE INTRODUCTION 17-18

Sambuz

Useful Links

Newsletter

Mail Us

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Copper, Nickel & Precious Metals in the U.S. August 2015 Cautionary Statement This