Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and - - PowerPoint PPT Presentation

deep learning on gpus
SMART_READER_LITE
LIVE PREVIEW

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and - - PowerPoint PPT Presentation

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice Scaling up DL 2 What is Deep Learning? 3 DEEP LEARNING EVERYWHERE INTERNET & CLOUD MEDICINE & BIOLOGY SECURITY & DEFENSE MEDIA &


slide-1
SLIDE 1

March 2016

Deep Learning on GPUs

slide-2
SLIDE 2

2

AGENDA

What is Deep Learning? GPUs and DL DL in practice Scaling up DL

slide-3
SLIDE 3

3

What is Deep Learning?

slide-4
SLIDE 4

4

DEEP LEARNING EVERYWHERE

INTERNET & CLOUD

Image Classification Speech Recognition Language Translation Language Processing Sentiment Analysis Recommendation

MEDIA & ENTERTAINMENT

Video Captioning Video Search Real Time Translation

AUTONOMOUS MACHINES

Pedestrian Detection Lane Tracking Recognize Traffic Sign

SECURITY & DEFENSE

Face Detection Video Surveillance Satellite Imagery

MEDICINE & BIOLOGY

Cancer Cell Detection Diabetic Grading Drug Discovery

slide-5
SLIDE 5

5

Traditional machine perception

Hand crafted feature extractors

Speaker ID, speech transcription, …

Raw data Feature extraction Result Classifier/ detector

SVM, shallow neural net, … HMM, shallow neural net, … Clustering, HMM, LDA, LSA …

Topic classification, machine translation, sentiment analysis…

slide-6
SLIDE 6

6

Deep learning approach

Deploy:

Dog Cat Honey badger

Errors

Dog Cat Raccoon

Dog Train:

MODEL MODEL

slide-7
SLIDE 7

7

Artificial neural network

A collection of simple, trainable mathematical units that collectively learn complex functions

Input layer Output layer Hidden layers Given sufficient training data an artificial neural network can approximate very complex functions mapping raw data to output decisions

slide-8
SLIDE 8

8

Artificial neurons

From Stanford cs231n lecture notes

Biological neuron w1 w2 w3 x1 x2 x3 y y=F(w1x1+w2x2+w3x3) F(x)=max(0,x) Artificial neuron

slide-9
SLIDE 9

9

Deep neural network (dnn)

Input Result

Application components: Task objective e.g. Identify face Training data 10-100M images Network architecture ~10 layers 1B parameters Learning algorithm ~30 Exaflops ~30 GPU days

Raw data Low-level features Mid-level features High-level features

slide-10
SLIDE 10

10

Deep learning benefits

§ Robust

§ No need to design the features ahead of time – features are automatically learned to be optimal for the task at hand § Robustness to natural variations in the data is automatically learned

§ Generalizable

§ The same neural net approach can be used for many different applications and data types

§ Scalable

§ Performance improves with more data, method is massively parallelizable

slide-11
SLIDE 11

11

Baidu Deep Speech 2

English and Mandarin speech recognition Transition from English to Mandarin made simpler by end-to-end DL

No feature engineering or Mandarin-specifics required

More accurate than humans

Error rate 3.7% vs. 4% for human tests

http://svail.github.io/mandarin/ http://arxiv.org/abs/1512.02595

End-to-end Deep Learning for English and Mandarin Speech Recognition

slide-12
SLIDE 12

12

AlphaGo

Training DNNs: 3 weeks, 340 million training steps on 50 GPUs Play: Asynchronous multi-threaded search

Simulations on CPUs, policy and value DNNs in parallel on GPUs Single machine: 40 search threads, 48 CPUs, and 8 GPUs Distributed version: 40 search threads, 1202 CPUs and 176 GPUs

Outcome: Beat both European and World Go champions in best of 5 matches

First Computer Program to Beat a Human Go Professional

http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html http://deepmind.com/alpha-go.html

slide-13
SLIDE 13

13

Deep Learning for Autonomous vehicles

slide-14
SLIDE 14

14

Deep Learning Synthesis

Texture synthesis and transfer using CNNs. Timo Aila et al., NVIDIA Research

slide-15
SLIDE 15

15

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 2009 2010 2011 2012 2013 2014 2015 2016

THE AI RACE IS ON

IBM Watson Achieves Breakthrough in Natural Language Processing Facebook Launches Big Sur Baidu Deep Speech 2 Beats Humans Google Launches TensorFlow Microsoft & U. Science & Tech, China Beat Humans on IQ Toyota Invests $1B in AI Labs

IMAGENET Accuracy Rate

Traditional CV Deep Learning

slide-16
SLIDE 16

16

The Big Bang in Machine Learning

“ Google’s AI engine also reflects how the world of computer hardware is changing.

(It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.” DNN GPU BIG DA TA

slide-17
SLIDE 17

17

GPUs and DL

USE MORE PROCESSORS TO GO FASTER

slide-18
SLIDE 18

18

Deep learning development cycle

slide-19
SLIDE 19

19

Three Kinds of Networks

DNN – all fully connected layers CNN – some convolutional layers RNN – recurrent neural network, LSTM

slide-20
SLIDE 20

20

DNN

Key operation is dense M x V

Backpropagation uses dense matrix-matrix multiply starting from softmax scores

slide-21
SLIDE 21

21

DNN

Batching for training and latency insensitive. M x M

Batched operation is M x M – gives re-use of weights.

Without batching, would use each element of Weight matrix once. Want 10-50 arithmetic operations per memory fetch for modern compute architectures.

slide-22
SLIDE 22

22

CNN

Requires convolution and M x V

Filters conserved through plane Multiply limited – even without batching.

slide-23
SLIDE 23

23

Other Operations

To finish building a DNN

These are not limiting factors with appropriate GPU use Complex networks have hundreds of millions of weights.

slide-24
SLIDE 24

24

Lots of Parallelism Available in a DNN

slide-25
SLIDE 25

25

TESLA M40

World’s Fastest Accelerator for Deep Learning Training

1 2 3 4 5 6 7 8 9 10 11 12 13

GPU Server with 4x TESLA M40 Dual CPU Server

13x Faster Training

Caffe

Number of Days

CUDA Cores 3072 Peak SP 7 TFLOPS GDDR5 Memory 12 GB Bandwidth 288 GB/s Power 250W

Reduce Training Time from 13 Days to just 1 Day

Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04

28 Gflop/W

slide-26
SLIDE 26

26

Comparing CPU and GPU – server class

Xeon E5-2698 and Tesla M40

NVIDIA Whitepaper “GPU based deep learning inference: A performance and power analysis.”

slide-27
SLIDE 27

27

DL in practice

slide-28
SLIDE 28

28

EDUCATION START-UPS

CNTK TENSORFLOW DL4J

The Engine of Modern AI

NVIDIA GPU PLATFORM

*U. Washington, CMU, Stanford, TuSimple, NYU, Microsoft, U. Alberta, MIT, NYU Shanghai

VITRUVIAN SCHULTS LABORATORIES

TORCH THEANO CAFFE MATCONVNET PURINE MOCHA.JL MINERVA MXNET* CHAINER BIG SUR WATSON OPENDEEP KERAS

slide-29
SLIDE 29

29

CUDA for Deep Learning Development

TITAN X GPU CLOUD DEVBOX DEEP LEARNING SDK

cuSPARSE cuBLAS DIGITS NCCL cuDNN

slide-30
SLIDE 30

30

§ GPU-accelerated Deep Learning subroutines § High performance neural network training § Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch, TensorFlow § Up to 3.5x faster AlexNet training in Caffe than baseline GPU Millions of Images Trained Per Day Tiled FFT up to 2x faster than FFT

developer .nvidia.com/cudnn

20 40 60 80 100 cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4 0.0x 0.5x 1.0x 1.5x 2.0x 2.5x

Deep Learning Primitives

Accelerating Artificial Intelligence

slide-31
SLIDE 31

31

CUDA BOOSTS DEEP LEARNING 5X IN 2 YEARS

Performance

AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04

Caffe Performance

K40 K40+cuDNN1 M40+cuDNN3 M40+cuDNN4 1 2 3 4 5 6 11/2013 9/2014 7/2015 12/2015

slide-32
SLIDE 32

32

NVIDIA DIGITS

Interactive Deep Learning GPU Training System

Test Image Test Image

Monitor Progress Configure DNN Process Data Visualize Layers

developer .nvidia.com/digits

slide-33
SLIDE 33

33

PC GAMING

ONE ARCHITECTURE — END-TO-END AI

DRIVE PX for Auto Titan X for PC T esla for Cloud Jetson for Embedded

slide-34
SLIDE 34

34

Scaling DL

slide-35
SLIDE 35

35

Scaling Neural Networks

Data Parallelism

Machine 1 Image 1 Machine 2 Image 2 Sync.

Notes: Need to sync model across machines. Largest models do not fit on one GPU. Requires P-fold larger batch size. Works across many nodes – parameter server approach – linear speedup.

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro

W W

slide-36
SLIDE 36

36

Multiple GPUs

Near linear scaling – data parallel.

Ren Wu et al, Baidu, “Deep Image: Scaling up Image Recognition.” arXiv 2015

slide-37
SLIDE 37

37

Scaling Neural Networks

Model Parallelism

Machine 1 Machine 2 Image 1

W

Notes: Allows for larger models than fit on one GPU. Requires much more frequent communication between GPUs. Most commonly used within a node – GPU P2P . Effective for the fully connected layers.

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro

slide-38
SLIDE 38

38

Scaling Neural Networks

Hyper Parameter Parallelism

Try many alternative neural networks in parallel – on different CPU / GPU / Machines. Probably the most obvious and effective way!

slide-39
SLIDE 39

39

Deep Learning Everywhere

NVIDIA Titan X NVIDIA Jetson NVIDIA T esla NVIDIA DRIVE PX

Contact: jbarker@nvidia.com