Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund - PowerPoint PPT Presentation

Deep Learning on GPU Mattias Fält Dept. of Automatic Control Lund Institute of Technology Mattias Fält Deep Learning on GPU

Overview What is the difference between CPU and GPU? What is CUDA, and how does it relate to cuBLAS and cuDNN? How is this connected to Deep Learning and Tensorflow? How do I run tensorflow on the GPU? What is a TPU? Mattias Fält Deep Learning on GPU

GPU vs CPU http://allegroviva.com/gpu-computing/difference-between-gpu-and-cpu Mattias Fält Deep Learning on GPU

CPU (Central Processing Unit) CPU (Intel i7 6700K, 3500kr): Few cores (4) Few threads per core (2) Possible to have one thread per process Fast clock (4GHz) Advanced instructions and memory management Virtual Memory, Security (Protected memory from other processes) Encryption (AES), Trigonometric Functions, Interruptions Branch prediction, Reordering operations SIMD (Single Instruction Multiple Data) (8 SP float operations per core) Lower Memory Bandwidth (34.2 GB/s) Low power (95W) (64 GFLOPS SP or 128 GFLOPS DP + SIMD?) Mattias Fält Deep Learning on GPU

GPU (Graphics Processing Units) GPU (Nvidia Titan X, 14000kr) (i7 in blue): Lots of cores (3072) (32) 24 Multiprocessors 128 CUDA Cores/MP Lots of possible threads (2048 per MP) 2( per core) Not one process per thread (one per MP?) Slower clock ( ≈ 1GHz) (4GHz) Simpler instructions per core “Special Function Unit” to handle advanced ops (Trigonometric) High Memory Bandwidth (480GB/s graphics memory) Slower bandwidth from RAM (6.0 GB/sec) (34GB/s) High power (250W) (95W) Color LEDs! (16.8M colors) (0!) Up to (6600 GFLOPS SP and 206 GFLOPS DP) (i7: (64 GFLOPS SP or 128 GFLOPS DP or SIMD)) Conclusion: CPU for fast and advanced dependent operations GPU for massive parallel SIMD with shared data Mattias Fält Deep Learning on GPU

CUDA CUDA is an API for GPGPU (General-Purpose computing on GPU) Released in 2007 by Nvidia Previously, computations on GPU only possible throgh Graphics API Alternatives exist Open Computing Language (OpenCL) Joint project by Altera, AMD, Apple , ARM Holdings, Creative Technology, IBM, Imagination Technologies, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS Mattias Fält Deep Learning on GPU

CUDA Example Create Function: void saxpy ( int n , float a , float *x , float * y ) { int i = blockIdx . x * blockDim . x + threadIdx . x ; i f ( i < n ) y [ i ] = a* x [ i ] + y [ i ] ; } Allocate space on GPU: cudaMalloc (&d_x , N* sizeof ( float ) ) ; cudaMalloc (&d_y , N* sizeof ( float ) ) ; Copy to GPU, run and get data: cudaMemcpy( d_x , x , N* sizeof ( float ) , cudaMemcpyHostToDevice ) ; cudaMemcpy( d_y , y , N* sizeof ( float ) , cudaMemcpyHostToDevice ) ; saxpy <<<(N+255)/256 , 256>>>(N, 2.0 f , d_x , d_y ) ; cudaMemcpy( y , d_y , N* sizeof ( float ) , cudaMemcpyDeviceToHost ) ; Mattias Fält Deep Learning on GPU

cuBLAS BLAS (Basic Linear Algebra Subroutines) library build on CUDA. Implements several level 1-3 BLAS routines. Names on form cublasTroutine , T ∈ { S , D , C , Z } Examples: Level 1 (Scalar and Vector operations): cublasDaxpy y = α x + y (Alpha X Plus Y) cublasDdot (Dot product) z = x · y Level 2 (Matrix - Vector operations): y = α A ( T ) x + β y cublasDgbmv (General Banded Matrix Vector) A = α ( xy T + yx T ) + A cublasDspr2 (Symmetric Packed Rank 2) Level 3 (Matrix - Matrix operations): C = α A ( T ) B ( T ) + β C cublasDgemm (GEneral Matrix Matrix) A ( T ) X = α B cublasDtrsm (TRiangular System Multiple right) Mattias Fält Deep Learning on GPU

NVIDIA CUDA Deep Neural Network library (cuDNN) A low level API and flexible, efficient, well-optimized parallel implementations of common deep learning routines for GPU. Efficient Primitives for Deep Learning Similar philosophy as BLAS. Can be used with e.g. Caffe, TensorFlow, Theano, Torch, CNTK Implements several commonly used routines: Convolutions Sigmoid ReLU Hyperbolic Tangent Softmax Max-pooling Tensor Transformations. Mattias Fält Deep Learning on GPU

TensorFlow TensorFlow binaries require Cuda Toolkit 8.0 and cuDNN v5 for GPU support More options available if you compile TensorFlow yourself Instructions: https://www.tensorflow.org/versions/r0.11/get_ started/os_setup.html TensorFlow will automatically decide which operations should run on the CPU and GPU(s). Normal speedup is in the range 10-20x. http://www.nvidia.com/object/gpu-accelerated-applications-tensorflow-benchmarks.html Mattias Fält Deep Learning on GPU

TensorFlow # Creates a graph . a = t f . constant ( [ 1 . 0 , 2.0 , 3.0 , 4.0 , 5.0 , 6. 0] , shape =[2 , 3] , name= ’a ’ ) b = t f . constant ( [ 1 . 0 , 2.0 , 3.0 , 4.0 , 5.0 , 6. 0] , shape =[3 , 2] , name= ’b ’ ) c = t f . matmul (a , b ) # Creates a session with log_device_placement set to True . sess = t f . Session ( config= t f . ConfigProto ( log_device_placement=True ) ) # Runs the op . print sess . run ( c ) # Output : # Device mapping : # / job : localhost / r e p l i c a : 0 / task : 0 / gpu :0 − > device : 0 , # name: Tesla K40c , pci bus , id : 0000:05:00.0 # b : / job : localhost / r e p l i c a : 0 / task : 0 / gpu :0 # a : / job : localhost / r e p l i c a : 0 / task : 0 / gpu :0 # MatMul : / job : localhost / r e p l i c a : 0 / task : 0 / gpu :0 # [ [ 22. 28.] # [ 49. 6 4 . ] ] Mattias Fält Deep Learning on GPU

TPU (Tensor Processing Unit) “order of magnitude better performance per watt than standard solutions” All Google TPUs “can find all the text in the Street View database in less than five days” One can process more than 100 million Google photos per day. Very secret: Used for training or evaluation? “reduced computational precision”: 32bit, 16bit, 8bit? https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html Mattias Fält Deep Learning on GPU

Homework Either Run one of your own examples on a CPU and GPU and compare the times. Do all computations run on the GPU? https://www.tensorflow.org/versions/r0.11/how_tos/ using_gpu/index.html Or Read: TensorFlow: A system for large-scale machine learning https://arxiv.org/abs/1605.08695 Mattias Fält Deep Learning on GPU

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund - PowerPoint PPT Presentation

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology Mattias Flt Deep Learning on GPU Overview What is the difference between CPU and GPU? What is CUDA, and how does it relate to cuBLAS and cuDNN? How

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing

Discussion of Vector-based Computers and Applicability of Different Types of Programs Weston

GESTURE RECOGNITION: USING A MULTI SENSOR APPROACH SHALINI GUPTA, PAVLO MOLCHANOV, KIHWAN KIM,

AnInputandGesture Recogni3onFrameworkfor TacTile ChadThompson

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

Sambuz

Useful Links

Newsletter

Mail Us

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund - PowerPoint PPT Presentation

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology Mattias Flt Deep Learning on GPU Overview What is the difference between CPU and GPU? What is CUDA, and how does it relate to cuBLAS and cuDNN? How

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

DEEP INTO TRTIS: BERT PRACTICAL DEPLOYMENT ON NVIDIA GPU Xu Tianhao, Deep Learning Solution

Tiresias A GPU Cluster Manager for Distributed Deep Learning Ju Junchen eng g Gu , Mosharaf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing

Discussion of Vector-based Computers and Applicability of Different Types of Programs Weston

GESTURE RECOGNITION: USING A MULTI SENSOR APPROACH SHALINI GUPTA, PAVLO MOLCHANOV, KIHWAN KIM,

AnInputandGesture Recogni3onFrameworkfor TacTile ChadThompson

SUPERCOMPUTERS TO SUPERCARS Bill Veenhuis Sr. Solutions Architect, Automotive

Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms e , Luis P. Garc a, Javier

GPU Servers for Research in Quantum Fluids L. Galantucci HPC &amp; Quantum Summit QEII Centre,

OpenCL Application on Mobile GPU: A Case Study Elena Barreras, Juan M. Jimenez, Arian Maghazeh,

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Servers for Research in Quantum Fluids L. Galantucci HPC & Quantum Summit QEII Centre,