High-Performance GPU Programming for Deep Learning 7 April 2016 - PowerPoint PPT Presentation

MAKING MACHINES SMARTER.™ High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems

High-Performance GPU kernels for deep learning • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd ner va na 2 Proprietary and confidential. Do not distribute.

GEMM: Basics C = AB ner va na 3 Proprietary and confidential. Do not distribute.

GEMM: Memory Load Outer product contiguous Outer product strided single tile threads memory load batched GEMM ner va na 4 Proprietary and confidential. Do not distribute.

GEMM: Tile sizes Batched GEMM tiles 32 x 32 GEMM tile 32 x 64 GEMM tile 32 x 32 threads shared memory load ner va na 5 Proprietary and confidential. Do not distribute.

hGEMM Results - NN Nx3072x3072 NN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 6 Proprietary and confidential. Do not distribute.

hGEMM Results - TN Nx3072x3072 TN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 7 Proprietary and confidential. Do not distribute.

Direct convolution is still relevant • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer ner va na 8 Proprietary and confidential. Do not distribute.

Direct convolution: implementation details • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations ner va na 9 Proprietary and confidential. Do not distribute.

Winograd: input transform 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be Input Feature Map simplified to remove zeros ner va na 10 Proprietary and confidential. Do not distribute.

Winograd: filter transform • Filter transform • Same as input but with different coefficients • Transform each feature map independently ner va na 11 Proprietary and confidential. Do not distribute.

Winograd: batched GEMM ner va na 12 Proprietary and confidential. Do not distribute.

Winograd: output transform • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile Output Feature Map ner va na 13 Proprietary and confidential. Do not distribute.

Performance: VGG Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update VGG fp32 - Totals by operation cuDNN fp32 fprop 2 cuDNN fp32 bprop Algorithmic Speedup cuDNN fp32 update 1.5 1 0.5 0 64 32 16 8 4 2 1 Batch Size ner va na 14 Proprietary and confidential. Do not distribute.

Performance: Alexnet convolutional layers Alexnet Totals 2 Nervana fp16 Nervana fp32 CuBLAS fp16 Algorithmic Speedup 1.5 CuBLAS fp32 1 0.5 0 128 64 32 16 8 4 ner va na Batch Size 15 Proprietary and confidential. Do not distribute.

Compounding Compounding inside of GEMM and conv for free: • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean ner va na 16 Proprietary and confidential. Do not distribute.

Summary • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor ner va na 17 Proprietary and confidential. Do not distribute.

VGG 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size

GoogLeNetv2 - Totals: 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size

MSRA - Totals: 2 Winograd fp16 Winograd fp32 Algorithmic Speedup cuDNN fp16 1.5 cuDNN fp32 1 0.5 0 64 32 16 8 4 2 1 Batch Size

About nervana • A platform for machine intelligence • enable deep learning at scale • optimized from algorithms to silicon X ner va na 22 Proprietary and confidential. Do not distribute.

GEMM • Matrix multiply is the fundamental operation of deep learning • Used in fully connected layers and as basis for convolution • Full utilization is hard to achieve for small mini-batches (tall and skinny matrices) • Carefully optimized memory access patterns ner va na 23 Proprietary and confidential. Do not distribute.

Winograd Convolution • Optimizations: • Similar to FFT: • Kernels for 3x3 pixel filters, • Transform image tile 2x2 and 4x4 output tile size • Transform filter • External vs internal fused tra • EW mutiply the two • Transform output back • The transforms are defined in terms of matrix multiplies ner va na 24 Proprietary and confidential. Do not distribute.

High-Performance GPU Programming for Deep Learning 7 April 2016 - PowerPoint PPT Presentation

MAKING MACHINES SMARTER. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems High-Performance GPU kernels for deep learning Fast matrix multiply for small minibatches Direct convolution

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology

Calcium Chloride A Test Project for Lethbridge County Haul Roads Results from One Year Monitoring

4

Quantum Recursion and Second Quantisation Basic Ideas and Examples Mingsheng Ying University of

MALLS BOI LABS BOI Research Services BRAND AWARENESS JABODETABEK Brand awareness -

State Senator Gray Tollison Chairman, Senate Education Committee Mississippi State Senate

Shortbelly rockfish bycatch in the 2019 hake fishery Karl Haflinger, Sea State Inc. Offshore and

INTRODUCING SUPPORT SYSTEMS FOR SAAB ELECTRONIC WARFARE Mikael Larsson OEWESF Systems

CISB Aeronautics Committee Background and Plans WS3 in Aeronautics and Defence Sao Jos dos

Sambuz

Useful Links

Newsletter

Mail Us

High-Performance GPU Programming for Deep Learning 7 April 2016 - PowerPoint PPT Presentation

MAKING MACHINES SMARTER. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems High-Performance GPU kernels for deep learning Fast matrix multiply for small minibatches Direct convolution

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

Deep Learning on GPU Mattias Flt Dept. of Automatic Control Lund Institute of Technology

Calcium Chloride A Test Project for Lethbridge County Haul Roads Results from One Year Monitoring

4

Quantum Recursion and Second Quantisation Basic Ideas and Examples Mingsheng Ying University of

MALLS BOI LABS BOI Research Services BRAND AWARENESS JABODETABEK Brand awareness -

State Senator Gray Tollison Chairman, Senate Education Committee Mississippi State Senate

Shortbelly rockfish bycatch in the 2019 hake fishery Karl Haflinger, Sea State Inc. Offshore and

INTRODUCING SUPPORT SYSTEMS FOR SAAB ELECTRONIC WARFARE Mikael Larsson OEWESF Systems

CISB Aeronautics Committee Background and Plans WS3 in Aeronautics and Defence Sao Jos dos

Sambuz

Useful Links

Newsletter

Mail Us

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,