High-Performance GPU Programming for Deep Learning 7 April 2016 - - PowerPoint PPT Presentation

high performance gpu programming for deep learning
SMART_READER_LITE
LIVE PREVIEW

High-Performance GPU Programming for Deep Learning 7 April 2016 - - PowerPoint PPT Presentation

MAKING MACHINES SMARTER. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems High-Performance GPU kernels for deep learning Fast matrix multiply for small minibatches Direct convolution


slide-1
SLIDE 1

High-Performance GPU Programming for Deep Learning

7 April 2016 Scott Gray

Nervana Systems

MAKING MACHINES SMARTER.™

slide-2
SLIDE 2

Proprietary and confidential. Do not distribute.

ner va na

High-Performance GPU kernels for deep learning

2

  • Fast matrix multiply for small minibatches
  • Direct convolution leveraging GEMM advances
  • Even faster convolution with Winograd
slide-3
SLIDE 3

Proprietary and confidential. Do not distribute.

ner va na

GEMM: Basics

3

C = AB

slide-4
SLIDE 4

Proprietary and confidential. Do not distribute.

ner va na

GEMM: Memory Load

4

Outer product contiguous Outer product strided

threads memory load single tile batched GEMM

slide-5
SLIDE 5

Proprietary and confidential. Do not distribute.

ner va na

Batched GEMM tiles 32 x 32 GEMM tile 32 x 64 GEMM tile 32 x 32

GEMM: Tile sizes

5 threads shared memory load

slide-6
SLIDE 6

Proprietary and confidential. Do not distribute.

ner va na

hGEMM Results - NN

6

Nx3072x3072 NN op

1500 3000 4500 6000 32 64 96 128

Nervana 32x32 cuBLAS 128x64 Batch Size (N) GFLOPS

slide-7
SLIDE 7

Proprietary and confidential. Do not distribute.

ner va na

hGEMM Results - TN

7

GFLOPS

Nx3072x3072 TN op

1500 3000 4500 6000 32 64 96 128

Nervana 32x32 cuBLAS 128x64 Batch Size (N)

slide-8
SLIDE 8

Proprietary and confidential. Do not distribute.

ner va na

Direct convolution is still relevant

8

  • Striding
  • Odd-size filters
  • Placeholder until faster algo can be implemented
  • Often faster for single image or first small C layer
slide-9
SLIDE 9

Proprietary and confidential. Do not distribute.

ner va na

Direct convolution: implementation details

9

  • Batched GEMM for efficient transpose and higher occupancy
  • Compound outer product block remapping
  • Square wave pattern for P,Q block mapping
  • Slicing: shared memory lookup + integer division
  • N vs C contiguous
  • Single P,Q vs tiled P,Q
  • Bprop as upside down fprop
  • Update specific optimizations
slide-10
SLIDE 10

Proprietary and confidential. Do not distribute.

ner va na

Winograd: input transform

10

Input Feature Map 4x4 stride 2

  • Input transform
  • 2D Winograd is a nested

product of 1D transforms

  • Transforms can be

simplified to remove zeros

slide-11
SLIDE 11

Proprietary and confidential. Do not distribute.

ner va na

Winograd: filter transform

11

  • Filter transform
  • Same as input but with

different coefficients

  • Transform each feature map

independently

slide-12
SLIDE 12

Proprietary and confidential. Do not distribute.

ner va na

Winograd: batched GEMM

12

slide-13
SLIDE 13

Proprietary and confidential. Do not distribute.

ner va na

Winograd: output transform

13

Output Feature Map

  • Output transform
  • Same as input and filter
  • Transform back to pixel

space to obtain 2x2 output tile

slide-14
SLIDE 14

Proprietary and confidential. Do not distribute.

ner va na

14

Performance: VGG

VGG fp32 - Totals by operation

0.5 1 1.5 2

64 32 16 8 4 2 1

Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update cuDNN fp32 fprop cuDNN fp32 bprop cuDNN fp32 update

Algorithmic Speedup Batch Size

slide-15
SLIDE 15

Proprietary and confidential. Do not distribute.

ner va na

Performance: Alexnet convolutional layers

15

Alexnet Totals

0.5 1 1.5 2 128 64 32 16 8 4

Nervana fp16 Nervana fp32 CuBLAS fp16 CuBLAS fp32

Batch Size Algorithmic Speedup

slide-16
SLIDE 16

Proprietary and confidential. Do not distribute.

ner va na

Compounding

16

  • alpha / beta
  • bias
  • relu, prelu, tanh, …
  • bprop relu, …
  • bprop bias
  • batchnorm mean

Compounding inside of GEMM and conv for free:

slide-17
SLIDE 17

Proprietary and confidential. Do not distribute.

ner va na

Summary

17

  • Nervana has the fastest tools for deep learning
  • neon with state-of-the-art Maxwell kernels
  • Nervana Cloud with multi-GPU training
  • Watch for Nervana Engine, our deep learning processor
slide-18
SLIDE 18

<extra plots>

slide-19
SLIDE 19

VGG

0.5 1 1.5 2 64 32 16 8 4 2 1

Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32

Algorithmic Speedup Batch Size

slide-20
SLIDE 20

GoogLeNetv2 - Totals:

0.5 1 1.5 2 64 32 16 8 4 2 1

Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32

Algorithmic Speedup Batch Size

slide-21
SLIDE 21

MSRA - Totals:

0.5 1 1.5 2 64 32 16 8 4 2 1

Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32

Algorithmic Speedup Batch Size

slide-22
SLIDE 22

Proprietary and confidential. Do not distribute.

ner va na

About nervana

22

  • A platform for machine intelligence
  • enable deep learning at scale
  • optimized from algorithms to silicon

X

slide-23
SLIDE 23

Proprietary and confidential. Do not distribute.

ner va na

GEMM

23

  • Matrix multiply is the fundamental operation of deep learning
  • Used in fully connected layers and as basis for convolution
  • Full utilization is hard to achieve for small mini-batches (tall and

skinny matrices)

  • Carefully optimized memory access patterns
slide-24
SLIDE 24

Proprietary and confidential. Do not distribute.

ner va na

Winograd Convolution

24

  • Similar to FFT:
  • Transform image tile
  • Transform filter
  • EW mutiply the two
  • Transform output back
  • The transforms are defined

in terms of matrix multiplies

  • Optimizations:
  • Kernels for 3x3 pixel filters,

2x2 and 4x4 output tile size

  • External vs internal fused tra