high performance gpu programming for deep learning
play

High-Performance GPU Programming for Deep Learning 7 April 2016 - PowerPoint PPT Presentation

MAKING MACHINES SMARTER. High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems High-Performance GPU kernels for deep learning Fast matrix multiply for small minibatches Direct convolution


  1. MAKING MACHINES SMARTER.™ High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems

  2. High-Performance GPU kernels for deep learning • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd ner va na 2 Proprietary and confidential. Do not distribute.

  3. GEMM: Basics C = AB ner va na 3 Proprietary and confidential. Do not distribute.

  4. GEMM: Memory Load Outer product contiguous Outer product strided single tile threads memory load batched GEMM ner va na 4 Proprietary and confidential. Do not distribute.

  5. GEMM: Tile sizes Batched GEMM tiles 32 x 32 GEMM tile 32 x 64 GEMM tile 32 x 32 threads shared memory load ner va na 5 Proprietary and confidential. Do not distribute.

  6. hGEMM Results - NN Nx3072x3072 NN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 6 Proprietary and confidential. Do not distribute.

  7. hGEMM Results - TN Nx3072x3072 TN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 7 Proprietary and confidential. Do not distribute.

  8. Direct convolution is still relevant • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer ner va na 8 Proprietary and confidential. Do not distribute.

  9. Direct convolution: implementation details • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations ner va na 9 Proprietary and confidential. Do not distribute.

  10. Winograd: input transform 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be Input Feature Map simplified to remove zeros ner va na 10 Proprietary and confidential. Do not distribute.

  11. Winograd: filter transform • Filter transform • Same as input but with different coefficients • Transform each feature map independently ner va na 11 Proprietary and confidential. Do not distribute.

  12. Winograd: batched GEMM ner va na 12 Proprietary and confidential. Do not distribute.

  13. Winograd: output transform • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile Output Feature Map ner va na 13 Proprietary and confidential. Do not distribute.

  14. Performance: VGG Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update VGG fp32 - Totals by operation cuDNN fp32 fprop 2 cuDNN fp32 bprop Algorithmic Speedup cuDNN fp32 update 1.5 1 0.5 0 64 32 16 8 4 2 1 Batch Size ner va na 14 Proprietary and confidential. Do not distribute.

  15. Performance: Alexnet convolutional layers Alexnet Totals 2 Nervana fp16 Nervana fp32 CuBLAS fp16 Algorithmic Speedup 1.5 CuBLAS fp32 1 0.5 0 128 64 32 16 8 4 ner va na Batch Size 15 Proprietary and confidential. Do not distribute.

  16. Compounding Compounding inside of GEMM and conv for free: • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean ner va na 16 Proprietary and confidential. Do not distribute.

  17. Summary • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor ner va na 17 Proprietary and confidential. Do not distribute.

  18. <extra plots>

  19. VGG 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size

  20. GoogLeNetv2 - Totals: 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size

  21. MSRA - Totals: 2 Winograd fp16 Winograd fp32 Algorithmic Speedup cuDNN fp16 1.5 cuDNN fp32 1 0.5 0 64 32 16 8 4 2 1 Batch Size

  22. About nervana • A platform for machine intelligence • enable deep learning at scale • optimized from algorithms to silicon X ner va na 22 Proprietary and confidential. Do not distribute.

  23. GEMM • Matrix multiply is the fundamental operation of deep learning • Used in fully connected layers and as basis for convolution • Full utilization is hard to achieve for small mini-batches (tall and skinny matrices) • Carefully optimized memory access patterns ner va na 23 Proprietary and confidential. Do not distribute.

  24. Winograd Convolution • Optimizations: • Similar to FFT: • Kernels for 3x3 pixel filters, • Transform image tile 2x2 and 4x4 output tile size • Transform filter • External vs internal fused tra • EW mutiply the two • Transform output back • The transforms are defined in terms of matrix multiplies ner va na 24 Proprietary and confidential. Do not distribute.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend