Balancing Efficiency and Flexibility for DNN Acceleration via - - PowerPoint PPT Presentation

balancing efficiency and flexibility for dnn acceleration
SMART_READER_LITE
LIVE PREVIEW

Balancing Efficiency and Flexibility for DNN Acceleration via - - PowerPoint PPT Presentation

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration Cong Guo 1 , Yangjie Zhou 1 , Jingwen Leng 1 , Yuhao Zhu 2 , Zidong Du 3 , Quan Chen 1 , Chao Li 1 , Bin Yao 1 , Minyi Guo 1 1 Shanghai Jiao Tong


slide-1
SLIDE 1

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

Cong Guo1, Yangjie Zhou1, Jingwen Leng1, Yuhao Zhu2, Zidong Du3, Quan Chen1, Chao Li1, Bin Yao1, Minyi Guo1

1Shanghai Jiao Tong University, 2University of Rochester, 3Institute of Computing Technology, Chinese Academy of Sciences

slide-2
SLIDE 2

Biography

  • Cong Guo
  • First-year Ph.D. student at Shanghai Jiao Tong University
  • Interested in computer architecture and high performance computing.
  • Jingwen Leng
  • Associate professor
  • John Hopcroft Center for Computer Science

Department of Computer Science and Engineering Shanghai Jiao Tong University

2

slide-3
SLIDE 3

Outline

  • Introduction
  • Simultaneous Multi-mode Architecture
  • Evaluation

3

slide-4
SLIDE 4

Introduction

4

Google TPU [ISCA’16] Specialized for GEMM (General Matrix Multiply) Nvidia GPU [Volta, 17]

Efficiency Flexibility

General-purpose Core

slide-5
SLIDE 5

Inefficiency for TPU

  • TPU v2, 1 core, 22TFLOPS
  • GPU V100, 15TFLOPS
  • TPU profiling
  • Mask R-CNN
  • CNN/FC: 20% faster than GPU
  • Total: 75% slower than GPU
  • DeepLab
  • CNN/FC: 40% faster than GPU
  • ArgMax: 2x slowdown than GPU
  • CRF:

10x worse than GPU Mask R-CNN DeepLab

Hybrid models:

5

slide-6
SLIDE 6

Inefficiency for GPU

6

  • Spatial integration
  • Explicit synchronization
  • Fixed shape GEMM
  • Performance inefficiency
  • Tensor Core
  • Efficiency < 60%
  • TPU
  • Efficiency > 99%

Efficiency: Achieved FLOPS divided by the peak FLOPS

slide-7
SLIDE 7

Outline

  • Introduction
  • Simultaneous Multi-mode Architecture
  • Evaluation

7

slide-8
SLIDE 8

GPU

  • GPU
  • Papalism
  • Single Instruction Multiple Data
  • Massive threads
  • Warp active mask
  • Memory
  • Register file, vector access
  • Shared memory, scalar access
  • Communication
  • PE array with shared memory
  • Warp Shuffle Instruction

Instruction Cache Warp Scheduler Register File Dispatch Unit LD/ST Interconnect Network Cache

Core

SFU

Dispatch Port Result Queue Operand Collector FP INT DRAM DRAM

SM

CUDA core

GPU (SIMD) 8

slide-9
SLIDE 9

TPU

  • TPU
  • Papalism
  • 2D Systolic Array (MISD)
  • High concurrency
  • Active/Non-Active (one inst. two status)
  • Memory
  • Weight, vector access (continuous)
  • Input/output, scalar access(discontinuous)
  • Communication
  • Interconnected PE array

9

slide-10
SLIDE 10

Similarity

  • GPU
  • Parallelism
  • Single Instruction Multiple Data
  • Massive threads
  • Warp active mask
  • Memory
  • Register file, vector access
  • Shared memory, scalar access
  • Communication
  • PE array with shared memory
  • TPU
  • Parallelism
  • 2D Systolic Array (MISD)
  • High concurrency
  • Active/Non-Active (one inst. two status)
  • Memory
  • Weight, vector access (continuous)
  • Input/output, scalar access(discontinuous)
  • Communication
  • Interconnected PE array

10

slide-11
SLIDE 11

SMA Hardware Design

Simultaneous Multi-mode Architecture (SMA)

Challenges: 1: Massive output scalar accesses 2: Inter-PE Communication 3: How to control the systolic array Similar to GPU 11

slide-12
SLIDE 12

Massive output scalar accesses

Semi-broadcasted Weight-stationary Weight-stationary Register file Shared memory

  • Semi-broadcast
  • Weight
  • Preload
  • Register file
  • Output
  • Vector access
  • Register file
  • Input
  • Scalar access
  • Shared memory

Shared memory Bank conflicts 12

slide-13
SLIDE 13

Partial Sum Communication

  • One-way wires
  • Horizontal neighbor PEs
  • Fast
  • latency-sensitive
  • Negligible overhead
  • Vertical PEs
  • Slow
  • Broadcast, Prefetch
  • latency-insensitivity

Partial Sum need low latency Without PE layout reconfiguration Slow Fast 13

slide-14
SLIDE 14

Instruction Control

  • A new instruction:
  • LSMA (Load, Store and

Multiply-accumulate)

  • A systolic controller

Per SM Input : 8 x 2 x 4 Bytes Output : 24 x 2 x 4 Bytes Total : 256 Bytes Area Overhead < 0.1% 256KB Register file 128KB L1 Cache/Shared memory

14

slide-15
SLIDE 15

Software Design: Tiling GEMM

Based on CUTLASS 1.3 LSMA, PTX code half synchronization 15

slide-16
SLIDE 16

Outline

  • Introduction
  • Simultaneous Multi-mode Architecture
  • Evaluation

16

slide-17
SLIDE 17

Evaluation

  • Methodology
  • GPGPUsim-4.0
  • GEMM based on CUTLASS

17

slide-18
SLIDE 18

Iso-FLOP

  • Square GEMM
  • 2-SMA efficiency > 90%, 30% higher than 4-TC.
  • SMA (broadcast) 20% - 40% higher than non-broadcast.

18

slide-19
SLIDE 19

Iso-Area

  • 5 networks
  • 3-SMA 63% faster, 23% less energy than 4-TC on average.

19

slide-20
SLIDE 20

End-to-end application (autopilot)

[Shih-Chieh Lin, etc.,ASPLOS’18]

DeepLab, CNN GO-TURN, CNN ORB-SLAM, non-CNN CUDA CUDA

Tensor

SMA 0.5x + 1.0x 1.0x 1.5x GPU TC SMA GEMM Speedup Same area Platform 20

slide-21
SLIDE 21

End-to-end application (autopilot)

[Euphrates, ISCA’18]

N = 4, SMA Latency Reduction 50%, More Flexibility CUDA core non-GEMM Bottleneck 21

slide-22
SLIDE 22

Summary

  • Hardware
  • Parallelism similarity
  • Memory and communication
  • Systolic controller
  • Software
  • Tiling GEMM
  • Evaluation
  • Efficiency
  • Flexibility

22

slide-23
SLIDE 23

Questions

Thank you!

slide-24
SLIDE 24

Backup Slides

slide-25
SLIDE 25

SMA Hardware Design

Simultaneous Multi-mode Architecture (SMA)

slide-26
SLIDE 26

LSMA execution

  • 5-bit mask for 4x4 array
  • Counter (input row number)
  • Preload weight
  • Prefetch input (Shared memory)
  • Execute
  • Store output (Register file)

1

Mask Input PE array & Preloaded weight Output Counter

8

Cycle : 0 B 4x4 A 8x4 C 8x4

slide-27
SLIDE 27

Cycle : 1

1

Mask Input PE array & Preloaded weight Output Counter

8

Cycle : 1 Prefetch input Discontinuous

slide-28
SLIDE 28

Cycle : 2

1 1

Mask Input PE array & Preloaded weight Output Counter

7

Cycle : 2 Execute

slide-29
SLIDE 29

Cycle : 3

1 1 1

Mask Input PE array & Preloaded weight Output Counter

6

Cycle : 3

slide-30
SLIDE 30

Cycle : 4

1 1 1 1

Mask Input PE array & Preloaded weight Output Counter

5

Cycle : 4

slide-31
SLIDE 31

Cycle : 5

1 1 1 1 1

Mask Input PE array & Preloaded weight Output Counter

4

Cycle : 5 Store output Continuous

slide-32
SLIDE 32

Cycle : 6

1 1 1 1 1

Mask Input PE array & Preloaded weight Output Counter

3

Cycle : 6

slide-33
SLIDE 33

Cycle : 7

1 1 1 1 1

Mask Input PE array & Preloaded weight Output Counter

2

Cycle : 7

slide-34
SLIDE 34

Cycle : 8

1 1 1 1 1

Mask Input PE array & Preloaded weight Output Counter

1

Cycle : 8

slide-35
SLIDE 35

Cycle : 9

1 1 1 1

Mask Input PE array & Preloaded weight Output Counter Cycle : 9

slide-36
SLIDE 36

Cycle : 10

1 1 1

Mask Input PE array & Preloaded weight Output Counter Cycle : 10

slide-37
SLIDE 37

Cycle : 11

1 1

Mask Input PE array & Preloaded weight Output Counter Cycle : 11

slide-38
SLIDE 38

Cycle : 12

1

Mask Input PE array & Preloaded weight Output Counter Cycle : 12

slide-39
SLIDE 39

Software Design: Tiling GEMM