Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - - PowerPoint PPT Presentation

acceleration with 3d memory
SMART_READER_LITE
LIVE PREVIEW

Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - - PowerPoint PPT Presentation

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University ASPLOS April 2017 Neural Networks (NNs) Unprecedented accuracy for


slide-1
SLIDE 1

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis

Stanford University ASPLOS – April 2017

slide-2
SLIDE 2

Neural Networks (NNs)

 Unprecedented accuracy for challenging applications  System perspective: compute and memory intensive

  • Many efforts to accelerate with specialized hardware

2

Classification Recognition Control Prediction Optimization Multi- cores GPUs FPGAs ASICs

Clusters

slide-3
SLIDE 3

“Dog”

Neural Networks (NNs)

3

*

=

ifmaps

  • fmaps

filters

Ni No No Ni

foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v)

CONV

foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)

𝑃 = 𝐽 × 𝑋 FC

slide-4
SLIDE 4

Domain-Specific NN Accelerators

 Spatial architectures of PEs

  • 100x performance and energy efficiency
  • Low-precision arithmetic, dynamic pruning, static compression, …

4

foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Global Buffer

ALU Reg File

Processing Element

Main Memory

slide-5
SLIDE 5

Memory Challenges for Large NNs

 Large footprints and bandwidth requirements

  • Many and large layers, complex neuron structures
  • Efficient computing requires higher bandwidth

 Limit scalability for future NNs

5

foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) foreach b in batch Nb foreach neuron x in Nx foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v)

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Global Buffer

ALU Reg File

Processing Element

Main Memory

?

Large on-chip buffers: area inefficiency Multiple DRAM channels: energy inefficiency

slide-6
SLIDE 6

Memory Challenges for Large NNs

 State-of-the-art NN accelerator with 400 PEs

  • 1.5 MB SRAM buffer  70% area
  • 4 LPDDR3 x32 chips  45% power in DRAM & SRAM

6

5 10 15 20 25 30 0.5 1 1.5 2 2.5 3 100 120 144 168 196 224 256 288 324 360 400

Bandwidth (GBps) Power (W) Number of PEs

PE/reg dynamic Buf dynamic DRAM dynamic Total static Peak DRAM bandwidth

slide-7
SLIDE 7

3D Memory + NN Acceleration

 Opportunities

  • High bandwidth at low access energy
  • Abundant parallelism (vaults, banks)

 Key questions

  • Hardware resource balance
  • Software scheduling and workload

partitioning

7

Micron’s Hybrid Memory Cube

Vault (Channel) DRAM Die Logic Die Bank TSVs

slide-8
SLIDE 8

TETRIS

 NN acceleration with 3D memory

  • Improves performance scalability by 4.1x over 2D
  • Improves energy efficiency by 1.5x over 2D

 Hardware architecture

  • Rebalance resources between PEs and buffers
  • In-memory accumulation

 Software optimizations

  • Analytical dataflow scheduling for memory hierarchy
  • Hybrid partitioning for parallelism across vaults

8 High performance & low energy Alleviate bandwidth pressure Optimize buffer use Efficient parallel processing

slide-9
SLIDE 9

TETRIS Hardware Architecture

slide-10
SLIDE 10

TETRIS Architecture

 Associate one NN engine with each vault

  • PE array, local register files, and a shared global buffer

 NoC + routers for accesses to remote vaults  All vaults can process NN computations in parallel

10

Vault DRAM Die Logic Die

PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE

Global Buffer Memory Controller Router

To local vault To remote vault

slide-11
SLIDE 11

Resource Balancing

 Larger PE arrays with smaller SRAM buffers

  • High memory bandwidth  more PEs
  • Low access energy + sequential pattern  smaller buffers

11

2 4 6 0.6 1.2 1.8

36/467kB 48/441kB 64/408kB 80/374kB 100/332kB 120/290kB 144/240kB 168/190kB 196/133kB 224/72kB

Normalized Runtime Normalized Energy

# PEs / buffer capacity

PE dynamic Reg/buf dynamic DRAM dynamic Total static Runtime

196 PEs with 133 kB buffer (area 1:1) better performance and energy

slide-12
SLIDE 12

In-Memory Accumulation

 Move simple accumulation logic close to DRAM banks

  • 2x bandwidth reduction for output data
  • See paper for discussion of logic placement in DRAM

12

PE array Memory Y += W * X X W Y Y PE array Memory ΔY = W * X X W ΔY +

slide-13
SLIDE 13

Scheduling and Partitioning for TETRIS

slide-14
SLIDE 14

Dataflow Scheduling

 Critical for maximizing on-chip data reuse to save energy

14

foreach b in batch Nb foreach ifmap u in Ni foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Mapping: execute 2D conv on PE array

  • Regfiles and array interconnect
  • Row stationary [Chen et al., ISCA’16]

Ordering: loop blocking and reordering

  • Locality in global buffer
  • Non-convex, exhaustive search
slide-15
SLIDE 15

TETRIS Bypass Ordering

 Limited reuse opportunities with small buffers  IW bypass, OW bypass, IO bypass

  • Use buffer only for one stream for maximum benefit
  • Bypass buffer for the other two to sacrifice their reuse

15 ifmaps

  • fmaps

filters

Global buffer Reg files

OW bypass ordering Off-chip On-chip

  • 1. Read 1 ifmap chunk into gbuf
  • 2. Stream ofmaps and filters to regf
  • 3. Move ifmaps from gbuf to regf
  • 4. Convolve
  • 5. Jump to 2

Chunk 2 Chunk 1 Chunk 0

slide-16
SLIDE 16

TETRIS Bypass Ordering

 Analytically derived

  • Closed-form solution
  • No need for exhaustive search

 Near-optimal schedules

  • With 2% from schedules

derived with exhaustive search

16

min 𝐵DRAM = 2 × 𝑂b𝑂o𝑇o × 𝑢i + 𝑂b𝑂i𝑇i + 𝑂o𝑂i𝑇w × 𝑢b s.t. ൞ 𝑂b 𝑢b × 𝑂i 𝑢i × 𝑇i ≤ 𝑇buf 1 ≤ 𝑢b ≤ 𝑂b, 1 ≤ 𝑢i ≤ 𝑂i

NN Runtime Gap (w.r.t. optimal) Energy Gap (w.r.t. optimal) AlexNet 1.48 % 1.86 % ZFNet 1.55 % 1.83 % VGG16 0.16 % 0.20 % VGG19 0.13 % 0.16 % ResNet 2.91 % 0.78 %

slide-17
SLIDE 17

NN Partitioning

 Option 1: fmap partitioning

  • Divide a fmap into tiles
  • Each vault processes one tile
  • Minimum remote accesses

17

Layer i Layer i+1 Vault 0 Vault 1 Vault 2 Vault 3

 Process NN computations in parallel in all vaults

slide-18
SLIDE 18

NN Partitioning

 Option 2: output partitioning

  • Partition all ofmaps into groups
  • Each vault processes one group
  • Better filter weight reuse
  • Fewer total memory accesses

18

Layer i Layer i+1 Vault 0 Vault 1 Vault 2 Vault 3

 Process NN computations in parallel in all vaults

slide-19
SLIDE 19

TETRIS Hybrid Partitioning

 Combine fmap partitioning and output partitioning

  • Balance between minimizing remote accesses and total DRAM accesses
  • Total energy = NoC energy + DRAM energy

 Difficulties

  • Design space exponential to # layers

 Greedy algorithm reduces to be linear to # layers

  • Complex dataflow scheduling to determine total DRAM accesses

 Bypass ordering to quickly estimate total DRAM accesses

19

slide-20
SLIDE 20

TETRIS Evaluation

slide-21
SLIDE 21

Methodology

 State-of-the-art NNs

  • AlexNet, ZFNet, VGG16, VGG19, ResNet
  • 100—300 MB total memory footprint for each NN
  • Up to 152 layers in ResNet

 2D and 3D accelerators with ≥1 NN engines

  • 2D engine: 16 x 16 PEs, 576 kB buffer, 1 LPDDR3 channel
  • 8.5 mm2, 51.2 Gops/sec
  • Bandwidth-constrained
  • 3D engine: 14 x 14 PEs, 133 kB buffer, 1 HMC vault
  • 3.5 mm2, 39.2 Gops/sec
  • Area-constrained

21

slide-22
SLIDE 22

Single-engine Comparison

 Up to 37% performance improvement with TETRIS

  • Due to higher bandwidth despite smaller PE array

22

0.2 0.4 0.6 0.8 1 1.2 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D

Normalized Runtime Large NNs benefit more!

AlexNet ZFNet VGG16 VGG19 ResNet

slide-23
SLIDE 23

 35–40% energy reduction with TETRIS

  • Smaller on-chip buffer, better scheduling

0.2 0.4 0.6 0.8 1 1.2 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D

Normalized Energy

Total static NoC dynamic DRAM dynamic Reg/buf dynamic PE dynamic

Single-engine Comparison

23

From SRAM & DRAM, and static

AlexNet ZFNet VGG16 VGG19 ResNet

slide-24
SLIDE 24

Multi-engine Comparison

 4 2D engines: 34 mm2, pin constrained (4 LPDDR3 channels)  16 3D engines: 56 mm2, area constrained (16 HMC vaults)  4.1x performance gain  2x compute density

24

0.05 0.1 0.15 0.2 0.25 0.3 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16

Normalized Runtime

AlexNet ZFNet VGG16 VGG19 ResNet

slide-25
SLIDE 25

Multi-Engine Comparison

 1.5x lower energy

  • 1.2x from better scheduling and partitioning

 4x computation only costs 2.7x power

25

0.2 0.4 0.6 0.8 1 1.2 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16

Normalized Energy

Total static NoC dynamic DRAM dynamic Reg/buf dynamic PE dynamic AlexNet ZFNet VGG16 VGG19 ResNet

slide-26
SLIDE 26

TETRIS Summary

 A scalable and efficient NN accelerator using 3D memory

  • 4.1x performance and 1.5x energy benefits over 2D baseline

 Hardware features

  • PE/buffer area rebalancing
  • In-memory accumulation

 Software features

  • Analytical dataflow scheduling
  • Hybrid partitioning

 Scheduling exploration tool

  • https://github.com/stanford-mast/nn_dataflow

26

slide-27
SLIDE 27

Thanks!

Questions?