Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - PowerPoint PPT Presentation

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University ASPLOS – April 2017

Neural Networks (NNs)  Unprecedented accuracy for challenging applications  System perspective: compute and memory intensive o Many efforts to accelerate with specialized hardware Classification Recognition Multi- GPUs FPGAs ASICs Clusters Control cores Prediction Optimization 2

Neural Networks (NNs) “Dog” CONV ifmaps filters ofmaps N o FC N i = * 𝑃 = 𝐽 × 𝑋 N o N i foreach b in batch Nb foreach b in batch Nb foreach neuron x in Nx foreach ifmap u in Ni foreach neuron y in Ny foreach ofmap v in No // Matrix multiply // 2D conv O(y,b) += I(x,b) x W(x,y) + B(v) O(v,b) += I(u,b) * W(u,v) + B(v) 3

Domain-Specific NN Accelerators  Spatial architectures of PEs o 100x performance and energy efficiency o Low- precision arithmetic, dynamic pruning, static compression, … foreach b in batch Nb foreach ifmap u in Ni PE PE PE PE foreach ofmap v in No // 2D conv Main Memory Global Buffer O(v,b) += I(u,b) * W(u,v) + B(v) PE PE PE PE Reg File foreach b in batch Nb foreach neuron x in Nx ALU PE PE PE PE foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) PE PE PE PE Processing Element 4

Memory Challenges for Large NNs  Large footprints and bandwidth requirements o Many and large layers, complex neuron structures o Efficient computing requires higher bandwidth  Limit scalability for future NNs Large on-chip buffers: ? area inefficiency foreach b in batch Nb foreach ifmap u in Ni PE PE PE PE foreach ofmap v in No // 2D conv Main Memory Global Buffer O(v,b) += I(u,b) * W(u,v) + B(v) PE PE PE PE Reg File foreach b in batch Nb foreach neuron x in Nx ALU PE PE PE PE foreach neuron y in Ny // Matrix multiply O(y,b) += I(x,b) x W(x,y) + B(v) PE PE PE PE Processing Element Multiple DRAM channels: energy inefficiency 5

Memory Challenges for Large NNs  State-of-the-art NN accelerator with 400 PEs o 1.5 MB SRAM buffer  70% area o 4 LPDDR3 x32 chips  45% power in DRAM & SRAM 3 30 2.5 25 Bandwidth 2 20 (GBps) Power (W) 1.5 15 1 10 0.5 5 0 0 100 120 144 168 196 224 256 288 324 360 400 Number of PEs PE/reg dynamic Buf dynamic DRAM dynamic Total static Peak DRAM bandwidth 6

3D Memory + NN Acceleration  Opportunities o High bandwidth at low access energy Bank o Abundant parallelism (vaults, banks) TSVs  Key questions o Hardware resource balance DRAM Die Logic Die o Software scheduling and workload Vault partitioning (Channel) Micron’s Hybrid Memory Cube 7

TETRIS  NN acceleration with 3D memory o Improves performance scalability by 4.1x over 2D o Improves energy efficiency by 1.5x over 2D  Hardware architecture High performance & low energy o Rebalance resources between PEs and buffers Alleviate bandwidth pressure o In-memory accumulation  Software optimizations o Analytical dataflow scheduling for memory hierarchy Optimize buffer use o Hybrid partitioning for parallelism across vaults Efficient parallel processing 8

TETRIS Hardware Architecture

TETRIS Architecture  Associate one NN engine with each vault o PE array, local register files, and a shared global buffer  NoC + routers for accesses to remote vaults  All vaults can process NN computations in parallel PE PE PE PE Memory Controller To local vault Global Buffer PE PE PE PE Router PE PE PE PE To remote vault PE PE PE PE DRAM Die Vault Logic Die 10

Resource Balancing  Larger PE arrays with smaller SRAM buffers o High memory bandwidth  more PEs o Low access energy + sequential pattern  smaller buffers 196 PEs with 133 kB buffer (area 1:1) better performance and energy 1.8 6 Normalized Energy Normalized Runtime 1.2 4 0.6 2 0 0 36/467kB 48/441kB 64/408kB 80/374kB 100/332kB 120/290kB 144/240kB 168/190kB 196/133kB 224/72kB # PEs / buffer capacity PE dynamic Reg/buf dynamic DRAM dynamic Total static Runtime 11

In-Memory Accumulation  Move simple accumulation logic close to DRAM banks o 2x bandwidth reduction for output data o See paper for discussion of logic placement in DRAM Memory Memory + W X Y Y W X Δ Y PE array PE array Y += W * X Δ Y = W * X 12

Scheduling and Partitioning for TETRIS

Dataflow Scheduling  Critical for maximizing on-chip data reuse to save energy Ordering : loop blocking and reordering foreach b in batch Nb • Locality in global buffer foreach ifmap u in Ni • Non-convex, exhaustive search foreach ofmap v in No // 2D conv O(v,b) += I(u,b) * W(u,v) + B(v) Mapping : execute 2D conv on PE array • Regfiles and array interconnect Row stationary [Chen et al., ISCA’16] • 14

TETRIS Bypass Ordering  Limited reuse opportunities with small buffers  IW bypass, OW bypass, IO bypass o Use buffer only for one stream for maximum benefit o Bypass buffer for the other two to sacrifice their reuse OW bypass ordering ifmaps ofmaps filters Chunk 0 Chunk 1 Chunk 2 Off-chip On-chip Global buffer 1. Read 1 ifmap chunk into gbuf 2. Stream ofmaps and filters to regf 3. Move ifmaps from gbuf to regf Reg files 4. Convolve 15 5. Jump to 2

TETRIS Bypass Ordering  Analytically derived  Near-optimal schedules o Closed-form solution o With 2% from schedules derived with exhaustive search o No need for exhaustive search min 𝐵 DRAM Runtime Gap Energy Gap NN = 2 × 𝑂 b 𝑂 o 𝑇 o × 𝑢 i + 𝑂 b 𝑂 i 𝑇 i + 𝑂 o 𝑂 i 𝑇 w × 𝑢 b (w.r.t. optimal) (w.r.t. optimal) AlexNet 1.48 % 1.86 % 𝑂 b × 𝑂 i × 𝑇 i ≤ 𝑇 buf ZFNet 1.55 % 1.83 % s.t. ൞ 𝑢 b 𝑢 i VGG16 0.16 % 0.20 % 1 ≤ 𝑢 b ≤ 𝑂 b , 1 ≤ 𝑢 i ≤ 𝑂 i VGG19 0.13 % 0.16 % ResNet 2.91 % 0.78 % 16

NN Partitioning  Process NN computations in parallel in all vaults Vault 1 Vault 0  Option 1: fmap partitioning o Divide a fmap into tiles Layer i Layer i +1 o Each vault processes one tile o Minimum remote accesses Vault 3 Vault 2 17

NN Partitioning  Process NN computations in parallel in all vaults Vault 2  Option 2: output partitioning Vault 3 o Partition all ofmaps into groups Layer i Layer i +1 o Each vault processes one group o Better filter weight reuse o Fewer total memory accesses Vault 1 Vault 0 18

TETRIS Hybrid Partitioning  Combine fmap partitioning and output partitioning o Balance between minimizing remote accesses and total DRAM accesses o Total energy = NoC energy + DRAM energy  Difficulties o Design space exponential to # layers  Greedy algorithm reduces to be linear to # layers o Complex dataflow scheduling to determine total DRAM accesses  Bypass ordering to quickly estimate total DRAM accesses 19

TETRIS Evaluation

Methodology  State-of-the-art NNs o AlexNet, ZFNet, VGG16, VGG19, ResNet o 100 — 300 MB total memory footprint for each NN o Up to 152 layers in ResNet  2D and 3D accelerators with ≥1 NN engines o 2D engine: 16 x 16 PEs, 576 kB buffer, 1 LPDDR3 channel • 8.5 mm 2 , 51.2 Gops/sec • Bandwidth-constrained o 3D engine: 14 x 14 PEs, 133 kB buffer, 1 HMC vault • 3.5 mm 2 , 39.2 Gops/sec • Area-constrained 21

Single-engine Comparison  Up to 37% performance improvement with TETRIS o Due to higher bandwidth despite smaller PE array Large NNs benefit more! 1.2 Normalized Runtime 1 0.8 0.6 0.4 0.2 0 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D AlexNet ZFNet VGG16 VGG19 ResNet 22

Single-engine Comparison  35 – 40% energy reduction with TETRIS o Smaller on-chip buffer, better scheduling From SRAM & DRAM, and static 1.2 Normalized Energy 1 Total static 0.8 NoC dynamic 0.6 DRAM dynamic 0.4 Reg/buf dynamic PE dynamic 0.2 0 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D AlexNet ZFNet VGG16 VGG19 ResNet 23

Multi-engine Comparison  4 2D engines: 34 mm 2 , pin constrained (4 LPDDR3 channels)  16 3D engines: 56 mm 2 , area constrained (16 HMC vaults)  4.1x performance gain  2x compute density 0.3 Normalized Runtime 0.25 0.2 0.15 0.1 0.05 0 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 AlexNet ZFNet VGG16 VGG19 ResNet 24

Multi-Engine Comparison  1.5x lower energy o 1.2x from better scheduling and partitioning  4x computation only costs 2.7x power 1.2 Normalized Energy 1 Total static 0.8 NoC dynamic 0.6 DRAM dynamic 0.4 Reg/buf dynamic PE dynamic 0.2 0 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 2D-4 3D-16 AlexNet ZFNet VGG16 VGG19 ResNet 25

TETRIS Summary  A scalable and efficient NN accelerator using 3D memory o 4.1x performance and 1.5x energy benefits over 2D baseline  Hardware features o PE/buffer area rebalancing o In-memory accumulation  Software features o Analytical dataflow scheduling o Hybrid partitioning  Scheduling exploration tool o https://github.com/stanford-mast/nn_dataflow 26

Thanks! Questions?

Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - PowerPoint PPT Presentation

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University ASPLOS April 2017 Neural Networks (NNs) Unprecedented accuracy for

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

This project promotes and supports land managers who help us sustain the benefits that we all get

Benchlearning This project is funded by the EU Employment and Social Affairs Platform (ESAP)

Observations on Undergraduate Student Enrollments in Electric Power Engineering G. T. Heydt

Property and Environmental Services Metro Committee on Racial Equity July 10, 2016 What is

Relationship-based Communication: Engaging our Patients in Their Care Katy Jo Stevens

Results Presentation Q3 2018 Disclaimer This presentation contains forward-looking statements.

TRANSITIONS ON THE LABOUR MARKET A GLOBAL APPROACH FOR PUBLIC EMPLOYMENT SERVICES? 3 rd ILO

Testimony on Payment for Ecosystem Services in VT Eric Roy, PhD Nutrient Cycling &

Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark - PowerPoint PPT Presentation

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao , Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University ASPLOS April 2017 Neural Networks (NNs) Unprecedented accuracy for

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Dynamic Memory Management 333 Dynamic Memory Management Process Memory Layout Process Memory

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

This project promotes and supports land managers who help us sustain the benefits that we all get

Benchlearning This project is funded by the EU Employment and Social Affairs Platform (ESAP)

Observations on Undergraduate Student Enrollments in Electric Power Engineering G. T. Heydt

Property and Environmental Services Metro Committee on Racial Equity July 10, 2016 What is

Relationship-based Communication: Engaging our Patients in Their Care Katy Jo Stevens

Results Presentation Q3 2018 Disclaimer This presentation contains forward-looking statements.

TRANSITIONS ON THE LABOUR MARKET A GLOBAL APPROACH FOR PUBLIC EMPLOYMENT SERVICES? 3 rd ILO

Testimony on Payment for Ecosystem Services in VT Eric Roy, PhD Nutrient Cycling &amp;

Testimony on Payment for Ecosystem Services in VT Eric Roy, PhD Nutrient Cycling &