AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems - PowerPoint PPT Presentation

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand 1 , Jawad Khan 2 , Sanjeev Trika 2 , Jason Lowe-Power 1 , Venkatesh Akella 1 1 University of California, Davis 2 Intel Corporation https://github.com/darchr/AutoTM March 12, 2020 1/29

Executive Summary Problem Automatic two-level memory management for Deep Neural Networks Idea • Profile Guided Optimization • Model as an Integer Linear Program (ILP) Results • Replace 50-80% DRAM with NVDIMMs with geometric mean 27.1% performance loss. • 3x better performance than real hardware cache. 2/29

Outline Background AutoTM Profiling ILP Modeling Results Wrap Up 3/29

Why Deep Neural Networks 4/29

Why Deep Neural Networks Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29

Why Deep Neural Networks Can we use multiple levels of memory to train large models on a single machine? Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29

Heterogeneous Memory Systems NVDIMM Style • Two types of memory. • Same memory controller. • Both are byte addressable. • NVDIMMs for high capacity and low cost 5/29

Heterogeneous Memory Systems NVDIMM Style • Two types of memory. • Same memory controller. • Both are byte addressable. • NVDIMMs for high capacity and low cost Challenges • All tensors in NVDIMMs memory is too slow . • DRAM as a cache for NVDIMMs also too slow . • Intelligent memory management required. 5/29

AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory 7/29

AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory How • Place tensors in fast or slow memory. • Optimal tensor movement 7/29

AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory How • Place tensors in fast or slow memory. • Optimal tensor movement Strategy • Profile kernel performance. • Model tensor assignment as ILP. 7/29

Kernel Profiling Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM K2 DRAM PMM PMM PMM DRAM DRAM PMM DRAM PMM PMM PMM DRAM PMM PMM PMM Table: Profile space for kernel K2. 8/29

Kernel Profiling Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM K2 DRAM PMM PMM Performance relative to all IO in DRAM 2 PMM DRAM DRAM PMM DRAM PMM 1 PMM PMM DRAM 0 Data In: DRAM DRAM DRAM DRAM PMM PMM PMM PMM PMM PMM PMM DRAM DRAM DRAM DRAM Weight: PMM PMM PMM PMM Data Out: DRAM PMM DRAM PMM DRAM PMM DRAM PMM Table: Profile space for kernel K2. 8/29

Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 9/29

ILP Modeling 16/29

ILP Modeling Objective Function Computation time � � min ρ k + M t k ∈K t ∈T 16/29

ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� Kernel Execution Time K : Set of Kernels ρ k : Run time of kernel k 17/29

ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� Kernel Execution Time K : Set of Kernels ρ k : Run time of kernel k Example Run time of kernel k 2 17/29

ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� Kernel Tensor Execution Movement Time Time T : Set of Tensors M t : Time moving tensor t 18/29

ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� Kernel Tensor Execution Movement Time Time T : Set of Tensors M t : Time moving tensor t Example Time moving tensor t 1 18/29

ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� Kernel Tensor Execution Movement Time Time Constraints Limit DRAM at each kernel � � t � I DRAM ≤ Limit ∀ k t , k t ∈L ( k ) 19/29

Variations of AutoTM PMM GPU Name Description System System Static Tensor’s can’t move ✓ ✗ Synchronous Tensor’s move but block ✓ ✓ computation Asynchronous Tensor movement con- ● ✓ current with computation 20/29

Experiments! Software • Modified the ngraph 1 compiler. • Julia’s JuMP 2 package for ILP modeling. • Gurobi 3 as the ILP solver. Hardware • 1.5 TB Optane TM DC PMM • 384 GiB DRAM 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29

Experiments! Software Conventional Batchsize Memory (GB) • Modified the ngraph 1 compiler. Inception v4 1024 111 • Julia’s JuMP 2 package for ILP Vgg 19 2048 143 modeling. Resnet 200 512 132 • Gurobi 3 as the ILP solver. DenseNet 264 512 115 Hardware • 1.5 TB Optane TM DC PMM • 384 GiB DRAM Workloads − − − − − − − − − − − − − − → 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29

Experiments! Software Conventional Batchsize Memory (GB) • Modified the ngraph 1 compiler. Inception v4 1024 111 • Julia’s JuMP 2 package for ILP Vgg 19 2048 143 modeling. Resnet 200 512 132 • Gurobi 3 as the ILP solver. DenseNet 264 512 115 Hardware • 1.5 TB Optane TM DC PMM Large Batchsize Memory (GB) • 384 GiB DRAM Inception v4 6144 659 Vgg 416 128 658 Workloads − − − − − − − − − − − − − − → Resnet 200 2560 651 DenseNet 264 3072 688 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29

Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better 23/29

Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better ◦ Just using PMM is too slow . 24/29

Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better ◦ Best performance when working-set fits in memory . 25/29

Comparison Against 2LM 2LM DRAM Cache 26/29

Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 2LM DRAM Cache 26/29

Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 2LM DRAM Cache • Avoid Dirty Writebacks • Lower Memory Contention 26/29

Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) Software management 2LM DRAM Cache outperforms hardware management by up to 3x . 26/29

Limitations • Static computation graphs. • Kernel profiling overhead. • ILP solution times. • ILP solution may be hard to interpret. 28/29

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems - PowerPoint PPT Presentation

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand 1 , Jawad Khan 2 , Sanjeev Trika 2 , Jason Lowe-Power 1 , Venkatesh Akella 1 1 University of California, Davis 2 Intel Corporation

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

CBSM? A Sustainability Gamechanger Michelle Vigen U of M Regional Sustainable Development

Behavioral Economics: Lessons from Retirement Research for Health Care and Beyond A Presentation

SUPPORTIVE SCHOOLS: A COMMUNITY SCHOOLS MODEL AT WORK SESSION E 15 JOSEPH D. FANTIGROSSI, ED.D.

Agenda Classroom Management and Positive Behavioral Intervention & Supports (PBIS)

: WHY INTENTION DOES NOT LEAD TO RECYCLING BEHAVIOR FOR MILLENNIALS Master Thesis 29 June 2017

Embracing the new threat: towards automatically, self-diversifying malware Mathias Payer

ICA ICAO HQ O HQ PRESENT PRESENTATION TION TO O THE THE AP APANPIR ANPIRG/26 G/26 Saulo

ABA Provider Workshop Confidential and Proprietary Information 1 Confidential and Proprietary

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems - PowerPoint PPT Presentation

AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand 1 , Jawad Khan 2 , Sanjeev Trika 2 , Jason Lowe-Power 1 , Venkatesh Akella 1 1 University of California, Davis 2 Intel Corporation

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

CBSM? A Sustainability Gamechanger Michelle Vigen U of M Regional Sustainable Development

Behavioral Economics: Lessons from Retirement Research for Health Care and Beyond A Presentation

SUPPORTIVE SCHOOLS: A COMMUNITY SCHOOLS MODEL AT WORK SESSION E 15 JOSEPH D. FANTIGROSSI, ED.D.

Agenda Classroom Management and Positive Behavioral Intervention &amp; Supports (PBIS)

: WHY INTENTION DOES NOT LEAD TO RECYCLING BEHAVIOR FOR MILLENNIALS Master Thesis 29 June 2017

Embracing the new threat: towards automatically, self-diversifying malware Mathias Payer

ICA ICAO HQ O HQ PRESENT PRESENTATION TION TO O THE THE AP APANPIR ANPIRG/26 G/26 Saulo

ABA Provider Workshop Confidential and Proprietary Information 1 Confidential and Proprietary

Agenda Classroom Management and Positive Behavioral Intervention & Supports (PBIS)