 
              AutoTM: Automatic Tensor Movement in Heterogeneous Memory Systems using Integer Linear Programming Mark Hildebrand 1 , Jawad Khan 2 , Sanjeev Trika 2 , Jason Lowe-Power 1 , Venkatesh Akella 1 1 University of California, Davis 2 Intel Corporation https://github.com/darchr/AutoTM March 12, 2020 1/29
Executive Summary Problem Automatic two-level memory management for Deep Neural Networks Idea • Profile Guided Optimization • Model as an Integer Linear Program (ILP) Results • Replace 50-80% DRAM with NVDIMMs with geometric mean 27.1% performance loss. • 3x better performance than real hardware cache. 2/29
Outline Background AutoTM Profiling ILP Modeling Results Wrap Up 3/29
Why Deep Neural Networks 4/29
Why Deep Neural Networks Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29
Why Deep Neural Networks Can we use multiple levels of memory to train large models on a single machine? Image: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 4/29
Heterogeneous Memory Systems NVDIMM Style • Two types of memory. • Same memory controller. • Both are byte addressable. • NVDIMMs for high capacity and low cost 5/29
Heterogeneous Memory Systems NVDIMM Style • Two types of memory. • Same memory controller. • Both are byte addressable. • NVDIMMs for high capacity and low cost Challenges • All tensors in NVDIMMs memory is too slow . • DRAM as a cache for NVDIMMs also too slow . • Intelligent memory management required. 5/29
Outline Background AutoTM Profiling ILP Modeling Results Wrap Up 6/29
AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory 7/29
AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory How • Place tensors in fast or slow memory. • Optimal tensor movement 7/29
AutoTM Goal Minimize execution time • Arbitrary computation graph • Size constraint on fast memory How • Place tensors in fast or slow memory. • Optimal tensor movement Strategy • Profile kernel performance. • Model tensor assignment as ILP. 7/29
Kernel Profiling Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM K2 DRAM PMM PMM PMM DRAM DRAM PMM DRAM PMM PMM PMM DRAM PMM PMM PMM Table: Profile space for kernel K2. 8/29
Kernel Profiling Profile performance of kernels for all tensor IO locations. Kernel IO Tensor Locations T1 T2 T3 DRAM DRAM DRAM DRAM DRAM PMM DRAM PMM DRAM K2 DRAM PMM PMM Performance relative to all IO in DRAM 2 PMM DRAM DRAM PMM DRAM PMM 1 PMM PMM DRAM 0 Data In: DRAM DRAM DRAM DRAM PMM PMM PMM PMM PMM PMM PMM DRAM DRAM DRAM DRAM Weight: PMM PMM PMM PMM Data Out: DRAM PMM DRAM PMM DRAM PMM DRAM PMM Table: Profile space for kernel K2. 8/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 9/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 10/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 11/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 12/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 13/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 14/29
Tensor Lifetime Flow Network Path of flow through the graph describes where a tensor’s memory location throughout its lifetime. 15/29
ILP Modeling 16/29
ILP Modeling Objective Function Computation time � � min ρ k + M t k ∈K t ∈T 16/29
ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� � Kernel Execution Time K : Set of Kernels ρ k : Run time of kernel k 17/29
ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� � Kernel Execution Time K : Set of Kernels ρ k : Run time of kernel k Example Run time of kernel k 2 17/29
ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� � � �� � Kernel Tensor Execution Movement Time Time T : Set of Tensors M t : Time moving tensor t 18/29
ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� � � �� � Kernel Tensor Execution Movement Time Time T : Set of Tensors M t : Time moving tensor t Example Time moving tensor t 1 18/29
ILP Modeling Objective Function Computation time � � min + M t ρ k k ∈K t ∈T � �� � � �� � Kernel Tensor Execution Movement Time Time Constraints Limit DRAM at each kernel � � t � I DRAM ≤ Limit ∀ k t , k t ∈L ( k ) 19/29
Variations of AutoTM PMM GPU Name Description System System Static Tensor’s can’t move ✓ ✗ Synchronous Tensor’s move but block ✓ ✓ computation Asynchronous Tensor movement con- ● ✓ current with computa- tion 20/29
Outline Background AutoTM Profiling ILP Modeling Results Wrap Up 21/29
Experiments! Software • Modified the ngraph 1 compiler. • Julia’s JuMP 2 package for ILP modeling. • Gurobi 3 as the ILP solver. Hardware • 1.5 TB Optane TM DC PMM • 384 GiB DRAM 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29
Experiments! Software Conventional Batchsize Memory (GB) • Modified the ngraph 1 compiler. Inception v4 1024 111 • Julia’s JuMP 2 package for ILP Vgg 19 2048 143 modeling. Resnet 200 512 132 • Gurobi 3 as the ILP solver. DenseNet 264 512 115 Hardware • 1.5 TB Optane TM DC PMM • 384 GiB DRAM Workloads − − − − − − − − − − − − − − → 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29
Experiments! Software Conventional Batchsize Memory (GB) • Modified the ngraph 1 compiler. Inception v4 1024 111 • Julia’s JuMP 2 package for ILP Vgg 19 2048 143 modeling. Resnet 200 512 132 • Gurobi 3 as the ILP solver. DenseNet 264 512 115 Hardware • 1.5 TB Optane TM DC PMM Large Batchsize Memory (GB) • 384 GiB DRAM Inception v4 6144 659 Vgg 416 128 658 Workloads − − − − − − − − − − − − − − → Resnet 200 2560 651 DenseNet 264 3072 688 1 https://github.com/NervanaSystems/ngraph 2 https://github.com/JuliaOpt/JuMP.jl 3 gurobi.com 22/29
Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better 23/29
Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better 24/29
Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better ◦ Just using PMM is too slow . 24/29
Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better 25/29
Scaling Performance - Inception V4 Performance of Inception v4 - Batchsize 1024 5 synchronous 4 3 Slowdown Lower is Better 2 1 0 20 40 60 80 100 120 Dram Limit (GB) Lower is Better ◦ Best performance when working-set fits in memory . 25/29
Comparison Against 2LM 2LM DRAM Cache 26/29
Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 2LM DRAM Cache 26/29
Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) 2LM DRAM Cache • Avoid Dirty Writebacks • Lower Memory Contention 26/29
Comparison Against 2LM static-AutoTM sync-AutoTM 3 2 Speedup over 2LM Higher is Better 1 0 Vgg416 (320) Inception v4 (6144) Resnet200 (2560) DenseNet 264 (3072) Software management 2LM DRAM Cache outperforms hardware management by up to 3x . 26/29
Outline Background AutoTM Profiling ILP Modeling Results Wrap Up 27/29
Limitations • Static computation graphs. • Kernel profiling overhead. • ILP solution times. • ILP solution may be hard to interpret. 28/29
Recommend
More recommend