Understanding and Tackling the Hidden Memory Latency for Edge-based - PowerPoint PPT Presentation

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform Zhendong Wang, Zhen Wang, Cong Liu, and Yang Hu Presented by Zhendong Wang HotEdge 2020 Jun. 25, 2020 Pervasive and Emerging Architecture Research Lab, UT Dallas

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 1. Background 2. Motivation and Challenges Edge intelligence Integrated GPU Latency 3. Proposed design 4. Evaluation and Conclusion CPU GPU GPU data data Computation allocation initialization kernel 2

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 1. Background ML/DNN enables a series of edge applications Widely deployed in integrated CPU/GPU(iGPU) platform Size Integrated backed by GPU Weight Power Deployment in iGPUs are stymied Rigorous requirements of (1) Limited memory space memory footprint and e.g., TX2: 8GPU, AGX:16GB processing latency for iGPU platform (2) Application stringent latency requirements e.g., driving automation is safety-critical and latency-sensitive 3

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 1. Background Unified Memory (UM) Management model has relieved the situation (1) Ease memory management (2) Save memory footprint CUDA: cudaMallocManaged() Is current Unified Memory (UM) model good enough? 4

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 2. Motivation Limits of current Unified Memory (UM) model – hidden latency t Init Alloc CPU Def Data processing flow under Def. and UM memory model GPU Execution Init Alloc Def: copy-then-execute memory model CPU UM: unified memory model UM GPU Execution Kernel time Autonomous driving workloads – large matrix operation scale (M.O.S.) --- Matric Addition and Matric Multiplication DNN YOLO2 YOL03 SSD DAVE-2 49K 81K 10K 250K M.O.S 5

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 2. Motivation Limits of current Unified Memory (UM) model – hidden latency t Init Alloc 1. Def: copy-then-execute memory model CPU Others = H2D copy + D2H copy + kernel time Def Init.: data initialization GPU Execution Init Alloc 2. UM: unified memory model CPU Others = kernel time (No copy) UM Init.: data initialization GPU Execution UM still spends excessive time on initialization Def: init ~50% latency UM: init ~90% latency matrix multiplication 6 matrix addition

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 2. Motivation Limits of current Unified Memory (UM) model – hidden latency UM also slows down the computation kernel 1. Def kernel = kernel execution 2. UM = kernel execution + mapping latency matrix multiplication matrix addition t Init Alloc CPU Def GPU Execution Observations: Unnecessary initialize data in Kernel time Init Alloc CPU side CPU UM (1) Save initialization latency GPU (2) Benefit kernel /overall application response Execution performance 7 Kernel time

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 3. Proposed design Enhanced Unified Memory Management (eUMM) (1) Initializing data in GPU side Existing mechanism of legacy Unified Management model GPU-side data initialization in eUMM 8

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 3. Proposed design Enhanced Unified Memory Management (eUMM) (2) Prefetch-enhanced GPU-Init performance 9

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 4. Evaluation Platform: Jetson TX2, Xavier AGX Benchmark: matrix addition, matrix multiplication, Needleman-Wunsch (NW), random access (RA) Faster data initialization Computation kernel is not longer slowed down 10

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) 5. Conclusion Characterization of legacy unified memory management ◆ Initialization latency ◆ Kernel launch latency An enhanced data initialization model based on Unified Memory management (eUMM) ◆ Initializing data in GPU side ◆ Overlapping page mapping with data initialization to further reduce latency 11

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Prospect & Future work Extend eUMM to a broad spectrum of workloads ◆ Autonomous driving workloads (object detection, object tracking) Reduce the inherent overhead of GPU-side data initialization ◆ GPU-side data initialization does not outperform when data size is small GPUDirect ◆ Bypass CPU to accelerate the communication between GPU and peripheral storage 12

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Thank You If you have any questions, please contact zhendong.wang@utdallas.edu 13

Understanding and Tackling the Hidden Memory Latency for Edge-based - PowerPoint PPT Presentation

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform Zhendong Wang, Zhen Wang, Cong Liu, and Yang Hu

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Nadine Peacock, PhD, Stephanie Townsell, Andrea McGlynn, L.Michele Issel This study focused on

Proverbs 8:1-11 Author (s) King Solomon 1 Kings 4:32 - He composed 3,000 proverbs and 1,005

Learning to Assess: A Professional Development Model for Librarians Dr. Corinne Laverty

Structure growth & redshi0-space distor4ons around voids Yan-Chuan Cai Y. Cai, A. Taylor, J.

s ssts t rs

WattWatcher: Fine-Grained Power Estimation for Emerging Workloads Michael LeBeane, Jee Ho Ryoo ,

Proof Pearl: Using Combinators to Manipulate let -Expressions Michael Norrish 1 Konrad Slind 2 1

Theoretical Pearl Monads from Comonads, Comonads from Monads An Exercise in Program

Understanding and Tackling the Hidden Memory Latency for Edge-based - PowerPoint PPT Presentation

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING Pervasive and Emerging Architecture Research Lab (PEARL) Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform Zhendong Wang, Zhen Wang, Cong Liu, and Yang Hu

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Marianne Boyle &amp; Suzie Wall LBH Sport &amp; Physical Activity Tackling Inactivity in the

Tackling Europes health Tackling Europe s health priorities p Meeting of the European

Tackling social work student poverty IN EDUCATION RESEARCH POLICY Tackling Social Work

Tackling Root Causes TACKLING ROOT CAUSES AGENDA 1) Downstream Solutions suggested time 15-20

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

Nadine Peacock, PhD, Stephanie Townsell, Andrea McGlynn, L.Michele Issel This study focused on

Proverbs 8:1-11 Author (s) King Solomon 1 Kings 4:32 - He composed 3,000 proverbs and 1,005

Learning to Assess: A Professional Development Model for Librarians Dr. Corinne Laverty

Structure growth &amp; redshi0-space distor4ons around voids Yan-Chuan Cai Y. Cai, A. Taylor, J.

s ssts t rs

WattWatcher: Fine-Grained Power Estimation for Emerging Workloads Michael LeBeane, Jee Ho Ryoo ,

Proof Pearl: Using Combinators to Manipulate let -Expressions Michael Norrish 1 Konrad Slind 2 1

Theoretical Pearl Monads from Comonads, Comonads from Monads An Exercise in Program

Marianne Boyle & Suzie Wall LBH Sport & Physical Activity Tackling Inactivity in the

Structure growth & redshi0-space distor4ons around voids Yan-Chuan Cai Y. Cai, A. Taylor, J.