Memory Access Latency Joshua San Miguel Natalie Enright Jerger - PowerPoint PPT Presentation

Load Value Approximation: Approaching the Ideal Memory Access Latency Joshua San Miguel Natalie Enright Jerger

Chip Multiprocessor main memory shared caches, network-on-chip private cache miss private cache private cache core core core 2

Approximate Data Many applications can tolerate inexact data values.  In approximate computing applications, 40% to nearly 100% of memory data footprint can be approximated [Sampson, MICRO 2013]. Approximate data storage:  Reducing SRAM power by lowering supply voltage [Flautner, ISCA 2002].  Reducing DRAM power by lowering refresh rate [Liu, ASPLOS 2011].  Improving PCM performance and lifetime by lowering write precision and reusing failed cells [Sampson, MICRO 2013]. 3

Outline • Load Value Approximation • Approximator Design • Evaluation • Conclusion 4

Load Value Approximation main memory shared caches, network-on-chip private cache private cache private cache core core core 5

Load Value Approximation main memory shared caches, network-on-chip approximator approximator approximator private cache private cache private cache core core core 6

Load Value Approximation main memory shared caches, network-on-chip approximator approximator approximator private cache private cache miss A private cache core core core 7

Load Value Approximation main memory shared caches, network-on-chip generate A_approx approximator approximator approximator private cache private cache private cache core core core 8

Load Value Approximation main memory shared caches, network-on-chip approximator approximator approximator private cache private cache private cache A_approx core core core 9

Load Value Approximation request A_actual main memory shared caches, network-on-chip approximator approximator approximator private cache private cache private cache A_approx core core core 10

Load Value Approximation main memory shared caches, network-on-chip train with A_actual approximator approximator approximator private cache private cache private cache A_approx core core core 11

Load Value Approximation main memory Takes memory access off critical path. shared caches, network-on-chip approximator approximator approximator private cache private cache private cache core core core 12

Approximator Design load A approximator table tag tag global history buffer instruction ℎ , 1.0 2.2 3.1 address tag PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 tag tag local history buffer 𝑔 4.1 3.9 4.0 tag tag (4.1 + 3.9 + 4.0) / 3 tag A_approx = 4.0 13

Approximator Design Load value approximators overcome the challenges of traditional value predictors:  No complexity of tracking speculative values.  No rollbacks.  High accuracy/coverage with floating-point values.  More tolerant to value delay. 14

Evaluation EnerJ framework [Sampson, PLDI 2011]:  Program annotations to distinguish approximate data from precise data.  Evaluate final output error and approximator coverage. benchmark GHB size LHB size approximator size fft 0 2 49 kB lu 3 1 32 kB raytracer 1 1 32 kB smm 5 1 32 kB sor 0 2 49 kB 15

Evaluation output error approximator coverage 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% fft lu raytracer smm sor 16

Conclusion Future work:  Further explore approximator design space (dynamic/hybrid schemes, machine learning).  Measure speedup of load value approximation using full- system simulations.  Measure power savings (low-power caches/NoCs/memory for approximate data). Low-error, high-coverage approximators allow us to approach the ideal memory access latency. 17

Thank you baseline (precise) - raytracer load value approximation - raytracer 18

Memory Access Latency Joshua San Miguel Natalie Enright Jerger - PowerPoint PPT Presentation

Load Value Approximation: Approaching the Ideal Memory Access Latency Joshua San Miguel Natalie Enright Jerger Chip Multiprocessor main memory shared caches, network-on-chip private cache miss private cache private cache core core core

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Memory Management Memory Manager Requirements Minimize primary memory access time

Cache Management Improving Memory Locality and Reducing Memory Latency Introduction Memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Main Memory Moving further away from the CPU .. 95 Main Memory Performance measurement

Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay Kulkarni , Aniraj Kesavan,

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

FY2015 Presentation to the Joint Committee on Appropriations South Dakota Bureau of Finance and

ITU- -T SG15 activities on T SG15 activities on ITU the NGN and its Transport the NGN and its

Contribution from OECD to the Contribution from OECD to the Seminar on ITS: Finding Seminar on

CONTRACT CL CLOSEOUTS: : PR PREPARING FOR A A SMOOTH EN ENDING No November r 8, 8, 20

Bob Rutherford; Thaumus Environmental Consultants Ltd. Rod Currie; R.A.Currie & Assoc.,

Frasers Property Thailand Corporate Day 1/2020 1st Quarter Fiscal Year 2020 Earnings Three

! " # $ % &' ' #

deliver Results for the six months ended 30 June 2019 6 September 2019 Disclaimer:

Memory Access Latency Joshua San Miguel Natalie Enright Jerger - PowerPoint PPT Presentation

Load Value Approximation: Approaching the Ideal Memory Access Latency Joshua San Miguel Natalie Enright Jerger Chip Multiprocessor main memory shared caches, network-on-chip private cache miss private cache private cache core core core

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Memory Management Memory Manager Requirements Minimize primary memory access time

Cache Management Improving Memory Locality and Reducing Memory Latency Introduction Memory

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Main Memory Moving further away from the CPU .. 95 Main Memory Performance measurement

Rocksteady: Fast Migration for Low-Latency In-memory Storage Chinmay Kulkarni , Aniraj Kesavan,

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Virtual Memory and Virtual Memory and Demand Paging Demand Paging Virtual Memory Illustrated

FY2015 Presentation to the Joint Committee on Appropriations South Dakota Bureau of Finance and

ITU- -T SG15 activities on T SG15 activities on ITU the NGN and its Transport the NGN and its

Contribution from OECD to the Contribution from OECD to the Seminar on ITS: Finding Seminar on

CONTRACT CL CLOSEOUTS: : PR PREPARING FOR A A SMOOTH EN ENDING No November r 8, 8, 20

Bob Rutherford; Thaumus Environmental Consultants Ltd. Rod Currie; R.A.Currie &amp; Assoc.,

Frasers Property Thailand Corporate Day 1/2020 1st Quarter Fiscal Year 2020 Earnings Three

! &quot; # $ % &amp;' ' #

deliver Results for the six months ended 30 June 2019 6 September 2019 Disclaimer:

Bob Rutherford; Thaumus Environmental Consultants Ltd. Rod Currie; R.A.Currie & Assoc.,

! " # $ % &' ' #