Memory Access Latency Joshua San Miguel Natalie Enright Jerger - - PowerPoint PPT Presentation

memory access latency
SMART_READER_LITE
LIVE PREVIEW

Memory Access Latency Joshua San Miguel Natalie Enright Jerger - - PowerPoint PPT Presentation

Load Value Approximation: Approaching the Ideal Memory Access Latency Joshua San Miguel Natalie Enright Jerger Chip Multiprocessor main memory shared caches, network-on-chip private cache miss private cache private cache core core core


slide-1
SLIDE 1

Load Value Approximation: Approaching the Ideal Memory Access Latency

Joshua San Miguel Natalie Enright Jerger

slide-2
SLIDE 2

Chip Multiprocessor

2

core core core private cache private cache private cache shared caches, network-on-chip main memory miss

slide-3
SLIDE 3

Approximate Data

Many applications can tolerate inexact data values.

  • In approximate computing applications, 40% to nearly 100% of

memory data footprint can be approximated [Sampson, MICRO 2013].

3

Approximate data storage:

  • Reducing SRAM power by lowering supply voltage [Flautner, ISCA 2002].
  • Reducing DRAM power by lowering refresh rate [Liu, ASPLOS 2011].
  • Improving PCM performance and lifetime by lowering write precision and

reusing failed cells [Sampson, MICRO 2013].

slide-4
SLIDE 4

Outline

  • Load Value Approximation
  • Approximator Design
  • Evaluation
  • Conclusion

4

slide-5
SLIDE 5

Load Value Approximation

5

core core core private cache private cache private cache shared caches, network-on-chip main memory

slide-6
SLIDE 6

Load Value Approximation

6

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator

slide-7
SLIDE 7

Load Value Approximation

7

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator miss A

slide-8
SLIDE 8

Load Value Approximation

8

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator generate A_approx

slide-9
SLIDE 9

Load Value Approximation

9

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator A_approx

slide-10
SLIDE 10

Load Value Approximation

10

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator A_approx request A_actual

slide-11
SLIDE 11

Load Value Approximation

11

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator A_approx train with A_actual

slide-12
SLIDE 12

Load Value Approximation

12

core core core private cache private cache private cache shared caches, network-on-chip main memory approximator approximator approximator

Takes memory access off critical path.

slide-13
SLIDE 13

Approximator Design

13

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag tag tag tag tag tag tag tag

approximator table

𝑔

local history buffer

4.1 3.9 4.0

PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 (4.1 + 3.9 + 4.0) / 3 A_approx = 4.0 load A

slide-14
SLIDE 14

Approximator Design

14

Load value approximators overcome the challenges of traditional value predictors:

  • No complexity of tracking speculative values.
  • No rollbacks.
  • High accuracy/coverage with floating-point values.
  • More tolerant to value delay.
slide-15
SLIDE 15

Evaluation

15

EnerJ framework [Sampson, PLDI 2011]:

  • Program annotations to distinguish approximate data from

precise data.

  • Evaluate final output error and approximator coverage.

benchmark GHB size LHB size approximator size fft 2 49 kB lu 3 1 32 kB raytracer 1 1 32 kB smm 5 1 32 kB sor 2 49 kB

slide-16
SLIDE 16

Evaluation

16

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% fft lu raytracer smm sor

  • utput error

approximator coverage

slide-17
SLIDE 17

Conclusion

17

Future work:

  • Further explore approximator design space (dynamic/hybrid

schemes, machine learning).

  • Measure speedup of load value approximation using full-

system simulations.

  • Measure power savings (low-power caches/NoCs/memory for

approximate data).

Low-error, high-coverage approximators allow us to approach the ideal memory access latency.

slide-18
SLIDE 18

Thank you

18 baseline (precise) - raytracer load value approximation - raytracer