Load Value Approximation Joshua San Miguel Mario Badr Natalie - - PowerPoint PPT Presentation

load value approximation
SMART_READER_LITE
LIVE PREVIEW

Load Value Approximation Joshua San Miguel Mario Badr Natalie - - PowerPoint PPT Presentation

Load Value Approximation Joshua San Miguel Mario Badr Natalie Enright Jerger Accessing Memory main memory shared caches, directory, network-on-chip L1 cache processor core 2 Accessing Memory main memory shared caches, directory,


slide-1
SLIDE 1

Load Value Approximation

Joshua San Miguel Mario Badr Natalie Enright Jerger

slide-2
SLIDE 2

Accessing Memory

2

shared caches, directory, network-on-chip main memory processor core L1 cache

slide-3
SLIDE 3

Accessing Memory

3

shared caches, directory, network-on-chip main memory miss processor core L1 cache

slide-4
SLIDE 4

Accessing Memory

4

shared caches, directory, network-on-chip main memory

Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!

processor core L1 cache miss

slide-5
SLIDE 5

Accessing Memory

5

shared caches, directory, network-on-chip main memory

Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache! Higher efficiency via Approximate Computing…

processor core L1 cache miss

slide-6
SLIDE 6

Approximate Computing

Not all computations need to be precise.

6

http://www.zentut.com/ http://www.businessweek.com/ http://www.cc.gatech.edu/~cnieto6/ http://www.analyticbridge.com/ http://themusicparlour.blogspot.ca/ http://www.scientific-computing.com/

Data mining Computer vision Audio and video processing Gaming Machine learning Dynamical simulation

slide-7
SLIDE 7

Approximate Computing

7

execution time energy

slide-8
SLIDE 8

Approximate Computing

8

execution time energy error

slide-9
SLIDE 9

Approximate Computing

9

execution time energy error

slide-10
SLIDE 10

Approximate Computing

Many applications can tolerate approximate data.

  • 40% to nearly 100% of data footprint is approximate

[Sampson, MICRO 2013].

10

slide-11
SLIDE 11

Approximate Computing

Many applications can tolerate approximate data.

  • 40% to nearly 100% of data footprint is approximate

[Sampson, MICRO 2013].

11

Approximate value locality:

  • Many data values are similar to or can be approximated from

previously seen values.

slide-12
SLIDE 12

Outline

  • Load Value Approximation
  • Non-Speculative Operation
  • Approximator Design
  • Relaxed Confidence Windows
  • Approximation Degree
  • Methodology
  • Evaluation

12

slide-13
SLIDE 13

Load Value Approximation

13

processor core L1 cache shared caches, directory, network-on-chip main memory

slide-14
SLIDE 14

Load Value Approximation

14

processor core L1 cache shared caches, directory, network-on-chip main memory approximator

slide-15
SLIDE 15

Load Value Approximation

15

processor core L1 cache shared caches, directory, network-on-chip main memory approximator load miss A A?

slide-16
SLIDE 16

Load Value Approximation

16

processor core L1 cache shared caches, directory, network-on-chip main memory approximator generate A_approx A?

slide-17
SLIDE 17

Load Value Approximation

17

processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx

slide-18
SLIDE 18

Load Value Approximation

18

processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx

No speculation, no rollbacks.

slide-19
SLIDE 19

Load Value Approximation

19

processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx

slide-20
SLIDE 20

Load Value Approximation

20

processor core L1 cache shared caches, directory, network-on-chip A_approx main memory fetch A_actual approximator

slide-21
SLIDE 21

Load Value Approximation

21

processor core L1 cache shared caches, directory, network-on-chip main memory A_approx approximator train with A_actual

slide-22
SLIDE 22

Load Value Approximation

22

processor core L1 cache shared caches, directory, network-on-chip main memory approximator

Learns past values. Estimates future values. Improves performance and saves energy.

A_approx

slide-23
SLIDE 23

Approximator Design

23

ℎ ,

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

slide-24
SLIDE 24

Approximator Design

24

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

time

slide-25
SLIDE 25

Approximator Design

25

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A time

slide-26
SLIDE 26

Approximator Design

26

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A time

PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1

slide-27
SLIDE 27

Approximator Design

27

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A time

PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 (4.1 + 3.9 + 4.0) / 3 A_approx = 4.0

slide-28
SLIDE 28

Approximator Design

28

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) time

slide-29
SLIDE 29

Approximator Design

29

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

slide-30
SLIDE 30

Approximator Design

30

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) A_actual = 4.2 time

slide-31
SLIDE 31

Approximator Design

31

ℎ ,

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer load miss A do_work(A_approx) request(A_actual) A_actual = 4.2 time

2.2 3.1 4.2 3.9 4.0 4.2

slide-32
SLIDE 32

Approximator Design – Other Considerations

32

  • Floating-point precision
  • History buffer sizes
  • Stale values

More details in paper.

slide-33
SLIDE 33

Approximator Design

33

Relaxed Confidence Windows

  • How do we avoid making bad approximations?
  • Trade-off performance and error.

Approximation Degree

  • Do we need to fetch the actual value from memory every time?
  • Trade-off energy and error.
slide-34
SLIDE 34

Relaxed Confidence Windows

34

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

A_approx = 4.0

slide-35
SLIDE 35

Relaxed Confidence Windows

35

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

A_approx = 4.0 A_actual = 9.0!

slide-36
SLIDE 36

Relaxed Confidence Windows

36 tag conf degree LHB

When approximating:

if conf >= 0: use A_approx else: don’t use A_approx

When updating:

if A_approx, A_actual differ by <= CONF_WINDOW%: conf++ else: conf--

slide-37
SLIDE 37

Relaxed Confidence Windows – Output Error

37

0% 20% 40% 60% 80% 100%

  • utput error

0% 5% 10% 20% infinite

Varying CONF_WINDOW%:

slide-38
SLIDE 38

Relaxed Confidence Windows – L1-D MPKI

38

Varying CONF_WINDOW%:

0.0 0.2 0.4 0.6 0.8 1.0 0% 5% 10% 20% infinite normalized L1-D MPKI CONF_WINDOW%

slide-39
SLIDE 39

Approximator Design

39

Relaxed Confidence Windows

  • How do we avoid making bad approximations?
  • Trade-off performance and error.

Approximation Degree

  • Do we need to fetch the actual value from memory every time?
  • Trade-off energy and error.
slide-40
SLIDE 40

Approximation Degree

40

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

A_approx = 4.0

slide-41
SLIDE 41

Approximation Degree

41

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

A_approx = 4.0 A_actual = 4.0

slide-42
SLIDE 42

Approximation Degree

42

ℎ , 1.0 2.2 3.1

instruction address

global history buffer

tag conf degree LHB

approximator table

𝑔

local history buffer

4.1 3.9 4.0

load miss A do_work(A_approx) request(A_actual) time

A_approx = 4.0 A_actual = 4.0

slide-43
SLIDE 43

Approximation Degree

43 tag conf degree LHB

When approximating:

if degree == APPROX_DEGREE: fetch A_actual else: don’t fetch A_actual

When updating:

if degree == APPROX_DEGREE: degree = 0 else: degree++

slide-44
SLIDE 44

Approximation Degree – Output Error

44

Varying APPROX_DEGREE:

0% 20% 40% 60% 80% 100%

  • utput error

1 2 4 8 16

slide-45
SLIDE 45

Approximation Degree – L1-D Fetches

45

Varying APPROX_DEGREE:

0.2 0.4 0.6 0.8 1 1 2 4 8 16 normalized L1-D fetches APPROX_DEGREE

slide-46
SLIDE 46

Methodology

46

Multi-threaded approximate applications

  • PARSEC benchmark suite [Bienia, Princeton 2011]
  • Programmer annotations and ISA extensions [Esmaeilzadeh, ASPLOS 2012]

Approximator design space exploration

  • Pin dynamic binary instrumentation tool [Luk, PLDI 2005]

Full-system simulation

  • FeS2 cycle-level x86 simulator [Neelakantam, ASPLOS 2008]

Approximator, cache and memory energy consumption

  • CACTI modeling tool [Thoziyoor, HP 2008]
slide-47
SLIDE 47

Evaluation

47

0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings

slide-48
SLIDE 48

Evaluation

48

0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings

Up to 28% speedup

slide-49
SLIDE 49

Evaluation

49

0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings

Up to 28% speedup Up to 44% energy savings

slide-50
SLIDE 50

Evaluation

50

0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings

Reduces L1-D MPKI by 30% over traditional value predictor and prefetcher. Up to 28% speedup Up to 44% energy savings

slide-51
SLIDE 51

Conclusion

51

Load Value Approximation

  • Approximate Value Locality
  • Non-Speculative
  • Relaxed Confidence Windows
  • Approximation Degree

↑ performance ↓ energy consumption low output error

slide-52
SLIDE 52

Conclusion

52

baseline (precise) load value approximation