Load Value Approximation Joshua San Miguel Mario Badr Natalie - - PowerPoint PPT Presentation
Load Value Approximation Joshua San Miguel Mario Badr Natalie - - PowerPoint PPT Presentation
Load Value Approximation Joshua San Miguel Mario Badr Natalie Enright Jerger Accessing Memory main memory shared caches, directory, network-on-chip L1 cache processor core 2 Accessing Memory main memory shared caches, directory,
Accessing Memory
2
shared caches, directory, network-on-chip main memory processor core L1 cache
Accessing Memory
3
shared caches, directory, network-on-chip main memory miss processor core L1 cache
Accessing Memory
4
shared caches, directory, network-on-chip main memory
Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache!
processor core L1 cache miss
Accessing Memory
5
shared caches, directory, network-on-chip main memory
Accessing memory is 10x – 100x greater latency and energy than accessing L1 cache! Higher efficiency via Approximate Computing…
processor core L1 cache miss
Approximate Computing
Not all computations need to be precise.
6
http://www.zentut.com/ http://www.businessweek.com/ http://www.cc.gatech.edu/~cnieto6/ http://www.analyticbridge.com/ http://themusicparlour.blogspot.ca/ http://www.scientific-computing.com/
Data mining Computer vision Audio and video processing Gaming Machine learning Dynamical simulation
Approximate Computing
7
execution time energy
Approximate Computing
8
execution time energy error
Approximate Computing
9
execution time energy error
Approximate Computing
Many applications can tolerate approximate data.
- 40% to nearly 100% of data footprint is approximate
[Sampson, MICRO 2013].
10
Approximate Computing
Many applications can tolerate approximate data.
- 40% to nearly 100% of data footprint is approximate
[Sampson, MICRO 2013].
11
Approximate value locality:
- Many data values are similar to or can be approximated from
previously seen values.
Outline
- Load Value Approximation
- Non-Speculative Operation
- Approximator Design
- Relaxed Confidence Windows
- Approximation Degree
- Methodology
- Evaluation
12
Load Value Approximation
13
processor core L1 cache shared caches, directory, network-on-chip main memory
Load Value Approximation
14
processor core L1 cache shared caches, directory, network-on-chip main memory approximator
Load Value Approximation
15
processor core L1 cache shared caches, directory, network-on-chip main memory approximator load miss A A?
Load Value Approximation
16
processor core L1 cache shared caches, directory, network-on-chip main memory approximator generate A_approx A?
Load Value Approximation
17
processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx
Load Value Approximation
18
processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx
No speculation, no rollbacks.
Load Value Approximation
19
processor core L1 cache shared caches, directory, network-on-chip main memory approximator A_approx
Load Value Approximation
20
processor core L1 cache shared caches, directory, network-on-chip A_approx main memory fetch A_actual approximator
Load Value Approximation
21
processor core L1 cache shared caches, directory, network-on-chip main memory A_approx approximator train with A_actual
Load Value Approximation
22
processor core L1 cache shared caches, directory, network-on-chip main memory approximator
Learns past values. Estimates future values. Improves performance and saves energy.
A_approx
Approximator Design
23
ℎ ,
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
Approximator Design
24
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
time
Approximator Design
25
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A time
Approximator Design
26
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A time
PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1
Approximator Design
27
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A time
PC ⊕ 1.0 ⊕ 2.2 ⊕ 3.1 (4.1 + 3.9 + 4.0) / 3 A_approx = 4.0
Approximator Design
28
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) time
Approximator Design
29
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
Approximator Design
30
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) A_actual = 4.2 time
Approximator Design
31
ℎ ,
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer load miss A do_work(A_approx) request(A_actual) A_actual = 4.2 time
2.2 3.1 4.2 3.9 4.0 4.2
Approximator Design – Other Considerations
32
- Floating-point precision
- History buffer sizes
- Stale values
More details in paper.
Approximator Design
33
Relaxed Confidence Windows
- How do we avoid making bad approximations?
- Trade-off performance and error.
Approximation Degree
- Do we need to fetch the actual value from memory every time?
- Trade-off energy and error.
Relaxed Confidence Windows
34
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
A_approx = 4.0
Relaxed Confidence Windows
35
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
A_approx = 4.0 A_actual = 9.0!
Relaxed Confidence Windows
36 tag conf degree LHB
When approximating:
if conf >= 0: use A_approx else: don’t use A_approx
When updating:
if A_approx, A_actual differ by <= CONF_WINDOW%: conf++ else: conf--
Relaxed Confidence Windows – Output Error
37
0% 20% 40% 60% 80% 100%
- utput error
0% 5% 10% 20% infinite
Varying CONF_WINDOW%:
Relaxed Confidence Windows – L1-D MPKI
38
Varying CONF_WINDOW%:
0.0 0.2 0.4 0.6 0.8 1.0 0% 5% 10% 20% infinite normalized L1-D MPKI CONF_WINDOW%
Approximator Design
39
Relaxed Confidence Windows
- How do we avoid making bad approximations?
- Trade-off performance and error.
Approximation Degree
- Do we need to fetch the actual value from memory every time?
- Trade-off energy and error.
Approximation Degree
40
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
A_approx = 4.0
Approximation Degree
41
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
A_approx = 4.0 A_actual = 4.0
Approximation Degree
42
ℎ , 1.0 2.2 3.1
instruction address
global history buffer
tag conf degree LHB
approximator table
𝑔
local history buffer
4.1 3.9 4.0
load miss A do_work(A_approx) request(A_actual) time
A_approx = 4.0 A_actual = 4.0
Approximation Degree
43 tag conf degree LHB
When approximating:
if degree == APPROX_DEGREE: fetch A_actual else: don’t fetch A_actual
When updating:
if degree == APPROX_DEGREE: degree = 0 else: degree++
Approximation Degree – Output Error
44
Varying APPROX_DEGREE:
0% 20% 40% 60% 80% 100%
- utput error
1 2 4 8 16
Approximation Degree – L1-D Fetches
45
Varying APPROX_DEGREE:
0.2 0.4 0.6 0.8 1 1 2 4 8 16 normalized L1-D fetches APPROX_DEGREE
Methodology
46
Multi-threaded approximate applications
- PARSEC benchmark suite [Bienia, Princeton 2011]
- Programmer annotations and ISA extensions [Esmaeilzadeh, ASPLOS 2012]
Approximator design space exploration
- Pin dynamic binary instrumentation tool [Luk, PLDI 2005]
Full-system simulation
- FeS2 cycle-level x86 simulator [Neelakantam, ASPLOS 2008]
Approximator, cache and memory energy consumption
- CACTI modeling tool [Thoziyoor, HP 2008]
Evaluation
47
0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings
Evaluation
48
0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings
Up to 28% speedup
Evaluation
49
0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings
Up to 28% speedup Up to 44% energy savings
Evaluation
50
0% 2% 4% 6% 8% 10% 12% 14% 16% 4 16 APPROX_DEGREE application speedup energy savings
Reduces L1-D MPKI by 30% over traditional value predictor and prefetcher. Up to 28% speedup Up to 44% energy savings
Conclusion
51
Load Value Approximation
- Approximate Value Locality
- Non-Speculative
- Relaxed Confidence Windows
- Approximation Degree
↑ performance ↓ energy consumption low output error
Conclusion
52