Stream-based Memory Specialization for General Purpose Processors
Zhengrong Wang
- Prof. Tony Nowatzki
1
for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki - - PowerPoint PPT Presentation
Stream-based Memory Specialization for General Purpose Processors Zhengrong Wang Prof. Tony Nowatzki 1 Computation & Memory Specialization SIMD Dataflow + + + + + + New ISA abstraction for / certain computation pattern. Core
1
Core Mem
Acc. New ISA abstraction for certain computation pattern. New ISA abstraction for memory access pattern?
Dataflow Acc.
a[b[1]] a[b[2]] … b[0] b[1] b[2] … b[i] a[b[i]]
Stream
2
Core Mem
Acc. Acc.
– Decouple memory access. – Enable efficient prefetching. – Leverage stream information in cache policies.
a[b[0]] a[b[1]] a[b[2]] … b[0] b[1] b[2] … b[i] a[b[i]]
Stream
3
4
5
addr load if add br addr load if add br
O3 Core
Miss
L1 Cache L2 Cache
Hit Resp. Resp. Miss Hit Resp. Resp.
Addr. Addr. Val. Val.
while (i < N) { if (cond) v += a[i]; i++; } Overhead 1: Hard to prefetch with control flow. Overhead 2: Similar address computation/loads. Overhead 3: Assumption on reuse.
6
addr load if addr load if
O3 Core
Miss
L1 Cache L2 Cache
Hit Resp. Miss Hit Resp.
Addr. Addr.
add br add br Resp. Resp.
Val.
Overhead 1: Hard to prefetch with control flow.
Before loop.
cfg. SE. Miss Hit Resp. Resp.
Prefetch.
Hit Hit
Val.
Opportunity 1: Prefetch with control flow. cfg(a[i]); while (i < N) { if (cond) v += a[i]; i++; }
7
if if
O3 Core L1 Cache L2 Cache
add br add br
Before loop.
cfg. SE. Miss Hit Resp. Resp.
Prefetch.
Hit addr load addr load
Addr. Addr.
Resp. Resp.
Val.
Hit
Opportunity 1: Prefetch with control flow. Overhead 2: Similar address computation/loads.
FIFO
Opportunity 2: Semi-binding prefetch. s_a = cfg(); while (i < N) { if (cond) v += s_a; i++; }
8
if if
O3 Core L1 Cache L2 Cache
add br add br
Before loop.
cfg. SE. Hit Resp.
Prefetch.
Opportunity 1: Prefetch with control flow. Overhead 2: Repeated address computation/loads.
FIFO Miss Resp.
Opportunity 2: Semi-binding prefetch. Overhead 3: Assumption on reuse. Opportunity 3: Better policies, e.g. bypass a cache level if no locality. s_a = cfg(); while (i < N) { if (cond) v += s_a; }
9
– Outrider [ISCA’11], DeSC [MICRO’15], etc. – Ours: New ISA abstraction for the access engine.
– Stride, IMP [MICRO’15], etc. – Ours: Explicit access pattern in ISA.
– Counter-based [ICCD’05], LLC bypassing [ISCA’11], etc. – Ours: Incorporate static stream information.
10
11
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Affine Indirect PC Unqualified Outside
12
Support indirect stream.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
pca rbm disparity lbm_s sphinx srr svm xz_s avg.
>1k >100 >50 >0
13
Support longer stream to capture long term behavior. Low overhead to support short streams.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Execution Paths within the Loop
>3 3 2 1
14
Decouple from control flow.
15
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph
16
int i = 0; while (i < N) { sum += a[i]; i++; } stream_cfg(s_i, s_a); while (s_i < N) { sum += s_a; stream_step(s_i); } stream_end(s_i, s_a);
Memory 0x408 … Memory 0x400 Memory 0x404 Stream a[i] s_a
i++
User Step.
i++
Pseudo-Reg Iter. 1 2
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph
17
int i = 0, j = 0; while (cond) { if (a[i] < b[j]) i++; else j++; } stream_cfg(s_i, s_a, s_j, s_b); while (cond) { if (s_a < s_b) stream_step(s_i); else stream_step(s_j); } stream_end(s_i, s_a, s_j, s_b);
Memory 0x408 … Memory 0x400 Memory 0x404 Stream a[i] s_a
i++
User Step
i++
Pseudo-Reg Iter. 1 2
Original C Code Stream Decoupled Pseudo Code Stream Dependence Graph
18
int i = 0; while (i < N) { sum += a[b[i]]; i++; } stream_cfg(s_i, s_a, s_b); while (s_i < N) { sum += s_a; stream_step(s_i); } stream_end(s_i, s_a, s_b);
Memory 0x86c … Memory 0x888 Memory 0x668 a[b[i]]
i++
User Step
i++
Pseudo-Reg Memory 0x408 … Memory 0x400 Memory 0x404 b[i] s_a s_b Pseudo-Reg Iter. 1 2
– Stream configuration. – Current iteration’s data.
– Stream elements will be used. – Streams are long.
– Load → first use of the pseudo-register after configured/stepped. – Store → every write to the pseudo-register.
19
20
Compiler (ISA) /Hardware Memory Footprint Reuse Distance Modified? Conditional Used? Indirect … Prefetch Throttling Cache Replacement Cache Bypassing Sub-Line Transfer … Rich Information Better Policies
21
Core L1$ L2$
s_a s_b s_b s_a
a[N][N] b[N][N] Reuse Dist. 𝑂 Reuse Dist. 𝑂 × 𝑂
22
23
Pseudo-Reg Memory 0x408 Memory 0x40c Memory 0x400 Memory 0x404 Memory 0x410 Stream
24
– Decrement the iteration map. – No need to flush the FIFO and re-fetch data (decoupled) !
– Revert the stream states, including stream FIFO.
25
26
– Identify stream candidates. – Generate stream configuration. – Transform the program.
– SPEC2017 C/CPP benchmarks. – CortexSuite.
– 10 million instructions’ simpoints. – ~10 simpoints per benchmark.
27
– Table-based prefetcher.
– SMT-based ideal helper thread. – Requires no HW resources (ROB, etc.). – Exactly 1k instruction before the main thread.
– Prefetch only.
– + Semi-binding prefetch.
– + Stream-Aware cache bypassing.
28
29
1 2 3 4 5 6 7 Pf-Stride SSP-Non-Bind SSP-Semi-Bind SSP-Cache-Aware Pf-Helper
0.2 0.4 0.6 0.8 1
Remain Insts Added Insts
1 1.5
Speedup of Semi-Binding Prefetch vs. Non-Binding Prefetch
30
31
0.5 0.6 0.7 0.8 0.9 1 1.1 1 1.5 2 2.5 3
Energy CortexSuite Speedup
OOO[2,6,8] Pf-Stride[2,6,8] 0.5 0.6 0.7 0.8 0.9 1 1.1 1 1.5 2 2.5 3
Energy SPEC CPU 2017 Speedup
Pf-Helper[2,6,8] SSP-Cache-Aware[2,6,8]
– ISA/Microarchitecture extension. – Stream-aware cache bypassing.
– New direction for improving cache architectures. – Combine memory and computation specialization.
32