Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA ‘19
Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur)
1
Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel - - PowerPoint PPT Presentation
Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur) 1
Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA ‘19
Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur)
1Why a Bouquet?
2No single IP based prefetcher performs well across all applications
Our Goal: Idealistic Though ☺
3L1 Core
L1 Prefetcher L1 hit rate of 100% (a dream ☺) Reality with SPEC CPU 2017 benchmarks provided by DPC3: L1 hit rate of 88.12% What about L2? 23.55% RIP Memory wall ☺
Zooming into the Prefetcher
4Prefetcher
Instruction Pointer (a.k.a. PC) Demand Memory Accesses (cache-line aligned addresses) Future Memory Accesses
We use the IP information: can eliminate compulsory misses ☺ Started with the simplest IP prefetcher: IP-Stride
IP-Stride Prefetcher [Fu et al. MICRO ‘92]
5Good for constant strides
IP Last-address Stride
Prefetch Address = Current Address + Stride
Our Bouquet
6First IP prefetcher: Constant stride
Constant-stride prefetcher (CS class)
7IP_index IP_tag Valid? Last_page Page_offset Stride Confidence
[0,63], Cache line offset within a 4KB OS page
If (current_page=last_page) then stride within a page Page boundary learning: If (current_page=last_page±1) Stride = 64±(page_offset_new-page_offset_old)
Valid Bit?
8Two different IP_tags can map to same IP_index IPA: V=1, IPB mapped to same entry: V=0, IPA: V=0: IPA mapped to same entry: V=1 If V=0 but IP_tag is different then clear the entry and make confidence zero ~ 2-way associative cache, minimize collisions
IP_index IP_tag Valid? Last_page Page_offset Stride Confidence
IPA IPB
Constant Stride Class
9IP X, X+2, X+4, …………. Constant stride of 2 IP X, X+3, X+4, X+2 …… Variable stride of ? Signature Path Prefetching, DPC-2, MICRO ‘16
Our Bouquet
10First IP prefetcher: Constant stride Second IP prefetcher: Complex stride
Complex Stride (CPLX Class) [Kim et al., DPC-2/MICRO ‘16]
IP Signature Stride Confidence IPA SigA (+1, +2, +3)
2/3
11We call it Delta Prediction Table (DPT) +1 +2 +3 -3 +1 +2 + 3 -4 +1 +2 +3 -3
From Stride to Stream: Global Stream
12X, X+1, Y, Y+4, Z, …………………. IPX IPY IPZ IPX drives the global stream: Y=X+2 and Z=X+7 IP independence can provide better coverage and timeliness
Our Bouquet
13First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream
Global Stream (GS Class)
14IP Stream Valid? Stream Direction Stream Strength? IPX Yes (0/1) +/- Strong
. . .
GHB (Global History Buffer)
X, X+1, Y, Y+1, Z, ………………….
X Z
❷ ❸ ❸ ❶ ❹
X+1 X+2 ……… X+PrefetchDegree n entries If n/2 GHB hits, valid If 3n/4 hits, strong
Our Bouquet
15First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream Fourth prefetcher: Next-line
No-IP: Next-line (NL Class)
16Prefetch Address = Current Address + 1 Detrimental to performance in case of irregular accesses SPECULATIVE NL: NL is ON L1 Misses Per Kilo Cycles (MPKC) is low (< 15 for single-core) NL is OFF Otherwise
The Bouquet
17Constant Stride (CS class) Complex Stride (CPLX class) Global Stream (GS class) Next Line (NL class) Design Choice: A hardware table for each class? Our Proposal: IPCP, a single hardware table for all the classes
Our Proposal (IPCP at L1)
18IP Valid? Page no. Page offset Stride Confidence Signature Stream valid? Direction Strength Stride Confidence
. . .
GHB
X Z
CS CPLX GS
L1 access [IP, Access address] Priority of classes: GS > CS > CPLX > NL Prefetch Degree: GS: 6, CS and CPLX: 3
Our Proposal (IPCP at L2)
19IP Valid? Class_type Stride
L1 Prefetcher
GS, CS, CPLX, NL, NO
No IP classification at the L2, table construction based on metadata No prefetching for CPLX class
Trained Stride, Stream Direction
Prefetch Degree: 4 for GS and 4 for CS if MSHR is less than half full else 3 L1 Miss
Metadata
20L1 Prefetch Packet L1 Prefetch Packet Metadata Stride (7 bits) Class-type (3 bits) SPEC_NL (1 bit)
Stream direction in case of GS class type
Hardware Overhead
21Table Entry size * #Entries Total IP Table 77 * 1024 (L1) + 17 * 1024 (L2) bits 12.03 KB DPT Table 9 * 4096 bits 4.6 KB GHB Table 16 * 58 bits 928 bits Others 100 bits 86 bits 16.7 KB
Single-core Performance [SPEC CPU 2017]
220.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5
600.perlbench_s-570B 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 654.roms_s-523B 657.xz_s-2302B Geomean3.59 3.02
On average: 43.75% improvement Multi-core: 25 mixes, 22% improvement
Distribution of IP Classes
230% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
600.perlbench_s-570B 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 654.roms_s-523B 657.xz_s-2302B MeanGS CS CPLX NL
On average, all classes trigger equally
Comparison with the State-of-the-art: Performance [Higher the better]
2430 32 34 36 38 40 42 44 46 BO [HPCA '16, DPC-2 Winner] SPP+ Perceptron Filter [ISCA '19] IPCP Average Improvement in %
34.53 40.40 43.75
Key Takeaways
25Access patterns can be classified based on IPs (IPCP) Classification at the L1, reuse at the L2 through metadata Simple and modular collection of prefetchers High performance and low hardware overhead Prefetchers like ISB [MICRO ‘13] and IMP [MICRO ‘15] can be added to the bouquet seamlessly
Dream ☺ ?
26With IPCP, L1 hit rate jumps from 88.11% to 92.43% ☺ With IPCP, L2 hit rate jumps from 23.55% to 51.82% ☺
“Great things are done by a series of small things brought together” Vincent Van Gogh, Dutch painter