classifier based hardware prefetching
play

Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel - PowerPoint PPT Presentation

Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur) 1


  1. Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA ‘19 Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur) 1

  2. Why a Bouquet? No single IP based prefetcher performs well across all applications  2

  3. Our Goal: Idealistic Though ☺ Core L1 L1 Prefetcher L1 hit rate of 100% (a dream ☺ ) RIP Memory wall ☺ Reality with SPEC CPU 2017 benchmarks provided by DPC3: L1 hit rate of 88.12%  What about L2? 23.55%   3

  4. Zooming into the Prefetcher Instruction Pointer (a.k.a. PC) Prefetcher Future Memory Accesses Demand Memory Accesses (cache-line aligned addresses) We use the IP information: can eliminate compulsory misses ☺ Started with the simplest IP prefetcher: IP-Stride 4

  5. IP- Stride Prefetcher [Fu et al. MICRO ‘92] IP Last-address Stride Prefetch Address = Current Address + Stride Good for constant strides 5

  6. Our Bouquet First IP prefetcher: Constant stride 6

  7. Constant-stride prefetcher (CS class) IP_index IP_tag Valid? Last_page Page_offset Stride Confidence [0,63], Cache line offset within a 4KB OS page If (current_page=last_page) then stride within a page Page boundary learning: If (current_page=last_page ±1) Stride = 64± (page_offset_new-page_offset_old) 7

  8. Valid Bit? IP_index IP_tag Valid? Last_page Page_offset Stride Confidence IP A IP B Two different IP_tags can map to same IP_index IPA: V=1, IPB mapped to same entry: V=0, IPA: V=0: IPA mapped to same entry: V=1 If V=0 but IP_tag is different then clear the entry and make confidence zero ~ 2-way associative cache, minimize collisions 8

  9. Constant Stride Class X, X+2, X+4, …………. Constant stride of 2 IP X, X+3, X+4, X+2 …… Variable stride of ? IP Signature Path Prefetching, DPC- 2, MICRO ‘ 16 9

  10. Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride 10

  11. Complex Stride (CPLX Class) [Kim et al., DPC- 2/MICRO ‘ 16] IP Signature Stride Confidence IP A Sig A (+1, +2, +3) -3 2/3 +1 +2 +3 -3 +1 +2 + 3 -4 +1 +2 +3 -3 We call it Delta Prediction Table (DPT) 11

  12. From Stride to Stream: Global Stream X, X+1, Y, Y+4, Z, …………………. IP X IP Y IP Z IP X drives the global stream: Y=X+2 and Z=X+7 IP independence can provide better coverage and timeliness 12

  13. Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream 13

  14. Global Stream (GS Class) X, X+1, Y, Y+1, Z, …………………. ❶ ❸ ❷ IP Stream Valid? Stream Stream Direction Strength? Z . GHB IP X Yes (0/1) +/- Strong . (Global History Buffer) . n entries X ❹ ❸ If n/2 GHB hits, valid If 3n/4 hits, strong X+1 X+2 ……… X+PrefetchDegree 14

  15. Our Bouquet First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream Fourth prefetcher: Next-line 15

  16. No-IP: Next-line (NL Class) Prefetch Address = Current Address + 1 Detrimental to performance in case of irregular accesses SPECULATIVE NL: NL is ON L1 Misses Per Kilo Cycles (MPKC) is low (< 15 for single-core) NL is OFF Otherwise 16

  17. The Bouquet Constant Stride (CS class) Complex Stride (CPLX class) Global Stream (GS class) Next Line (NL class) Design Choice: A hardware table for each class? Our Proposal: IPCP, a single hardware table for all the classes 17

  18. Our Proposal (IPCP at L1) L1 access [IP, Access address] CPLX GS CS IP Valid? Page no. Page offset Stride Confidence Signature Stream valid? Direction Strength Priority of classes: Z . GHB GS > CS > CPLX > NL . Stride Confidence . Prefetch Degree: X GS: 6, CS and CPLX: 3 18

  19. Our Proposal (IPCP at L2) GS, CS, CPLX, NL, NO L1 Prefetcher Trained Stride, Stream Direction L1 Miss IP Valid? Class_type Stride No IP classification at the L2, table construction based on metadata No prefetching for CPLX class Prefetch Degree: 4 for GS and 4 for CS if MSHR is less than half full else 3 19

  20. Metadata L1 Prefetch Packet L1 Prefetch Packet Metadata Stride (7 bits) Class-type (3 bits) SPEC_NL (1 bit) Stream direction in case of GS class type 20

  21. Hardware Overhead Table Entry size * #Entries Total IP Table 77 * 1024 (L1) + 17 * 1024 (L2) bits 12.03 KB DPT Table 9 * 4096 bits 4.6 KB GHB Table 16 * 58 bits 928 bits Others 100 bits 86 bits 16.7 KB 21

  22. 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 600.perlbench_s-570B Single-core Performance [SPEC CPU 2017] Multi-core: 25 mixes, 22% improvement On average: 43.75% improvement 602.gcc_s-1850B 3.59 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 3.02 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 22 654.roms_s-523B 657.xz_s-2302B Geomean

  23. 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 600.perlbench_s-570B Distribution of IP Classes 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B On average, all classes trigger equally 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B GS 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B CS 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B CPLX 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. NL 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 23 654.roms_s-523B 657.xz_s-2302B Mean

  24. Comparison with the State-of-the-art: Performance [Higher the better] 46 Average Improvement in % 43.75 44 42 40.40 40 38 36 34.53 34 32 30 BO [HPCA '16, DPC-2 SPP+ Perceptron Filter IPCP Winner] [ISCA '19] 24

  25. Key Takeaways Access patterns can be classified based on IPs (IPCP) Classification at the L1, reuse at the L2 through metadata Simple and modular collection of prefetchers Prefetchers like ISB [MICRO ‘13] and IMP [MICRO ‘15] can be added to the bouquet seamlessly High performance and low hardware overhead 25

  26. Dream ☺ ? With IPCP, L1 hit rate jumps from 88.11% to 92.43% ☺ With IPCP, L2 hit rate jumps from 23.55% to 51.82% ☺ 26

  27. “Great things are done by a series of small things brought together” Vincent Van Gogh, Dutch painter Than hank Y k You ou 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend