Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel - - PowerPoint PPT Presentation

classifier based hardware prefetching
SMART_READER_LITE
LIVE PREVIEW

Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel - - PowerPoint PPT Presentation

Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA 19 Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur) 1


slide-1
SLIDE 1

Bouquet of f In Instruction Pointers: In Instruction Pointer Classifier-based Hardware Prefetching DPC3@ISCA ‘19

Samuel Pakalapati (Intel Technology Pvt. Ltd. and BITS Pilani) and Biswabandan Panda (Indian Institute of Technology Kanpur)

1
slide-2
SLIDE 2

Why a Bouquet?

2

No single IP based prefetcher performs well across all applications 

slide-3
SLIDE 3

Our Goal: Idealistic Though ☺

3

L1 Core

L1 Prefetcher L1 hit rate of 100% (a dream ☺) Reality with SPEC CPU 2017 benchmarks provided by DPC3: L1 hit rate of 88.12%  What about L2? 23.55%   RIP Memory wall ☺

slide-4
SLIDE 4

Zooming into the Prefetcher

4

Prefetcher

Instruction Pointer (a.k.a. PC) Demand Memory Accesses (cache-line aligned addresses) Future Memory Accesses

We use the IP information: can eliminate compulsory misses ☺ Started with the simplest IP prefetcher: IP-Stride

slide-5
SLIDE 5

IP-Stride Prefetcher [Fu et al. MICRO ‘92]

5

Good for constant strides

IP Last-address Stride

Prefetch Address = Current Address + Stride

slide-6
SLIDE 6

Our Bouquet

6

First IP prefetcher: Constant stride

slide-7
SLIDE 7

Constant-stride prefetcher (CS class)

7

IP_index IP_tag Valid? Last_page Page_offset Stride Confidence

[0,63], Cache line offset within a 4KB OS page

If (current_page=last_page) then stride within a page Page boundary learning: If (current_page=last_page±1) Stride = 64±(page_offset_new-page_offset_old)

slide-8
SLIDE 8

Valid Bit?

8

Two different IP_tags can map to same IP_index IPA: V=1, IPB mapped to same entry: V=0, IPA: V=0: IPA mapped to same entry: V=1 If V=0 but IP_tag is different then clear the entry and make confidence zero ~ 2-way associative cache, minimize collisions

IP_index IP_tag Valid? Last_page Page_offset Stride Confidence

IPA IPB

slide-9
SLIDE 9

Constant Stride Class

9

IP X, X+2, X+4, …………. Constant stride of 2 IP X, X+3, X+4, X+2 …… Variable stride of ? Signature Path Prefetching, DPC-2, MICRO ‘16

slide-10
SLIDE 10

Our Bouquet

10

First IP prefetcher: Constant stride Second IP prefetcher: Complex stride

slide-11
SLIDE 11

Complex Stride (CPLX Class) [Kim et al., DPC-2/MICRO ‘16]

IP Signature Stride Confidence IPA SigA (+1, +2, +3)

  • 3

2/3

11

We call it Delta Prediction Table (DPT) +1 +2 +3 -3 +1 +2 + 3 -4 +1 +2 +3 -3

slide-12
SLIDE 12

From Stride to Stream: Global Stream

12

X, X+1, Y, Y+4, Z, …………………. IPX IPY IPZ IPX drives the global stream: Y=X+2 and Z=X+7 IP independence can provide better coverage and timeliness

slide-13
SLIDE 13

Our Bouquet

13

First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream

slide-14
SLIDE 14

Global Stream (GS Class)

14

IP Stream Valid? Stream Direction Stream Strength? IPX Yes (0/1) +/- Strong

. . .

GHB (Global History Buffer)

X, X+1, Y, Y+1, Z, ………………….

X Z

❷ ❸ ❸ ❶ ❹

X+1 X+2 ……… X+PrefetchDegree n entries If n/2 GHB hits, valid If 3n/4 hits, strong

slide-15
SLIDE 15

Our Bouquet

15

First IP prefetcher: Constant stride Second IP prefetcher: Complex stride Third IP prefetcher: Global stream Fourth prefetcher: Next-line

slide-16
SLIDE 16

No-IP: Next-line (NL Class)

16

Prefetch Address = Current Address + 1 Detrimental to performance in case of irregular accesses SPECULATIVE NL: NL is ON L1 Misses Per Kilo Cycles (MPKC) is low (< 15 for single-core) NL is OFF Otherwise

slide-17
SLIDE 17

The Bouquet

17

Constant Stride (CS class) Complex Stride (CPLX class) Global Stream (GS class) Next Line (NL class) Design Choice: A hardware table for each class? Our Proposal: IPCP, a single hardware table for all the classes

slide-18
SLIDE 18

Our Proposal (IPCP at L1)

18

IP Valid? Page no. Page offset Stride Confidence Signature Stream valid? Direction Strength Stride Confidence

. . .

GHB

X Z

CS CPLX GS

L1 access [IP, Access address] Priority of classes: GS > CS > CPLX > NL Prefetch Degree: GS: 6, CS and CPLX: 3

slide-19
SLIDE 19

Our Proposal (IPCP at L2)

19

IP Valid? Class_type Stride

L1 Prefetcher

GS, CS, CPLX, NL, NO

No IP classification at the L2, table construction based on metadata No prefetching for CPLX class

Trained Stride, Stream Direction

Prefetch Degree: 4 for GS and 4 for CS if MSHR is less than half full else 3 L1 Miss

slide-20
SLIDE 20

Metadata

20

L1 Prefetch Packet L1 Prefetch Packet Metadata Stride (7 bits) Class-type (3 bits) SPEC_NL (1 bit)

Stream direction in case of GS class type

slide-21
SLIDE 21

Hardware Overhead

21

Table Entry size * #Entries Total IP Table 77 * 1024 (L1) + 17 * 1024 (L2) bits 12.03 KB DPT Table 9 * 4096 bits 4.6 KB GHB Table 16 * 58 bits 928 bits Others 100 bits 86 bits 16.7 KB

slide-22
SLIDE 22

Single-core Performance [SPEC CPU 2017]

22

0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5

600.perlbench_s-570B 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 654.roms_s-523B 657.xz_s-2302B Geomean

3.59 3.02

On average: 43.75% improvement Multi-core: 25 mixes, 22% improvement

slide-23
SLIDE 23

Distribution of IP Classes

23

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

600.perlbench_s-570B 602.gcc_s-1850B 602.gcc_s-2226B 602.gcc_s-734B 603.bwaves_s-1740B 603.bwaves_s-2609B 603.bwaves_s-2931B 603.bwaves_s-891B 605.mcf_s-1152B 605.mcf_s-1536B 605.mcf_s-1554B 605.mcf_s-1644B 605.mcf_s-472B 605.mcf_s-484B 605.mcf_s-665B 605.mcf_s-782B 605.mcf_s-994B 607.cactuBSSN_s-2421B 607.cactuBSSN_s-3477B 607.cactuBSSN_s-4004B 619.lbm_s-2676B 619.lbm_s-2677B 619.lbm_s-3766B 619.lbm_s-4268B 620.omnetpp_s-141B 620.omnetpp_s-874B 621.wrf_s-6673B. 621.wrf_s-8065B 623.xalancbmk_s-10B 623.xalancbmk_s-165B 623.xalancbmk_s-202B 627.cam4_s-490B 628.pop2_s-17B 641.leela_s-1083B 649.fotonik3d_s-10881B 649.fotonik3d_s-1176B 649.fotonik3d_s-7084B 649.fotonik3d_s-8225B 654.roms_s-1007B 654.roms_s-1070B 654.roms_s-1390B 654.roms_s-1613B 654.roms_s-293B 654.roms_s-294B 654.roms_s-523B 657.xz_s-2302B Mean

GS CS CPLX NL

On average, all classes trigger equally

slide-24
SLIDE 24

Comparison with the State-of-the-art: Performance [Higher the better]

24

30 32 34 36 38 40 42 44 46 BO [HPCA '16, DPC-2 Winner] SPP+ Perceptron Filter [ISCA '19] IPCP Average Improvement in %

34.53 40.40 43.75

slide-25
SLIDE 25

Key Takeaways

25

Access patterns can be classified based on IPs (IPCP) Classification at the L1, reuse at the L2 through metadata Simple and modular collection of prefetchers High performance and low hardware overhead Prefetchers like ISB [MICRO ‘13] and IMP [MICRO ‘15] can be added to the bouquet seamlessly

slide-26
SLIDE 26

Dream ☺ ?

26

With IPCP, L1 hit rate jumps from 88.11% to 92.43% ☺ With IPCP, L2 hit rate jumps from 23.55% to 51.82% ☺

slide-27
SLIDE 27 27

“Great things are done by a series of small things brought together” Vincent Van Gogh, Dutch painter

Than hank Y k You

  • u