Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - - PowerPoint PPT Presentation

criticality aware tiered cache hierarchy catch
SMART_READER_LITE
LIVE PREVIEW

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - - PowerPoint PPT Presentation

Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache


slide-1
SLIDE 1

Criticality Aware Tiered Cache Hierarchy (CATCH)

Anant Nori*, Jayesh Gaur*, Siddharth Rai#, Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel

# Indian Institute of Technology Kanpur, India

slide-2
SLIDE 2

Popular Three Level Cache Hierarchy

2

slide-3
SLIDE 3

Popular Three Level Cache Hierarchy

2

  • Cache capacity ↔ Access latency
  • Target low average latency

L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc

Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive

Data Code 32 KB

1MB

slide-4
SLIDE 4

Popular Three Level Cache Hierarchy

2

  • Cache capacity ↔ Access latency
  • Target low average latency
  • Large distributed LLC, high latency
  • Lower L2 latency important

L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc

Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive

Data Code 32 KB

1MB

slide-5
SLIDE 5

Popular Three Level Cache Hierarchy

2

  • Cache capacity ↔ Access latency
  • Target low average latency
  • Large distributed LLC, high latency
  • Lower L2 latency important

L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc

Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive

Data Code 32 KB

1MB

  • 6.5%
  • 7.0%
  • 4.2%
  • 14.0%
  • 6.9%
  • 7.8%
  • 4.0%
  • 6.8%
  • 3.3%
  • 9.5%
  • 2.7%
  • 5.1%
  • 15%
  • 12%
  • 9%
  • 6%
  • 3%

0% client FSPEC HPC ISPEC server GeoMean

  • perf. impact

NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC

slide-6
SLIDE 6

Popular Three Level Cache Hierarchy

2

  • Cache capacity ↔ Access latency
  • Target low average latency
  • Large distributed LLC, high latency
  • Lower L2 latency important
  • Trend to larger L2 sizes

L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc

Broadwell-like Server

2MB/core 8MB (4 core) Inclusive 256 KB

Data Code 32 KB

Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive

Data Code 32 KB

1MB

  • 6.5%
  • 7.0%
  • 4.2%
  • 14.0%
  • 6.9%
  • 7.8%
  • 4.0%
  • 6.8%
  • 3.3%
  • 9.5%
  • 2.7%
  • 5.1%
  • 15%
  • 12%
  • 9%
  • 6%
  • 3%

0% client FSPEC HPC ISPEC server GeoMean

  • perf. impact

NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC

slide-7
SLIDE 7

Popular Three Level Cache Hierarchy

2

  • Cache capacity ↔ Access latency
  • Target low average latency
  • Large distributed LLC, high latency
  • Lower L2 latency important
  • Trend to larger L2 sizes

Is a large L2 the most efficient design choice?

L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc

Broadwell-like Server

2MB/core 8MB (4 core) Inclusive 256 KB

Data Code 32 KB

Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive

Data Code 32 KB

1MB

  • 6.5%
  • 7.0%
  • 4.2%
  • 14.0%
  • 6.9%
  • 7.8%
  • 4.0%
  • 6.8%
  • 3.3%
  • 9.5%
  • 2.7%
  • 5.1%
  • 15%
  • 12%
  • 9%
  • 6%
  • 3%

0% client FSPEC HPC ISPEC server GeoMean

  • perf. impact

NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC

slide-8
SLIDE 8

L 2 L 2

L2 L2 LLC LLC Inclusive LLC Exclusive LLC

Large L2 caches

3

  • Inclusive LLC → Exclusive LLC
slide-9
SLIDE 9

L 2 L 2

L2 L2 LLC LLC Inclusive LLC Exclusive LLC

Large L2 caches

3

  • Inclusive LLC → Exclusive LLC
  • Lower effective on-die cache per core
slide-10
SLIDE 10

L 2 L 2

L2 L2 LLC LLC Inclusive LLC Exclusive LLC

Large L2 caches

3

  • Inclusive LLC → Exclusive LLC
  • Lower effective on-die cache per core
  • Large LLC better for multiple threads

with disparate cache footprints

         

slide-11
SLIDE 11

L 2 L 2

L2 L2 LLC LLC Inclusive LLC Exclusive LLC

Large L2 caches

3

  • Inclusive LLC → Exclusive LLC
  • Lower effective on-die cache per core
  • Large LLC better for multiple threads

with disparate cache footprints

  • Area for Snoop-filter/Coherence-directory
slide-12
SLIDE 12

L 2 L 2

L2 L2 LLC LLC Inclusive LLC Exclusive LLC

Large L2 caches

3

  • Inclusive LLC → Exclusive LLC
  • Lower effective on-die cache per core
  • Large LLC better for multiple threads

with disparate cache footprints

  • Area for Snoop-filter/Coherence-directory

Despite area and power overheads, average latency reduction (performance) drives large L2

slide-13
SLIDE 13

Program execution expressed in a Data Dependency Graph (Fields et. al.)

4

Loads and Program Criticality

D E C D E C D E C D E C D E C D E C D E C

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2

200

10

10 2

2 10

30

200 30

2 2 10

slide-14
SLIDE 14

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

slide-15
SLIDE 15

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

slide-16
SLIDE 16

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

CRITICAL L2 HIT LOAD

slide-17
SLIDE 17

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT

LLC hit

30 30

slide-18
SLIDE 18

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD

slide-19
SLIDE 19

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD No performance impact if L2 HIT → LLC HIT

LLC hit

30

30

slide-20
SLIDE 20

Program execution expressed in a Data Dependency Graph (Fields et. al.)

  • Execution time governed by “Critical Path”

4

Loads and Program Criticality

E C D C D E C D C D C D E C D

LLC miss L2 Hit L2 Hit LLC Hit

1 2 3 4 5 6 7

Load Load Load Load

2 30 10 2 10

30 D E E E E C 200 10 2 2 2

200

10

CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD No performance impact if L2 HIT → LLC HIT

Only critical load L2 hits matter to performance.

slide-21
SLIDE 21

Cache Hierarchy and Program Criticality

5

Oracle study

  • Track critical load PCs
  • Increase latencies of targeted load PCs
slide-22
SLIDE 22
  • 16.1%
  • 7.8%
  • 4.9%
  • 0.8%

49.1% 39.6% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

  • 18%
  • 16%
  • 14%
  • 12%
  • 10%
  • 8%
  • 6%
  • 4%
  • 2%

0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. % loads converted to higer latency

  • perf. impact
  • Perf. Impact – All loads
  • Perf. Impact – NonCritical loads % loads converted

Cache Hierarchy and Program Criticality

5

slide-23
SLIDE 23
  • 16.1%
  • 7.8%
  • 4.9%
  • 0.8%

49.1% 39.6% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

  • 18%
  • 16%
  • 14%
  • 12%
  • 10%
  • 8%
  • 6%
  • 4%
  • 2%

0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. % loads converted to higer latency

  • perf. impact
  • Perf. Impact – All loads
  • Perf. Impact – NonCritical loads % loads converted

Cache Hierarchy and Program Criticality

5

L2 cache most amenable to criticality optimizations

slide-24
SLIDE 24

Criticality Aware Tiered Cache Hierarchy (CATCH)

  • A. Track critical load PCs

6

slide-25
SLIDE 25

Criticality Aware Tiered Cache Hierarchy (CATCH)

  • A. Track critical load PCs
  • Served from non-L1 on-die caches

6

slide-26
SLIDE 26

Criticality Aware Tiered Cache Hierarchy (CATCH)

  • A. Track critical load PCs
  • Served from non-L1 on-die caches
  • B. Prefetch critical loads into L1
  • Accelerate the critical path

6

slide-27
SLIDE 27

Criticality Aware Tiered Cache Hierarchy (CATCH)

  • A. Track critical load PCs
  • Served from non-L1 on-die caches
  • B. Prefetch critical loads into L1
  • Accelerate the critical path

6

5.5% 5.8% 6.1% 6.6% 6.2% 14.1% 15.5% 17.0% 0% 4% 8% 12% 16% 20% 0% 2% 4% 6% 8% 10% 32 PC 128 PC 2048 PC All PC NoL2 + 2048 PC % L1 misses converted

  • perf. impact

Oracle Performance Potential (*prefetchers disabled)

PerfImpact %loads Converted

slide-28
SLIDE 28

Criticality Aware Tiered Cache Hierarchy (CATCH)

  • A. Track critical load PCs
  • Served from non-L1 on-die caches
  • B. Prefetch critical loads into L1
  • Accelerate the critical path

6

5.5% 5.8% 6.1% 6.6% 6.2% 14.1% 15.5% 17.0% 0% 4% 8% 12% 16% 20% 0% 2% 4% 6% 8% 10% 32 PC 128 PC 2048 PC All PC NoL2 + 2048 PC % L1 misses converted

  • perf. impact

Oracle Performance Potential (*prefetchers disabled)

PerfImpact %loads Converted

L2 can become redundant in a criticality aware cache hierarchy

slide-29
SLIDE 29

CATCH Configuration Options

7

Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared

5.5MB Exclusive

1MB

Data Code 32 KB

BASELINE CATCH Hardware

  • A. Track critical

load PCs

  • B. Prefetch

critical loads in L1

slide-30
SLIDE 30

CATCH Configuration Options

7

Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared

5.5MB Exclusive

1MB

Data Code 32 KB

BASELINE CATCH Hardware

  • A. Track critical

load PCs

  • B. Prefetch

critical loads in L1

5.5MB Exclusive

1MB

Data Code 32 KB

Three-Level CATCH

Accelerates critical path

slide-31
SLIDE 31

CATCH Configuration Options

7

Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared

5.5MB Exclusive

1MB

Data Code 32 KB

BASELINE CATCH Hardware

  • A. Track critical

load PCs

  • B. Prefetch

critical loads in L1

5.5MB Exclusive

1MB

Data Code 32 KB

Three-Level CATCH

Accelerates critical path

5.5MB Exclusive

Data Code 32 KB

Two-Level CATCH (NoL2)

Accelerates critical path + Area saving

5.5MB Inclusive

slide-32
SLIDE 32

CATCH Configuration Options

7

Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared

5.5MB Exclusive

1MB

Data Code 32 KB

BASELINE CATCH Hardware

  • A. Track critical

load PCs

  • B. Prefetch

critical loads in L1

5.5MB Exclusive

1MB

Data Code 32 KB

Three-Level CATCH

Accelerates critical path

5.5MB Exclusive

Data Code 32 KB

Two-Level CATCH (NoL2)

Accelerates critical path + Area saving

5.5MB Exclusive

1MB

Data Code 32 KB

Two-Level CATCH (NoL2, IsoArea)

Accelerates critical path

5.5MB Inclusive

slide-33
SLIDE 33

CATCH Configuration Options

7

Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared

5.5MB Exclusive

1MB

Data Code 32 KB

BASELINE CATCH Hardware

  • A. Track critical

load PCs

  • B. Prefetch

critical loads in L1

5.5MB Exclusive

1MB

Data Code 32 KB

Three-Level CATCH

Accelerates critical path

5.5MB Exclusive

Data Code 32 KB

Two-Level CATCH (NoL2)

Accelerates critical path + Area saving

Data Code 32 KB

Two-Level CATCH (NoL2, IsoArea)

Accelerates critical path + Power saving

9.5MB Inclusive 5.5MB Inclusive

slide-34
SLIDE 34

A) Criticality Detection in Hardware

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

slide-35
SLIDE 35

A) Criticality Detection in Hardware

Buffer execution DDG (Fields et. al.) on instruction retire

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

History of Retired Instructions (2.5x ROB window)

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

slide-36
SLIDE 36

A) Criticality Detection in Hardware

Buffer execution DDG (Fields et. al.) on instruction retire Enumerate critical path every 2x ROB instruction retires

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

History of Retired Instructions (2.5x ROB window)

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

slide-37
SLIDE 37

A) Criticality Detection in Hardware

Buffer execution DDG (Fields et. al.) on instruction retire Enumerate critical path every 2x ROB instruction retires

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

History of Retired Instructions (2.5x ROB window)

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

32 entry Critical Load PC Table

slide-38
SLIDE 38

A) Criticality Detection in Hardware

Buffer execution DDG (Fields et. al.) on instruction retire Enumerate critical path every 2x ROB instruction retires Optimizations: Area of DDG, Fast enumeration of critical path

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

History of Retired Instructions (2.5x ROB window)

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

32 entry Critical Load PC Table

slide-39
SLIDE 39

A) Criticality Detection in Hardware

Buffer execution DDG (Fields et. al.) on instruction retire Enumerate critical path every 2x ROB instruction retires Optimizations: Area of DDG, Fast enumeration of critical path

  • Uses 3KB of storage. Details in the paper

8

Instructions being executed (ROB) Instructions being allocated

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

History of Retired Instructions (2.5x ROB window)

D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C

32 entry Critical Load PC Table

slide-40
SLIDE 40

B) Timeliness Aware, Criticality Triggered (TACT) Prefetchers

Critical load PC prefetchers optimized for inter-cache prefetching into Data L1

9

slide-41
SLIDE 41

B) Timeliness Aware, Criticality Triggered (TACT) Prefetchers

Critical load PC prefetchers optimized for inter-cache prefetching into Data L1 Data Prefetchers

  • Identify “Trigger” load PCs for “Target” critical load PCs

9

slide-42
SLIDE 42

B) Timeliness Aware, Criticality Triggered (TACT) Prefetchers

Critical load PC prefetchers optimized for inter-cache prefetching into Data L1 Data Prefetchers

  • Identify “Trigger” load PCs for “Target” critical load PCs

Code “Run-Ahead” Prefetcher

  • Cover LLC latency instead of L2 (for when the L2 is removed)

9

slide-43
SLIDE 43

10

TACT: Data Prefetchers

slide-44
SLIDE 44

10

“Cross” Prefetcher:

TACT: Data Prefetchers

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

slide-45
SLIDE 45

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

TACT: Data Prefetchers

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ
slide-46
SLIDE 46

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

TACT: Data Prefetchers

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

slide-47
SLIDE 47

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

  • Upto deep prefetch distance 16

TACT: Data Prefetchers

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

slide-48
SLIDE 48

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

  • Upto deep prefetch distance 16

TACT: Data Prefetchers

LD Target D i LD Target D i+1 LD Target D i+2 … … … … LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3

“Feeder” Prefetcher: Address of Target/Critical load = M * Data of Feeder load + C

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

slide-49
SLIDE 49

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

  • Upto deep prefetch distance 16

TACT: Data Prefetchers

LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3

“Feeder” Prefetcher: Address of Target/Critical load = M * Data of Feeder load + C

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

slide-50
SLIDE 50

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

  • Upto deep prefetch distance 16

TACT: Data Prefetchers

LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F Feeder Prefetch Data to Prefetch Address of Target D LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3

“Feeder” Prefetcher: Address of Target/Critical load = M * Data of Feeder load + C

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

slide-51
SLIDE 51

10

“Cross” Prefetcher:

  • Trigger load PC address @

constant delta from Target/Critical PC

“Self” Deep Prefetcher:

  • Upto deep prefetch distance 16

TACT: Data Prefetchers

LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F Feeder Prefetch Data to Prefetch Address of Target D LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3

“Feeder” Prefetcher: Address of Target/Critical load = M * Data of Feeder load + C

LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1

  • Addr. Trigger T to
  • Addr. Target C = Δ

a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ

Implementation details in the paper

slide-52
SLIDE 52

TACT: Code “Run-Ahead” Prefetcher

11

Next Instruction Pointer (NIP) Logic … (Branch Prediction etc) Branch Mispredict NextInstructionPointer Code Load Next Line Code Prefetch +

slide-53
SLIDE 53

CodeNextPrefetch InstructionPointer Code Prefetch Front End Stall

TACT: Code “Run-Ahead” Prefetcher

On front-end stall:

  • Use NIP logic (Branch Prediction, BTB)
  • Speculatively prefetch code lines

11

Next Instruction Pointer (NIP) Logic … (Branch Prediction etc) Branch Mispredict NextInstructionPointer Code Load Next Line Code Prefetch +

slide-54
SLIDE 54

Evaluation : Configuration

  • 4 x86 cores @ 3.2GHz, 4-wide, 224 ROB entries
  • 32 KB, 8-way Data and Code L1
  • PC-based stride prefetcher, multi-stream prefetchers
  • Dual channel DDR4-2400 main memory

12

slide-55
SLIDE 55

Evaluation : Configuration

  • 4 x86 cores @ 3.2GHz, 4-wide, 224 ROB entries
  • 32 KB, 8-way Data and Code L1
  • PC-based stride prefetcher, multi-stream prefetchers
  • Dual channel DDR4-2400 main memory
  • Two baseline L2/LLC configurations
  • Large L2 (1MB), Exclusive LLC (5.5MB, 1.375 MB/core)
  • Small L2 (256KB), Inclusive LLC (8MB)

12

slide-56
SLIDE 56

Evaluation : Configuration

  • 4 x86 cores @ 3.2GHz, 4-wide, 224 ROB entries
  • 32 KB, 8-way Data and Code L1
  • PC-based stride prefetcher, multi-stream prefetchers
  • Dual channel DDR4-2400 main memory
  • Two baseline L2/LLC configurations
  • Large L2 (1MB), Exclusive LLC (5.5MB, 1.375 MB/core)
  • Small L2 (256KB), Inclusive LLC (8MB)
  • 70 diverse ST workloads – SPEC-06, HPC, Server, Client
  • 60 four-way multi-programmed workloads

12

slide-57
SLIDE 57

13

Large 1MB L2, Exclusive 1.375 MB LLC per Core ST GeoMean Performance Impact

  • 7.8%

4.5% 7.2% 8.4%

  • 15.0%
  • 10.0%
  • 5.0%

0.0% 5.0% 10.0% 15.0% client FSPEC HPC ISPEC server GeoMean

  • perf. impact w.r.t.

1MB L2, 5.5 MB LLC NoL2 NoL2 + CATCH NoL2 + 9.5MB LLC + CATCH CATCH

slide-58
SLIDE 58

13

Large 1MB L2, Exclusive 1.375 MB LLC per Core ST GeoMean Performance Impact

  • 7.8%

4.5% 7.2% 8.4%

  • 15.0%
  • 10.0%
  • 5.0%

0.0% 5.0% 10.0% 15.0% client FSPEC HPC ISPEC server GeoMean

  • perf. impact w.r.t.

1MB L2, 5.5 MB LLC NoL2 NoL2 + CATCH NoL2 + 9.5MB LLC + CATCH CATCH

CATCH accelerates the critical path and enables designs with better performance and area/power tradeoffs.

slide-59
SLIDE 59

14

Large 1MB L2, Exclusive 1.375 MB LLC per Core ST Per-Workload Performance Impact

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

  • perf. ratio w.r.t.

1MB L2, 5.5MB LLC NoL2 NoL2 + 9.5MB LLC + CATCH CATCH

povray namd

TACT prefetchers recover loss from no L2 in majority of workloads

slide-60
SLIDE 60

14

Large 1MB L2, Exclusive 1.375 MB LLC per Core ST Per-Workload Performance Impact

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

  • perf. ratio w.r.t.

1MB L2, 5.5MB LLC NoL2 NoL2 + 9.5MB LLC + CATCH CATCH

povray namd

TACT prefetchers recover loss from no L2 in majority of workloads Research on optimizations to improve remaining outliers

slide-61
SLIDE 61

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

slide-62
SLIDE 62

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

  • Removal of L2
  • Reduced cache activity
  • Increased interconnect traffic
slide-63
SLIDE 63

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

  • Removal of L2
  • Reduced cache activity
  • Increased interconnect traffic
  • Increased LLC (+L2 from all cores)
  • Reduced DRAM traffic
slide-64
SLIDE 64

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

  • Removal of L2
  • Reduced cache activity
  • Increased interconnect traffic
  • Increased LLC (+L2 from all cores)
  • Reduced DRAM traffic

19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC

Overall impact

  • ~11% energy savings
  • With 7.2% performance gains
slide-65
SLIDE 65

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

  • Removal of L2
  • Reduced cache activity
  • Increased interconnect traffic
  • Increased LLC (+L2 from all cores)
  • Reduced DRAM traffic

19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC

Overall impact

  • ~11% energy savings
  • With 7.2% performance gains

For large, power hungry (mesh) interconnects, power impact too high

Increased interconnect traffic

slide-66
SLIDE 66

Power Analysis:

Iso-Area NoL2+9.5MB LLC + CATCH

15

  • Removal of L2
  • Reduced cache activity
  • Increased interconnect traffic
  • Increased LLC (+L2 from all cores)
  • Reduced DRAM traffic

19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC

Overall impact

  • ~11% energy savings
  • With 7.2% performance gains

For large, power hungry (mesh) interconnects, power impact too high ⇒ Small L2 to absorb L1 evictions preferable

Increased interconnect traffic

slide-67
SLIDE 67

16

Summary

slide-68
SLIDE 68

16

Summary

  • Fundamental re-look at each level of a three level cache hierarchy
slide-69
SLIDE 69

16

Summary

  • Fundamental re-look at each level of a three level cache hierarchy
  • L2 highly amenable to criticality optimizations
  • Trend towards large L2 not the most efficient design choice
slide-70
SLIDE 70

16

Summary

  • Fundamental re-look at each level of a three level cache hierarchy
  • L2 highly amenable to criticality optimizations
  • Trend towards large L2 not the most efficient design choice
  • CATCH introduces:
  • Dynamic tracking of critical loads.
  • Optimized inter-cache prefetch into L1
slide-71
SLIDE 71

16

Summary

  • Fundamental re-look at each level of a three level cache hierarchy
  • L2 highly amenable to criticality optimizations
  • Trend towards large L2 not the most efficient design choice
  • CATCH introduces:
  • Dynamic tracking of critical loads.
  • Optimized inter-cache prefetch into L1
  • CATCH enables radical new processor designs
  • Efficient area/power/performance tradeoffs

Three-level CATCH 8.4% perf. gain Two-level CATCH 4.2% perf. gain + 30% area saving Two-level CATCH Iso-area 7.2% perf. gain + 11% energy saving

slide-72
SLIDE 72

17