Criticality Aware Tiered Cache Hierarchy (CATCH)
Anant Nori*, Jayesh Gaur*, Siddharth Rai#, Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel
# Indian Institute of Technology Kanpur, India
Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh - - PowerPoint PPT Presentation
Criticality Aware Tiered Cache Hierarchy (CATCH) Anant Nori*, Jayesh Gaur*, Siddharth Rai # , Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel # Indian Institute of Technology Kanpur, India Popular Three Level Cache
Anant Nori*, Jayesh Gaur*, Siddharth Rai#, Sreenivas Subramoney*, Hong Wang* * Microarchitecture Research Lab, Intel
# Indian Institute of Technology Kanpur, India
2
2
L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc
Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive
Data Code 32 KB
1MB
2
L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc
Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive
Data Code 32 KB
1MB
2
L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc
Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive
Data Code 32 KB
1MB
0% client FSPEC HPC ISPEC server GeoMean
NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC
2
L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc
Broadwell-like Server
2MB/core 8MB (4 core) Inclusive 256 KB
Data Code 32 KB
Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive
Data Code 32 KB
1MB
0% client FSPEC HPC ISPEC server GeoMean
NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC
2
L1 (Pvt.) 5 cyc L2 (Pvt) 15 cyc LLC (Shared) 40 cyc
Broadwell-like Server
2MB/core 8MB (4 core) Inclusive 256 KB
Data Code 32 KB
Skylake-like Server 1.375MB/core 5.5MB (4 core) Exclusive
Data Code 32 KB
1MB
0% client FSPEC HPC ISPEC server GeoMean
NoL2 + 6.5MB LLC NoL2 + 9.5MB LLC
L 2 L 2
3
L 2 L 2
3
L 2 L 2
3
with disparate cache footprints
L 2 L 2
3
with disparate cache footprints
L 2 L 2
3
with disparate cache footprints
4
D E C D E C D E C D E C D E C D E C D E C
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2
200
10
10 2
2 10
30
200 30
2 2 10
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
CRITICAL L2 HIT LOAD
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT
LLC hit
30 30
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD No performance impact if L2 HIT → LLC HIT
LLC hit
30
30
4
E C D C D E C D C D C D E C D
LLC miss L2 Hit L2 Hit LLC Hit
1 2 3 4 5 6 7
Load Load Load Load
2 30 10 2 10
30 D E E E E C 200 10 2 2 2
200
10
CRITICAL L2 HIT LOAD ~9% performance loss if L2 HIT → LLC HIT NON-CRITICAL L2 HIT LOAD No performance impact if L2 HIT → LLC HIT
5
49.1% 39.6% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. % loads converted to higer latency
5
49.1% 39.6% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0% ALL NonCritical ALL NonCritical L1 hits to L2 lat. L2 hits to LLC lat. % loads converted to higer latency
5
6
6
6
6
5.5% 5.8% 6.1% 6.6% 6.2% 14.1% 15.5% 17.0% 0% 4% 8% 12% 16% 20% 0% 2% 4% 6% 8% 10% 32 PC 128 PC 2048 PC All PC NoL2 + 2048 PC % L1 misses converted
Oracle Performance Potential (*prefetchers disabled)
PerfImpact %loads Converted
6
5.5% 5.8% 6.1% 6.6% 6.2% 14.1% 15.5% 17.0% 0% 4% 8% 12% 16% 20% 0% 2% 4% 6% 8% 10% 32 PC 128 PC 2048 PC All PC NoL2 + 2048 PC % L1 misses converted
Oracle Performance Potential (*prefetchers disabled)
PerfImpact %loads Converted
7
Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared
5.5MB Exclusive
1MB
Data Code 32 KB
BASELINE CATCH Hardware
load PCs
critical loads in L1
7
Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared
5.5MB Exclusive
1MB
Data Code 32 KB
BASELINE CATCH Hardware
load PCs
critical loads in L1
5.5MB Exclusive
1MB
Data Code 32 KB
Three-Level CATCH
Accelerates critical path
7
Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared
5.5MB Exclusive
1MB
Data Code 32 KB
BASELINE CATCH Hardware
load PCs
critical loads in L1
5.5MB Exclusive
1MB
Data Code 32 KB
Three-Level CATCH
Accelerates critical path
5.5MB Exclusive
Data Code 32 KB
Two-Level CATCH (NoL2)
Accelerates critical path + Area saving
5.5MB Inclusive
7
Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared
5.5MB Exclusive
1MB
Data Code 32 KB
BASELINE CATCH Hardware
load PCs
critical loads in L1
5.5MB Exclusive
1MB
Data Code 32 KB
Three-Level CATCH
Accelerates critical path
5.5MB Exclusive
Data Code 32 KB
Two-Level CATCH (NoL2)
Accelerates critical path + Area saving
5.5MB Exclusive
1MB
Data Code 32 KB
Two-Level CATCH (NoL2, IsoArea)
Accelerates critical path
5.5MB Inclusive
7
Level1 (L1) Private Level2 (L2) Private Level3 (L3) Shared
5.5MB Exclusive
1MB
Data Code 32 KB
BASELINE CATCH Hardware
load PCs
critical loads in L1
5.5MB Exclusive
1MB
Data Code 32 KB
Three-Level CATCH
Accelerates critical path
5.5MB Exclusive
Data Code 32 KB
Two-Level CATCH (NoL2)
Accelerates critical path + Area saving
Data Code 32 KB
Two-Level CATCH (NoL2, IsoArea)
Accelerates critical path + Power saving
9.5MB Inclusive 5.5MB Inclusive
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
History of Retired Instructions (2.5x ROB window)
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
History of Retired Instructions (2.5x ROB window)
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
History of Retired Instructions (2.5x ROB window)
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
32 entry Critical Load PC Table
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
History of Retired Instructions (2.5x ROB window)
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
32 entry Critical Load PC Table
8
Instructions being executed (ROB) Instructions being allocated
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
History of Retired Instructions (2.5x ROB window)
D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C D E C
32 entry Critical Load PC Table
9
9
9
10
10
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
10
constant delta from Target/Critical PC
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
10
constant delta from Target/Critical PC
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
10
constant delta from Target/Critical PC
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
10
constant delta from Target/Critical PC
LD Target D i LD Target D i+1 LD Target D i+2 … … … … LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
10
constant delta from Target/Critical PC
LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
10
constant delta from Target/Critical PC
LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F Feeder Prefetch Data to Prefetch Address of Target D LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
10
constant delta from Target/Critical PC
LD Target D i LD Target D i+1 LD Target D i+2 … … … … SELF “Deep” Address Prefetch of Feeder F Feeder Prefetch Data to Prefetch Address of Target D LD Feeder F i LD Feeder F i+1 LD Feeder F i+2 LD Feeder F i+3 … … … LD Target D i+3
LD Trigger T o LD Target C o LD Trigger T o+1 … … LD Target C o+1
a, a+δ, a+2δ, … , a+16δ, a+17δ, a+18δ
11
Next Instruction Pointer (NIP) Logic … (Branch Prediction etc) Branch Mispredict NextInstructionPointer Code Load Next Line Code Prefetch +
CodeNextPrefetch InstructionPointer Code Prefetch Front End Stall
11
Next Instruction Pointer (NIP) Logic … (Branch Prediction etc) Branch Mispredict NextInstructionPointer Code Load Next Line Code Prefetch +
12
12
12
13
4.5% 7.2% 8.4%
0.0% 5.0% 10.0% 15.0% client FSPEC HPC ISPEC server GeoMean
1MB L2, 5.5 MB LLC NoL2 NoL2 + CATCH NoL2 + 9.5MB LLC + CATCH CATCH
13
4.5% 7.2% 8.4%
0.0% 5.0% 10.0% 15.0% client FSPEC HPC ISPEC server GeoMean
1MB L2, 5.5 MB LLC NoL2 NoL2 + CATCH NoL2 + 9.5MB LLC + CATCH CATCH
14
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1MB L2, 5.5MB LLC NoL2 NoL2 + 9.5MB LLC + CATCH CATCH
povray namd
14
0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1MB L2, 5.5MB LLC NoL2 NoL2 + 9.5MB LLC + CATCH CATCH
povray namd
15
15
15
15
19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC
15
19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC
Increased interconnect traffic
15
19.01% 14.36% 5.88% 10.15% 10.62% 10.87% 0% 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% energy savings w.r.t. 1MB L2 , 5.5 MB LLC
Increased interconnect traffic
16
16
16
16
16
Three-level CATCH 8.4% perf. gain Two-level CATCH 4.2% perf. gain + 30% area saving Two-level CATCH Iso-area 7.2% perf. gain + 11% energy saving
17