Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez - - PowerPoint PPT Presentation

cache hierarchies
SMART_READER_LITE
LIVE PREVIEW

Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez - - PowerPoint PPT Presentation

Jenga: Software-Defined Cache Hierarchies Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary Heterogeneous caches are traditionally organized as a rigid hierarchy Easy to program but introduce expensive overheads when


slide-1
SLIDE 1

Jenga: Software-Defined Cache Hierarchies

Po-An Tsai, Nathan Beckmann, and Daniel Sanchez

slide-2
SLIDE 2

Executive summary

 Heterogeneous caches are traditionally organized as a rigid hierarchy

 Easy to program but introduce expensive overheads when hierarchy is not helpful

 Jenga builds application-specific cache hierarchies on the fly  Key contribution: New algorithms to find near-optimal hierarchies

 Arbitrary application behaviors & changing resource constraints  Full system optimization at 36 cores in <1 ms

 Jenga improves EDP by up to 85% vs. state-of-the-art

2

slide-3
SLIDE 3

Deep, rigid hierarchies are running out of steam

3

slide-4
SLIDE 4

Deep, rigid hierarchies are running out of steam

Main Memory L1

L2

~1ns ~10ns ~100ns

Systems had few cache levels with widely different sizes and latencies Past 3

slide-5
SLIDE 5

Deep, rigid hierarchies are running out of steam

Main Memory L1

L2

~1ns ~10ns ~100ns

Systems had few cache levels with widely different sizes and latencies Past

L1 L2 ~1ns ~5ns

Now 3

slide-6
SLIDE 6

Deep, rigid hierarchies are running out of steam

Main Memory L1

L2

~1ns ~10ns ~100ns

Systems had few cache levels with widely different sizes and latencies Past

L1 L2 ~1ns ~5ns

Now

~25ns Distributed SRAM L3

Core Private L1 & L2 SRAM Cache Bank

3

slide-7
SLIDE 7

Deep, rigid hierarchies are running out of steam

Main Memory L1

L2

~1ns ~10ns ~100ns

DRAM bank DRAM bank DRAM bank DRAM bank

Distributed DRAM L4 ~50ns

Systems had few cache levels with widely different sizes and latencies Past

L1 L2 ~1ns ~5ns

Now

~25ns Distributed SRAM L3

Core Private L1 & L2 SRAM Cache Bank

3

slide-8
SLIDE 8

Deep, rigid hierarchies are running out of steam

Main Memory Main Memory L1

L2

~1ns ~10ns ~100ns

DRAM bank DRAM bank DRAM bank DRAM bank

Distributed DRAM L4 ~50ns ~100ns

Systems had few cache levels with widely different sizes and latencies Past

L1 L2 ~1ns ~5ns

Now

~25ns Distributed SRAM L3

Core Private L1 & L2 SRAM Cache Bank

3

slide-9
SLIDE 9

Deep, rigid hierarchies are running out of steam

Main Memory Main Memory L1

L2

~1ns ~10ns ~100ns

DRAM bank DRAM bank DRAM bank DRAM bank

Distributed DRAM L4 ~50ns ~100ns

Systems had few cache levels with widely different sizes and latencies Higher overheads due to closer sizes and latencies across hierarchy levels Past

L1 L2 ~1ns ~5ns

Now

~25ns Distributed SRAM L3

Core Private L1 & L2 SRAM Cache Bank

3

slide-10
SLIDE 10

Rigid hierarchies must cater to the conflicting needs of many applications

4 App 1: Scan through a 256MB array repeatedly

slide-11
SLIDE 11

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

4 App 1: Scan through a 256MB array repeatedly

slide-12
SLIDE 12

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4 App 1: Scan through a 256MB array repeatedly

slide-13
SLIDE 13

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

slide-14
SLIDE 14

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency =

slide-15
SLIDE 15

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency =

×

slide-16
SLIDE 16

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency = 0ns ~50ns ~5ns + + = ~55ns (30% lower) Hit latency =

×

slide-17
SLIDE 17

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency = 0ns ~50ns ~5ns + + = ~55ns (30% lower) Hit latency =

×

slide-18
SLIDE 18

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency = 0ns ~50ns ~5ns + + = ~55ns (30% lower) Hit latency =

× ×

slide-19
SLIDE 19

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency = 0ns ~50ns ~5ns + + = ~55ns (30% lower) Hit latency = 0ns ~40ns ~5ns + + = ~45ns (45% lower) Hit latency =

× ×

slide-20
SLIDE 20

Rigid hierarchies must cater to the conflicting needs of many applications

SRAM L3 DRAM L4 Main Memory Private L1 & L2

App 1 4

0% hit rate 0% hit rate 100% hit rate Array data

App 1: Scan through a 256MB array repeatedly

~25ns ~50ns ~5ns + + = ~80ns Hit latency = 0ns ~50ns ~5ns + + = ~55ns (30% lower) Hit latency = 0ns ~40ns ~5ns + + = ~45ns (45% lower) Hit latency =

× × Even the best rigid hierarchy is a bad compromise! (See paper for details)

slide-21
SLIDE 21

Jenga: Software-defined cache hierarchies

5

slide-22
SLIDE 22

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

5

slide-23
SLIDE 23

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

5 App 1: Scan through a 256MB array

256MB cache Main Memory Private L1 & L2

App 1

Ideal hierarchy

slide-24
SLIDE 24

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

5 App 1: Scan through a 256MB array

256MB cache Main Memory Private L1 & L2

App 1

Ideal hierarchy

slide-25
SLIDE 25

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

5 App 1: Scan through a 256MB array

256MB cache Main Memory Private L1 & L2

App 1

Ideal hierarchy

App 2: Lookup a 5MB hashmap

5MB cache Private L1 & L2

App 2

Ideal hierarchy

slide-26
SLIDE 26

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

5 App 1: Scan through a 256MB array

256MB cache Main Memory Private L1 & L2

App 1

Ideal hierarchy

App 2: Lookup a 5MB hashmap

5MB cache Private L1 & L2

App 2

Ideal hierarchy

slide-27
SLIDE 27

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

6

slide-28
SLIDE 28

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

6 App 3: Scan through two arrays (1MB and 256MB)

256MB cache Private L1 & L2

App 3

1MB cache

slide-29
SLIDE 29

Jenga manages distributed and heterogeneous banks as a single resource pool and builds virtual hierarchies tailored to each application in the system.

SRAM bank DRAM bank

Jenga: Software-defined cache hierarchies

6 App 3: Scan through two arrays (1MB and 256MB)

256MB cache Private L1 & L2

App 3

1MB cache

slide-30
SLIDE 30

Prior work to mitigate the cost of rigid hierarchies

7

slide-31
SLIDE 31

Prior work to mitigate the cost of rigid hierarchies

 Bypass levels to avoid cache pollutions

 Do not install lines at specific levels  Give lines low priority in replacement policy

L3

L4

Private L1 & L2

7

slide-32
SLIDE 32

Prior work to mitigate the cost of rigid hierarchies

 Bypass levels to avoid cache pollutions

 Do not install lines at specific levels  Give lines low priority in replacement policy

 Speculatively access up the hierarchy

 Hit/miss predictors, prefetchers  Hide latency with speculative accesses

L3

L4

Private L1 & L2

L3

L4

Private L1 & L2

7

slide-33
SLIDE 33

Prior work to mitigate the cost of rigid hierarchies

 Bypass levels to avoid cache pollutions

 Do not install lines at specific levels  Give lines low priority in replacement policy

 Speculatively access up the hierarchy

 Hit/miss predictors, prefetchers  Hide latency with speculative accesses

 They must still check all levels for correctness!

 Waste energy and bandwidth

L3

L4

Private L1 & L2

L3

L4

Private L1 & L2

7

slide-34
SLIDE 34

Prior work to mitigate the cost of rigid hierarchies

 Bypass levels to avoid cache pollutions

 Do not install lines at specific levels  Give lines low priority in replacement policy

 Speculatively access up the hierarchy

 Hit/miss predictors, prefetchers  Hide latency with speculative accesses

 They must still check all levels for correctness!

 Waste energy and bandwidth

L3

L4

Private L1 & L2

It’s better to build the right hierarchy and avoid the root cause: unnecessary accesses to unwanted cache levels

L3

L4

Private L1 & L2

7

slide-35
SLIDE 35

Jenga = flexible hardware + smart software

Hardware Software

8

slide-36
SLIDE 36

Jenga = flexible hardware + smart software

Hardware Software

Time 8

slide-37
SLIDE 37

Jenga = flexible hardware + smart software

Hardware Software

Read hardware monitors Time 8

slide-38
SLIDE 38

Jenga = flexible hardware + smart software

Hardware Software

Read hardware monitors

Optimize hierarchies

Time 8

slide-39
SLIDE 39

Jenga = flexible hardware + smart software

Hardware Software

Read hardware monitors

Optimize hierarchies

Time Update hierarchies 8

slide-40
SLIDE 40

Jenga = flexible hardware + smart software

Hardware Software

Read hardware monitors

Optimize hierarchies

Time Update hierarchies 100ms 8

slide-41
SLIDE 41

Jenga = flexible hardware + smart software

Hardware Software

Read hardware monitors

Optimize hierarchies

Time Update hierarchies

Optimize hierarchies

100ms 8

slide-42
SLIDE 42

Jenga hardware: supporting virtual hierarchies (VHs)

 Cores consult virtual hierarchy table (VHT) to find the access path

 Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels

9

slide-43
SLIDE 43

Jenga hardware: supporting virtual hierarchies (VHs)

 Cores consult virtual hierarchy table (VHT) to find the access path

 Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels

SRAM Bank Private $ NoC Router

Core VHT

TLB

Addr VH id

DRAM bank 9

slide-44
SLIDE 44

Jenga hardware: supporting virtual hierarchies (VHs)

 Cores consult virtual hierarchy table (VHT) to find the access path

 Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels

SRAM Bank Private $ NoC Router

Core VHT

TLB

Addr VH id

DRAM bank Two-level using both SRAM and DRAM 9

slide-45
SLIDE 45

Jenga hardware: supporting virtual hierarchies (VHs)

 Cores consult virtual hierarchy table (VHT) to find the access path

 Similar to Jigsaw [PACT’13, HPCA’15], but it supports two levels

SRAM Bank Private $ NoC Router

Core VHT

TLB

Addr VH id

DRAM bank Two-level using both SRAM and DRAM 9

slide-46
SLIDE 46

Accessing a two-level virtual hierarchy

Tile 10

Private Caches

Core 1

VHT

DRAM cache bank

Tile Access path: SRAM bank  DRAM bank  Mem 10

slide-47
SLIDE 47

Accessing a two-level virtual hierarchy

Tile 10 Virtual L1 (VL1)

1

SRAM (bank 10) Private Caches

Core 1

VHT Core miss  VL1 bank

DRAM cache bank

1

Tile Access path: SRAM bank  DRAM bank  Mem 10

slide-48
SLIDE 48

Accessing a two-level virtual hierarchy

Tile 10 Virtual L1 (VL1) Virtual L2 (VL2)

1 2

SRAM (bank 10) Private Caches

Core 1

VHT DRAM (bank 38) Core miss  VL1 bank VL1 miss  VL2 bank

DRAM cache bank

1 2

Tile Access path: SRAM bank  DRAM bank  Mem 10

slide-49
SLIDE 49

Accessing a two-level virtual hierarchy

Tile 10 Virtual L1 (VL1) Virtual L2 (VL2)

1 2 3

SRAM (bank 10) Private Caches

Core 1

VHT DRAM (bank 38) Core miss  VL1 bank VL1 miss  VL2 bank VL2 hit, serve line

DRAM cache bank

1 2 3

Tile Access path: SRAM bank  DRAM bank  Mem 10

slide-50
SLIDE 50

Accessing an single-level VH using SRAM + DRAM

 With VHT, software can group any combinations of banks to form a VH

VHT

11 Private Caches Core

Main Memory

slide-51
SLIDE 51

Accessing an single-level VH using SRAM + DRAM

 With VHT, software can group any combinations of banks to form a VH

VHT

Single-level using both SRAM and DRAM 11 Private Caches Core

Main Memory

slide-52
SLIDE 52

Accessing an single-level VH using SRAM + DRAM

 With VHT, software can group any combinations of banks to form a VH

VHT

Single-level using both SRAM and DRAM 11 Private Caches Core

Main Memory

Addr X

slide-53
SLIDE 53

Accessing an single-level VH using SRAM + DRAM

 With VHT, software can group any combinations of banks to form a VH

VHT

Single-level using both SRAM and DRAM 11 Private Caches Core

Main Memory

Addr Y Addr X

slide-54
SLIDE 54

Accessing an single-level VH using SRAM + DRAM

 With VHT, software can group any combinations of banks to form a VH

VHT

Single-level using both SRAM and DRAM 11 Private Caches Core

Main Memory

Addr Y Logically equivalent to… Private Caches Core

DRAM SRAM SRAM SRAM

Addr X

slide-55
SLIDE 55

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Set VHTs

Hardware Monitors Reconfigure Virtual Hierarchies

Hardware Software

12

slide-56
SLIDE 56

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Set VHTs

Hardware Monitors

Application miss curves

Reconfigure Virtual Hierarchies

Hardware Software

12

slide-57
SLIDE 57

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Set VHTs

Hardware Monitors

Application miss curves

Reconfigure Virtual Hierarchies

Hardware Software

Virtual Hierarchy Allocation 12

slide-58
SLIDE 58

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Set VHTs

Hardware Monitors

Application miss curves

Reconfigure Virtual Hierarchies

Hardware Software

Virtual Hierarchy Allocation

VL1 VL2 VH sizes & levels

12

slide-59
SLIDE 59

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Set VHTs

Hardware Monitors

Application miss curves

Reconfigure Virtual Hierarchies

Hardware Software

Bandwidth-Aware Placement Virtual Hierarchy Allocation

VL1 VL2 VH sizes & levels

12

slide-60
SLIDE 60

Jenga software: finding near-optimal hierarchies

 Periodically, Jenga reconfigures VHs to minimize data movement

Final allocation Set VHTs

Hardware Monitors

Application miss curves

Reconfigure Virtual Hierarchies

Hardware Software

Bandwidth-Aware Placement Virtual Hierarchy Allocation

VL1 VL2 VH sizes & levels

12

slide-61
SLIDE 61

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

13

slide-62
SLIDE 62

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start 13

slide-63
SLIDE 63

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start Cache Access Latency

Total Capacity

DRAM bank

13

slide-64
SLIDE 64

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start Cache Access Latency

Total Capacity

DRAM bank

Virtual Cache size Latency

13

slide-65
SLIDE 65

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start Cache Access Latency

Total Capacity

DRAM bank

Virtual Cache size Latency

Access latency 13

slide-66
SLIDE 66

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start Cache Access Latency

Total Capacity

DRAM bank

Virtual Cache size Latency

Access latency Miss latency

Miss curve from hardware monitors

13

slide-67
SLIDE 67

Modeling performance of heterogeneous caches

 Treat SRAM and DRAM as different “flavors” of banks with different latencies

DRAM bank

Color  latency Start Cache Access Latency

Total Capacity

DRAM bank

Virtual Cache size Latency

Access latency Miss latency Total latency

Miss curve from hardware monitors

Latency curve for single-level, heterogeneous cache 13

slide-68
SLIDE 68

Optimizing hierarchies by minimizing system latency

14

slide-69
SLIDE 69

Optimizing hierarchies by minimizing system latency

 Our prior work has proposed algorithms to take latency curves, allocate

capacity and place them on chip to minimize system latency

 But only builds single-level VHs

14

slide-70
SLIDE 70

Optimizing hierarchies by minimizing system latency

 Our prior work has proposed algorithms to take latency curves, allocate

capacity and place them on chip to minimize system latency

 But only builds single-level VHs

Latency Capacity App2 App3 App1 14

slide-71
SLIDE 71

Optimizing hierarchies by minimizing system latency

 Our prior work has proposed algorithms to take latency curves, allocate

capacity and place them on chip to minimize system latency

 But only builds single-level VHs

Latency Capacity App2 App3 App1

Capacity App2 App1 App3

14

slide-72
SLIDE 72

Optimizing hierarchies by minimizing system latency

 Our prior work has proposed algorithms to take latency curves, allocate

capacity and place them on chip to minimize system latency

 But only builds single-level VHs

Latency Capacity App2 App3 App1

Capacity App2 App1 App3

14

slide-73
SLIDE 73

Multi-level hierarchies are much more complex

15

slide-74
SLIDE 74

Multi-level hierarchies are much more complex

 Many intertwined factors

 Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size)

15

slide-75
SLIDE 75

Multi-level hierarchies are much more complex

 Many intertwined factors

 Best VL1 size depends on VL2 size  Best VL2 size depends on VL1 size  Should we have VL2? (Depends on total size)

 Jenga encodes these tradeoffs in a single curve

 Can reuse prior allocation algorithms

15

slide-76
SLIDE 76

How to get a latency curve for a multi-level VH

16

slide-77
SLIDE 77

How to get a latency curve for a multi-level VH

Two-level hierarchies form a latency surface! 16

slide-78
SLIDE 78

How to get a latency curve for a multi-level VH

Best 1- and 2-level hierarchy at every size Two-level hierarchies form a latency surface! 16 Project

slide-79
SLIDE 79

How to get a latency curve for a multi-level VH

Best 1- and 2-level hierarchy at every size Two-level hierarchies form a latency surface! 16 Project

slide-80
SLIDE 80

How to get a latency curve for a multi-level VH

Best 1- and 2-level hierarchy at every size Best overall hierarchy at every size Two-level hierarchies form a latency surface! 16 Project

slide-81
SLIDE 81

How to get a latency curve for a multi-level VH

Best 1- and 2-level hierarchy at every size Best overall hierarchy at every size Two-level hierarchies form a latency surface! 16 Project

slide-82
SLIDE 82

How to get a latency curve for a multi-level VH

Best 1- and 2-level hierarchy at every size Best overall hierarchy at every size Two-level hierarchies form a latency surface! Curve lets us optimize multi-level hierarchies! 16 Project

slide-83
SLIDE 83

Allocating virtual hierarchies

VH2 VH3 VH1 Latency curves

17

slide-84
SLIDE 84

Allocating virtual hierarchies

VH2 VH3 VH1 Latency curves

17 Cache allocation algorithm

slide-85
SLIDE 85

Allocating virtual hierarchies

VH2 VH3 VH1 Latency curves

17

Total capacity

  • f each VH

Capacity VH1 VH2 VH3

Cache allocation algorithm

slide-86
SLIDE 86

Allocating virtual hierarchies

VH2 VH3 VH1 Latency curves

17

Total capacity

  • f each VH

Capacity VH1 VH2 VH3

Cache allocation algorithm Decide the best hierarchy

slide-87
SLIDE 87

Allocating virtual hierarchies

VH2 VH3 VH1 Latency curves

17

Total capacity

  • f each VH

Capacity VH1 VH2 VH3

Cache allocation algorithm Decide the best hierarchy

Virtual hierarchy size and levels VL1

VL1 VL1 VL2

slide-88
SLIDE 88

Bandwidth-aware virtual hierarchy placement

SRAM bank DRAM bank 18

VL1

VL1 VL1 VL2

slide-89
SLIDE 89

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth

SRAM bank DRAM bank 18

VL1

VL1 VL1 VL2

slide-90
SLIDE 90

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

SRAM bank DRAM bank 18

VL1

VL1 VL1 VL2

slide-91
SLIDE 91

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

18

VL1

VL1 VL1 VL2

slide-92
SLIDE 92

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

18

VL1

VL1 VL1 VL2

slide-93
SLIDE 93

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.0X Latency

18

VL1

VL1 VL1 VL2

slide-94
SLIDE 94

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.0X Latency

18

VL1

VL1 VL1 VL2

slide-95
SLIDE 95

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.3X Latency 1.0X Latency

18

VL1

VL1 VL1 VL2

slide-96
SLIDE 96

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.3X Latency 1.0X Latency 1.1X Latency

18

VL1

VL1 VL1 VL2

slide-97
SLIDE 97

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.3X Latency 1.0X Latency 1.1X Latency 1.3X Latency

18

VL1

VL1 VL1 VL2

slide-98
SLIDE 98

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.3X Latency 1.0X Latency 1.1X Latency 1.3X Latency

18

VL1

VL1 VL1 VL2

slide-99
SLIDE 99

Bandwidth-aware virtual hierarchy placement

 Place data close without saturating DRAM bandwidth  Every iteration, Jenga …

 Chooses a VH (via an opportunity cost metric, see paper)  Greedily places a chunk of its data in its closest bank  Update DRAM bank latency

1.0X Latency 1.1X Latency 1.3X Latency 1.0X Latency 1.1X Latency 1.3X Latency

18

VL1

VL1 VL1 VL2

slide-100
SLIDE 100

Jenga adds small overheads

19

slide-101
SLIDE 101

Jenga adds small overheads

 Hardware overheads

 VHT requires ∼2.4 KB/tile  Monitors are 8 KB x 2/tile  In total, Jenga adds ∼20 KB per tile, 4% of the SRAM banks  Similar to Jigsaw

19

slide-102
SLIDE 102

Jenga adds small overheads

 Hardware overheads

 VHT requires ∼2.4 KB/tile  Monitors are 8 KB x 2/tile  In total, Jenga adds ∼20 KB per tile, 4% of the SRAM banks  Similar to Jigsaw

 Software overheads

 0.4% of system cycles at 36 tiles  Runs concurrently with applications; only needs to pause cores to update VHTs  Trivial to parallelize

19

slide-103
SLIDE 103

See paper for …

 Hardware support for

 Fast reconfiguration  Page reclassification

 Efficient implementation of hierarchy allocation  OS integration

20

slide-104
SLIDE 104

Evaluation

21

slide-105
SLIDE 105

Evaluation

 Modeled system

 36 cores on 6x6 mesh  18MB SRAM  1GB Stacked DRAM

21

slide-106
SLIDE 106

Evaluation

 Modeled system

 36 cores on 6x6 mesh  18MB SRAM  1GB Stacked DRAM

 Workloads

 36 copies of same app (SPECrate)  Random 36 SPECCPU apps mixes  36-threaded SPECOMP apps

21

slide-107
SLIDE 107

Evaluation

 Modeled system

 36 cores on 6x6 mesh  18MB SRAM  1GB Stacked DRAM

 Workloads

 36 copies of same app (SPECrate)  Random 36 SPECCPU apps mixes  36-threaded SPECOMP apps

 Compared 5 schemes

SRAM DRAM S-NUCA Rigid L3

  • Alloy

Rigid L3 Rigid L4 Jigsaw App-specific L3

  • JigAlloy

App-specific L3 Rigid L4 Jenga App-specific Virtual Hierarchies21

slide-108
SLIDE 108

Case study: 36 copies of xalanc

22

slide-109
SLIDE 109

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB 22

slide-110
SLIDE 110

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB

Private L2

22

slide-111
SLIDE 111

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB

Private L2

22

Rigid SRAM L3 Data

slide-112
SLIDE 112

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB

Memory

Private L2

22

Rigid SRAM L3 Data ~100% miss rate

slide-113
SLIDE 113

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB

Memory

Private L2

Wasteful accesses to L3, should have gone to memory directly 22

Rigid SRAM L3 Data ~100% miss rate

slide-114
SLIDE 114

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB

Memory

Private L2

Wasteful accesses to L3, should have gone to memory directly 22

Rigid SRAM L3 Data ~100% miss rate

slide-115
SLIDE 115

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 23

slide-116
SLIDE 116

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 23

slide-117
SLIDE 117

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 23

Rigid SRAM L3 Data

slide-118
SLIDE 118

Case study: 36 copies of xalanc

Private L2 Rigid DRAM L4

Working set: 6MB x 36 = 216 MB 23

Rigid SRAM L3 Data ~100% miss rate

slide-119
SLIDE 119

Case study: 36 copies of xalanc

Memory

Private L2 Rigid DRAM L4

Working set: 6MB x 36 = 216 MB 23

Rigid SRAM L3 Data ~100% miss rate ~0% miss rate

slide-120
SLIDE 120

Case study: 36 copies of xalanc

Memory

Private L2 Rigid DRAM L4

Cache working sets with DRAM L4 Working set: 6MB x 36 = 216 MB 23

Rigid SRAM L3 Data ~100% miss rate ~0% miss rate

slide-121
SLIDE 121

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB 24

slide-122
SLIDE 122

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 24

slide-123
SLIDE 123

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 24

App-specific SRAM L3

slide-124
SLIDE 124

Case study: 36 copies of xalanc

Memory

Private L2

Working set: 6MB x 36 = 216 MB 24

App-specific SRAM L3 ~90% miss rate

slide-125
SLIDE 125

Case study: 36 copies of xalanc

Memory

Private L2

Reduce 10% misses with app-specific SRAM L3 Working set: 6MB x 36 = 216 MB 24

App-specific SRAM L3 ~90% miss rate

slide-126
SLIDE 126

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB 25

slide-127
SLIDE 127

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 25

slide-128
SLIDE 128

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 25

App-specific SRAM L3

slide-129
SLIDE 129

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 25

App-specific SRAM L3 ~90% miss rate

slide-130
SLIDE 130

Case study: 36 copies of xalanc

Private L2 Rigid DRAM L4

Working set: 6MB x 36 = 216 MB 25

App-specific SRAM L3 ~90% miss rate

slide-131
SLIDE 131

Case study: 36 copies of xalanc

Memory

Private L2 Rigid DRAM L4

Working set: 6MB x 36 = 216 MB 25

App-specific SRAM L3 ~90% miss rate ~0% miss rate

slide-132
SLIDE 132

Case study: 36 copies of xalanc

Memory

Private L2 Rigid DRAM L4

Combines Jigsaw’s and Alloy’s benefits, but still a rigid hierarchy Working set: 6MB x 36 = 216 MB 25

App-specific SRAM L3 ~90% miss rate ~0% miss rate

slide-133
SLIDE 133

Case study: 36 copies of xalanc

Working set: 6MB x 36 = 216 MB 26

slide-134
SLIDE 134

Case study: 36 copies of xalanc

Private L2

Working set: 6MB x 36 = 216 MB 26

slide-135
SLIDE 135

Case study: 36 copies of xalanc

Private L2 6MB, SRAM + DRAM VL1-only hierarchy

Working set: 6MB x 36 = 216 MB 26

… … …

slide-136
SLIDE 136

Case study: 36 copies of xalanc

Memory

Private L2 6MB, SRAM + DRAM VL1-only hierarchy

Working set: 6MB x 36 = 216 MB 26

~0% miss rate … … …

slide-137
SLIDE 137

Case study: 36 copies of xalanc

Memory

Private L2 6MB, SRAM + DRAM VL1-only hierarchy

Single lookup to the working set! No wasteful lookups!

60% better 20% better

Working set: 6MB x 36 = 216 MB 26

~0% miss rate … … …

slide-138
SLIDE 138

Case study: 36 copies of xalanc

Memory

Private L2 6MB, SRAM + DRAM VL1-only hierarchy

Single lookup to the working set! No wasteful lookups!

60% better 20% better

Working set: 6MB x 36 = 216 MB 26

~0% miss rate … … …

Jenga improves performance and energy efficiency by creating the right hierarchy using the best available resources!

slide-139
SLIDE 139

Jenga works across a wide range of behaviors

App with two-level working set App with flat working set

27

slide-140
SLIDE 140

Jenga works across a wide range of behaviors

Working set Jenga VHs

SRAM VL1 DRAM VL2 0.5MB + 16MB 1MB + 8MB SRAM+DRAM VL1 DRAM VL2 App with two-level working set App with flat working set

27

slide-141
SLIDE 141

Jenga works across a wide range of behaviors

Working set Jenga VHs

SRAM VL1 DRAM VL2 0.5MB + 16MB 1MB + 8MB SRAM+DRAM VL1 DRAM VL2 2.5MB SRAM+ DRAM VL1 8MB SRAM+ DRAM VL1 >50MB DRAM VL1 or No caching App with two-level working set App with flat working set

27

slide-142
SLIDE 142

Jenga works for random multi-program mixes

28

slide-143
SLIDE 143

Jenga works for random multi-program mixes

2.6X over S-NUCA 20% over JigAlloy

28

slide-144
SLIDE 144

Jenga works for random multi-program mixes

2.6X over S-NUCA 20% over JigAlloy 1.7X over S-NUCA 10% over JigAlloy

28

slide-145
SLIDE 145

Jenga works for random multi-program mixes

Jenga consistently outperforms the other schemes for multi-program mixes

2.6X over S-NUCA 20% over JigAlloy 1.7X over S-NUCA 10% over JigAlloy

28

slide-146
SLIDE 146

See paper for more results

 Full result for SPECCPU-rate  Multithreaded apps  Sensitivity study for Jenga’s software techniques  2.5D DRAM architectures  Jigsaw SRAM L3 + Jigsaw DRAM L4  And more

29

slide-147
SLIDE 147

Conclusion

30

slide-148
SLIDE 148

Conclusion

 Rigid, multi-level cache hierarchies are ill-suited to many applications

 They cause significant overhead when they are not helpful

30

slide-149
SLIDE 149

Conclusion

 Rigid, multi-level cache hierarchies are ill-suited to many applications

 They cause significant overhead when they are not helpful

 We propose Jenga, a software-defined, reconfigurable cache hierarchy

 Adopts application-specific organization on-the-fly  Uses new software algorithm to find near-optimal hierarchy efficiently

30

slide-150
SLIDE 150

Conclusion

 Rigid, multi-level cache hierarchies are ill-suited to many applications

 They cause significant overhead when they are not helpful

 We propose Jenga, a software-defined, reconfigurable cache hierarchy

 Adopts application-specific organization on-the-fly  Uses new software algorithm to find near-optimal hierarchy efficiently

 Jenga improves both performance and energy efficiency, by up to 85% in

EDP , over a combination of state-of-art techniques

30

slide-151
SLIDE 151

Thanks! Questions?

 Rigid, multi-level cache hierarchies are ill-suited to many applications

 They cause significant overhead when they are not helpful

 We propose Jenga, a software-defined, reconfigurable cache hierarchy

 Adopts application-specific organization on-the-fly  Uses new software algorithm to find near-optimal hierarchy efficiently

 Jenga improves both performance and energy efficiency, by up to 85% in

EDP , over a combination of state-of-art techniques

31

slide-152
SLIDE 152

Thank you for your attention!

Questions?

Jenga: Software-Defined Cache Hierarchies