Nexus: A New Approach to Replication in Distributed Shared Caches - - PowerPoint PPT Presentation

nexus a new approach to replication in distributed shared
SMART_READER_LITE
LIVE PREVIEW

Nexus: A New Approach to Replication in Distributed Shared Caches - - PowerPoint PPT Presentation

Nexus: A New Approach to Replication in Distributed Shared Caches Po-An Tsai , Nathan Beckmann, and Daniel Sanchez Executive summary 2 Executive summary Data replication reduces the access latency of non-uniform caches (NUCA) But


slide-1
SLIDE 1

Nexus: A New Approach to Replication in Distributed Shared Caches

Po-An Tsai, Nathan Beckmann, and Daniel Sanchez

slide-2
SLIDE 2

Executive summary

2

slide-3
SLIDE 3

Executive summary

 Data replication reduces the access latency of non-uniform caches (NUCA)

 But replicating too aggressively leads to more cache misses

2

slide-4
SLIDE 4

Executive summary

 Data replication reduces the access latency of non-uniform caches (NUCA)

 But replicating too aggressively leads to more cache misses

 Prior adaptive techniques focus on which data to replicate at each core

 Data that is not replicated locally still incurs high latency

2

slide-5
SLIDE 5

Executive summary

 Data replication reduces the access latency of non-uniform caches (NUCA)

 But replicating too aggressively leads to more cache misses

 Prior adaptive techniques focus on which data to replicate at each core

 Data that is not replicated locally still incurs high latency

 Nexus instead focuses on how much to replicate across the system

 Chooses the best number of replicas for the whole read-only working set  Lets cores access replicas beyond their local bank  Outperforms a state-of-the-art replication technique by up to 90%

2

slide-6
SLIDE 6

The last-level cache (LLC) has become distributed and non-uniform (NUCA)

3

slide-7
SLIDE 7

The last-level cache (LLC) has become distributed and non-uniform (NUCA)

3 Thread

Core L1I L1D L1I L1D LLC Bank

slide-8
SLIDE 8

The last-level cache (LLC) has become distributed and non-uniform (NUCA)

3 Thread LLC data

Core L1I L1D L1I L1D LLC Bank

Near

LLC Bank

slide-9
SLIDE 9

The last-level cache (LLC) has become distributed and non-uniform (NUCA)

3 Thread LLC data

Core L1I L1D LLC Bank Core L1I L1D L1I L1D LLC Bank

Near Far

LLC Bank

slide-10
SLIDE 10

The last-level cache (LLC) has become distributed and non-uniform (NUCA)

3 Thread LLC data

Core L1I L1D LLC Bank Core L1I L1D L1I L1D LLC Bank

Near Far

LLC Bank

Key problem is what data to place and where to place it on chip

slide-11
SLIDE 11

Static NUCA (S-NUCA) spreads data using a fixed line-to-bank mapping

4 Threads LLC data

slide-12
SLIDE 12

Static NUCA (S-NUCA) spreads data using a fixed line-to-bank mapping

4 Threads LLC data

slide-13
SLIDE 13

Static NUCA (S-NUCA) spreads data using a fixed line-to-bank mapping

4 Threads LLC data

slide-14
SLIDE 14

Static NUCA (S-NUCA) spreads data using a fixed line-to-bank mapping

4 Threads LLC data

Simple but large average distance

Some near Mostly far

slide-15
SLIDE 15

Replication reduces the distance to read-only data

5 Threads LLC data

slide-16
SLIDE 16

Replication reduces the distance to read-only data

5

Read-only data

A B C D D B C A

Threads LLC data

slide-17
SLIDE 17

Replication reduces the distance to read-only data

5

Read-only data

A B C D D B C A A

Threads LLC data

slide-18
SLIDE 18

Replication reduces the distance to read-only data

5

Read-only data

A B C D D B C A A B

Threads LLC data

slide-19
SLIDE 19

Replication reduces the distance to read-only data

5

Read-only data

A B C D D B C A A B C D

Threads LLC data

slide-20
SLIDE 20

Replication reduces the distance to read-only data

5

Read-only data

A B C D D B C A A B C D

Threads LLC data

slide-21
SLIDE 21

Replication reduces the distance to read-only data

5

Cache replicated read-only lines locally and check the local bank first. Upon a miss in the local bank, check the directory (at line’s original location).

Read-only data

A B C D D B C A A B C D

Threads LLC data

slide-22
SLIDE 22

Replication reduces the distance to read-only data

5

Cache replicated read-only lines locally and check the local bank first. Upon a miss in the local bank, check the directory (at line’s original location).

Read-only data

A B C D D B C A A B C D A B C D

Threads LLC data

slide-23
SLIDE 23

Replication reduces the distance to read-only data

5

Cache replicated read-only lines locally and check the local bank first. Upon a miss in the local bank, check the directory (at line’s original location).

Read-only data

A B C D D B C A A B C D A B C D A, B, C, D are local, but replicated lines compete for cache capacity with other data

Threads LLC data

slide-24
SLIDE 24

Replication reduces the distance to read-only data

5

Cache replicated read-only lines locally and check the local bank first. Upon a miss in the local bank, check the directory (at line’s original location).

Replicating too aggressively causes more cache misses than no replication

Read-only data

A B C D D B C A A B C D A B C D A, B, C, D are local, but replicated lines compete for cache capacity with other data

Threads LLC data

slide-25
SLIDE 25

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Read-only data

A B C D D B C A

Threads LLC data

slide-26
SLIDE 26

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Be selective about which lines to replicate, but always replicate them in the core’s local bank.

Read-only data

A B C D D B C A A

Threads LLC data

slide-27
SLIDE 27

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Be selective about which lines to replicate, but always replicate them in the core’s local bank.

Read-only data

A B C D D B C A A

Threads LLC data

slide-28
SLIDE 28

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Be selective about which lines to replicate, but always replicate them in the core’s local bank.

Read-only data

A B C D D B C A A A A A

Threads LLC data

slide-29
SLIDE 29

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Be selective about which lines to replicate, but always replicate them in the core’s local bank.

Read-only data

A B C D D B C A A A A A A is nearby, but B, C, D are still far away

Threads LLC data

slide-30
SLIDE 30

ASR [Beckmann, MICRO 2006], SP-NUCA [Dybdahl, HPCA 2007], ECC [Herrero, ISCA 2010], Locality-aware replication [Kurian, HPCA 2014])

Adaptive replication in directory-based dynamic NUCAs

6

Be selective about which lines to replicate, but always replicate them in the core’s local bank.

Read-only data that is not replicated still causes high latency

Read-only data

A B C D D B C A A A A A A is nearby, but B, C, D are still far away

Threads LLC data

slide-31
SLIDE 31

Nexus spreads replicas across nearby banks to replicate more

7

Read-only data

A B C D D B C A

Threads LLC data

slide-32
SLIDE 32

Nexus spreads replicas across nearby banks to replicate more

7

Threads share a read-only data replica within a core group cluster

Read-only data

A B C D D B C A

Threads LLC data

slide-33
SLIDE 33

Nexus spreads replicas across nearby banks to replicate more

7

Threads share a read-only data replica within a core group cluster

Read-only data

A B C D D B C A A C D B

Threads LLC data

slide-34
SLIDE 34

Nexus spreads replicas across nearby banks to replicate more

7

Threads share a read-only data replica within a core group cluster

Read-only data

A B C D D B C A A C D B A is local B, C, D are just one hop away

Threads LLC data

slide-35
SLIDE 35

Nexus spreads replicas across nearby banks to replicate more

7

Threads share a read-only data replica within a core group cluster

Read-only data

A B C D D B C A A C D B D B A C D C A B A is local B, C, D are just one hop away

Threads LLC data

slide-36
SLIDE 36

Nexus spreads replicas across nearby banks to replicate more

7

Threads share a read-only data replica within a core group cluster

All threads enjoy fast access to all read-only data by replicating beyond their local bank

Read-only data

A B C D D B C A A C D B D B A C D C A B A is local B, C, D are just one hop away

Threads LLC data

slide-37
SLIDE 37

An experiment to show why and when Nexus is better

8

A multithreaded workload that regularly scans over shared read-only data

Access latency (lower is better)

slide-38
SLIDE 38

An experiment to show why and when Nexus is better

8

A multithreaded workload that regularly scans over shared read-only data

High average latency Access latency (lower is better)

slide-39
SLIDE 39

An experiment to show why and when Nexus is better

8

A multithreaded workload that regularly scans over shared read-only data

Data no longer fits in LLC High average latency Access latency (lower is better)

slide-40
SLIDE 40

9

Previous replication techniques are ineffective when the read-

  • nly data does not fit in the local bank

Access latency (lower is better)

slide-41
SLIDE 41

9 Data fits in the local bank

Previous replication techniques are ineffective when the read-

  • nly data does not fit in the local bank

Access latency (lower is better)

slide-42
SLIDE 42

9 Data fits in the local bank Some data is replicated in the local bank, but most data stays remote

Previous replication techniques are ineffective when the read-

  • nly data does not fit in the local bank

Access latency (lower is better)

slide-43
SLIDE 43

Nexus allows replication even when read-only data cannot fit in the local bank

10 Access latency (lower is better)

slide-44
SLIDE 44

Nexus allows replication even when read-only data cannot fit in the local bank

10 Data fits in the local bank, each thread owns 1 replica Access latency (lower is better)

slide-45
SLIDE 45

Nexus allows replication even when read-only data cannot fit in the local bank

10 Data fits in the local bank, each thread owns 1 replica 1 replica shared by every 4 neighbors Access latency (lower is better)

slide-46
SLIDE 46

Nexus allows replication even when read-only data cannot fit in the local bank

10 Data fits in the local bank, each thread owns 1 replica 1 replica shared by every 4 neighbors 1 replica shared by every 16 neighbors Access latency (lower is better)

slide-47
SLIDE 47

Nexus allows replication even when read-only data cannot fit in the local bank

10 Data fits in the local bank, each thread owns 1 replica 1 replica shared by every 4 neighbors 1 replica shared by every 16 neighbors 1 replica shared by all threads → Same as S-NUCA Access latency (lower is better)

slide-48
SLIDE 48

Nexus allows replication even when read-only data cannot fit in the local bank

10 Data fits in the local bank, each thread owns 1 replica 1 replica shared by every 4 neighbors 1 replica shared by every 16 neighbors 1 replica shared by all threads → Same as S-NUCA

A significant latency reduction over prior work!

Access latency (lower is better)

slide-49
SLIDE 49

Recent directory-less dynamic NUCAs enable replication beyond the local bank

11 Threads LLC data

X Z Y

slide-50
SLIDE 50

Recent directory-less dynamic NUCAs enable replication beyond the local bank

11

Data placement is controlled using the virtual memory system and does not require a global directory Core TLB

Threads LLC data

X Z Y

slide-51
SLIDE 51

Recent directory-less dynamic NUCAs enable replication beyond the local bank

11

Data placement is controlled using the virtual memory system and does not require a global directory Core TLB

Threads LLC data

X Z Y X Z Y

slide-52
SLIDE 52

Recent directory-less dynamic NUCAs enable replication beyond the local bank

11

Data placement is controlled using the virtual memory system and does not require a global directory

Data can be dynamically mapped to nearby banks and shared by arbitrary cores

Core TLB

Threads LLC data

X Z Y X Z Y

slide-53
SLIDE 53

The number of replicas (replication degree) is important

12 Threads Read-only data (4MB) 16 MB LLC capacity

slide-54
SLIDE 54

The number of replicas (replication degree) is important

12 Threads Read-only data (4MB) 16 MB LLC capacity

slide-55
SLIDE 55

The number of replicas (replication degree) is important

12 Threads

Replicating 4 times works best (4 x 4MB read-only = 16MB)

Read-only data (4MB) 16 MB LLC capacity

slide-56
SLIDE 56

The number of replicas (replication degree) is important

12 Threads

Replicating 4 times works best (4 x 4MB read-only = 16MB)

Read-only data (4MB) 16 MB LLC capacity

Choosing how much to replicate is more important than choosing which lines to replicate

slide-57
SLIDE 57

The number of replicas (replication degree) is important

13 Threads Other data (8MB) Read-only data (1MB) 16 MB LLC capacity

slide-58
SLIDE 58

The number of replicas (replication degree) is important

13 Threads Other data (8MB) Read-only data (1MB) 16 MB LLC capacity

slide-59
SLIDE 59

The number of replicas (replication degree) is important

13 Threads Other data (8MB)

Replicating 8 times works best (8 x 1MB read-only + 8MB other = 16MB)

Read-only data (1MB) 16 MB LLC capacity

slide-60
SLIDE 60

The number of replicas (replication degree) is important

13 Threads Other data (8MB)

Replicating 8 times works best (8 x 1MB read-only + 8MB other = 16MB)

Too few replicas cause extra network traversals, while too many cause unnecessary cache misses

Read-only data (1MB) 16 MB LLC capacity

slide-61
SLIDE 61

No adaptive replication in directory-less D-NUCAs

14 Threads Other data Instructions (read-only)

slide-62
SLIDE 62

No adaptive replication in directory-less D-NUCAs

14 Threads Other data

Reactive-NUCA (R-NUCA) [Hardavellas, ISCA 2009] always replicates instructions every 4 cores statically.

Instructions (read-only)

slide-63
SLIDE 63

No adaptive replication in directory-less D-NUCAs

14 Threads Other data

Reactive-NUCA (R-NUCA) [Hardavellas, ISCA 2009] always replicates instructions every 4 cores statically.

Instructions (read-only)

Other directory-less D-NUCAs do not replicate data

slide-64
SLIDE 64

 Study read-only data intensive workloads running on a 144-core system

 Apply different replication degrees for all read-only data

Workloads have different preferences to replication degrees

15

slide-65
SLIDE 65

 Study read-only data intensive workloads running on a 144-core system

 Apply different replication degrees for all read-only data

Workloads have different preferences to replication degrees

15

slide-66
SLIDE 66

 Study read-only data intensive workloads running on a 144-core system

 Apply different replication degrees for all read-only data

Workloads have different preferences to replication degrees

Observation 1: Applications prefer different degrees, requiring an adaptive approach.

15

slide-67
SLIDE 67

 Study read-only data intensive workloads running on a 144-core system

 Apply different replication degrees for all read-only data

Workloads have different preferences to replication degrees

Observation 1: Applications prefer different degrees, requiring an adaptive approach.

15

Observation 2: A few replication degrees suffice.

slide-68
SLIDE 68

Nexus: enabling adaptive replication degrees in NUCA

16

slide-69
SLIDE 69

Nexus: enabling adaptive replication degrees in NUCA

16

 Builds on top of directory-less D-NUCAs

 Read-only data’s on-chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads

slide-70
SLIDE 70

Nexus: enabling adaptive replication degrees in NUCA

16

 Builds on top of directory-less D-NUCAs

 Read-only data’s on-chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads

 Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]

 Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree

slide-71
SLIDE 71

Nexus: enabling adaptive replication degrees in NUCA

16

 Builds on top of directory-less D-NUCAs

 Read-only data’s on-chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads

 Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]

 Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree

 Nexus-J builds on Jigsaw [PACT’13, HPCA’15]

 Extends Jigsaw’s configuration algorithm to select the best replication degree  Outperforms Nexus-R in multi-program workloads

slide-72
SLIDE 72

Nexus: enabling adaptive replication degrees in NUCA

16

 Builds on top of directory-less D-NUCAs

 Read-only data’s on-chip location and coherence are tracked via the virtual memory system  Cores access and share closest replicas without directory overheads

 Nexus-R builds on R-NUCA [Hardavellas, ISCA’09]

 Supports flexible replication degrees for all read-only data  Leverages set-sampling to choose the best replication degree

 Nexus-J builds on Jigsaw [PACT’13, HPCA’15]

 Extends Jigsaw’s configuration algorithm to select the best replication degree  Outperforms Nexus-R in multi-program workloads

Focus of this talk

slide-73
SLIDE 73

Nexus-R: Applying Nexus to R-NUCA

17

slide-74
SLIDE 74

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

17

slide-75
SLIDE 75

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Unknown 17

Thread 0 Thread1

Time

X Y Z

slide-76
SLIDE 76

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write Unknown 17

Thread 0 Thread1

Time

X Y Z

slide-77
SLIDE 77

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss

Unknown 17

Thread 0 Thread1

Time Read X

X Y Z

slide-78
SLIDE 78

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss

Unknown 17

Thread 0 Thread1

Time Read X Read Y

X Y Z

slide-79
SLIDE 79

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss Read TLB miss from other thread

Unknown 17

Thread 0 Thread1

Time Read X Read Y Read Y

X Y Z

slide-80
SLIDE 80

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss Read TLB miss from other thread

Unknown 17 Nexus-R replicates this

Thread 0 Thread1

Time Read X Read Y Read Y

X Y Z

slide-81
SLIDE 81

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss Read TLB miss from other thread

Unknown 17 Nexus-R replicates this

Thread 0 Thread1

Time Read X Read Y Read Y Read Z

X Y Z

slide-82
SLIDE 82

Nexus-R: Applying Nexus to R-NUCA

 Nexus uses the virtual memory system to classify pages into three types.

 Similar to R-NUCA, but differentiates all read-only data (not just instructions)

Thread Private Shared Read-only Shared Read-write

First TLB miss Read TLB miss from other thread Write TLB miss from other thread

Unknown 17 Nexus-R replicates this

Thread 0 Thread1

Time Read X Read Y Write Z Read Y Read Z

X Y Z

slide-83
SLIDE 83

Nexus-R: Applying Nexus to R-NUCA

18

slide-84
SLIDE 84

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

slide-85
SLIDE 85

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

slide-86
SLIDE 86

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

slide-87
SLIDE 87

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

Private data: Always local

slide-88
SLIDE 88

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

Shared read-write data: Always like S-NUCA Private data: Always local

slide-89
SLIDE 89

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

Shared read-only data: Replicated clusters Shared read-write data: Always like S-NUCA Private data: Always local

Replication degree of 9 on 36 cores → cluster with size of 4 (36 divided by 9)

slide-90
SLIDE 90

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

Shared read-only data: Replicated clusters

Replication degree of 9 on 36 cores → cluster with size of 4 (36 divided by 9)

slide-91
SLIDE 91

Nexus-R: Applying Nexus to R-NUCA

18

 Supports flexible replication degrees via flexible cluster sizes

 R-NUCA always uses the cluster size of 4; Nexus-R supports reconfigurable sizes

Shared read-only data: Replicated clusters

Replication degree of 9 on 36 cores → cluster with size of 4 (36 divided by 9) Replication degree of 4 → cluster with size of 9

slide-92
SLIDE 92

Nexus-R leverages set-sampling to select the best degree

19

slide-93
SLIDE 93

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees

slide-94
SLIDE 94

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

slide-95
SLIDE 95

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

slide-96
SLIDE 96

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

  • 1. L1 Miss
slide-97
SLIDE 97

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

  • 1. L1 Miss
  • 2. Sampled access for degree of 4

MSHR

slide-98
SLIDE 98

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

  • 1. L1 Miss
  • 2. Sampled access for degree of 4
  • 3. Sampled access returns

MSHR

Latency X

slide-99
SLIDE 99

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

1/4 4/9 4/36 1/9 1/36 9/36

  • 1. L1 Miss
  • 2. Sampled access for degree of 4
  • 3. Sampled access returns

MSHR

Latency X

Counters record the latency difference between degrees

slide-100
SLIDE 100

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

1/4 4/9 4/36 1/9 1/36 9/36

  • 1. L1 Miss
  • 2. Sampled access for degree of 4
  • 3. Sampled access returns
  • 4. Update counters

MSHR

Latency X

Counters record the latency difference between degrees

+X

  • X
  • X
slide-101
SLIDE 101

Nexus-R leverages set-sampling to select the best degree

19

 Enhances set-sampling to monitor the performance of different degrees  Compares the cumulative access latency of each degree from sampled sets

Core L1s

Address to Bank/Set Lookup Logic

1/4 4/9 4/36 1/9 1/36 9/36

  • 1. L1 Miss
  • 2. Sampled access for degree of 4
  • 3. Sampled access returns
  • 4. Update counters

Best degree

  • 5. Vote

MSHR

Latency X

Counters record the latency difference between degrees

+X

  • X
  • X
slide-102
SLIDE 102

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

LLC bank: Threads:

slide-103
SLIDE 103

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

LLC bank: Threads:

slide-104
SLIDE 104

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

1

Sampling sets for cluster size: LLC bank: Threads:

slide-105
SLIDE 105

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

4 1

Sampling sets for cluster size: LLC bank: Threads:

slide-106
SLIDE 106

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

4 9 1

Sampling sets for cluster size: LLC bank: Threads:

slide-107
SLIDE 107

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

4 16 9 1

Sampling sets for cluster size: LLC bank: Threads:

slide-108
SLIDE 108

Nexus-R leverages set-sampling to select the best degree

20

 Sampled sets spread across several banks

 Threads share sampled sets if they share a read-only replica cluster

4 16 9 1

Sampling sets for cluster size: LLC bank: Threads:

For another thread in the system

slide-109
SLIDE 109

Nexus-R takes coordinated decision across threads

21

slide-110
SLIDE 110

Nexus-R takes coordinated decision across threads

21

 Uncoordinated decisions work poorly

 “Tragedy of the commons”: Each thread wants itself to replicate more, but others to

replicate less

slide-111
SLIDE 111

Nexus-R takes coordinated decision across threads

21

 Uncoordinated decisions work poorly

 “Tragedy of the commons”: Each thread wants itself to replicate more, but others to

replicate less

 Nexus-R makes the whole process agree on the best replication degree by

using per-process total latency for each degree

 The OS reads latency counters periodically and sets the best degree for a process

slide-112
SLIDE 112

Nexus-R adds small overheads over R-NUCA

22

slide-113
SLIDE 113

Nexus-R adds small overheads over R-NUCA

22

 Overheads of applying Nexus to R-NUCA:

 1.5% of the LLC used for set-sampling  ~100 bits per core for hardware counters  10s of instructions per context switch for the OS support

slide-114
SLIDE 114

Nexus-J: Applying Nexus to Jigsaw

23

 Jigsaw groups partitions from different banks to create virtual caches (VCs)

4x4 mesh NUCA LLC

Core L1I L1D LLC Bank

slide-115
SLIDE 115

Nexus-J: Applying Nexus to Jigsaw

23

 Jigsaw groups partitions from different banks to create virtual caches (VCs)

4x4 mesh NUCA LLC

Core L1I L1D LLC Bank

slide-116
SLIDE 116

Nexus-J: Applying Nexus to Jigsaw

23

 Jigsaw groups partitions from different banks to create virtual caches (VCs)

VC1 VC3 VC2

……

4x4 mesh NUCA LLC

Core L1I L1D LLC Bank

slide-117
SLIDE 117

Nexus-J: Applying Nexus to Jigsaw

23

 Jigsaw groups partitions from different banks to create virtual caches (VCs)

VC1 VC3 VC2

……

4x4 mesh NUCA LLC

Core L1I L1D LLC Bank

NoC TLB

slide-118
SLIDE 118

Nexus-J: Applying Nexus to Jigsaw

23

 Jigsaw groups partitions from different banks to create virtual caches (VCs)

VC1 VC3 VC2

……

4x4 mesh NUCA LLC

Core L1I L1D LLC Bank

NoC

Jigsaw manages capacity among applications and data types,

  • utperforming many D-NUCA techniques

TLB

slide-119
SLIDE 119

Nexus-J: Applying Nexus to Jigsaw

24

slide-120
SLIDE 120

Nexus-J: Applying Nexus to Jigsaw

24

 Jigsaw outperforms R-NUCA’s simple heuristics with better data placement

 Especially in multi-programmed workloads  But Jigsaw never replicates data!!

slide-121
SLIDE 121

Nexus-J: Applying Nexus to Jigsaw

24

 Jigsaw outperforms R-NUCA’s simple heuristics with better data placement

 Especially in multi-programmed workloads  But Jigsaw never replicates data!!

 Nexus-J implements adaptive replication degree on Jigsaw

 Combines the ability to allocate capacity between apps with adaptive replication  Enhance Jigsaw’s software runtime to select the best replication degree

slide-122
SLIDE 122

Nexus-J: Applying Nexus to Jigsaw

24

 Jigsaw outperforms R-NUCA’s simple heuristics with better data placement

 Especially in multi-programmed workloads  But Jigsaw never replicates data!!

 Nexus-J implements adaptive replication degree on Jigsaw

 Combines the ability to allocate capacity between apps with adaptive replication  Enhance Jigsaw’s software runtime to select the best replication degree

 See the paper for implementation details

slide-123
SLIDE 123

Evaluation

25

 Modeled system

 144 Silvermont-like OOO cores  12x12 mesh  32KB L1I/D caches  72MB LLC (0.5MB per core)

 Multithreaded workloads

 Scientific workloads: SPECOMP2012, PARSEC, SPLASH2, BioParallel  Server workloads: TailBench [Kasture, IISWC’16]  With various input sizes

Mem / IO

Tile Organization

Tiled CMP Architecture

LLC Bank NoC Router

Core

L1I L1D

Mem / IO Mem / IO Mem / IO

slide-124
SLIDE 124

Evaluation

26

 Compared 6 schemes

S-NUCA No replication (baseline). R-NUCA Replicate instructions at a fixed degree. Jigsaw Allocate capacity across processes. No replication. Locality-aware replication [Kurian, HPCA’14] State-of-the-art directory-based D-NUCA. Selectively replicate cache lines in local bank. Nexus-R Nexus on R-NUCA. Nexus-J Nexus on Jigsaw.

slide-125
SLIDE 125

Nexus outperforms prior selective replication techniques

27

 Single-program workloads running with 144 threads

slide-126
SLIDE 126

Nexus outperforms prior selective replication techniques

27 Workloads with small read-only footprint → Nexus matches prior work

 Single-program workloads running with 144 threads

slide-127
SLIDE 127

Nexus outperforms prior selective replication techniques

27 Workloads with small read-only footprint → Nexus matches prior work Workloads with medium read-only footprint → Nexus outperforms prior work

 Single-program workloads running with 144 threads

slide-128
SLIDE 128

Nexus outperforms prior selective replication techniques

27 Workloads with small read-only footprint → Nexus matches prior work Workloads with medium read-only footprint → Nexus outperforms prior work Workloads with large read-only footprint → Nexus does not hurt performance

 Single-program workloads running with 144 threads

slide-129
SLIDE 129

Nexus outperforms prior selective replication techniques

27 Workloads with small read-only footprint → Nexus matches prior work Workloads with medium read-only footprint → Nexus outperforms prior work Workloads with large read-only footprint → Nexus does not hurt performance

 Single-program workloads running with 144 threads

slide-130
SLIDE 130

Nexus-J performs best with multi-programmed workloads

28

 Workload mixes with 4 different apps running with 36 threads each

slide-131
SLIDE 131

Nexus-J performs best with multi-programmed workloads

28

 Workload mixes with 4 different apps running with 36 threads each

slide-132
SLIDE 132

Nexus-J performs best with multi-programmed workloads

28

 Workload mixes with 4 different apps running with 36 threads each

Replication-sensitive → Nexus-R and Nexus-J are better

slide-133
SLIDE 133

Nexus-J performs best with multi-programmed workloads

28

 Workload mixes with 4 different apps running with 36 threads each

Replication-sensitive → Nexus-R and Nexus-J are better Capacity-sensitive → Jigsaw and Nexus-J are better

slide-134
SLIDE 134

Nexus-J performs best with multi-programmed workloads

28

 Workload mixes with 4 different apps running with 36 threads each

Replication-sensitive → Nexus-R and Nexus-J are better Capacity-sensitive → Jigsaw and Nexus-J are better Sensitive to both → Nexus-J performs the best

slide-135
SLIDE 135

See paper for more results

 Performance of 60 apps between Nexus-R and Locality-aware replication  Dynamic replication degree vs. static degrees  Result of 20 Multi-program workloads  Sensitivity study to

 System sizes  Different cache hierarchies

 Dynamic data reclassification

29

slide-136
SLIDE 136

Conclusion

 Data replication can improve the performance of NUCA systems

 Replication requires balancing the latency and capacity tradeoff in NUCA

30

slide-137
SLIDE 137

Conclusion

 Data replication can improve the performance of NUCA systems

 Replication requires balancing the latency and capacity tradeoff in NUCA

 We propose Nexus, a new approach to adaptive replication

 Unlike prior work, Nexus focuses on how much to replicate the read-only data  We present two implementation of Nexus: Nexus-R and Nexus-J

30

slide-138
SLIDE 138

Conclusion

 Data replication can improve the performance of NUCA systems

 Replication requires balancing the latency and capacity tradeoff in NUCA

 We propose Nexus, a new approach to adaptive replication

 Unlike prior work, Nexus focuses on how much to replicate the read-only data  We present two implementation of Nexus: Nexus-R and Nexus-J

 Nexus outperforms the state-of-the-art adaptive scheme

 By up to 90% and 23% on average for replication sensitive workloads

30

slide-139
SLIDE 139

Thanks! Questions?

31

 Data replication can improve the performance of NUCA systems

 Replication requires balancing the latency and capacity tradeoff in NUCA

 We propose Nexus, a new approach to adaptive replication

 Unlike prior work, Nexus focuses on how much to replicate the read-only data  We present two implementation of Nexus: Nexus-R and Nexus-J

 Nexus outperforms the state-of-the-art adaptive scheme

 By up to 90% and 23% on average for replication sensitive workloads