Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - - PowerPoint PPT Presentation

efficient techniques for sharing on chip resources in cmps
SMART_READER_LITE
LIVE PREVIEW

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - - PowerPoint PPT Presentation

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09 Overall cloud workloads will more than triple from 2015 to 2020. Cisco Global Cloud Index CRM as a Service Storage as Maching


slide-1
SLIDE 1

Efficient Techniques for Sharing On-chip Resources in CMPs

Ruisheng Wang

PhD Oral Defense 2017-05-09

slide-2
SLIDE 2

Functions as a Service Maching Learning as a Service Email as a Service Database as a Service Storage as a Service CRM as a Service Payments as a Service

“Overall cloud workloads will more than triple from 2015 to 2020.”

Cisco Global Cloud Index

slide-3
SLIDE 3

Functions as a Service Maching Learning as a Service Email as a Service Database as a Service Storage as a Service CRM as a Service Payments as a Service

“Overall cloud workloads will more than triple from 2015 to 2020.”

Cisco Global Cloud Index

slide-4
SLIDE 4

Low Server Utilization

3 / 36

slide-5
SLIDE 5

Low Server Utilization

“Apple Inc. plans to invest $2 billion to build data centers ...”

Wall Street Journal, 2015

“Google plans to build 12 new cloud-focused data centers in next 18 months ...”

bloomberg.com, 2016

“There are over 7,500 data centers worldwide, with over 2,600 in the top 20 global cities alone, and data center con- struction will grow 21% per year through 2018.”

ciena.com, 2016

3 / 36

slide-6
SLIDE 6

Low Server Utilization

“Various analyses estimate industry-wide utilization is between 6% and12%.”

“Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014

“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10%–50% CPU utilization range.”

“Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013 3 / 36

slide-7
SLIDE 7

Low Server Utilization

“Various analyses estimate industry-wide utilization is between 6% and12%.”

“Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014

“Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10%–50% CPU utilization range.”

“Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013

O v e r p r

  • v

i s i

  • n

i n g ! ! !

W

  • r

k l

  • a

d I n t e r f e r e n c e

  • n

S h a r e d O n

  • C

h i p R e s

  • u

r c e s

3 / 36

slide-8
SLIDE 8

Resource Interference (Uncontrolled Sharing)

Offline Batch Analytics (MapReduce) User Facing Latency Critical (Web Search) Memory Bandwidth Shared Cache

DRAM

4 / 36

slide-9
SLIDE 9

Resource Interference (Uncontrolled Sharing)

Offline Batch Analytics (MapReduce) User Facing Latency Critical (Web Search) Memory Bandwidth Shared Cache SLO Violation!!!

DRAM

4 / 36

slide-10
SLIDE 10

Resource Interference (Uncontrolled Sharing)

Offline Batch Analytics (MapReduce) User Facing Latency Critical (Web Search) Memory Bandwidth Shared Cache SLO Violation!!!

DRAM

To enable aggressive workload collocation, shared on-chip resources need to be controlled in an efficient and effective way.

4 / 36

slide-11
SLIDE 11

Shared On-chip Resources

Last-Level Cache

  • Partitioning-induced associativity loss
  • Unpredictable miss rate curve

Off-Chip Memory Bandwidth

  • Unfair/Unreasonable memory

bandwidth allocation On-Chip Network

  • Expensive deadlock avoidance

Core Core Core Core Core Core Core Core Queue,UnCore,I/O

Shared L3 Cache Memory Controller

DRAM Bandwidth

Intel Core i7-5960X

5 / 36

slide-12
SLIDE 12

Shared On-chip Resources

Last-Level Cache

  • Partitioning-induced associativity loss
  • Unpredictable miss rate curve

Off-Chip Memory Bandwidth

  • Unfair/Unreasonable memory

bandwidth allocation On-Chip Network

  • Expensive deadlock avoidance

Core Core Core Core Core Core Core Core Queue,UnCore,I/O

Shared L3 Cache Memory Controller

DRAM Bandwidth

Intel Core i7-5960X

5 / 36

slide-13
SLIDE 13

Shared On-chip Resources

Last-Level Cache

  • Partitioning-induced associativity loss
  • Unpredictable miss rate curve

Off-Chip Memory Bandwidth

  • Unfair/Unreasonable memory

bandwidth allocation On-Chip Network

  • Expensive deadlock avoidance

Core Core Core Core Core Core Core Core Queue,UnCore,I/O

Shared L3 Cache Memory Controller On-Chip Network

DRAM Bandwidth

Intel Core i7-5960X

5 / 36

slide-14
SLIDE 14

My contributions

Efficient techniques for sharing last-level cache, off-chip memory bandwidth and on-chip network My contributions

  • Last-level Cache

– Futility Scaling: High-Associativity Cache Partitioning (MICRO 2014) – Predictable Cache Protection Policy (under preparation for submission)

  • Off-chip Memory Bandwidth

– Analytical Model for Memory Bandwidth Partitioning (IPDPS 2013)

  • On-Chip Network

– Bubble Coloring: Low-cost Deadlock Avoidance Scheme (ICS 2013)

6 / 36

slide-15
SLIDE 15

My contributions

Efficient techniques for sharing last-level cache, off-chip memory bandwidth and on-chip network My contributions

  • Last-level Cache

– Futility Scaling: High-Associativity Cache Partitioning (MICRO 2014) – Predictable Cache Protection Policy (under preparation for submission)

  • Off-chip Memory Bandwidth

– Analytical Model for Memory Bandwidth Partitioning (IPDPS 2013)

  • On-Chip Network

– Bubble Coloring: Low-cost Deadlock Avoidance Scheme (ICS 2013)

6 / 36

slide-16
SLIDE 16

Title Page An Analytical Performance Model for Memory Bandwidth Partitioning

slide-17
SLIDE 17

Shared Memory Bandwidth Management

Focus on fairness

  • Fair Queue Memory System – divide the memory bandwidth equally for

each application [Nesbit et al., 2006] Focus on throughput

  • ATLAS – prioritize the applications that have attained the least service
  • ver others [Kim et al., 2010a]

Focus on both throughput and fairness

  • Thread Cluster Memory Scheduler – improves both system throughput

and fairness by clustering different types of threads together [Kim et al., 2010b]

8 / 36

slide-18
SLIDE 18

Shared Memory Bandwidth Management

Focus on fairness

  • Fair Queue Memory System – divide the memory bandwidth equally for

each application [Nesbit et al., 2006] Focus on throughput

  • ATLAS – prioritize the applications that have attained the least service
  • ver others [Kim et al., 2010a]

Focus on both throughput and fairness

  • Thread Cluster Memory Scheduler – improves both system throughput

and fairness by clustering different types of threads together [Kim et al., 2010b]

What are the best memory bandwidth partitioning schemes for different system performance objectives?

8 / 36

slide-19
SLIDE 19

Model for Memory Bandwidth Partitioning

maximize x SystemObjectiveFunction(x) subject to

N

i=1

xi ≤ B,i = 1, ... , N

9 / 36

slide-20
SLIDE 20

Model for Memory Bandwidth Partitioning

maximize x SystemObjectiveFunction(x) subject to

N

i=1

xi ≤ B,i = 1, ... , N

Common System Performance Objectives

Throughput-oriented: Weighted Speedup / Sum of IPCs Fairness: Minimum Fairness (Lowest Speedup) Balancing throughput and fairness: Harmonic Weighted Speedup

9 / 36

slide-21
SLIDE 21

Single Application Performance Model

IPCshared,i = APCshared,i APIi = xi APIi

  • IPC: Instructions Per Cycle
  • APC: memory Accesses Per Cycle
  • API: memory Accesses Per

Instruction

Example

Assume an application takes 10,000 cycles to execute 1,000 instructions, during which it generates 100 memory accesses

  • IPC = 1,000/10,000 = 0.1
  • API = 100/1,000 = 0.1
  • APC = 100/10,000 = 0.01

10 / 36

slide-22
SLIDE 22

Single Application Performance Model

IPCshared,i = APCshared,i APIi = xi APIi

  • IPC: Instructions Per Cycle
  • APC: memory Accesses Per Cycle
  • API: memory Accesses Per

Instruction

Example

Assume an application takes 10,000 cycles to execute 1,000 instructions, during which it generates 100 memory accesses

  • IPC = 1,000/10,000 = 0.1
  • API = 100/1,000 = 0.1
  • APC = 100/10,000 = 0.01

10 / 36

slide-23
SLIDE 23

Harmonic Weighted Speedup

maximize x Hsp = N ∑N

i=1 IPCalone,i IPCshared,i

= N ∑N

i=1 APCalone,i xi

subject to

N

i=1

xi ≤ B,i = 1, ... , N

  • Optimal Partitioning — Square_root

xi xj = √ APCalone,i √ APCalone,j

11 / 36

slide-24
SLIDE 24

Fairness

IPCshared,i IPCalone,i = IPCshared,j IPCalone,j = ⇒ xi APCalone,i = xj APCalone,j

  • Optimal Partitioning — Proportional

xi xj = APCalone,i APCalone,j

12 / 36

slide-25
SLIDE 25

Weighted Speedup

maximize x Wsp = 1 N

N

i=1

IPCshared,i IPCalone,i = 1 N

N

i=1

xi APCalone,i subject to

N

i=1

xi ≤ B,i = 1, ... , N

  • Optimal Partitioning — Priority_APC

– A fractional Knapsack problem – The optimal memory request scheduling is to always prioritize the requests from an application with a lower APCalone over the ones from an application with a higher APCalone – Similarly, the optimal partitioning for sum of IPCs is Priority_API

13 / 36

slide-26
SLIDE 26

Relationship between Performance Objectives and Memory bandwidth Partitioning

Application 1 Application 2 Best Weighted Speedup Priority_APC Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2)

APCalone,1 APCalone,2 = 1 4

Uncontrolled Sharing

14 / 36

slide-27
SLIDE 27

Relationship between Performance Objectives and Memory bandwidth Partitioning

Application 1 Application 2 Best Weighted Speedup Priority_APC Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2)

APCalone,1 APCalone,2 = 1 4

Uncontrolled Sharing

No One-Size-Fits-All

Different partitioning schemes are needed for optimizing different system performance objectives

14 / 36

slide-28
SLIDE 28

Evaluation Methodology

Full system simulator (Gem5) + Memory subsystem simulator (DRAMSim2)

System Configuration

Cores

  • Four out of order cores

Caches

  • L1 I-cache/D-cache

– 32KB, 2-way, 1 ns, 64B line

  • Private unified L2

– 256KB, 8-way, 5 ns, 64B line

Memory

  • DDR2-400
  • tRP-tRCD-CL: 12.5-12.5-12.5ns

Workloads

  • Benchmark: SPEC CPU 2006
  • 14 Workloads

– Mix 4 benchmarks

  • RSD: Relative Standard Deviation
  • f APCalones of co-scheduled

applications

  • 7 Heterogeneous

– RSD > 30

  • 7 Homogeneous

– RSD < 30

15 / 36

slide-29
SLIDE 29

Results: Fairness

0.5 1 1.5 2 2.5 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Minimum Fairness Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Proportional scheme achieves highest minimum fairness (> 50% improvement over No_partitioning)

16 / 36

slide-30
SLIDE 30

Results: Fairness

0.5 1 1.5 2 2.5 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Minimum Fairness Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Propotional Proportional scheme achieves highest minimum fairness (> 50% improvement over No_partitioning)

16 / 36

slide-31
SLIDE 31

Results: Fairness

0.5 1 1.5 2 2.5 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Minimum Fairness Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Priority_APC/API Proportional scheme achieves highest minimum fairness (> 50% improvement over No_partitioning)

16 / 36

slide-32
SLIDE 32

Results: Fairness

0.5 1 1.5 2 2.5 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Minimum Fairness Equal Proportional Priority_APC Priority_API Square_root 2/3_power

same trend as heteogenous workloads Proportional scheme achieves highest minimum fairness (> 50% improvement over No_partitioning)

16 / 36

slide-33
SLIDE 33

Results: Weighted Speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Priority_APC achieves highest Weighted Speedup (64.2% improvement over No_Partitioning)

17 / 36

slide-34
SLIDE 34

Results: Weighted Speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Priority_APC Priority_APC achieves highest Weighted Speedup (64.2% improvement over No_Partitioning)

17 / 36

slide-35
SLIDE 35

Results: Harmonic Weighted Speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Harmonic Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Square_root scheme achieves highest Harmonic Weighted Speedup (20.3% improvement over No_partitioning

18 / 36

slide-36
SLIDE 36

Results: Harmonic Weighted Speedup

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Harmonic Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power

Square_root Square_root scheme achieves highest Harmonic Weighted Speedup (20.3% improvement over No_partitioning

18 / 36

slide-37
SLIDE 37

Summary of Bandwidth Partitioning Model

  • Analytical model that establishes the relationship between memory

bandwidth partitioning schemes and system performance objectives

  • No one-size-fits-all

– Based on the model, different optimal partitioning schemes for different performance objectives are derived

  • Extension for cache partitioning

_ _ _ _

19 / 36

slide-38
SLIDE 38

Summary of Bandwidth Partitioning Model

  • Analytical model that establishes the relationship between memory

bandwidth partitioning schemes and system performance objectives

  • No one-size-fits-all

– Based on the model, different optimal partitioning schemes for different performance objectives are derived

  • Extension for cache partitioning

IPCshared,i = APCshared,i APIshared,i = memory_bandwidth_sharei Fi (cache_capacity_sharei)

19 / 36

slide-39
SLIDE 39

Summary of Bandwidth Partitioning Model

  • Analytical model that establishes the relationship between memory

bandwidth partitioning schemes and system performance objectives

  • No one-size-fits-all

– Based on the model, different optimal partitioning schemes for different performance objectives are derived

  • Extension for cache partitioning

IPCshared,i = APCshared,i APIshared,i = memory_bandwidth_sharei Fi (cache_capacity_sharei)

Predictable cache miss rate curve

19 / 36

slide-40
SLIDE 40

Predictable Cache Protection Policy

slide-41
SLIDE 41

Overview of Cache Protection Policies

Insertion based Policy

What fraction of incoming lines will be protected? ⇒ insertion ratio ρ Bimodal Insertion Policy (BIP1)

  • 1/32 (ρ) of incoming lines are

inserted to MRU position

  • The rest of incoming lines are

inserted to LRU position

Protecting Distance based Policy

How long will existing lines be protected? ⇒ protecting distance dp Protecting Distance based Policy (PDP2)

  • An inserted/reused line is protected

for dp accesses before its eviction

  • An incoming line will bypass the

cache if no unprotected candidates available

  • 1M. Qureshi, et al. “Adaptive insertion policies for high performance caching” ISCA 2007
  • 2N. Duong, et al. “Improving cache management policies using dynamic reuse distances” MICRO 2012

21 / 36

slide-42
SLIDE 42

Overview of Cache Protection Policies

Insertion based Policy

What fraction of incoming lines will be protected? ⇒ insertion ratio ρ Bimodal Insertion Policy (BIP1)

  • 1/32 (ρ) of incoming lines are

inserted to MRU position

  • The rest of incoming lines are

inserted to LRU position

Protecting Distance based Policy

How long will existing lines be protected? ⇒ protecting distance dp Protecting Distance based Policy (PDP2)

  • An inserted/reused line is protected

for dp accesses before its eviction

  • An incoming line will bypass the

cache if no unprotected candidates available

Why do we need predictability?

  • 1. Help the cache controller to enforce better dp or ρ.
  • 2. Help the resource allocation algorithm to make intelligent decisions

to share the cache.

  • 1M. Qureshi, et al. “Adaptive insertion policies for high performance caching” ISCA 2007
  • 2N. Duong, et al. “Improving cache management policies using dynamic reuse distances” MICRO 2012

21 / 36

slide-43
SLIDE 43

Predictable Cache Protection Policy (PCPP)

unprotected region Partition 1 Partition 2 protected region Insertions Evictions demotion promtion Bypass

22 / 36

slide-44
SLIDE 44

Predictable Cache Protection Policy (PCPP)

unprotected region Partition 1 Partition 2 protected region Insertions Evictions demotion promtion Bypass

Operations

On a hit

  • 1. reset the hit line’s age to zero
  • 2. promote if the line is unprotected

On a miss

  • 1. demote if candidate’s age > dp
  • 2. if 1 # of protected lines < s and

2 unprotected candidates exist

– insert the incoming line – evict an unprotected candidate

  • therwise → bypass

(ρ=1-bypass_rate)

22 / 36

slide-45
SLIDE 45

Model Overview

dp ρ s h

PCPP Enforcer Model

23 / 36

slide-46
SLIDE 46

Model Overview

dp ρ s h

PCPP Enforcer Model

23 / 36

slide-47
SLIDE 47

Model Overview

dp ρ s h

PCPP Enforcer Model

Model

  • Inputs (ρ, dp)
  • 1. On a miss, insert an incoming line into the cache at the probability of ρ
  • 2. Protect the inserted/reused line for at least dp accesses
  • Outputs (h, s)
  • 1. What is the average number of protected lines over time (s)?
  • 2. What is the hit rate (h)?

How to characterize the cache access pattern of an application?

23 / 36

slide-48
SLIDE 48

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • : number of
  • protected reuse streaks whose length is

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length

24 / 36

slide-49
SLIDE 49

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • : number of
  • protected reuse streaks whose length is

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length dp = 1 Nstreak(1, 1) = 2 L(1) = 1 dp = 2 Nstreak(3, 2) = 1 L(2) = 3 dp = 3 Nstreak(1, 3) = 1 L(3) = 2 Nstreak(3, 3) = 1

24 / 36

slide-50
SLIDE 50

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • Nstreak(l, dp): number of dp-protected reuse streaks whose length is l

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length dp = 1 Nstreak(1, 1) = 2 L(1) = 1 dp = 2 Nstreak(3, 2) = 1 L(2) = 3 dp = 3 Nstreak(1, 3) = 1 L(3) = 2 Nstreak(3, 3) = 1

24 / 36

slide-51
SLIDE 51

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • Nstreak(l, dp): number of dp-protected reuse streaks whose length is l

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length dp = 1 Nstreak(1, 1) = 2 L(1) = 1 dp = 2 Nstreak(3, 2) = 1 L(2) = 3 dp = 3 Nstreak(1, 3) = 1 L(3) = 2 Nstreak(3, 3) = 1 full information

24 / 36

slide-52
SLIDE 52

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • Nstreak(l, dp): number of dp-protected reuse streaks whose length is l

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length dp = 1 Nstreak(1, 1) = 2 L(1) = 1 dp = 2 Nstreak(3, 2) = 1 L(2) = 3 dp = 3 Nstreak(1, 3) = 1 L(3) = 2 Nstreak(3, 3) = 1

24 / 36

slide-53
SLIDE 53

Reuse Streak

  • A dp-protected reuse: an access whose reuse distance ≤ dp
  • A dp-protected reuse streak: a sequence of consecutive dp-protected reuses
  • Nstreak(l, dp): number of dp-protected reuse streaks whose length is l

Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length dp = 1 Nstreak(1, 1) = 2 L(1) = 1 dp = 2 Nstreak(3, 2) = 1 L(2) = 3 dp = 3 Nstreak(1, 3) = 1 L(3) = 2 Nstreak(3, 3) = 1 approximate

24 / 36

slide-54
SLIDE 54

Average Reuse Streak Length (cactusADM)

1 2 3 4 5

Protecting distance (×216)

10 20 30 40 50 60

Average streak length (L)

25 / 36

slide-55
SLIDE 55

Average Reuse Streak Length (cactusADM)

1 2 3 4 5

Protecting distance (×216)

10 20 30 40 50 60

Average streak length (L)

L(216) = 1.6 L(217) = 60.5 L(218) = 5.7

25 / 36

slide-56
SLIDE 56

Hit Rate of a Single Streak

Assumption: the insertions of incoming lines are independent. hstreak (l, ρ) = l − E (Nfailures) = l − (1 − ρ) ( 1 − (1 − ρ)l) ρ ⪆ l + 1 − 1 ρ (when l → ∞)

20 40 60 80 100

Streak length ( )

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate ( )

Precise Model Approximate Model

26 / 36

slide-57
SLIDE 57

Hit Rate of a Single Streak

Assumption: the insertions of incoming lines are independent. hstreak (l, ρ) = l − E (Nfailures) = l − (1 − ρ) ( 1 − (1 − ρ)l) ρ ⪆ l + 1 − 1 ρ (when l → ∞)

Precise Approximate

20 40 60 80 100

Streak length ( )

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate ( )

Precise Model Approximate Model

26 / 36

slide-58
SLIDE 58

Hit Rate of a Single Streak

Assumption: the insertions of incoming lines are independent. hstreak (l, ρ) = l − E (Nfailures) = l − (1 − ρ) ( 1 − (1 − ρ)l) ρ ⪆ l + 1 − 1 ρ (when l → ∞)

20 40 60 80 100

Streak length (l)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (hstreak(l)/l)

ρ = 1/32

Precise Model Approximate Model

26 / 36

slide-59
SLIDE 59

Hit Rate of a Single Streak

Assumption: the insertions of incoming lines are independent. hstreak (l, ρ) = l − E (Nfailures) = l − (1 − ρ) ( 1 − (1 − ρ)l) ρ ⪆ l + 1 − 1 ρ (when l → ∞)

20 40 60 80 100

Streak length (l)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (hstreak(l)/l)

ρ = 1/32

Precise Model Approximate Model

Streak Effect

When ρ ≪ 1, a cache protection policy serves as a “filter” that allows long reuse streaks to occupy the cache while blocking short ones

26 / 36

slide-60
SLIDE 60

Model

  • Hit model h(ρ)

h(ρ) = total hits total accesses = ∑∞

l=1 Nstreaks (l) × hstreak (l)

total accesses ⪆ Hmax ( 1 + 1 L − 1 ρL ) = Hmax − Hmax L (1 − ρ ρ )

  • Size model s(ρ)

s(ρ) = lifetime of all lines total accesses = total hits × D + total evictions × dp total accesses = total hits total accesses × D + total insertions total accesses × dp = h(ρ)D + ρ(1 − h(ρ))dp

27 / 36

slide-61
SLIDE 61

Model

  • Hit model h(ρ)

h(ρ) = total hits total accesses = ∑∞

l=1 Nstreaks (l) × hstreak (l)

total accesses ⪆ Hmax ( 1 + 1 L − 1 ρL ) = Hmax − Hmax L (1 − ρ ρ )

  • Size model s(ρ)

s(ρ) = lifetime of all lines total accesses = total hits × D + total evictions × dp total accesses = total hits total accesses × D + total insertions total accesses × dp = h(ρ)D + ρ(1 − h(ρ))dp

Model Required Information Precise full reuse streak pattern Approximate average reuse streak length (L)

27 / 36

slide-62
SLIDE 62

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Mixed Length

28 / 36

slide-63
SLIDE 63

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Mixed Length

close to linear line

28 / 36

slide-64
SLIDE 64

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 217, L(217) = 60.5

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Mixed Length

28 / 36

slide-65
SLIDE 65

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 217, L(217) = 60.5

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

,

Mixed Length

close to limit line

28 / 36

slide-66
SLIDE 66

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 217, L(217) = 60.5

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 218, L(218) = 5.7

Mixed Length

28 / 36

slide-67
SLIDE 67

Hit Rate Curve Construction

“Knee”: the point on the approximate curve that has the maximum distance from the linear reference line ρknee(dp) = 1 √ L −

Hmax L(1−Hmax)

≈ 1 √ L

Talus3: yield a hit rate curve that traces

  • ut the convex hull of a set of points

Apply Talus technique on (0,0), Knee points, Max points

2 4 6 8 10

Cache Size (MB)

0.0 0.1 0.2 0.3 0.4 0.5

Hit rate

maximum

Approximate Knee Max point (0,0)

29 / 36

slide-68
SLIDE 68

Hit Rate Curve Construction

“Knee”: the point on the approximate curve that has the maximum distance from the linear reference line ρknee(dp) = 1 √ L −

Hmax L(1−Hmax)

≈ 1 √ L

Talus3: yield a hit rate curve that traces

  • ut the convex hull of a set of points

Apply Talus technique on (0,0), Knee points, Max points

2 4 6 8 10

Cache Size (MB)

0.0 0.1 0.2 0.3 0.4 0.5

Hit rate

maximum

Approximate Knee Max point (0,0)

29 / 36

slide-69
SLIDE 69

Hit Rate Curve Construction

“Knee”: the point on the approximate curve that has the maximum distance from the linear reference line ρknee(dp) = 1 √ L −

Hmax L(1−Hmax)

≈ 1 √ L

Talus3: yield a hit rate curve that traces

  • ut the convex hull of a set of points

Apply Talus technique on (0,0), Knee points, Max points

2 4 6 8 10

Cache Size (MB)

0.0 0.1 0.2 0.3 0.4 0.5

Hit rate

maximum

Approximate Knee Max point (0,0)

29 / 36

slide-70
SLIDE 70

Hit Rate Curve Construction

“Knee”: the point on the approximate curve that has the maximum distance from the linear reference line ρknee(dp) = 1 √ L −

Hmax L(1−Hmax)

≈ 1 √ L

Talus3: yield a hit rate curve that traces

  • ut the convex hull of a set of points

Apply Talus technique on (0,0), Knee points, Max points

2 4 6 8 10

Cache Size (MB)

0.0 0.1 0.2 0.3 0.4 0.5

Hit rate

maximum

Approximate Knee Max point (0,0)

  • 3N. Beckmann and D. Sanchez.“Talus: A simple way to remove cliffs in cache performance.” HPCA 2015

29 / 36

slide-71
SLIDE 71

Model Validation (cactusADM)

Precise Approximate Linear L → ∞ Simulation 1 2 3 3.8

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Hit rate (Normalized to Hmax)

dp = 216, L(216) = 1.6

Short Streaks

1 2 3 4 5 6 6.7

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 217, L(217) = 60.5

Long Streaks

2 4 6 8 10

Cache Size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

dp = 218, L(218) = 5.7

Mixed Length

29 / 36

slide-72
SLIDE 72

Profiling Average Reuse Streak Length

L = total reuses # of reuse streaks = total reuses # of streak starts − # of streak ends A A A no protected reuse a new reuse streak no new reuse streaks

Detecting the start of a reuse streak

30 / 36

slide-73
SLIDE 73

Profiling Average Reuse Streak Length

L = total reuses # of reuse streaks = total reuses # of streak starts − # of streak ends ... A ... A ... A ... dp < Dcur no protected reuse Dcur ≤ dp < Dlast a new reuse streak dp ≥ Dlast no new reuse streaks

Dlast Dcur

Detecting the start of a reuse streak

30 / 36

slide-74
SLIDE 74

Implementation

Access address 1/128 Sampling lastTS (12 bits) lastRD (8 bits) hashedTag (16 bits)

Last Level Cache PCPP Enforcer

64×64 Shadow Tag Array

31 / 36

slide-75
SLIDE 75

Implementation

Access address 1/128 Sampling lastTS (12 bits) lastRD (8 bits) hashedTag (16 bits)

Last Level Cache PCPP Enforcer

64×64 Shadow Tag Array

< 1% of cache size

31 / 36

slide-76
SLIDE 76

Implementation

Hmax[...], D[...], L[...] Miss rate curves Protecting distances (dp) and Target sizes (s)

Access address 1/128 Sampling

Allocation Algorithm

lastTS (12 bits) lastRD (8 bits) hashedTag (16 bits)

Last Level Cache PCPP Enforcer

64×64 Shadow Tag Array

Pre-Processing Post-Processing < 1% of cache size

Miss rate curves and Target size

31 / 36

slide-77
SLIDE 77

Results

LRU DRRIP PDP PCPP Prediction 1 2 4 5 6 7 9 10 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Miss rate

cactusADM

1 2 4 5 6 7 9 10 0.5 0.6 0.7 0.8 0.9 1.0

lbm

3 6 9 12 15 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mcf

1 2 4 5 6

Cache size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Miss rate

gcc

1 2 4 5

Cache size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

sphinx3

3 6 9 12 15 18 21

Cache size (MB)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

xalancbmk

32 / 36

slide-78
SLIDE 78

Results

LRU DRRIP PDP PCPP Prediction 1 2 4 5 6 7 9 10 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Miss rate

cactusADM

1 2 4 5 6 7 9 10 0.5 0.6 0.7 0.8 0.9 1.0

lbm

3 6 9 12 15 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mcf

1 2 4 5 6

Cache size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

Miss rate

gcc

1 2 4 5

Cache size (MB)

0.0 0.2 0.4 0.6 0.8 1.0

sphinx3

3 6 9 12 15 18 21

Cache size (MB)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

xalancbmk

performance difference

32 / 36

slide-79
SLIDE 79

PCPP Summary

  • The reuse streak concept and the streak effect that explains the

behaviors of a cache protection policy

  • A precise and an approximate model to predict the performance of

cache protection policy based on reuse streak information

  • A runtime profiler for average reuse steak length and a practical cache

protection policy that produces predictable miss rate curves

33 / 36

slide-80
SLIDE 80

Conclusions

To enable aggressive workload collocation on a chip, shared on-chip resources needs to be managed in an efficient and effective way.

  • Last-level cache

– High-associativity cache partitioning – Predictable high-performance cache policy

  • Off-chip memory bandwidth

– Goal-oriented memory bandwidth allocation

  • On-chip network

– Low-cost deadlock avoidance

34 / 36

slide-81
SLIDE 81

My Publications

  • Ruisheng Wang and Lizhong Chen, “Futility Scaling: High-Associativity Cache Partitioning”, in Proceedings
  • f the 47th IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2014
  • Lizhong Chen, Lihang Zhao, Ruisheng Wang and Timothy Mark Pinkston, “MP3:MinimizingPerformance

Penalty for Power-gating of Clos Network-on-Chip”, in Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 2014

  • Ruisheng Wang, Lizhong Chen and Timothy Mark Pinkston, “Bubble Coloring:Avoiding Routing- and

Protocol-induced Deadlocks with Minimal Virtual Channel Requirement”, in Proceedings of the 27th International Conference on Supercomputing (ICS), June 2013

  • Ruisheng Wang, Lizhong Chen and Timothy Mark Pinkston, “An Analytical Performance Model for

Partitioning Off-Chip Memory Bandwidth”, in Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2013

  • Lizhong Chen, Ruisheng Wang and Timothy Mark Pinkston, “Critical Bubble Scheme: An Efficient

Implementation of Globally-aware Network Flow Control”, in Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011)

  • Yuho Jin, Ruisheng Wang, Woojin Choi and Timothy Mark Pinkston, “Thread Criticality Support in On- Chip

Networks”, in Proceedings of Third International Workshop on Network on Chip Architectures (NoCArc 2010), held in conjunction with the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-43)

35 / 36

slide-82
SLIDE 82

Thank You For Listening! Questions?