Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - PowerPoint PPT Presentation

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09

“Overall cloud workloads will more than triple from 2015 to 2020.” Cisco Global Cloud Index CRM as a Service Storage as Maching Learning a Service as a Service Database as a Service Functions as a Service Payments as a Service Email as a Service

CRM as a Service Storage as “Overall cloud workloads will more Maching Learning a Service as a Service than triple from 2015 to 2020.” Database as a Service Cisco Global Cloud Index Functions as a Service Payments as a Service Email as a Service

Low Server Utilization 3 / 36

Low Server Utilization “Apple Inc. plans to invest $2 billion to build data centers ...” Wall Street Journal, 2015 “Google plans to build 12 new cloud-focused data centers in next 18 months ...” bloomberg.com, 2016 “There are over 7,500 data centers worldwide, with over 2,600 in the top 20 global cities alone, and data center con- struction will grow 21% per year through 2018.” ciena.com, 2016 3 / 36

Low Server Utilization “Various analyses estimate industry-wide utilization is between 6% and12%.” “Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014 “Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10%–50% CPU utilization range.” “Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013 3 / 36

Low Server Utilization “Various analyses estimate industry-wide utilization is between 6% and12%.” ! ! ! g n i n o s e i c s r i u “Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014 v o o s r e R p r p e h i v C O - n O d e r a h S n o e c n e r e “Such WSCs tend to have relatively low average utilization, spending r f e t n I d a o k l most of (their) time in the 10%–50% CPU utilization range.” r o W “Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013 3 / 36

Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) Shared Cache Memory Bandwidth DRAM 4 / 36

Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) SLO Violation!!! Shared Cache Memory Bandwidth DRAM 4 / 36

Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) To enable aggressive workload collocation, SLO Violation!!! shared on-chip resources need to be controlled in an efficient and effective way. Shared Cache Memory Bandwidth DRAM 4 / 36

Off-Chip Memory Bandwidth • Unfair/Unreasonable memory bandwidth allocation On-Chip Network • Expensive deadlock avoidance Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss • Unpredictable miss rate curve Core Core Core Core Shared L3 Cache Core Core Core Core Memory Controller Bandwidth DRAM Intel Core i7-5960X 5 / 36

On-Chip Network • Expensive deadlock avoidance Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss • Unpredictable miss rate curve Core Core Core Core Shared Off-Chip Memory Bandwidth L3 Cache Core Core • Unfair/Unreasonable memory bandwidth allocation Core Core Memory Controller Bandwidth DRAM Intel Core i7-5960X 5 / 36

Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss On-Chip Network • Unpredictable miss rate curve Core Core Core Core Shared Off-Chip Memory Bandwidth L3 Cache Core Core • Unfair/Unreasonable memory bandwidth allocation Core Core On-Chip Network Memory Controller • Expensive deadlock avoidance Bandwidth DRAM Intel Core i7-5960X 5 / 36

My contributions Efficient techniques for sharing last-level cache, off-chip memory bandwidth and on-chip network My contributions • Last-level Cache – Futility Scaling: High-Associativity Cache Partitioning (MICRO 2014) – Predictable Cache Protection Policy (under preparation for submission) • Off-chip Memory Bandwidth – Analytical Model for Memory Bandwidth Partitioning (IPDPS 2013) • On-Chip Network – Bubble Coloring: Low-cost Deadlock Avoidance Scheme (ICS 2013) 6 / 36

Title Page An Analytical Performance Model for Memory Bandwidth Partitioning

Shared Memory Bandwidth Management Focus on fairness • Fair Queue Memory System – divide the memory bandwidth equally for each application [Nesbit et al., 2006] Focus on throughput • ATLAS – prioritize the applications that have attained the least service over others [Kim et al., 2010a] Focus on both throughput and fairness • Thread Cluster Memory Scheduler – improves both system throughput and fairness by clustering different types of threads together [Kim et al., 2010b] 8 / 36

Shared Memory Bandwidth Management Focus on fairness • Fair Queue Memory System – divide the memory bandwidth equally for each application [Nesbit et al., 2006] What are the best memory bandwidth Focus on throughput partitioning schemes for different system • ATLAS – prioritize the applications that have attained the least service over others [Kim et al., 2010a] performance objectives? Focus on both throughput and fairness • Thread Cluster Memory Scheduler – improves both system throughput and fairness by clustering different types of threads together [Kim et al., 2010b] 8 / 36

Model for Memory Bandwidth Partitioning maximize SystemObjectiveFunction ( x ) x N ∑ subject to x i ≤ B , i = 1, ... , N i =1 9 / 36

Model for Memory Bandwidth Partitioning maximize SystemObjectiveFunction ( x ) x N ∑ subject to x i ≤ B , i = 1, ... , N i =1 Common System Performance Objectives Throughput-oriented: Weighted Speedup / Sum of IPCs Fairness: Minimum Fairness (Lowest Speedup) Balancing throughput and fairness: Harmonic Weighted Speedup 9 / 36

Example Assume an application takes 10,000 cycles to execute 1,000 instructions, during which it generates 100 memory accesses • IPC = 1,000/10,000 = 0.1 • API = 100/1,000 = 0.1 • APC = 100/10,000 = 0.01 Single Application Performance Model IPC shared , i = APC shared , i x i = API i API i • IPC: Instructions Per Cycle • APC: memory Accesses Per Cycle • API: memory Accesses Per Instruction 10 / 36

Single Application Performance Model Example IPC shared , i = APC shared , i x i Assume an application takes 10,000 = API i API i cycles to execute 1,000 instructions, during which it generates 100 • IPC: Instructions Per Cycle memory accesses • APC: memory Accesses Per Cycle • IPC = 1,000/10,000 = 0.1 • API: memory Accesses Per • API = 100/1,000 = 0.1 Instruction • APC = 100/10,000 = 0.01 10 / 36

Harmonic Weighted Speedup N N maximize H sp = = ∑ N IPC alone , i ∑ N APC alone , i x i =1 i =1 IPC shared , i x i N ∑ subject to x i ≤ B , i = 1, ... , N i =1 • Optimal Partitioning — Square_root √ APC alone , i x i = x j √ APC alone , j 11 / 36

Fairness IPC shared , i = IPC shared , j x i x j = = ⇒ IPC alone , i IPC alone , j APC alone , i APC alone , j • Optimal Partitioning — Proportional x i = APC alone , i x j APC alone , j 12 / 36

Weighted Speedup N N W sp = 1 = 1 IPC shared , i x i ∑ ∑ maximize x N IPC alone , i N APC alone , i i =1 i =1 N ∑ subject to x i ≤ B , i = 1, ... , N i =1 • Optimal Partitioning — Priority_APC – A fractional Knapsack problem – The optimal memory request scheduling is to always prioritize the requests from an application with a lower APC alone over the ones from an application with a higher APC alone – Similarly, the optimal partitioning for sum of IPCs is Priority_API 13 / 36

Relationship between Performance Objectives and Memory bandwidth Partitioning Application 1 APC alone ,1 = 1 APC alone ,2 4 Application 2 Uncontrolled Sharing Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2) Best Weighted Speedup Priority_APC 14 / 36

Relationship between Performance Objectives and Memory bandwidth Partitioning Application 1 APC alone ,1 = 1 APC alone ,2 4 Application 2 No One-Size-Fits-All Different partitioning schemes are needed for optimizing different Uncontrolled Sharing system performance objectives Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2) Best Weighted Speedup Priority_APC 14 / 36

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - PowerPoint PPT Presentation

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09 Overall cloud workloads will more than triple from 2015 to 2020. Cisco Global Cloud Index CRM as a Service Storage as Maching

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Computer Science Coordinate Major Required courses: CMPS 1500 Introduction to Computer

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

CMPS 2200 Fall 2015 Probability and Expected Values Carola Wenk 11/18/15 CMPS 2200

CMPS 3120/6120 CMPS 3120/6120 Computational Geometry p y Carola Wenk Department of Computer

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip EMERGING STARS IN

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Starting Your Own CmPS Team A presentation by Laura Gary and Michelle Vecchio-Weinmeister Who

ESCRI-SA Knowledge Sharing Sharing Objectives and Components A presentation for the ESCRI-SA

MESA EVENTS and RESERVATIONS SPRING 2016 APRIL 2016 78 CMPS 1604049 4/2/2016 Arnie

P and NP Carola Wenk Slides courtesy of Piotr Indyk with additions by Carola Wenk CMPS 6610

Frchet Distance Carola Wenk CMPS 2200 Introduction to Algorithms 10/5/15 1 Polygonal Curves

Pseudokarst - introduction A talk to ASF conference, Chillagoe, April 2011 These are the

CERTIFICATE 2 IN BUSINESS various businesses to see how their offices are run. We will also be

!"#$%&'(&)+(, ! ! #2(,3"$4 !"#$%&'(&)+#,-./ !

The San Francisco State College of Business Welcomes you to the Third International Workshop on

Integrating functions over curves Recall that for a (smooth) curve C parametrized by a

SUPPORTING SCHOOLS AND STUDENTS TO ACHIEVE SHERRI YBARRA, SUPERINTENDENT OF PUBLIC INSTRUCTION

CS184a: Computer Architecture (Structures and Organization) Day1: September 25, 2000

G EOLOGICAL REPOSITORY OF F RENCH HLW C IGO P ROJECT : T HE GEOLOGICAL STORAGE OF HLW DEVELOPED

Sambuz

Useful Links

Newsletter

Mail Us

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - PowerPoint PPT Presentation

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09 Overall cloud workloads will more than triple from 2015 to 2020. Cisco Global Cloud Index CRM as a Service Storage as Maching

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip Australian Junior

Computer Science Coordinate Major Required courses: CMPS 1500 Introduction to Computer

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

CMPS 2200 Fall 2015 Probability and Expected Values Carola Wenk 11/18/15 CMPS 2200

CMPS 3120/6120 CMPS 3120/6120 Computational Geometry p y Carola Wenk Department of Computer

Secret Sharing and Visual Cryptography Outline Secret Sharing Visual Secret Sharing

Australian Junior Resources Blue Chip Australian Junior Resources Blue Chip EMERGING STARS IN

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

Exploring Chip to Chip Photonic Networks Philip Watts Computer Laboratory University of Cambridge

Advanced Tools from Modern Cryptography Lecture 3 Secret-Sharing (ctd.) Secret-Sharing Last

Starting Your Own CmPS Team A presentation by Laura Gary and Michelle Vecchio-Weinmeister Who

ESCRI-SA Knowledge Sharing Sharing Objectives and Components A presentation for the ESCRI-SA

MESA EVENTS and RESERVATIONS SPRING 2016 APRIL 2016 78 CMPS 1604049 4/2/2016 Arnie

P and NP Carola Wenk Slides courtesy of Piotr Indyk with additions by Carola Wenk CMPS 6610

Frchet Distance Carola Wenk CMPS 2200 Introduction to Algorithms 10/5/15 1 Polygonal Curves

Pseudokarst - introduction A talk to ASF conference, Chillagoe, April 2011 These are the

CERTIFICATE 2 IN BUSINESS various businesses to see how their offices are run. We will also be

!&quot;#$%&amp;'(&amp;)*+(, ! ! #2(,3&quot;$4 !&quot;#$%&amp;'(&amp;)*+#,-./ !

The San Francisco State College of Business Welcomes you to the Third International Workshop on

Integrating functions over curves Recall that for a (smooth) curve C parametrized by a

SUPPORTING SCHOOLS AND STUDENTS TO ACHIEVE SHERRI YBARRA, SUPERINTENDENT OF PUBLIC INSTRUCTION

CS184a: Computer Architecture (Structures and Organization) Day1: September 25, 2000

G EOLOGICAL REPOSITORY OF F RENCH HLW C IGO P ROJECT : T HE GEOLOGICAL STORAGE OF HLW DEVELOPED

Sambuz

Useful Links

Newsletter

Mail Us

!"#$%&'(&)+(, ! ! #2(,3"$4 !"#$%&'(&)+#,-./ !