efficient techniques for sharing on chip resources in cmps
play

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng - PowerPoint PPT Presentation

Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09 Overall cloud workloads will more than triple from 2015 to 2020. Cisco Global Cloud Index CRM as a Service Storage as Maching


  1. Efficient Techniques for Sharing On-chip Resources in CMPs Ruisheng Wang PhD Oral Defense 2017-05-09

  2. “Overall cloud workloads will more than triple from 2015 to 2020.” Cisco Global Cloud Index CRM as a Service Storage as Maching Learning a Service as a Service Database as a Service Functions as a Service Payments as a Service Email as a Service

  3. CRM as a Service Storage as “Overall cloud workloads will more Maching Learning a Service as a Service than triple from 2015 to 2020.” Database as a Service Cisco Global Cloud Index Functions as a Service Payments as a Service Email as a Service

  4. Low Server Utilization 3 / 36

  5. Low Server Utilization “Apple Inc. plans to invest $2 billion to build data centers ...” Wall Street Journal, 2015 “Google plans to build 12 new cloud-focused data centers in next 18 months ...” bloomberg.com, 2016 “There are over 7,500 data centers worldwide, with over 2,600 in the top 20 global cities alone, and data center con- struction will grow 21% per year through 2018.” ciena.com, 2016 3 / 36

  6. Low Server Utilization “Various analyses estimate industry-wide utilization is between 6% and12%.” “Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014 “Such WSCs tend to have relatively low average utilization, spending most of (their) time in the 10%–50% CPU utilization range.” “Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013 3 / 36

  7. Low Server Utilization “Various analyses estimate industry-wide utilization is between 6% and12%.” ! ! ! g n i n o s e i c s r i u “Reconciling High Server Utilization and Sub-millisecond Quality-of-Service” by Jacob Leverich and Christos Kozyrakis, 2014 v o o s r e R p r p e h i v C O - n O d e r a h S n o e c n e r e “Such WSCs tend to have relatively low average utilization, spending r f e t n I d a o k l most of (their) time in the 10%–50% CPU utilization range.” r o W “Data Center as a Computer” by Luiz Andre Barroso, Jimmy Clidaras, and Urs Holzle, 2013 3 / 36

  8. Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) Shared Cache Memory Bandwidth DRAM 4 / 36

  9. Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) SLO Violation!!! Shared Cache Memory Bandwidth DRAM 4 / 36

  10. Resource Interference (Uncontrolled Sharing) User Facing Offline Batch Latency Critical Analytics (Web Search) (MapReduce) To enable aggressive workload collocation, SLO Violation!!! shared on-chip resources need to be controlled in an efficient and effective way. Shared Cache Memory Bandwidth DRAM 4 / 36

  11. Off-Chip Memory Bandwidth • Unfair/Unreasonable memory bandwidth allocation On-Chip Network • Expensive deadlock avoidance Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss • Unpredictable miss rate curve Core Core Core Core Shared L3 Cache Core Core Core Core Memory Controller Bandwidth DRAM Intel Core i7-5960X 5 / 36

  12. On-Chip Network • Expensive deadlock avoidance Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss • Unpredictable miss rate curve Core Core Core Core Shared Off-Chip Memory Bandwidth L3 Cache Core Core • Unfair/Unreasonable memory bandwidth allocation Core Core Memory Controller Bandwidth DRAM Intel Core i7-5960X 5 / 36

  13. Shared On-chip Resources Last-Level Cache Queue,UnCore,I/O • Partitioning-induced associativity loss On-Chip Network • Unpredictable miss rate curve Core Core Core Core Shared Off-Chip Memory Bandwidth L3 Cache Core Core • Unfair/Unreasonable memory bandwidth allocation Core Core On-Chip Network Memory Controller • Expensive deadlock avoidance Bandwidth DRAM Intel Core i7-5960X 5 / 36

  14. My contributions Efficient techniques for sharing last-level cache, off-chip memory bandwidth and on-chip network My contributions • Last-level Cache – Futility Scaling: High-Associativity Cache Partitioning (MICRO 2014) – Predictable Cache Protection Policy (under preparation for submission) • Off-chip Memory Bandwidth – Analytical Model for Memory Bandwidth Partitioning (IPDPS 2013) • On-Chip Network – Bubble Coloring: Low-cost Deadlock Avoidance Scheme (ICS 2013) 6 / 36

  15. My contributions Efficient techniques for sharing last-level cache, off-chip memory bandwidth and on-chip network My contributions • Last-level Cache – Futility Scaling: High-Associativity Cache Partitioning (MICRO 2014) – Predictable Cache Protection Policy (under preparation for submission) • Off-chip Memory Bandwidth – Analytical Model for Memory Bandwidth Partitioning (IPDPS 2013) • On-Chip Network – Bubble Coloring: Low-cost Deadlock Avoidance Scheme (ICS 2013) 6 / 36

  16. Title Page An Analytical Performance Model for Memory Bandwidth Partitioning

  17. Shared Memory Bandwidth Management Focus on fairness • Fair Queue Memory System – divide the memory bandwidth equally for each application [Nesbit et al., 2006] Focus on throughput • ATLAS – prioritize the applications that have attained the least service over others [Kim et al., 2010a] Focus on both throughput and fairness • Thread Cluster Memory Scheduler – improves both system throughput and fairness by clustering different types of threads together [Kim et al., 2010b] 8 / 36

  18. Shared Memory Bandwidth Management Focus on fairness • Fair Queue Memory System – divide the memory bandwidth equally for each application [Nesbit et al., 2006] What are the best memory bandwidth Focus on throughput partitioning schemes for different system • ATLAS – prioritize the applications that have attained the least service over others [Kim et al., 2010a] performance objectives? Focus on both throughput and fairness • Thread Cluster Memory Scheduler – improves both system throughput and fairness by clustering different types of threads together [Kim et al., 2010b] 8 / 36

  19. Model for Memory Bandwidth Partitioning maximize SystemObjectiveFunction ( x ) x N ∑ subject to x i ≤ B , i = 1, ... , N i =1 9 / 36

  20. Model for Memory Bandwidth Partitioning maximize SystemObjectiveFunction ( x ) x N ∑ subject to x i ≤ B , i = 1, ... , N i =1 Common System Performance Objectives Throughput-oriented: Weighted Speedup / Sum of IPCs Fairness: Minimum Fairness (Lowest Speedup) Balancing throughput and fairness: Harmonic Weighted Speedup 9 / 36

  21. Example Assume an application takes 10,000 cycles to execute 1,000 instructions, during which it generates 100 memory accesses • IPC = 1,000/10,000 = 0.1 • API = 100/1,000 = 0.1 • APC = 100/10,000 = 0.01 Single Application Performance Model IPC shared , i = APC shared , i x i = API i API i • IPC: Instructions Per Cycle • APC: memory Accesses Per Cycle • API: memory Accesses Per Instruction 10 / 36

  22. Single Application Performance Model Example IPC shared , i = APC shared , i x i Assume an application takes 10,000 = API i API i cycles to execute 1,000 instructions, during which it generates 100 • IPC: Instructions Per Cycle memory accesses • APC: memory Accesses Per Cycle • IPC = 1,000/10,000 = 0.1 • API: memory Accesses Per • API = 100/1,000 = 0.1 Instruction • APC = 100/10,000 = 0.01 10 / 36

  23. Harmonic Weighted Speedup N N maximize H sp = = ∑ N IPC alone , i ∑ N APC alone , i x i =1 i =1 IPC shared , i x i N ∑ subject to x i ≤ B , i = 1, ... , N i =1 • Optimal Partitioning — Square_root √ APC alone , i x i = x j √ APC alone , j 11 / 36

  24. Fairness IPC shared , i = IPC shared , j x i x j = = ⇒ IPC alone , i IPC alone , j APC alone , i APC alone , j • Optimal Partitioning — Proportional x i = APC alone , i x j APC alone , j 12 / 36

  25. Weighted Speedup N N W sp = 1 = 1 IPC shared , i x i ∑ ∑ maximize x N IPC alone , i N APC alone , i i =1 i =1 N ∑ subject to x i ≤ B , i = 1, ... , N i =1 • Optimal Partitioning — Priority_APC – A fractional Knapsack problem – The optimal memory request scheduling is to always prioritize the requests from an application with a lower APC alone over the ones from an application with a higher APC alone – Similarly, the optimal partitioning for sum of IPCs is Priority_API 13 / 36

  26. Relationship between Performance Objectives and Memory bandwidth Partitioning Application 1 APC alone ,1 = 1 APC alone ,2 4 Application 2 Uncontrolled Sharing Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2) Best Weighted Speedup Priority_APC 14 / 36

  27. Relationship between Performance Objectives and Memory bandwidth Partitioning Application 1 APC alone ,1 = 1 APC alone ,2 4 Application 2 No One-Size-Fits-All Different partitioning schemes are needed for optimizing different Uncontrolled Sharing system performance objectives Best Fairness Proportional (1:4) Best Harmonic Weighted Speedup Square_root (1:2) Best Weighted Speedup Priority_APC 14 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend