Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies - PowerPoint PPT Presentation

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez

Die-stacking has enabled near-data processing

Die-stacking has enabled near-data processing Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Private Caches Cores

Die-stacking has enabled near-data processing Near-data processors place Conventional multicore processors use cores close to main memory to a multi-level deep cache hierarchy to reduce data movement reduce data movement Shared LLC Private Caches DRAM Dies Logic Vault Layer Controller Cores NDP Core Private cache only ( shallow hierarchy)

Die-stacking has enabled near-data processing Near-data processors place Conventional multicore processors use cores close to main memory to a multi-level deep cache hierarchy to reduce data movement reduce data movement Shared LLC Private Caches DRAM Dies Neither shallow nor deep hierarchies work well for all applications… Logic Vault Layer Controller Cores NDP Core Private cache only ( shallow hierarchy)

Asymmetric hierarchies get the best of both worlds

Asymmetric hierarchies get the best of both worlds Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both [Ahn et al., ISCA’15][Gao et al., PACT’15 ] [ Hsieh et al., ISCA’16][ Boroumand et al., ASPLOS’18]

Applications have strong hierarchy preferences 4

Applications have strong hierarchy preferences 80 70 Access latency (ns) 60 50 40 30 20 10 0 Deep hier. Shallow Deep hier. LLC hit hierarchy LLC miss 4

Applications have strong hierarchy preferences Performance/J of milc on different hierarchies 80 3 70 Access latency (ns) 2.5 Normalized Perf/J 60 2 50 40 1.5 30 1 20 0.5 10 0 0 Deep hier. Shallow Deep hier. Deep Shallow LLC hit hierarchy LLC miss hierarchy hierarchy 4

Applications have strong hierarchy preferences Performance/J of milc Performance/J of xalanc on different hierarchies on different hierarchies 80 3 1.2 70 Access latency (ns) Normalized Perf/J 2.5 1 Normalized Perf/J 60 0.8 2 50 40 0.6 1.5 30 0.4 1 20 0.2 0.5 10 0 0 0 Deep Shallow Deep hier. Shallow Deep hier. Deep Shallow hierarchy hierarchy LLC hit hierarchy LLC miss hierarchy hierarchy 4

Applications have strong hierarchy preferences Performance/J of milc Performance/J of xalanc on different hierarchies on different hierarchies 80 3 1.2 70 Access latency (ns) Normalized Perf/J 2.5 1 Normalized Perf/J 60 0.8 2 50 40 0.6 1.5 30 0.4 1 20 0.2 0.5 10 0 0 0 Deep Shallow Deep hier. Shallow Deep hier. Deep Shallow hierarchy hierarchy LLC hit hierarchy LLC miss hierarchy hierarchy How well each application can use the shared LLC is critical to its preference 4

Scheduling programs to the right hierarchy is hard 5

Scheduling programs to the right hierarchy is hard Performance/J of gems Many applications prefer different hierarchies over time because they have different phases 5

Scheduling programs to the right hierarchy is hard Performance/J of gems Performance/J of xalanc 2.5 Normalized Perf/J 2 1.5 1 0.5 0 Shallow Deep Deep Deep Deep hierarchy hierarchy hierarchy hierarchy hierarchy 2MB LLC 4MB LLC 8MB LLC 16MB LLC Many applications prefer different Applications may prefer different hierarchies over time because they hierarchies due to resource have different phases contention with other applications 5

Prior schedulers focus on different systems and constraints 6

Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB 6

Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores 6

Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores  NDP-aware workload partitioning (PIM- enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])  Focuses on single workloads and requires software modifications or compiler support 6

Prior schedulers focus on different systems and constraints  Contention-aware scheduling (Bubble- up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])  Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 LLC 2 8MB 8MB  Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])  Focuses on asymmetric core microarchitectures (big.LITTLE systems) OoO In-order cores cores  NDP-aware workload partitioning (PIM- enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])  Focuses on single workloads and requires software modifications or compiler support By contrast, our goal is to schedule threads considering both memory and core asymmetries, with no program modifications and transparently to users 6

AMS: An asymmetry-aware scheduler Miss curves Sample Hardware Hardware accesses Produce Misses utility monitors Cache size Schedule threads First contribution Second contribution Software Analytical model that estimates Two thread placement algorithms performance under different hierarchies (AMS-Greedy/AMS-DP) that extend techniques originally designed for cache partitioning 7

AMS analytical model 8

AMS analytical model  AMS estimates application preferences using total memory access latency 8

AMS analytical model  AMS estimates application preferences using total memory access latency Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from hardware monitors # Misses 2 4 6 8 LLC Capacity (MB) 8

AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem) Miss curve from Latency curve model hardware monitors Latency # Misses Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)  Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem Miss curve from Latency curve model hardware monitors Latency # Misses NDP core Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

AMS analytical model  AMS estimates application preferences using total memory access latency  Deep hierarchy has a shared LLC A function of LLC capacity  Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)  Shallow hierarchy has no shared LLC  Lat = # accesses x Latency of shallow mem Miss curve from Latency curve model hardware monitors Use processor-die core Latency # Misses NDP core Processor-die core Use NDP core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 8

Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model 9

Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model Latency curves Memory latency NDP core Processor-die core 2 4 6 8 LLC Capacity (MB) 9

Handling heterogeneous cores  Combine model from prior work (PIE) with our memory latency model Latency curves Memory stall curves Memory latency Memory stalls Weigh NDP core NDP core by MLP Processor-die core Processor-die core 2 4 6 8 2 4 6 8 LLC Capacity (MB) LLC Capacity (MB) 9

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies - PowerPoint PPT Presentation

Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez Die-stacking has enabled near-data processing Die-stacking has enabled near-data processing Conventional multicore processors use

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

Processes, Execution, and State Operating Systems Principles 4A. Introduction to Scheduling 4B.

Hybrid Cloud Integration Challenges of Big Data Science Dario Vianello (@vianello_d) Cloud

NASA Engineering Database NASA Engineering Database (NED) (NED) Prototype Prototype Stephen

The GDPR and Its Implications On Cloud Services September 2017 Norm Barber, Managing Director

Personal Control of Your Data Butler Lampson August 8, 2013 Background What is new about

https://www.eur.nl/en/campus/university-library/erasmus-data-service-centre Outline Introduction

Data Centric Security and Data Protection Manuela Cianfrone Bologna 29/10/2016 Speaker Manuela

Boards Association August 14, 2015 Data Use for Accountability and Student Achievement From

Malware Analysis Arun Lakhotia University of Louisiana at Lafayette, USA Presented at ISSISP