Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies - - PowerPoint PPT Presentation
Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies - - PowerPoint PPT Presentation
Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez Die-stacking has enabled near-data processing Die-stacking has enabled near-data processing Conventional multicore processors use
Die-stacking has enabled near-data processing
Die-stacking has enabled near-data processing
Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches
Die-stacking has enabled near-data processing
Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches DRAM Dies Logic Layer Near-data processors place cores close to main memory to reduce data movement NDP Core Vault Controller Private cache only (shallow hierarchy)
Die-stacking has enabled near-data processing
Conventional multicore processors use a multi-level deep cache hierarchy to reduce data movement Shared LLC Cores Private Caches DRAM Dies Logic Layer Near-data processors place cores close to main memory to reduce data movement NDP Core Vault Controller Private cache only (shallow hierarchy) Neither shallow nor deep hierarchies work well for all applications…
Asymmetric hierarchies get the best of both worlds
Asymmetric hierarchies get the best of both worlds
Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both
Asymmetric hierarchies get the best of both worlds
[Ahn et al., ISCA’15][Gao et al., PACT’15] [Hsieh et al., ISCA’16][Boroumand et al., ASPLOS’18]
Applications have strong hierarchy preferences
4
Applications have strong hierarchy preferences
4
10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns)
Applications have strong hierarchy preferences
4 Performance/J of milc
- n different hierarchies
10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J
Applications have strong hierarchy preferences
4 Performance/J of milc
- n different hierarchies
Performance/J of xalanc
- n different hierarchies
10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J 0.2 0.4 0.6 0.8 1 1.2 Deep hierarchy Shallow hierarchy Normalized Perf/J
Applications have strong hierarchy preferences
4 Performance/J of milc
- n different hierarchies
How well each application can use the shared LLC is critical to its preference
Performance/J of xalanc
- n different hierarchies
10 20 30 40 50 60 70 80 Deep hier. LLC hit Shallow hierarchy Deep hier. LLC miss Access latency (ns) 0.5 1 1.5 2 2.5 3 Deep hierarchy Shallow hierarchy Normalized Perf/J 0.2 0.4 0.6 0.8 1 1.2 Deep hierarchy Shallow hierarchy Normalized Perf/J
Scheduling programs to the right hierarchy is hard
5
Scheduling programs to the right hierarchy is hard
5
Many applications prefer different hierarchies over time because they have different phases
Performance/J of gems
Scheduling programs to the right hierarchy is hard
5
Many applications prefer different hierarchies over time because they have different phases Applications may prefer different hierarchies due to resource contention with other applications
0.5 1 1.5 2 2.5 Shallow hierarchy Deep hierarchy 2MB LLC Deep hierarchy 4MB LLC Deep hierarchy 8MB LLC Deep hierarchy 16MB LLC Normalized Perf/J
Performance/J of gems Performance/J of xalanc
Prior schedulers focus on different systems and constraints
6
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA) LLC 1 8MB LLC 2 8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems) In-order cores OoO cores LLC 1 8MB LLC 2 8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems)
NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])
Focuses on single workloads and requires software modifications or compiler support In-order cores OoO cores LLC 1 8MB LLC 2 8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems)
NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])
Focuses on single workloads and requires software modifications or compiler support
By contrast, our goal is to schedule threads considering both memory and core asymmetries, with no program modifications and transparently to users
In-order cores OoO cores LLC 1 8MB LLC 2 8MB
7 Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce
AMS: An asymmetry-aware scheduler
Analytical model that estimates performance under different hierarchies First contribution
Schedule threads
Second contribution Two thread placement algorithms (AMS-Greedy/AMS-DP) that extend techniques originally designed for cache partitioning
AMS analytical model
8
AMS estimates application preferences using total memory access latency
AMS analytical model
8
AMS estimates application preferences using total memory access latency
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8
AMS estimates application preferences using total memory access latency Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8
AMS estimates application preferences using total memory access latency Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 A function of LLC capacity
AMS estimates application preferences using total memory access latency Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity
AMS estimates application preferences using total memory access latency Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
Shallow hierarchy has no shared LLC Lat = # accesses x Latency of shallow mem
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 NDP core Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity
AMS estimates application preferences using total memory access latency Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
Shallow hierarchy has no shared LLC Lat = # accesses x Latency of shallow mem
AMS analytical model
8 # Misses Miss curve from hardware monitors LLC Capacity (MB) 2 4 6 8 NDP core Latency Latency curve model Processor-die core LLC Capacity (MB) 2 4 6 8 A function of LLC capacity Use processor-die core Use NDP core
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8
Weigh by MLP
Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8
Weigh by MLP Add non-memory component weighed by ILP
Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8
Weigh by MLP Add non-memory component weighed by ILP
Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Core cycles Core cycle curves
Non-mem cycles
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core Memory latency Latency curves Processor-die core LLC Capacity (MB) 2 4 6 8
Weigh by MLP Add non-memory component weighed by ILP
Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Memory stalls Memory stall curves Processor-die core LLC Capacity (MB) 2 4 6 8 NDP core Core cycles Core cycle curves
Non-mem cycles
Can be extended to other asymmetries, like frequencies (see paper)
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy Cache partitioning
- algo. from
prior work Partition plan T1: 3MB T2: 1MB T3: 4MB …
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy Cache partitioning
- algo. from
prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow
- hier. according
to the plan Map some threads to shallow hierarchy
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy Cache partitioning
- algo. from
prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow
- hier. according
to the plan Map some threads to shallow hierarchy
Do remaining threads fit in deep hier.?
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy Cache partitioning
- algo. from
prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow
- hier. according
to the plan Map some threads to shallow hierarchy
Do remaining threads fit in deep hier.?
Yes Done
Solve an optimization problem that seeks to minimize total cost Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10 Input: Cost curves of all threads for deep hierarchy Cache partitioning
- algo. from
prior work Partition plan T1: 3MB T2: 1MB T3: 4MB … Compare cost of deep/shallow
- hier. according
to the plan Map some threads to shallow hierarchy
Do remaining threads fit in deep hier.?
Yes Done No Cost curves for threads still mapped the deep hierarchy
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
Thread 1 Thread 2 Thread 3
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
8MB
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
3MB Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
4MB 8MB 1MB
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
3MB Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
4MB 8MB 1MB
: Opportunity cost
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-die
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
3MB Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
4MB 8MB 1MB
: Opportunity cost Opportunity cost <0 move to NDP
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-die
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
3MB Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
4MB 8MB 1MB
: Opportunity cost Opportunity cost <0 move to NDP
Perform multiple rounds
- f partitioning until the
processor die is not
- versubscribed
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-die
Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8 Cost LLC Capacity (MB) 2 4 6 8
3MB Partition the LLC among threads 1-3
Thread 1 Thread 2 Thread 3
4MB 8MB 1MB
: Opportunity cost Opportunity cost <0 move to NDP
Perform multiple rounds
- f partitioning until the
processor die is not
- versubscribed
Overhead: 0.1% of system cycles when scheduling every 50ms
AMS-DP: Scheduling threads with dynamic programming
12
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
- ptimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
- ptimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
- ptimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
- ptimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP serves as the upper bound of AMS-Greedy
But it is more expensive
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
NDP systems have different constraints from NUMA systems
NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth
Data placement for asymmetric hierarchies
13
NDP systems have different constraints from NUMA systems
NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth
We use simple heuristics to keep data from a thread in a single stack
Threads try to allocate to the same stack so long as the stack has enough capacity
See paper for more details
14
Handling multithreaded workloads AMS-DP formulation Different system scenarios
Oversubscribed systems Short-lived workloads or latency critical workloads
Evaluation
15
Evaluation
15
Modeled system:
Evaluation
15
Modeled system:
Evaluation
15
Modeled system:
Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC
Evaluation
15
Modeled system:
Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Compared schedulers
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to)
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to) Always NDP/Always processor-die
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to) Always NDP/Always processor-die Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]
Evaluation
15
Modeled system: Workloads
Multi-programmed SPECCPU Multithreaded SPECOMP/PARSEC
(see paper) Deep hierarchy: 8-core processor 32KB L1, 256KB L2, 16MB shared LLC Shallow hierarchy: 4 memory stacks, each with 2 NDP cores. Each core has private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to) Always NDP/Always processor-die Extended CRUISE [ASPLOS’12]/PIE [ISCA’11] AMS-Greedy/AMS-DP
AMS finds the right hierarchy for each application
16
AMS finds the right hierarchy for each application
16
AMS finds the right hierarchy for each application
16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random
AMS finds the right hierarchy for each application
16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random Always NDP sometimes hurts applications that prefer deep hierarchies because it never leverages the LLC. Only 9% better
AMS finds the right hierarchy for each application
16 Always processor never leverages the NDP capability of the asymmetric system and is 8% worse than Random Always NDP sometimes hurts applications that prefer deep hierarchies because it never leverages the LLC. Only 9% better AMS-Greedy never hurts performance and improves weighted speedup by up to 37% and by 18% on average
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
AMS-Greedy performs very close to AMS-DP , only 1% worse
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
AMS-Greedy performs very close to AMS-DP , only 1% worse Both AMS-Greedy and AMS- DP outperform CRUISE
AMS handles asymmetric core + memory well
18
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores Shallow hierarchy uses Silvermont-like cores
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores Shallow hierarchy uses Silvermont-like cores
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores Shallow hierarchy uses Silvermont-like cores
AMS-Greedy with the PIE model improves performance more than handling core/memory asymmetries separately
See paper for more evaluation results
19
A case study to show AMS adapts to application phases Multithreaded workloads Detailed runtime overheads Sensitivity study for system parameters
Number of cores, LLC capacity, main memory capacity Performance without and with hardware support for cache partitioning
Conclusion
20
Conclusion
20
Scheduling computation in asymmetric systems is very challenging
Conclusion
20
Scheduling computation in asymmetric systems is very challenging We present AMS, an adaptive scheduler for asymmetric systems AMS uses analytical models to adapt quickly and thread mapping algorithms
inspired by cache partitioning algorithms to find high-quality mappings
Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce Analytical model that estimates performance under different hierarchies First contribution
Schedule threads
Second contribution Two thread placement algorithms that extends techniques originally designed for cache partitioning
Thanks! Any questions?
21
Scheduling computation in asymmetric systems is very challenging We present AMS, an adaptive scheduler for asymmetric systems AMS uses analytical models to adapt quickly and thread mapping algorithms
inspired by cache partitioning algorithms to find high-quality mappings
Hardware utility monitors Hardware Software Sample accesses Misses Cache size Miss curves Produce Analytical model that estimates performance under different hierarchies First contribution
Schedule threads
Second contribution Two thread placement algorithms that extends techniques originally designed for cache partitioning