CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - - PowerPoint PPT Presentation

chip multiprocessors via an
SMART_READER_LITE
LIVE PREVIEW

CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra - - PowerPoint PPT Presentation

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence INTRODUCTION Motivation: Poor performance Isolation in chip


slide-1
SLIDE 1

Alexandra Fedorova Margo Seltzer Michael D. Smith Presented by Brad Eugene Torrence

IMPROVING PERFORMANCE ISOLATION ON CHIP MULTIPROCESSORS VIA AN OPERATING SYSTEM SCHEDULER

slide-2
SLIDE 2

INTRODUCTION

  • Motivation: Poor performance Isolation in chip multiprocessors (CMPs)
  • The performance of an application is also dependent on the performance
  • f other applications running at the same time.
  • Inherently acquired from the shared cache design of CMPs
  • Fairness among threads is not considered at the hardware level
slide-3
SLIDE 3

INTRODUCTION

  • High-miss-rate co-running applications increase execution time
  • Problems caused by poor performance isolation:
  • OS scheduler becomes non-deterministic and unpredictable
  • Weakens thread priority enforcement
  • QoS reserved resources are not as effective
  • Complicates per-CPU-hour billing
  • Applications unfairly billed for longer running times
slide-4
SLIDE 4

INTRODUCTION

  • Overall Performance measures time to complete execution
  • Performance Variability measures performance isolation by

calculating the performance difference between executions

  • High variability indicates poor performance isolation
  • Instructions per cycle (IPC) how fast a thread executes
  • CPU time-slice determines how long a thread takes to execute
slide-5
SLIDE 5

INTRODUCTION

  • Co-runner-dependent cache allocation creates IPC variability
  • IPC variability drastically effects performance variability
  • OS CANNOT control IPC variability because it cannot control

cache allocation

  • OS CAN control the CPU time-slice for each thread to

compensate for IPC variability

  • Cache-fair Algorithm: Offsets performance variability by

adjusting the thread’s CPU time-slice

slide-6
SLIDE 6

CACHE-FAIR ALGORITHM

  • NOT a new scheduling policy
  • Complements existing policies to mitigate performance

variability effects

  • Makes threads run as quickly as they would if the cache were

shared equally

  • Works by dynamically adjusting CPU time-slices of the threads
  • If one thread requires more CPU time, another thread must

sacrifice CPU time so overall execution time is not increased

slide-7
SLIDE 7

CACHE-FAIR ALGORITHM

  • Conventional Scheduler on a CMP
  • High-miss-rate thread B, is co-run with A
  • Thread A gets below fair cache allocation
  • Results in worse than fair performance
  • Hypothetical CMP (enforces fairness)
  • Shows the ideal fair performance
  • CMP with a cache-fair scheduler
  • Thread A is still effected by Thread B
  • Increased CPU time-slice of Thread A
  • Allows A to achieve fair performance
slide-8
SLIDE 8

CACHE-FAIR ALGORITHM

  • Two new thread types help maintain proper balance of CPU

time-sharing among threads

  • Cache-fair class threads – managed by the cache-fair

complemented scheduler to improve performance isolation

  • Best-effort class threads – not managed to improve

performance isolation but may receive time-slice adjustments

slide-9
SLIDE 9

CACHE-FAIR ALGORITHM

  • How it works:
  • When a cache-fair thread is adjusted, another cache-fair thread is adjusted

in such a way as to offset the previous time-slice adjustment

  • If another cache-fair thread to receive an offset adjustment does not exist, a

best-effort thread is adjusted instead

  • Adjustments to a single best-effort thread are kept under a specific

threshold mitigating the performance effects

  • CPU time-slice adjustments:
  • Compute and compare the fair IPC value with the actual IPC
  • Time-slices are computed so the thread will have an IPC value equal to the

fair IPC value when the time-slice expires

slide-10
SLIDE 10

FAIR IPC MODEL

  • The fair IPC of a thread cannot be measured directly
  • Fair IPC values are estimated using the Fair IPC Model
  • Estimates the miss-rate under fair-cache allocation
  • Then calculates the fair IPC given the fair-cache miss-rate
slide-11
SLIDE 11

FAIR IPC MODEL OVERVIEW

  • The miss-rate estimation model:
  • 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑏 ∗

𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐷𝑗 + 𝑐

𝑜 𝑗=1

  • Where a and b are linear coefficients, n is number of co-runners, 𝐷𝑗

is the i-th co-runner

  • After running a target thread with several co-runners, a and b are

derived using linear regression analysis

  • And since, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 = 𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓(𝐷𝑗)
  • Then, 𝐺𝑏𝑗𝑠𝑁𝑗𝑡𝑡𝑆𝑏𝑢𝑓 𝐵 =

𝑐 1−𝑏∗𝑜

slide-12
SLIDE 12

FAIR IPC MODEL EVALUATION

  • Actual vs. Estimated fair-cache miss-rates
slide-13
SLIDE 13

FAIR IPC MODEL LIMITATIONS

  • To estimate a thread’s miss rate requires running a thread with

several co-runners

  • This can yield poor results if co-runners are few or all have

similar cache-access patterns

  • Requires running a thread multiple-times with other threads
  • Highly impractical in reality
  • Model assumes uniform distribution of cache requests
  • An unrealistic assumption
slide-14
SLIDE 14

IMPLEMENTATION

  • Loadable module for Solaris 10 OS
  • Flexibility and independence from kernel’s scheduler
  • Cache-fair management induced by a system call
  • Threads are also assigned to a thread type class
  • Tracks positive and negative adjustments to maintain balance in

the overall performance

slide-15
SLIDE 15

IMPLEMENTATION

  • Cache-fair threads go through an initial preparation phase
  • OS gathers performance data to calculate the fair miss rate
  • Also necessary if thread changes cache-access patterns
  • Forced when thread executes 1 billion instructions
  • Scheduling phase monitors threads using hardware

performance counters

  • CPU time-slice adjusted if thread deviates from its fair IPC
  • Scheduling phase occurs every 50 million instructions
slide-16
SLIDE 16

EVALUATION

  • Using multi-program workload
  • SPEC CPU2000 benchmarks (for CPU workloads)
  • SPEC JBB and TPC-C (for Server workloads)
  • Default scheduler = Solaris fixed-priority scheduler
  • Evaluation compares performance isolation between cache-fair and default schedulers
  • Hardware simulator: Simics modules implementing a dual-core CMP
slide-17
SLIDE 17

EVALUATION

  • Slow schedule contains high-miss-rate co-runners
  • Fast schedule contains low-miss-rate co-runners
  • The preparation phase estimations are performed prior to experiment
  • Run with all benchmark programs to get accurate estimation
  • Principal thread is monitored for 500 million scheduling phase instructions
  • After the first 10 million to avoid cold-start effects
  • Concurrently executed with three threads running identical benchmarks
  • Only one of these is designated as a best-effort thread
  • Performance variability measured as percent slowdown
  • Difference between performance in the fast and slow groups
slide-18
SLIDE 18

EFFECT ON PERFORMANCE ISOLATION

  • Cache-fair scheduling results in <4% variability across all benchmarks
slide-19
SLIDE 19

EFFECT ON ABSOLUTE PERFORMANCE

  • Upper bound is the completion time of the slow group
  • Lower bound is the completion time of the fast group
  • Normalized to default scheduler’s completion time in the fast group
slide-20
SLIDE 20

EFFECT ON THE OVERALL THROUGHPUT

  • Aggregate IPC found to

be more dependent on relative IPCs of threads that had CPU time-slice adjustments than on the principal benchmark’s IPC

  • Slow group 1-12%

increase

  • Fast group 1-3%

increase

slide-21
SLIDE 21

EFFECT ON BEST-EFFORT THREADS

  • Worst-case effect on best-effort

threads is small

  • Multiple best-effort threads are

important in avoiding large performance effects

  • Average <1% effect on best-

performance threads but the range is quite large

slide-22
SLIDE 22

EXPERIMENTS WITH DATABASE WORKLOADS

  • The performance variability metric changes to transactions per second
  • Two sets of experiments: SPEC JBB and TPC-C each as principal program
  • The benchmark twolf was used as a best-effort co-running thread
  • Each emulates the database activities of an order-processing warehouse
  • Number of warehouses and execution threads are variable
  • Number of execution threads is fixed at one by authors
  • Memory constraints required reduced warehouse number to 5 or less
  • Authors also reduced L2 cache size to 512 KB in response
slide-23
SLIDE 23

SPEC JBB

  • Experimental layout
  • SPEC JBB Experimental Results
slide-24
SLIDE 24

TPC-C

  • Experimental Layout
  • TPC-C Experimental Results
slide-25
SLIDE 25

COMPARISON TO CACHE PARTITIONING

  • Created a hardware simulator that simulates a CMP with cache-partitioning
  • Cache-partitioning only reduced variability in 3 / 9 benchmarks
  • Poor results due to NO reduction in contention for the memory bus
  • Cache-fair scheduling accounts for memory bus contention
  • Therefore, is more effective than the hardware solution
slide-26
SLIDE 26

RELATED WORK

  • Hardware solutions
  • Address problem directly, avoiding OS modifications
  • Ensure fair resource allocation
  • Limited flexibility and effectiveness
  • Increases hardware cost, complexity, and time-to-market
  • Software solutions
  • Co-scheduling attempts to find the optimal co-runner thread
  • Requires a good co-runner to exist or fails
  • Limited scalability (requires cores to coordinate in scheduling)
slide-27
SLIDE 27

SUMMARY

  • The Cache-Fair Scheduling Algorithm improves performance

isolation on chip multiprocessors by nearly eliminating the effects from co-runner-dependent performance variability

  • Better than current hardware / software solutions (in 2007)

according to authors

  • I think the experiment shows promising results but the process
  • f calculating the fair-IPC is much too costly to be practical
slide-28
SLIDE 28

DISCUSSION

  • ANY QUESTIONS??
  • Do you agree that the cache-fair algorithm is better than other solutions at the time?
  • Do you think that the two-stage approach to cache-fair scheduling efficient?
  • Do you think that using a linear regression model was a good choice for the authors’

method of calculating the fair-IPC of a thread?

  • Does the cost of this algorithm outweigh the effects of performance variability?