SLIDE 1 Tyler Sondag Hridesh Rajan Iowa State U. Iowa State U.
Phase-guided Thread-to-core Assignment for Improved Utilization of Performance- Asymmetric Multi-Core Processors
International Workshop on Multicore Software Engineering
Supported in part by the US National Science Foundation under grants 06-27354 and 08-08913.
SLIDE 2
Overview
Performance asymmetric multicores are seen as a more efficient alternative to homogeneous multicores. Broad Problem: Efficient utilization of asymmetric cores Technical Challenge: Match resource requirements
Different shading represents varying resource requirements.
◮ Resource needs of threads vary at runtime. ◮ Target architecture may not be known statically.
Key Insight: Use phase behavior to reduce runtime overhead.
SLIDE 3 Introduction Background Solution Results Conclusion Performance Asymmetry Phase Behavior
Performance Asymmetric Multicores
◮ What: Cores have different characteristics (clock speed,
cache size, etc.)
◮ Why1:
◮ space ◮ heat ◮ power ◮ performance-power ratio ◮ parallelism
- 1R. Kumar et al. ISCA ’04
http://www.cs.iastate.edu/˜sapha/ 3/24 Phase-guided Assignment
SLIDE 4 Introduction Background Solution Results Conclusion Performance Asymmetry Phase Behavior
Phase Behavior
◮ Behavior: resource requirements (IPC, cache, etc.) ◮ Similar Behavior: segments with similar resource usage ◮ Phase: segments of execution that exhibit similar
behavior2 Phase behavior for gcc (taken from [2])
- 2T. Sherwood et al. ASPLOS ’02
http://www.cs.iastate.edu/˜sapha/ 4/24 Phase-guided Assignment
SLIDE 5 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Intuition Behind Our Solution
◮ Problem: Assign code to cores such that behavior of code
matches resources of cores
◮ Idea:
1
Determine sections of code that will behave in a similar way
2
Knowledge of one section gives us information about all similar sections
http://www.cs.iastate.edu/˜sapha/ 5/24 Phase-guided Assignment
SLIDE 6 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Approach Overview
◮ Idea: Apply the same thread-to-core mapping to all
approximately similar sections of code
1
Statically break the program into sections of code
2
Statically determine approximate similarity between these sections
3
Dynamically monitor a section then make mapping decisions for similar section
http://www.cs.iastate.edu/˜sapha/ 6/24 Phase-guided Assignment
SLIDE 7 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Program
http://www.cs.iastate.edu/˜sapha/ 7/24 Phase-guided Assignment
SLIDE 8 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Ignore “small” sections
http://www.cs.iastate.edu/˜sapha/ 8/24 Phase-guided Assignment
SLIDE 9 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Determine approximate similarity
http://www.cs.iastate.edu/˜sapha/ 9/24 Phase-guided Assignment
SLIDE 10 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Reduce number of transition points
http://www.cs.iastate.edu/˜sapha/ 10/24 Phase-guided Assignment
SLIDE 11 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Insert phase marks
http://www.cs.iastate.edu/˜sapha/ 11/24 Phase-guided Assignment
SLIDE 12 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Monitor
http://www.cs.iastate.edu/˜sapha/ 12/24 Phase-guided Assignment
SLIDE 13 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Run
http://www.cs.iastate.edu/˜sapha/ 13/24 Phase-guided Assignment
SLIDE 14 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Run
http://www.cs.iastate.edu/˜sapha/ 14/24 Phase-guided Assignment
SLIDE 15 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Monitor
http://www.cs.iastate.edu/˜sapha/ 15/24 Phase-guided Assignment
SLIDE 16 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Run
http://www.cs.iastate.edu/˜sapha/ 16/24 Phase-guided Assignment
SLIDE 17 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Run
http://www.cs.iastate.edu/˜sapha/ 17/24 Phase-guided Assignment
SLIDE 18 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Switch to matched core
http://www.cs.iastate.edu/˜sapha/ 18/24 Phase-guided Assignment
SLIDE 19 Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic
Run on matched core
http://www.cs.iastate.edu/˜sapha/ 19/24 Phase-guided Assignment
SLIDE 20 Introduction Background Solution Results Conclusion Experimentation Setup Experimentation Results
Experimental Setup
◮ Hardware setup: Quad Core - 2x2.4GHz, 2x1.6GHz ◮ Workloads
◮ 36-84 SPEC CPU2000 benchmarks ◮ constant workload size
◮ Compare to standard Linux assignment
http://www.cs.iastate.edu/˜sapha/ 20/24 Phase-guided Assignment
SLIDE 21 Introduction Background Solution Results Conclusion Experimentation Setup Experimentation Results
Overall
Best Result: Interval technique, min. size 45 instructions 4
http://www.cs.iastate.edu/˜sapha/ 21/24 Phase-guided Assignment
SLIDE 22 Introduction Background Solution Results Conclusion Related Work Conclusion
Previous Work
Falls into two categories
◮ Asymmetry-aware scheduler3
◮ high monitoring overhead ◮ requires OS modification
◮ Improved load balancing45
◮ ignores behavior - may cause inefficient utilization ◮ requires OS modification
- 3R. Kumar et al. ISCA ’04
- 4T. Li et al. SC ’07
- 5M. Becchi et al. CF ’06
http://www.cs.iastate.edu/˜sapha/ 22/24 Phase-guided Assignment
SLIDE 23 Introduction Background Solution Results Conclusion Related Work Conclusion
Conclusion
◮ Performance asymmetric multicores are a beneficial class
◮ Problem: Techniques to effectively assign threads to cores
are still needed.
◮ Solution: Use phase behavior to reduce dynamic
◮ Programmer oblivious ◮ Automatic ◮ Negligible overhead ◮ Transparent deployment http://www.cs.iastate.edu/˜sapha/ 23/24 Phase-guided Assignment
SLIDE 24 Introduction Background Solution Results Conclusion Related Work Conclusion
Questions
Questions?
http://www.cs.iastate.edu/˜sapha/ 24/24 Phase-guided Assignment
SLIDE 25 Experimental Setup
◮ Hardware setup: Quad Core - 2x2.4GHz, 2x1.6GHz ◮ Software setup
◮ Static analysis/instrumentation: our framework based on
GNU Binutils
◮ Runtime Performance monitoring: PAPI, perfmon2 ◮ Core switching: affinity calls built-in to kernel ◮ Workloads ◮ 36-84 SPEC CPU2000 benchmarks ◮ constant workload size
◮ Compare to standard Linux assignment
SLIDE 26
Overheads (Time)
BB[x, y]: Basic block technique, min. block size: x, Look-ahead: y. Int[x]: interval technique, min. interval size: x
SLIDE 27
Throughput Improvement (Instructions Executed)
Left: Interval technique, Right: Basic block technique
SLIDE 28
Speedup vs Fairness
SLIDE 29
Speedup vs Overhead
SLIDE 30
Speedup vs Throughput
1
SLIDE 31
Determining program behavior
Falls into two categories
◮ Techniques using execution traces ◮ Purely dynamic techniques
SLIDE 32 Execution Traces
◮ Benefits:
◮ Very accurate since actual performance is known ◮ Low dynamic overhead since no monitoring is required
◮ Limitations:
◮ Requires sample input set to be developed ◮ Run entire program to create execution trace ◮ What about sections of code not covered by sample input? ◮ Do different inputs result in different behavior?
SLIDE 33 Purely Dynamic
◮ Benefits:
◮ Does not require sample input sets ◮ No need for execution trace ◮ Does not monitor the whole program
◮ Limitations:
◮ Decisions for future code are made based on past code ◮ Higher dynamic overhead since we must monitor
periodically throughout the entire execution
SLIDE 34 Static Phase Marking
◮ Predict similarity between sections of code ◮ Insert phase marks on type transitions if determined
beneficial
◮ Basic blocks with look-ahead ◮ Intervals
SLIDE 35 Monitoring and Assignment
Phase marks
◮ Dynamic analysis code
◮ Monitor code if no mapping is unknown ◮ Switch cores if mapping is known
◮ Type information
SLIDE 36 Asymmetry Aware Scheduler
◮ What: Scheduler assigns threads to well matched cores ◮ Benefits:
◮ Very accurate since based on actual performance ◮ Makes system wide decisions ◮ Programs switch cores as behavior changes
◮ Limitations:
◮ Monitoring is required throughout entire execution ◮ Decisions for future execution are based on past behavior ◮ Requires OS modification
SLIDE 37 Improved Load Balancing
◮ What: “Fast” cores get more processes or round-robin ◮ Benefits:
◮ Low overhead: does not monitor execution ◮ System wide decision making
◮ Limitations:
◮ Aimed at fairness ◮ Ignores behavior: “Fast” programs may be on “slow” cores ◮ Requires OS modification
SLIDE 38 Intuition Behind Our Solution
◮ Problem: Assign code sections to cores such that
behavior of code matches resources of cores
◮ Idea: If we can determine that sections of code will behave
in a similar way, knowledge of one section gives us information about all similar sections.
◮ Advantages of this approach
◮ Only need to monitor a small amount of code dynamically ◮ No need to predict the actual behavior, just similarity. ◮ Considers changes in program behavior ◮ No knowledge of target machine required ◮ No need for execution traces or sample inputs ◮ Automatic - transparent to programmer and end user