Phase-guided Thread-to-core Assignment for Improved Utilization of - - PowerPoint PPT Presentation

phase guided thread to core assignment for improved
SMART_READER_LITE
LIVE PREVIEW

Phase-guided Thread-to-core Assignment for Improved Utilization of - - PowerPoint PPT Presentation

Tyler Sondag Hridesh Rajan Iowa State U. Iowa State U. Phase-guided Thread-to-core Assignment for Improved Utilization of Performance- Asymmetric Multi-Core Processors International Workshop on Multicore Software Engineering Supported in


slide-1
SLIDE 1

Tyler Sondag Hridesh Rajan Iowa State U. Iowa State U.

Phase-guided Thread-to-core Assignment for Improved Utilization of Performance- Asymmetric Multi-Core Processors

International Workshop on Multicore Software Engineering

Supported in part by the US National Science Foundation under grants 06-27354 and 08-08913.

slide-2
SLIDE 2

Overview

Performance asymmetric multicores are seen as a more efficient alternative to homogeneous multicores. Broad Problem: Efficient utilization of asymmetric cores Technical Challenge: Match resource requirements

Different shading represents varying resource requirements.

◮ Resource needs of threads vary at runtime. ◮ Target architecture may not be known statically.

Key Insight: Use phase behavior to reduce runtime overhead.

slide-3
SLIDE 3

Introduction Background Solution Results Conclusion Performance Asymmetry Phase Behavior

Performance Asymmetric Multicores

◮ What: Cores have different characteristics (clock speed,

cache size, etc.)

◮ Why1:

◮ space ◮ heat ◮ power ◮ performance-power ratio ◮ parallelism

  • 1R. Kumar et al. ISCA ’04

http://www.cs.iastate.edu/˜sapha/ 3/24 Phase-guided Assignment

slide-4
SLIDE 4

Introduction Background Solution Results Conclusion Performance Asymmetry Phase Behavior

Phase Behavior

◮ Behavior: resource requirements (IPC, cache, etc.) ◮ Similar Behavior: segments with similar resource usage ◮ Phase: segments of execution that exhibit similar

behavior2 Phase behavior for gcc (taken from [2])

  • 2T. Sherwood et al. ASPLOS ’02

http://www.cs.iastate.edu/˜sapha/ 4/24 Phase-guided Assignment

slide-5
SLIDE 5

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Intuition Behind Our Solution

◮ Problem: Assign code to cores such that behavior of code

matches resources of cores

◮ Idea:

1

Determine sections of code that will behave in a similar way

2

Knowledge of one section gives us information about all similar sections

http://www.cs.iastate.edu/˜sapha/ 5/24 Phase-guided Assignment

slide-6
SLIDE 6

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Approach Overview

◮ Idea: Apply the same thread-to-core mapping to all

approximately similar sections of code

1

Statically break the program into sections of code

2

Statically determine approximate similarity between these sections

3

Dynamically monitor a section then make mapping decisions for similar section

http://www.cs.iastate.edu/˜sapha/ 6/24 Phase-guided Assignment

slide-7
SLIDE 7

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Program

http://www.cs.iastate.edu/˜sapha/ 7/24 Phase-guided Assignment

slide-8
SLIDE 8

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Ignore “small” sections

http://www.cs.iastate.edu/˜sapha/ 8/24 Phase-guided Assignment

slide-9
SLIDE 9

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Determine approximate similarity

http://www.cs.iastate.edu/˜sapha/ 9/24 Phase-guided Assignment

slide-10
SLIDE 10

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Reduce number of transition points

http://www.cs.iastate.edu/˜sapha/ 10/24 Phase-guided Assignment

slide-11
SLIDE 11

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Insert phase marks

http://www.cs.iastate.edu/˜sapha/ 11/24 Phase-guided Assignment

slide-12
SLIDE 12

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Monitor

http://www.cs.iastate.edu/˜sapha/ 12/24 Phase-guided Assignment

slide-13
SLIDE 13

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Run

http://www.cs.iastate.edu/˜sapha/ 13/24 Phase-guided Assignment

slide-14
SLIDE 14

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Run

http://www.cs.iastate.edu/˜sapha/ 14/24 Phase-guided Assignment

slide-15
SLIDE 15

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Monitor

http://www.cs.iastate.edu/˜sapha/ 15/24 Phase-guided Assignment

slide-16
SLIDE 16

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Run

http://www.cs.iastate.edu/˜sapha/ 16/24 Phase-guided Assignment

slide-17
SLIDE 17

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Run

http://www.cs.iastate.edu/˜sapha/ 17/24 Phase-guided Assignment

slide-18
SLIDE 18

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Switch to matched core

http://www.cs.iastate.edu/˜sapha/ 18/24 Phase-guided Assignment

slide-19
SLIDE 19

Introduction Background Solution Results Conclusion Intuition System overview Example: Static Example: Dynamic

Run on matched core

http://www.cs.iastate.edu/˜sapha/ 19/24 Phase-guided Assignment

slide-20
SLIDE 20

Introduction Background Solution Results Conclusion Experimentation Setup Experimentation Results

Experimental Setup

◮ Hardware setup: Quad Core - 2x2.4GHz, 2x1.6GHz ◮ Workloads

◮ 36-84 SPEC CPU2000 benchmarks ◮ constant workload size

◮ Compare to standard Linux assignment

http://www.cs.iastate.edu/˜sapha/ 20/24 Phase-guided Assignment

slide-21
SLIDE 21

Introduction Background Solution Results Conclusion Experimentation Setup Experimentation Results

Overall

Best Result: Interval technique, min. size 45 instructions 4

http://www.cs.iastate.edu/˜sapha/ 21/24 Phase-guided Assignment

slide-22
SLIDE 22

Introduction Background Solution Results Conclusion Related Work Conclusion

Previous Work

Falls into two categories

◮ Asymmetry-aware scheduler3

◮ high monitoring overhead ◮ requires OS modification

◮ Improved load balancing45

◮ ignores behavior - may cause inefficient utilization ◮ requires OS modification

  • 3R. Kumar et al. ISCA ’04
  • 4T. Li et al. SC ’07
  • 5M. Becchi et al. CF ’06

http://www.cs.iastate.edu/˜sapha/ 22/24 Phase-guided Assignment

slide-23
SLIDE 23

Introduction Background Solution Results Conclusion Related Work Conclusion

Conclusion

◮ Performance asymmetric multicores are a beneficial class

  • f processors.

◮ Problem: Techniques to effectively assign threads to cores

are still needed.

◮ Solution: Use phase behavior to reduce dynamic

  • verhead.

◮ Programmer oblivious ◮ Automatic ◮ Negligible overhead ◮ Transparent deployment http://www.cs.iastate.edu/˜sapha/ 23/24 Phase-guided Assignment

slide-24
SLIDE 24

Introduction Background Solution Results Conclusion Related Work Conclusion

Questions

Questions?

http://www.cs.iastate.edu/˜sapha/ 24/24 Phase-guided Assignment

slide-25
SLIDE 25

Experimental Setup

◮ Hardware setup: Quad Core - 2x2.4GHz, 2x1.6GHz ◮ Software setup

◮ Static analysis/instrumentation: our framework based on

GNU Binutils

◮ Runtime Performance monitoring: PAPI, perfmon2 ◮ Core switching: affinity calls built-in to kernel ◮ Workloads ◮ 36-84 SPEC CPU2000 benchmarks ◮ constant workload size

◮ Compare to standard Linux assignment

slide-26
SLIDE 26

Overheads (Time)

BB[x, y]: Basic block technique, min. block size: x, Look-ahead: y. Int[x]: interval technique, min. interval size: x

slide-27
SLIDE 27

Throughput Improvement (Instructions Executed)

Left: Interval technique, Right: Basic block technique

slide-28
SLIDE 28

Speedup vs Fairness

slide-29
SLIDE 29

Speedup vs Overhead

slide-30
SLIDE 30

Speedup vs Throughput

1

slide-31
SLIDE 31

Determining program behavior

Falls into two categories

◮ Techniques using execution traces ◮ Purely dynamic techniques

slide-32
SLIDE 32

Execution Traces

◮ Benefits:

◮ Very accurate since actual performance is known ◮ Low dynamic overhead since no monitoring is required

◮ Limitations:

◮ Requires sample input set to be developed ◮ Run entire program to create execution trace ◮ What about sections of code not covered by sample input? ◮ Do different inputs result in different behavior?

slide-33
SLIDE 33

Purely Dynamic

◮ Benefits:

◮ Does not require sample input sets ◮ No need for execution trace ◮ Does not monitor the whole program

◮ Limitations:

◮ Decisions for future code are made based on past code ◮ Higher dynamic overhead since we must monitor

periodically throughout the entire execution

slide-34
SLIDE 34

Static Phase Marking

◮ Predict similarity between sections of code ◮ Insert phase marks on type transitions if determined

beneficial

◮ Basic blocks with look-ahead ◮ Intervals

slide-35
SLIDE 35

Monitoring and Assignment

Phase marks

◮ Dynamic analysis code

◮ Monitor code if no mapping is unknown ◮ Switch cores if mapping is known

◮ Type information

slide-36
SLIDE 36

Asymmetry Aware Scheduler

◮ What: Scheduler assigns threads to well matched cores ◮ Benefits:

◮ Very accurate since based on actual performance ◮ Makes system wide decisions ◮ Programs switch cores as behavior changes

◮ Limitations:

◮ Monitoring is required throughout entire execution ◮ Decisions for future execution are based on past behavior ◮ Requires OS modification

slide-37
SLIDE 37

Improved Load Balancing

◮ What: “Fast” cores get more processes or round-robin ◮ Benefits:

◮ Low overhead: does not monitor execution ◮ System wide decision making

◮ Limitations:

◮ Aimed at fairness ◮ Ignores behavior: “Fast” programs may be on “slow” cores ◮ Requires OS modification

slide-38
SLIDE 38

Intuition Behind Our Solution

◮ Problem: Assign code sections to cores such that

behavior of code matches resources of cores

◮ Idea: If we can determine that sections of code will behave

in a similar way, knowledge of one section gives us information about all similar sections.

◮ Advantages of this approach

◮ Only need to monitor a small amount of code dynamically ◮ No need to predict the actual behavior, just similarity. ◮ Considers changes in program behavior ◮ No knowledge of target machine required ◮ No need for execution traces or sample inputs ◮ Automatic - transparent to programmer and end user