StressRight: Finding the Right Stress for Accurate In-development - - PowerPoint PPT Presentation
StressRight: Finding the Right Stress for Accurate In-development - - PowerPoint PPT Presentation
StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH)
Configuring Workloads
1
- Modern workloads are configurable
− No definite answer: depends on the usage scenario
Evaluating a System
2
System Reconfigure workloads & system Performance report
(e.g., latency, throughput)
Workloads
Evaluating an In-development System
3
System simulator / emulator Workloads Investigate uArchitecture details
System modeling tools: Too slow or too inaccurate
No performance report
Workload Configuration Matters
4
- Configuration System behavior
− The system executes different code patterns Different analysis results & system design insights
Must configure to represent actual usage scenarios
Index
- Introduction / Motivation
- Limitations
- Proposed idea: StressRight
- Evaluation
- Conclusion
5
Limitations (of the Existing Methods)
6
- Inaccurate insights about the configurations
− Short simulation: No high-level metrics − DBT-based simulation: No kernel considerations − Emulator: No timing considerations
1 2 3 4 5 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 memcached query throughput (Normalized) Latency (Normalized)
Index
- Introduction / Motivation
- Limitations
- Proposed idea: StressRight
− Goals & Key ideas − Method details
- Conclusion
7
StressRight: Goals
8
- Goals
− Quickly derive workload-reported performance metrics
- To explore workload configurations for in-devel systems
- To evaluate the systems with right stress behaviors
- Requirements
− Long workload execution
- Must observe high-level workload-reported performance metrics
− Efficient performance model
- To quickly derive the performance metrics
(e.g., latency, throughput)
StressRight: Key Ideas
9
- Long workload execution
− Use timing-agnostic platforms (e.g., Emulators)
⇒ Extract user & kernel behavior, analyze performance later
- Efficient performance model
− Leverage redundancy in workloads
⇒ Analyze only the unique behaviors (i.e., code blocks) ⇒ Overall behavior = ∑ Analyzed unique behaviors
StressRight: Overview
100 Ops/sec C A B Core 0 Core 1 A B A A
Emulation
Code blocks (No timing, 1-IPC) (Inaccurate)
StressRight: Overview
Hit rate Branch Cache Time
Functional simulation
100 Ops/sec C A B Core 0 Core 1 A B A A
Emulation
Code blocks (No timing, 1-IPC) (Inaccurate) Memory / Branch trace
StressRight: Overview
Hit rate Branch Cache Time
Functional simulation
100 Ops/sec C A B Core 0 Core 1 A B A A
Emulation
Code blocks (No timing, 1-IPC) (Inaccurate)
Timing reconstruction
A A A B B C Memory / Branch trace
StressRight: Overview
Hit rate Branch Cache Time
Functional simulation
100 Ops/sec C A B Core 0 Core 1 A B A A
Emulation
Code blocks (No timing, 1-IPC) (Inaccurate)
Timing reconstruction
A A A B B C High $ hit Low $ hit Med $ hit
StressRight: Overview
Hit rate Branch Cache Time
Functional simulation
100 Ops/sec C A B Core 0 Core 1 A B A A
Emulation
Code blocks (No timing, 1-IPC) (Inaccurate)
Reschedule & Reinterpret
Core 0 Core 1 A C A B B A A 120 Ops/sec (Accurate)
Timing reconstruction
A A A B B C
StressRight: Timing Reconstruction
11
- Challenge: Code blocks are too short
− Pipeline drain effect is nontrivial
*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009
ROB IQ
StressRight: Timing Reconstruction
11
- Challenge: Code blocks are too short
− Pipeline drain effect is nontrivial
*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009
ROB IQ Empty Empty Issue rate drops (not true for longer traces)
StressRight: Timing Reconstruction
11
- Challenge: Code blocks are too short
− Pipeline drain effect is nontrivial
- Solution: Consider hypothetical next block
− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate
*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009
ROB IQ Empty Empty Issue rate drops (not true for longer traces)
StressRight: Timing Reconstruction
11
- Challenge: Code blocks are too short
− Pipeline drain effect is nontrivial
- Solution: Consider hypothetical next block
− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate
*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009
ROB IQ Next Next Current block avg. issue = 2.0 IPC Next block issues proportional to 2.0 IPC
StressRight: Timing Reconstruction
11
- Challenge: Code blocks are too short
− Pipeline drain effect is nontrivial
- Solution: Consider hypothetical next block
− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate
*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009
ROB IQ Next Next Current block avg. issue = 2.0 IPC Next block issues proportional to 2.0 IPC Next Next Larger window Issue more
StressRight: Multiple Performances
12
- Challenge: Difficult to model every scenario
Code block Mem Mem Mem Mem Mem
StressRight: Multiple Performances
12
- Challenge: Difficult to model every scenario
Code block Mem Mem Mem Mem Mem 90% $ Hit 50% $ Hit 30% $ Hit → Analysis → IPC A … → Analysis → IPC B → Analysis → IPC C
StressRight: Multiple Performances
12
- Challenge: Difficult to model every scenario
- Solution: Mix template scenarios
− Random-generate scenarios & mix them − Few templates are enough
Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss IPC 2.0 IPC 1.6
StressRight: Multiple Performances
12
- Challenge: Difficult to model every scenario
- Solution: Mix template scenarios
− Random-generate scenarios & mix them − Few templates are enough
Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss IPC 2.0 IPC 1.6 IPC 1.8 50% Hit
StressRight: Rescheduling
13
- Basic scheduling method
− Schedule to the earliest possible slot
- Three rules
− Rule 1: Blocks from a thread execute serially − Rule 2: Critical sections shouldn’t overlap − Rule 3: Threads should wait for barriers
StressRight: Rescheduling
14
- Rule 1: Blocks from a thread execute serially
− Tag code blocks with the executor thread ID − Prohibit blocks from a thread from running concurrently
Core 1 Thread 1 Core 0 Thread 1 Core 1 Thread 1 Core 0 Thread 1 Idle
StressRight: Rescheduling
15
- Rule 2: Critical sections shouldn’t overlap
− Tag code blocks with synchronization variable ID (if applicable) − Prohibit the critical sections from overlapping
Core 1 Thread 1 Core 0
A
Thread 2
A
Core 1 Thread 1 Core 0
A
Thread 2
A
Idle
StressRight: Rescheduling
16
- Rule 3: Threads should wait for barriers
− Tag code blocks related to barrier operations (if applicable) − Prohibit the scheduling before the last barrier_wait()
Core 1 Thread 1 Core 0 Thread 2
barrier_wait() Last barrier_wait()
Thread 1 Core 1 Thread 1 Core 0 Thread 2 Thread 1 Idle
Index
- Introduction / Motivation
- Limitations
- Proposed idea: StressRight
- Evaluation
- Conclusion
17
Evaluation
18
- Quantitative analysis
− Why StressRight would work well
- Accuracy and speed
− Comparison with cycle-level simulation (MARSSx86) − Model 1 / 12 / 16 OoO x86 cores − SPEC, PARSEC, memcached
- Implementation
− Emulation: QEMU, Reconstruction models: Python,
Functional simulators: C++
Quantitative Analysis
19
- Efficiency of the method
− # instructions: full-execution vs. unique code blocks − Orders of magnitude reduction in the analysis load
*mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup
Quantitative Analysis
20
- Accuracy of the dynamic resource models
− Functional simulations are accurate enough
Functional vs. Cycle-level memory simulation
Quantitative Analysis
20
- Accuracy of the dynamic resource models
− Functional simulations are accurate enough
Functional vs. Cycle-level memory simulation
Accuracy: SPEC
21
- Validating the pipeline model
− Correctly estimates the first-order performance
- Improvement in progress: Better memory model
Accuracy: PARSEC
22
- Validating the scheduler
− Correctly estimates the scaling behavior
- Improvement in progress: Barrier synchronizations
*We model a 12-core system
Accuracy: memcached
23
- Reconstructing throughput-latency curve
− StressRight greatly improves over the existing methods
*We model a 16-core system; 8 cores host the server and 8 cores run the load generator
Speed evaluation
24
- Order of magnitude faster vs. simulator
− Main bottleneck is cache simulation
*Reconstruction uses 40 vCPUs
Conclusion
25
- Motivation
− Stress in-development systems with actual usage
scenarios to obtain correct insights
- Key ideas
− Focus only on unique behavior − Consider execution dynamics: $, branch, and scheduling
- Results
− Accurately reconstruct workload-reported performance
metrics with an order of magnitude faster speed
Thank you!
26