StressRight: Finding the Right Stress for Accurate In-development - - PowerPoint PPT Presentation

stressright
SMART_READER_LITE
LIVE PREVIEW

StressRight: Finding the Right Stress for Accurate In-development - - PowerPoint PPT Presentation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH)


slide-1
SLIDE 1

StressRight:

Finding the Right Stress for Accurate In-development System Evaluation

Jaewon Lee1, Hanhwi Jang1, Jae-eon Jo1, Gyu-Hyeon Lee2, Jangwoo Kim2

High Performance Computing Lab Pohang University of Science and Technology (POSTECH)1 Seoul National University2

slide-2
SLIDE 2

Configuring Workloads

1

  • Modern workloads are configurable

− No definite answer: depends on the usage scenario

slide-3
SLIDE 3

Evaluating a System

2

System Reconfigure workloads & system Performance report

(e.g., latency, throughput)

Workloads

slide-4
SLIDE 4

Evaluating an In-development System

3

System simulator / emulator Workloads Investigate uArchitecture details

System modeling tools: Too slow or too inaccurate

No performance report

slide-5
SLIDE 5

Workload Configuration Matters

4

  •  Configuration   System behavior

− The system executes different code patterns  Different analysis results & system design insights

Must configure to represent actual usage scenarios

slide-6
SLIDE 6

Index

  • Introduction / Motivation
  • Limitations
  • Proposed idea: StressRight
  • Evaluation
  • Conclusion

5

slide-7
SLIDE 7

Limitations (of the Existing Methods)

6

  • Inaccurate insights about the configurations

− Short simulation: No high-level metrics − DBT-based simulation: No kernel considerations − Emulator: No timing considerations

1 2 3 4 5 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 memcached query throughput (Normalized) Latency (Normalized)

slide-8
SLIDE 8

Index

  • Introduction / Motivation
  • Limitations
  • Proposed idea: StressRight

− Goals & Key ideas − Method details

  • Conclusion

7

slide-9
SLIDE 9

StressRight: Goals

8

  • Goals

− Quickly derive workload-reported performance metrics

  • To explore workload configurations for in-devel systems
  • To evaluate the systems with right stress behaviors
  • Requirements

− Long workload execution

  • Must observe high-level workload-reported performance metrics

− Efficient performance model

  • To quickly derive the performance metrics

(e.g., latency, throughput)

slide-10
SLIDE 10

StressRight: Key Ideas

9

  • Long workload execution

− Use timing-agnostic platforms (e.g., Emulators)

⇒ Extract user & kernel behavior, analyze performance later

  • Efficient performance model

− Leverage redundancy in workloads

⇒ Analyze only the unique behaviors (i.e., code blocks) ⇒ Overall behavior = ∑ Analyzed unique behaviors

slide-11
SLIDE 11

StressRight: Overview

100 Ops/sec C A B  Core 0 Core 1 A B A A 

Emulation

Code blocks (No timing, 1-IPC) (Inaccurate)

slide-12
SLIDE 12

StressRight: Overview

Hit rate Branch Cache Time

Functional simulation

100 Ops/sec C A B  Core 0 Core 1 A B A A 

Emulation

Code blocks (No timing, 1-IPC) (Inaccurate) Memory / Branch trace

slide-13
SLIDE 13

StressRight: Overview

Hit rate Branch Cache Time

Functional simulation

100 Ops/sec C A B  Core 0 Core 1 A B A A 

Emulation

Code blocks (No timing, 1-IPC) (Inaccurate)

Timing reconstruction

A A A B B C Memory / Branch trace

slide-14
SLIDE 14

StressRight: Overview

Hit rate Branch Cache Time

Functional simulation

100 Ops/sec C A B  Core 0 Core 1 A B A A 

Emulation

Code blocks (No timing, 1-IPC) (Inaccurate)

Timing reconstruction

A A A B B C High $ hit Low $ hit Med $ hit

slide-15
SLIDE 15

StressRight: Overview

Hit rate Branch Cache Time

Functional simulation

100 Ops/sec C A B  Core 0 Core 1 A B A A 

Emulation

Code blocks (No timing, 1-IPC) (Inaccurate)

Reschedule & Reinterpret

Core 0 Core 1 A C A B B A A 120 Ops/sec (Accurate)

Timing reconstruction

A A A B B C

slide-16
SLIDE 16

StressRight: Timing Reconstruction

11

  • Challenge: Code blocks are too short

− Pipeline drain effect is nontrivial

*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009

ROB IQ

slide-17
SLIDE 17

StressRight: Timing Reconstruction

11

  • Challenge: Code blocks are too short

− Pipeline drain effect is nontrivial

*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009

ROB IQ Empty Empty Issue rate drops (not true for longer traces)

slide-18
SLIDE 18

StressRight: Timing Reconstruction

11

  • Challenge: Code blocks are too short

− Pipeline drain effect is nontrivial

  • Solution: Consider hypothetical next block

− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate

*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009

ROB IQ Empty Empty Issue rate drops (not true for longer traces)

slide-19
SLIDE 19

StressRight: Timing Reconstruction

11

  • Challenge: Code blocks are too short

− Pipeline drain effect is nontrivial

  • Solution: Consider hypothetical next block

− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate

*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009

ROB IQ Next Next Current block avg. issue = 2.0 IPC Next block issues proportional to 2.0 IPC

slide-20
SLIDE 20

StressRight: Timing Reconstruction

11

  • Challenge: Code blocks are too short

− Pipeline drain effect is nontrivial

  • Solution: Consider hypothetical next block

− Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate

*S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of-order processors,” ACM TOCS, 2009

ROB IQ Next Next Current block avg. issue = 2.0 IPC Next block issues proportional to 2.0 IPC Next Next Larger window  Issue more

slide-21
SLIDE 21

StressRight: Multiple Performances

12

  • Challenge: Difficult to model every scenario

Code block Mem Mem Mem Mem Mem

slide-22
SLIDE 22

StressRight: Multiple Performances

12

  • Challenge: Difficult to model every scenario

Code block Mem Mem Mem Mem Mem 90% $ Hit 50% $ Hit 30% $ Hit → Analysis → IPC A … → Analysis → IPC B → Analysis → IPC C

slide-23
SLIDE 23

StressRight: Multiple Performances

12

  • Challenge: Difficult to model every scenario
  • Solution: Mix template scenarios

− Random-generate scenarios & mix them − Few templates are enough

Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss IPC 2.0 IPC 1.6

slide-24
SLIDE 24

StressRight: Multiple Performances

12

  • Challenge: Difficult to model every scenario
  • Solution: Mix template scenarios

− Random-generate scenarios & mix them − Few templates are enough

Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss 40% Hit template Hit Miss Miss Hit Miss IPC 2.0 IPC 1.6 IPC 1.8 50% Hit

slide-25
SLIDE 25

StressRight: Rescheduling

13

  • Basic scheduling method

− Schedule to the earliest possible slot

  • Three rules

− Rule 1: Blocks from a thread execute serially − Rule 2: Critical sections shouldn’t overlap − Rule 3: Threads should wait for barriers

slide-26
SLIDE 26

StressRight: Rescheduling

14

  • Rule 1: Blocks from a thread execute serially

− Tag code blocks with the executor thread ID − Prohibit blocks from a thread from running concurrently

Core 1 Thread 1 Core 0 Thread 1 Core 1 Thread 1 Core 0 Thread 1 Idle

slide-27
SLIDE 27

StressRight: Rescheduling

15

  • Rule 2: Critical sections shouldn’t overlap

− Tag code blocks with synchronization variable ID (if applicable) − Prohibit the critical sections from overlapping

Core 1 Thread 1 Core 0

A

Thread 2

A

Core 1 Thread 1 Core 0

A

Thread 2

A

Idle

slide-28
SLIDE 28

StressRight: Rescheduling

16

  • Rule 3: Threads should wait for barriers

− Tag code blocks related to barrier operations (if applicable) − Prohibit the scheduling before the last barrier_wait()

Core 1 Thread 1 Core 0 Thread 2

barrier_wait() Last barrier_wait()

Thread 1 Core 1 Thread 1 Core 0 Thread 2 Thread 1 Idle

slide-29
SLIDE 29

Index

  • Introduction / Motivation
  • Limitations
  • Proposed idea: StressRight
  • Evaluation
  • Conclusion

17

slide-30
SLIDE 30

Evaluation

18

  • Quantitative analysis

− Why StressRight would work well

  • Accuracy and speed

− Comparison with cycle-level simulation (MARSSx86) − Model 1 / 12 / 16 OoO x86 cores − SPEC, PARSEC, memcached

  • Implementation

− Emulation: QEMU, Reconstruction models: Python,

Functional simulators: C++

slide-31
SLIDE 31

Quantitative Analysis

19

  • Efficiency of the method

− # instructions: full-execution vs. unique code blocks − Orders of magnitude reduction in the analysis load

*mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup

slide-32
SLIDE 32

Quantitative Analysis

20

  • Accuracy of the dynamic resource models

− Functional simulations are accurate enough

Functional vs. Cycle-level memory simulation

slide-33
SLIDE 33

Quantitative Analysis

20

  • Accuracy of the dynamic resource models

− Functional simulations are accurate enough

Functional vs. Cycle-level memory simulation

slide-34
SLIDE 34

Accuracy: SPEC

21

  • Validating the pipeline model

− Correctly estimates the first-order performance

  • Improvement in progress: Better memory model
slide-35
SLIDE 35

Accuracy: PARSEC

22

  • Validating the scheduler

− Correctly estimates the scaling behavior

  • Improvement in progress: Barrier synchronizations

*We model a 12-core system

slide-36
SLIDE 36

Accuracy: memcached

23

  • Reconstructing throughput-latency curve

− StressRight greatly improves over the existing methods

*We model a 16-core system; 8 cores host the server and 8 cores run the load generator

slide-37
SLIDE 37

Speed evaluation

24

  • Order of magnitude faster vs. simulator

− Main bottleneck is cache simulation

*Reconstruction uses 40 vCPUs

slide-38
SLIDE 38

Conclusion

25

  • Motivation

− Stress in-development systems with actual usage

scenarios to obtain correct insights

  • Key ideas

− Focus only on unique behavior − Consider execution dynamics: $, branch, and scheduling

  • Results

− Accurately reconstruct workload-reported performance

metrics with an order of magnitude faster speed

slide-39
SLIDE 39

Thank you!

26

Jaewon Lee (spiegel0@postech.ac.kr) Pohang University of Science and Technology (POSTECH)