stressright
play

StressRight: Finding the Right Stress for Accurate In-development - PowerPoint PPT Presentation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH)


  1. StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH) 1 Seoul National University 2

  2. Configuring Workloads • Modern workloads are configurable − No definite answer: depends on the usage scenario 1

  3. Evaluating a System Workloads Performance report (e.g., latency, throughput) Reconfigure workloads & system System 2

  4. Evaluating an In-development System No performance report Workloads Investigate uArchitecture details System simulator / emulator System modeling tools: Too slow or too inaccurate 3

  5. Workload Configuration Matters •  Configuration   System behavior − The system executes different code patterns  Different analysis results & system design insights Must configure to represent actual usage scenarios 4

  6. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 5

  7. Limitations (of the Existing Methods) • Inaccurate insights about the configurations − Short simulation: No high-level metrics − DBT-based simulation: No kernel considerations − Emulator: No timing considerations 5 Latency (Normalized) 4 3 2 1 0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 memcached query throughput (Normalized) 6

  8. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight − Goals & Key ideas − Method details • Conclusion 7

  9. StressRight: Goals • Goals (e.g., latency, throughput) − Quickly derive workload-reported performance metrics  To explore workload configurations for in-devel systems  To evaluate the systems with right stress behaviors • Requirements − Long workload execution  Must observe high-level workload-reported performance metrics − Efficient performance model  To quickly derive the performance metrics 8

  10. StressRight: Key Ideas • Long workload execution − Use timing-agnostic platforms (e.g., Emulators) ⇒ Extract user & kernel behavior, analyze performance later • Efficient performance model − Leverage redundancy in workloads ⇒ Analyze only the unique behaviors (i.e., code blocks) ⇒ Overall behavior = ∑ Analyzed unique behaviors 9

  11. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B

  12. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time

  13. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C

  14. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time High $ hit Low $ hit Med $ hit A A A Timing B B reconstruction C

  15. StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C Core 0 A B A 120 Ops/sec Reschedule (Accurate) & Reinterpret Core 1 C A B A

  16. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  17. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  18. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  19. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Current block avg. issue = 2.0 IPC Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  20. StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate Larger window IQ  Issue more Current block avg. issue = 2.0 IPC Next Next Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

  21. StressRight: Multiple Performances • Challenge: Difficult to model every scenario Code block Mem Mem Mem Mem Mem 12

  22. StressRight: Multiple Performances • Challenge: Difficult to model every scenario → Analysis → IPC A 90% $ Hit → Analysis → IPC B 50% $ Hit → Analysis → IPC C Code block Mem Mem Mem Mem Mem 30% $ Hit … 12

  23. StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 12

  24. StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 50% Hit IPC 1.8 12

  25. StressRight: Rescheduling • Basic scheduling method − Schedule to the earliest possible slot • Three rules − Rule 1: Blocks from a thread execute serially − Rule 2: Critical sections shouldn’t overlap − Rule 3: Threads should wait for barriers 13

  26. StressRight: Rescheduling • Rule 1: Blocks from a thread execute serially − Tag code blocks with the executor thread ID − Prohibit blocks from a thread from running concurrently Core 0 Thread 1 Core 1 Thread 1 Core 0 Thread 1 Core 1 Idle Thread 1 14

  27. StressRight: Rescheduling • Rule 2: Critical sections shouldn’t overlap − Tag code blocks with synchronization variable ID (if applicable) − Prohibit the critical sections from overlapping Core 0 A Thread 1 Core 1 Thread 2 A Core 0 Thread 1 A Thread 2 Core 1 Idle A 15

  28. StressRight: Rescheduling • Rule 3: Threads should wait for barriers − Tag code blocks related to barrier operations (if applicable) − Prohibit the scheduling before the last barrier_wait() barrier_wait() Core 0 Thread 1 Thread 1 Thread 2 Core 1 Last barrier_wait() Core 0 Thread 1 Idle Thread 1 Thread 2 Core 1 16

  29. Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 17

  30. Evaluation • Quantitative analysis − Why StressRight would work well • Accuracy and speed − Comparison with cycle-level simulation (MARSSx86) − Model 1 / 12 / 16 OoO x86 cores − SPEC, PARSEC, memcached • Implementation − Emulation: QEMU, Reconstruction models: Python, Functional simulators: C++ 18

  31. Quantitative Analysis • Efficiency of the method − # instructions: full-execution vs. unique code blocks − Orders of magnitude reduction in the analysis load 19 *mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup

  32. Quantitative Analysis • Accuracy of the dynamic resource models − Functional simulations are accurate enough Functional vs. Cycle-level memory simulation 20

  33. Quantitative Analysis • Accuracy of the dynamic resource models − Functional simulations are accurate enough Functional vs. Cycle-level memory simulation 20

  34. Accuracy: SPEC • Validating the pipeline model − Correctly estimates the first-order performance  Improvement in progress: Better memory model 21

  35. Accuracy: PARSEC • Validating the scheduler − Correctly estimates the scaling behavior  Improvement in progress: Barrier synchronizations *We model a 12-core system 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend