StressRight: Finding the Right Stress for Accurate In-development - PowerPoint PPT Presentation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH) 1 Seoul National University 2

Configuring Workloads • Modern workloads are configurable − No definite answer: depends on the usage scenario 1

Evaluating a System Workloads Performance report (e.g., latency, throughput) Reconfigure workloads & system System 2

Evaluating an In-development System No performance report Workloads Investigate uArchitecture details System simulator / emulator System modeling tools: Too slow or too inaccurate 3

Workload Configuration Matters •  Configuration   System behavior − The system executes different code patterns  Different analysis results & system design insights Must configure to represent actual usage scenarios 4

Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 5

Limitations (of the Existing Methods) • Inaccurate insights about the configurations − Short simulation: No high-level metrics − DBT-based simulation: No kernel considerations − Emulator: No timing considerations 5 Latency (Normalized) 4 3 2 1 0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 memcached query throughput (Normalized) 6

Index • Introduction / Motivation • Limitations • Proposed idea: StressRight − Goals & Key ideas − Method details • Conclusion 7

StressRight: Goals • Goals (e.g., latency, throughput) − Quickly derive workload-reported performance metrics  To explore workload configurations for in-devel systems  To evaluate the systems with right stress behaviors • Requirements − Long workload execution  Must observe high-level workload-reported performance metrics − Efficient performance model  To quickly derive the performance metrics 8

StressRight: Key Ideas • Long workload execution − Use timing-agnostic platforms (e.g., Emulators) ⇒ Extract user & kernel behavior, analyze performance later • Efficient performance model − Leverage redundancy in workloads ⇒ Analyze only the unique behaviors (i.e., code blocks) ⇒ Overall behavior = ∑ Analyzed unique behaviors 9

StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B

StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time

StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Memory / Branch trace Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C

StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time High $ hit Low $ hit Med $ hit A A A Timing B B reconstruction C

StressRight: Overview Code blocks (No timing, 1-IPC)  Core 0 A B A A 100 Ops/sec Emulation  (Inaccurate) Core 1 C A B Hit rate Functional Cache Branch simulation Time A A A Timing B B reconstruction C Core 0 A B A 120 Ops/sec Reschedule (Accurate) & Reinterpret Core 1 C A B A

StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Empty Empty ROB Issue rate drops (not true for longer traces) 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate IQ Current block avg. issue = 2.0 IPC Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

StressRight: Timing Reconstruction • Challenge: Code blocks are too short − Pipeline drain effect is nontrivial • Solution: Consider hypothetical next block − Assume: next block issue rate ≈ current block issue rate − Use power law* to further adjust the rate Larger window IQ  Issue more Current block avg. issue = 2.0 IPC Next Next Next Next block issues Next proportional to 2.0 IPC ROB 11 *S . Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith, “A mechanistic performance model for superscalar out-of- order processors,” ACM TOCS, 2009

StressRight: Multiple Performances • Challenge: Difficult to model every scenario Code block Mem Mem Mem Mem Mem 12

StressRight: Multiple Performances • Challenge: Difficult to model every scenario → Analysis → IPC A 90% $ Hit → Analysis → IPC B 50% $ Hit → Analysis → IPC C Code block Mem Mem Mem Mem Mem 30% $ Hit … 12

StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 12

StressRight: Multiple Performances • Challenge: Difficult to model every scenario • Solution: Mix template scenarios − Random-generate scenarios & mix them − Few templates are enough Code block Mem Mem Mem Mem Mem 60% Hit template Hit Miss Hit Hit Miss IPC 2.0 IPC 1.6 40% Hit template Hit Miss Miss Hit Miss 50% Hit IPC 1.8 12

StressRight: Rescheduling • Basic scheduling method − Schedule to the earliest possible slot • Three rules − Rule 1: Blocks from a thread execute serially − Rule 2: Critical sections shouldn’t overlap − Rule 3: Threads should wait for barriers 13

StressRight: Rescheduling • Rule 1: Blocks from a thread execute serially − Tag code blocks with the executor thread ID − Prohibit blocks from a thread from running concurrently Core 0 Thread 1 Core 1 Thread 1 Core 0 Thread 1 Core 1 Idle Thread 1 14

StressRight: Rescheduling • Rule 2: Critical sections shouldn’t overlap − Tag code blocks with synchronization variable ID (if applicable) − Prohibit the critical sections from overlapping Core 0 A Thread 1 Core 1 Thread 2 A Core 0 Thread 1 A Thread 2 Core 1 Idle A 15

StressRight: Rescheduling • Rule 3: Threads should wait for barriers − Tag code blocks related to barrier operations (if applicable) − Prohibit the scheduling before the last barrier_wait() barrier_wait() Core 0 Thread 1 Thread 1 Thread 2 Core 1 Last barrier_wait() Core 0 Thread 1 Idle Thread 1 Thread 2 Core 1 16

Index • Introduction / Motivation • Limitations • Proposed idea: StressRight • Evaluation • Conclusion 17

Evaluation • Quantitative analysis − Why StressRight would work well • Accuracy and speed − Comparison with cycle-level simulation (MARSSx86) − Model 1 / 12 / 16 OoO x86 cores − SPEC, PARSEC, memcached • Implementation − Emulation: QEMU, Reconstruction models: Python, Functional simulators: C++ 18

Quantitative Analysis • Efficiency of the method − # instructions: full-execution vs. unique code blocks − Orders of magnitude reduction in the analysis load 19 *mcd:memcached, BS:blackscholes, BT:bodytrack, SW:swaptions, DD:dedup

Quantitative Analysis • Accuracy of the dynamic resource models − Functional simulations are accurate enough Functional vs. Cycle-level memory simulation 20

Accuracy: SPEC • Validating the pipeline model − Correctly estimates the first-order performance  Improvement in progress: Better memory model 21

Accuracy: PARSEC • Validating the scheduler − Correctly estimates the scaling behavior  Improvement in progress: Barrier synchronizations *We model a 12-core system 22

StressRight: Finding the Right Stress for Accurate In-development - PowerPoint PPT Presentation

StressRight: Finding the Right Stress for Accurate In-development System Evaluation Jaewon Lee 1 , Hanhwi Jang 1 , Jae-eon Jo 1 , Gyu-Hyeon Lee 2 , Jangwoo Kim 2 High Performance Computing Lab Pohang University of Science and Technology (POSTECH)

Changing the Paradigm in Petroleum Industry: Enhancing the catalytic rate of DszD by QM/MM

Non-hermitian Diffusion Maciej A. Nowak Mark Kac Complex Systems Research Center, Marian

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1.

SLE and CFT Mitsuhiro Kato @ QFT2005 1. Introduction Critical phenomena Conformal Field

Drift estimation for differential equations driven by fractional Brownian motions Samy Tindel

Off-critical SLEs: Massive SLE(4) with M. Bauer, Saclay. A sample of critical Ising

Two or three things I know about mean field methods in neuroscience Olivier Faugeras

Stochastic fluid flow dynamics under location uncertainty E. M emin Fluminance Introduction

ASL Introduction AIRS Tropospheric CO 2 and (Upper-Trop) CH4 Ocean CO2 CH 4 retrievals. AIRS

The Circadian Clock, Transcriptional Feedback and the Regulation of Gene Expression Nobel Prize

Welcome to the IPLAN Webinar Webinar Health Marketing Health Marketing: A Solution to A Sol

Cultivating an Inner Implicit Holding Environment Practicing Therapist Mindfulness

WELCOME TO DCIS Sherri Verble, Principal Jenny Richards, Assistant Principal

STARnet Federa.on An astronomy virtual research community based on the STARnet Gateway

Porting applications to Distributed Computing Infrastructures incorporating Desktop Grids Tamas

Explaining Phenomena and Solving Problems as a Central Shift in the Next Generation Science

Watson update - The nuts and bolts behind Watson - What has Watson been up to lately - How can

Implementation of an RTZ code Implementation of an RTZ code for feedback DAC on a for feedback

A Novel Event Based Image Sensor Architecture M. AKRARAI 1 L. FESQUET 2 G. SICARD 3 1 University

U.S. Department of Justice Antitrust Division Procurement Collusion Strike Force Showcase

between the Wisconsin Standards for Science (WSS) and Next Generation Science Standards (NGSS)?

www.nt.gov.au www.nt.gov.au www.nt.gov.au Core Clinical Systems Renewal Program (CCSRP) Chris

Pr tt r r

Security testing for hardware product : the security evaluations practice 1 DCIS/SASTI/CESTI