RPPM: Rapid Performance Prediction of Multithreaded Workloads on - PowerPoint PPT Presentation

RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors Sander De Pestel * , Sam Van den Steen * , Shoaib Akram, Lieven Eeckhout * Intel, Belgium Ghent University ISPASS — March 25-26, 2019 Madison—Wisconsin 1

Analytical Modeling Key features – Super fast – Useful complement to simulation – Quickly explore large design spaces in early design stages Three types: – Empirical modeling • Black-box model, Easy to build, Needs training examples – Mechanistic modeling à this paper • White-box model, Insight – Hybrid modeling • Parameter fitting of semi-mechanistic model, Needs training 2

Prior Work in Mechanistic Modeling effective dispatch rate branch misprediction I-cache miss long-latency load miss time interval 1 interval 2 interval 3 Interval analysis for OOO cores Michaud [PACT’99], Karkhanis [ISCA’04], Eyerman [TOCS’09] Microarchitecture-independent model Van den Steen [ISPASS’15] Limited to single-core processors 3

Prior Work in Multicore Models Amdahl’s Law: high abstraction – Hill/Marty [Computer’08] Hybrid models: – Popov [IPDPS’15]: Amdahl’s Law + simulation Multi-programmed workloads: no inter-thread communication nor synchronization – Jongerius [TC’18] Machine learning: empirical, black-box model – Ipek [ASPLOS’06], Lee [MICRO’08] This work: multicore, multithreaded, mechanistic (white- box), microarchitecture-independent profile 4

Paper Contribution Microarchitecture-independent mechanistic performance model for multithreaded workloads on multicore processors one-time cost super fast: ~sec/min per-thread characteristics performance RPPM inter-thread interactions current limitation: same number of threads uarch-indep profile in profiling vs. prediction multicore config of multi-threaded app 5

Single-Threaded Interval Model uarch-indep branch predictor model [De Pestel, ISPASS’15] total cycle count miss rates predicted using StatStack [Eklov and Hagersten, ISPASS’10] uarch-indep MLP model [Van den Steen, CAL’18] N = dynamic instruction count D eff = effective dispatch rate; is function of ILP, I-mix, ALU contention [Van den Steen et al., ISPASS’15] 6

Naïve Extensions Apply single-threaded model [Van den Steen, ISPASS’15] to – Main thread avg 45% error main – Critical thread thread worker threads Fails to model – Synchronization barrier – Coherence effects – Resource contention critical thread: avg 28% error 7

Modeling Multithreaded Performance is Fundamentally Difficult Need super accurate per-thread performance prediction – Accumulating errors because of synchronization Need to accurately model – Inter-thread synchronization • Barriers, critical sections, producer/consumer, etc. – Inter-thread communication • Cache coherence – Inter-thread interference • Shared resources (e.g., LLC) 8

Accumulating Random Errors Predicting single-thread performance Random errors across short intervals cancel out Systematic errors (obviously) don’t 9

Accumulating Errors (cont’d) Predicting multithreaded performance b/w barriers Random errors do not cancel out 10

Problem Exacerbates with Thread Count Synthetic barrier-synchronized loop w/ 1M iterations and fixed work per iteration random error per prediction error synchronization epoch number of threads 11

RPPM Model Profiling Prediction Per-thread characteristics Predict per-thread Van den Steen [ISPASS’15] performance per synchronization epoch Synchronization Predict impact of synchronization Shared memory accesses Pin-based; measured per synchronization epoch 12

Profiling Synchronization Intercept library function calls in Pin – pthread and OpenMP – Automatic For example – Critical sections (pthread) pthread_mutex_lock pthread_mutex_unlock – Barriers (OpenMP) gomp_team_barrier_wait (gomp_barrier_t) User-level synchronization: annotate manually 13

Condition Variables Barrier using condition variables: insert marker function is not always called Similar solution for producer-consumer, semaphores, etc. Too cumbersome? No! – 4 Parsec benchmarks: pthread_cond_wait – facesim: pthread_cond_wait and pthread_cond_broadcast 14

Shared Memory Behavior Cold misses: first reference Conflict/capacity misses: StatStack [Eklov ISPASS’10] ABCDAABDDCAABEFCAB Reuse distance = no. references = 5 Stack distance = no. unique references = 3 Cache miss rate prediction for LRU cache 15

Shared Memory Behavior cont’d larger reuse distance à possibly negative interference per-thread reuse distance used for modeling private L1/L2 caches global reuse distance used for modeling shared LLC if a write à write invalidation shorter reuse distance à for D (infinite reuse distance) positive interference 16

Prediction Per-epoch active execution time Single-threaded model • To predict active execution time per synchronization epoch • Miss rates account for interference and coherence 17

Prediction Synchronization overhead Symbolic execution: fastest to slowest thread – Fastest thread(s) experience(s) idle time – Slowest thread determines execution time Critical sections, barriers, condition variables, thread create/join, etc. 18

Experimental Evaluation Rodinia (OpenMP) – Barrier synchronization Parsec (pthread) Simulator: HW-validated x86 Sniper , quad-core 4-wide OOO [Carlson TACO’14] 19

MAIN (models main thread): reasonable accuracy for Rodinia but highly inaccurate for Parsec Rodinia Parsec Summary 45%

CRIT (models critical thread): more accurate for Parsec Rodinia Parsec Summary 28%

RPPM (models critical thread per sync-epoch): 11% avg error versus MAIN (45%) and CRIT (28%) Rodinia Parsec Summary 11%

Design Space Exploration Which is the best performing 10-GOPS processor? smallest small base big biggest frequency (GHz) 5.0 3.33 2.5 2.0 1.66 width 2 3 4 5 6 Hybrid exploration strategy: – Use RPPM to predict optimum design – Simulate designs within 5% of predicted optimum Identify true optimum for all but one benchmark – RPPM predicts optimum for vast majority of benchmarks – Handful benchmarks need two simulation runs – pathfinder : within 2% of true optimum 23

Bottlegraphs: Visualizing a thread’s criticality and parallelism a thread’s criticality: share in total execution time simulation RPPM a thread’s parallelism: no. parallel threads when active 24 [Du Bois, OOPSLA’13]

Bottlegraphs: Balanced workloads Main thread distributes and co-works with worker threads 25

Bottlegraphs: Imbalanced workloads facesim : main thread performs slightly more work freqmine : main thread is bottleneck (but does parallel work) 26

Bottlegraphs: Highly imbalanced workloads Main thread does not perform any parallel work 27

Conclusions Microarchitecture-independent mechanistic performance model for multithreaded workloads on multicore processors – Accumulating random errors – Inter-thread synchronization, communication, interference Evaluation against simulation: 11% avg error versus MAIN (45%) and CRIT (28%) Use cases – Design space exploration – Workload characterization Future work: predict across thread counts – Predict Y-thread performance from X-thread profile (Y>X) – Predict Y-thread performance on X-core system (Y>X) 28

RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors Sander De Pestel * , Sam Van den Steen * , Shoaib Akram, Lieven Eeckhout * Intel, Belgium Ghent University ISPASS — March 25-26, 2019 Madison—Wisconsin 29

RPPM: Rapid Performance Prediction of Multithreaded Workloads on - PowerPoint PPT Presentation

RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors Sander De Pestel * , Sam Van den Steen * , Shoaib Akram, Lieven Eeckhout * Intel, Belgium Ghent University ISPASS March 25-26, 2019 MadisonWisconsin

Rapid Response Jobs are Alaskas Future Rapid Response Rapid Response Rapid Response is a

SE350: Operating Systems Lecture 5: Multithreaded Kernels Outline Use cases for multithreaded

DVFS PERFORMANCE PREDICTION FOR MANAGED MULTITHREADED APPLICATIONS Shoaib Akram, Jennifer B.

Model REM Rapid Engineering Model What is REM? REM Rapid Engineering Model What is REM? REM

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

CRISIS CRISIS Rapid Needs Assessment Rapid Needs Assessment ILO Crisis Response : Trainers

The Parker Center Rapid Recovery Program Plastic Surgery Patients Can Count on Rapid Recovery

RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans

Issues with Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of

Testing of Multithreaded Programs Kari Khknen, Olli Saarikivi, Keijo Heljanko The Problem

Trace-driven Simulation of Multithreaded Applications Alejandro Rico, Alejandro Duran, Felipe

Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London,

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CSE 351: Week 9 Tom Bergan, TA 1 Today Lab 5 Reference counting 2 Lab 5: Explicit

Counting, structure, and symmetry Examples For the four combinations of the null or com- plete

Auto-completion of contours based on topological persistence Vitaliy Kurlin, http://kurlin.org

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today Multiple

Outline Paging 1 2 Eviction policies 3 Thrashing 4 Details of paging 5 The user-level perspective

Performance Analysis of Parallel Scientific Applications In Eclipse EclipseCon 2015 Wyatt

Validated performance of accurate algorithms Bernard Goossens, Philippe Langlois, David Parello

PTZ Introduction Ultra / Pro / Value/ Special/ Analog International Product and

Sambuz

Useful Links

Newsletter

Mail Us