Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific - PowerPoint PPT Presentation

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems Charles Lively III*, Xingfu Wu*, Valerie Taylor*, Shirley Moore+ , Hung-Ching Chang^, Chun-Yi Su^, and Kirk Cameron^ *Department of Computer Science & Engineering, Texas A&M University +Electrical Engineering and Computer Science, University of Tennessee-Knoxville ^Department of Computer Science, Virginia Tech 7-9 Sept 2011 EnAHPC 2011 1

Introduction • Current trends in HPC put great focus on constraining power consumption without decreasing performance. • Multicore systems are hierarchical and can consist of heterogeneous components. • Understanding the mapping of scientific applications onto multicore and heterogeneous systems is necessary to optimize performance and power consumption. • Goal: Accurate models for performance and power consumption of scientific applications on multicore and heterogeneous systems 7-9 Sept 2011 EnAHPC 2011 2

Approach and Research Questions • Application-specific models are used to explore common and different characteristics of hybrid (MPI+OpenMP) scientific applications. 1. Which combination of performance counters should be used to model performance and power consumption of each component? – System, CPU, memory 2. Which application and system characteristics most affect runtime and power consumption? 3. Which aspects of hybrid applications and systems need to be optimized to improve power-performance on multicore systems? 7-9 Sept 2011 EnAHPC 2011 3

General Methodology • Explore which application characteristics (via performance counters) affect power consumption of system, CPU, and memory • Develop accurate models based on hardware counters for predicting power consumption of system components • Develop different models for each application class (Previous work used same set of performance counters across all applications). • Validate predictions using actual power measurements 7-9 Sept 2011 EnAHPC 2011 4

MuMMI Framework Multiple Metrics Modeling Infrastructure (MuMMI) http://www.mummi.org/ 7-9 Sept 2011 EnAHPC 2011 5

SystemG Largest power-aware compute system in the world • • Over 30 power and thermal sensors per node • http://scape.cs.vt.edu/ 7-9 Sept 2011 EnAHPC 2011 6

Modeling Methodology • Training Set: 5 training execution configurations – 1x1, 1x2, 1x3, 1x8, and 2x8 • 16 larger execution configurations are predicted. – 1x4, 1x5,…3x8, 4x8, 5x8, …..16x8 • 40 performance counter events are captured. • Performance counter events are normalized per cycle. • Performance-Tuned Supervised Principal Component Analysis Method is utilized to select combination of performance counters for each application. 7-9 Sept 2011 EnAHPC 2011 7

Performance-Tuned Supervised PCA 1. Compute Spearman’s rank correlation for each application and system component 1. Eliminate counters with low correlation 2. Compute regression model based upon performance counter event rates 3. Eliminate performance counters with negligible regression coefficients 4. Compute principal components of reduced performance counter space 5. Use performance counters with highest PCA vectors to build multivariate linear regression model Repeat the process for each application/system component pair. 7-9 Sept 2011 EnAHPC 2011 8

Performance-Tuned Supervised PCA 1. Compute Spearman’s rank correlation. 2. Eliminate counters with low correlation, based on β ai threshold . Example: BT-MZ correlation values for runtime Hardware Counter Correlation Value PAPI_TOT_INS 0.9187018 PAPI_FP_OPS 0.9105984 PAPI_L1_TCA 0.9017512 PAPI_L1_DCM 0.8718455 PAPI_L2_TCH 0.8123510 PAPI_L2_TCA 0.8021892 Cache_FLD 0.7511682 PAPI_TLB_DM 0.6218268 PAPI_L1_ICA 0.6487321 Bytes_out 0.6187535 7-9 Sept 2011 EnAHPC 2011 9

Performance-Tuned Supervised PCA 3. Compute regression model based upon counter event rates. 4. Eliminate counters will negligible regression coefficients. Hardware Counter Regression Coefficient Hardware Counter Regression Coefficient PAPI_TOT_INS 0.04183 PAPI_TOT_INS 0.04183 PAPI_FP_OPS -0.04219 PAPI_FP_OPS -0.04219 PAPI_L1_TCA 0.00165 PAPI_L1_TCA 0.00165 PAPI_L1_DCM 0.000179 PAPI_L2_TCH PAPI_L2_TCH 0.01875 0.01875 PAPI_L2_TCA 0.100187 PAPI_L2_TCA 0.100187 Cache_FLD -0.71548 Cache_FLD -0.71548 PAPI_TLB_DM 0.008418 PAPI_TLB_DM 0.008418 PAPI_L1_ICA -0.000048 Bytes_out 0.00085 7-9 Sept 2011 EnAHPC 2011 10

Performance-Tuned Supervised PCA 5. Compute principal components of reduced performance counter space. – Determine the variance of each principal component – Use the principal components containing at least 90% of data variance • Typically first 2 principal components – Select counters with significant PCA coefficients 5. Use performance counters with highest PCA vectors to build multivariate linear regression model: y=β 0 + β 1 * r 1 + β 2 r 2 + β 3 * r 3 ……..+ β n * r n 7-9 Sept 2011 EnAHPC 2011 11

Performance Counter Events • 15 performance counters used in this Work 7-9 Sept 2011 EnAHPC 2011 12

Applications • NAS Multizone Benchmark Suite – written in Fortran – Uses MPI and OpenMP for communication – Block Tri-diagonal algorithm (BT-MZ) • represents realistic performance case for exploring discretization meshes in parallel computing – Scalar Penta-diagonal algorithm (SP-MZ) • representative of a balanced workload – Lower-Upper symmetric Gauss-Seidel algorithm (LU-MZ) • coarse-grain parallelism of LU-MZ is limited to 16 MPI processes • Large-Scale Scientific Application – Gyrokinetic Toroidal code (GTC) • 3D particle- in-cell application • Flagship SciDAC fusion microturbulence code • written in Fortran90 • Uses MPI and OpenMP for communication 7-9 Sept 2011 EnAHPC 2011 13

BT-MZ Results 7-9 Sept 2011 EnAHPC 2011 14

SP-MZ Results 7-9 Sept 2011 EnAHPC 2011 15

LU-MZ Results 7-9 Sept 2011 EnAHPC 2011 16

GTC Results 7-9 Sept 2011 EnAHPC 2011 17

Application-specific Modeling • Multivariate regression coefficients 7-9 Sept 2011 EnAHPC 2011 18

Overall Prediction Accuracy 7-9 Sept 2011 EnAHPC 2011 19

Related Work • SoftPower: Power Estimations (Lim, Porterfield, & Fowler) – Goal: Develop a surrogate power estimation model using performance counters on the Intel Core i7 – Use Spearman’s rank correlation and robust regression analysis for training runs to derive small set of counters and correlation coefficients – Evaluation shows less than 14% error (median 5.3% error) • Power Estimation &Thread Scheduling (Singh, Bhadhauria, & McKee) – Goal: Use hardware counter model to predict power consumption on a system – Use Spearman’s rank correlation to choose top counter from each of four categories: FP, memory, stalls, instructions retired – Derive piecewise linear function for estimating core power • Reducing Energy Usage with Memory & Computation-Aware Dynamic Frequency Scaling (Laurenzano, Meswani, Carrington, Snavely, Tikir, & Poole) – Application signatures characterize execution regions – Signatures matched with set of benchmarks intended to form a covering set (machine characterization of expected power consumption over space of execution patterns and clock frequencies – Derive dynamic application frequency management strategy 7-9 Sept 2011 EnAHPC 2011 20

Conclusions • Predictive performance models for hybrid MPI+OpenMP scientific applications. – Execution time – System power consumption – CPU power consumption – Memory power consumption • 95+% accuracy across four hybrid (MPI+OpenMP) scientific applications 7-9 Sept 2011 EnAHPC 2011 21

Future Work • Explore use of microbenchmarks and application classes to derive application-centric models • Finer-granularity analysis of large-scale hybrid scientific applications – Do set of hardware counters and coefficients vary with application region? • Modeling and prediction across different application input sizes and frequency settings – Can hardware counter measurements drive a dynamic frequency scaling strategy? 7-9 Sept 2011 EnAHPC 2011 22

Acknowledgments • This work is supported by NSF grants CNS- 0911023, CNS-0910899, CNS-0910784, CNS- 0905187. • The authors would like to acknowledge Stephane Ethier from Princeton Plasma Physics Laboratory for providing the GTC code. 7-9 Sept 2011 EnAHPC 2011 23

Questions? 7-9 Sept 2011 EnAHPC 2011 24

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific - PowerPoint PPT Presentation

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems Charles Lively III, Xingfu Wu, Valerie Taylor, Shirley Moore+ , Hung-Ching Chang^, Chun-Yi Su^, and Kirk Cameron^ Department of Computer

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

A Practical Methodology for Measuring the Side- Channel Signal Available to the Attacker for

Summary Summary What you need to know about concurrency What you need to know about concurrency

Stephan Merz INRIA Lorraine & LORIA Nancy, France 1

Verification with the Check suite Yatin Manerkar Princeton University ARM Cambridge, July 20 th ,

Reminder Final exam Standard Template Library II The date for the Final has been

Announcement Final exam Standard Template Library II Tuesday, May 20 th 2:45

STL and Example Review <vector> Declaration: std::vector<T>vec = {initializer

A Unified Framework for Mul4- Target Tracking and Collec4ve

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific - PowerPoint PPT Presentation

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems Charles Lively III*, Xingfu Wu*, Valerie Taylor*, Shirley Moore+ , Hung-Ching Chang^, Chun-Yi Su^, and Kirk Cameron^ *Department of Computer

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

A Practical Methodology for Measuring the Side- Channel Signal Available to the Attacker for

Summary Summary What you need to know about concurrency What you need to know about concurrency

Stephan Merz INRIA Lorraine &amp; LORIA Nancy, France 1

Verification with the Check suite Yatin Manerkar Princeton University ARM Cambridge, July 20 th ,

Reminder Final exam Standard Template Library II The date for the Final has been

Announcement Final exam Standard Template Library II Tuesday, May 20 th 2:45

STL and Example Review &lt;vector&gt; Declaration: std::vector&lt;T&gt;vec = {initializer

A Unified Framework for Mul4- Target Tracking and Collec4ve

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) Scientific Applications on Multicore Systems Charles Lively III, Xingfu Wu, Valerie Taylor, Shirley Moore+ , Hung-Ching Chang^, Chun-Yi Su^, and Kirk Cameron^ Department of Computer

Stephan Merz INRIA Lorraine & LORIA Nancy, France 1

STL and Example Review <vector> Declaration: std::vector<T>vec = {initializer