Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - PowerPoint PPT Presentation

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org

Outline n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization http://www.mummi.org

MuMMI (Multiple Metrics Modeling Infrastructure) Project Application E-AMOM PAPI PowerPack EMON API Database Multicore/Heterogeneous System for Execution http://www.mummi.org

MuMMI Database Schema Application Application Executable Run Performance Coupling User Modules Power Inputs Counters Function Module_Info Systems Functions Performance Compilers Counters Power Control Resource Basic Unit Flow Model Performance Connection Template Function_Info Sys_Comp Data Structure Model_Info Library Sys_Comm Performance http://www.mummi.org

Data Collection: MAIDE System Source code MAIDE Instrumented source code Compiler Call Graph Power and HW Counters Instrumented executable Performance, HW counters, Power and Energy Data SOAP Server with Perl Script MuMMI Database http://www.mummi.org

Power Measurement Tools n IBM EMON API on BlueGene P/Q (MonEQ) n Intel RAPL n NVIDA MLPM n PowerMon2 (RENCI) n PowerPack (VT) http://www.mummi.org

PowerPack Schema (Virginia Tech) Power sampling frequency: 1 sample per second http://scape.cs.vt.edu/software/powerpack-2-0/ http://www.mummi.org

PowerPack http://www.mummi.org

IBM EMON API Power sampling frequency: ~2 samples per second Source: IBM http://www.mummi.org

IBM EMON API http://www.mummi.org

IBM EMON API Power per node card for GTC on 16384 nodes of ANL BGQ Mira 2000 1800 Node_Card Chip_Core 1600 DRAM 1400 Network 1200 SRAM Power (W) Optics 1000 PCIexpress 800 Link_Chip_Core 600 400 200 0 0 10 20 30 40 50 60 70 80 90 100 Time (seconds) http://www.mummi.org

IBM EMON API Average Power per node for GTC on 16384 nodes of ANL BGQ Mira Node Power CPU Power Memory Power Network 60 50 40 Power (W) 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Time (seconds) http://www.mummi.org

Performance Counter Tools n Perf_events (Linux) n HPM (IBM) n perfmon (Linux) n PAPI (UTK) http://www.mummi.org

Overview: Performance Counter-based Modeling Recommendations HPC Application for Improvements Predicted runtime Runtime and Power Modeling Application or Metric PAPI Function-level Runtime and f(C 1 ,C 2 , … ,C n ) Power (CPU,mem) Spearman Regression Model PCA Counters Application or function-level Predicted power Performance (node, CPU, mem) Counters (C i ) Four metrics: runtime, node power, CPU power, memory power http://www.mummi.org

Four Models of Parallel eq3dyna on SystemG http://www.mummi.org

Four Models of Parallel eq3dyna on ANL Mira http://www.mummi.org

Prediction Error Rates on ANL Mira http://www.mummi.org

Overview: Counter-Guided Optimizations Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) Ranking counters with percentage (>1%) (rank from the highest to the lowest) C 1 , C 2 , … ,C k Pair-wise spearman correlation analysis Final counters (rank from the highest to the lowest) C 1 , C 2 , … ,C j (j < k) http://www.mummi.org

Counter Ranking Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) For example, given a parallel aerospace simulation PMLB: Runtime Node Power CPU Power Memory Power TLB_IM: 64.29% VEC_INS: 76.64% VEC_INS: 99.15% VEC_INS: 83.91% TLB_DM: 14.03% CA_SHR: 22.45% BR_NTK: 0.81% CA_CLN: 13.74% L2_ICM: 10.49% L1_TCM: 0.89% RES_STL: 0.04% BR_NTK: 0.98% L1_ICM: 9.75% RES_STL: 0.02% L1_TCM: 0.92% L2_ICA: 1.40% RES_STL: 0.18% BR_INS: 0.03% BR_TKN: 0.16% SR_INS: 0.01% L1_ICA: 0.11% http://www.mummi.org

Counter Ranking for Original PMLB on SystemG 100 PAPI_L1_ICA 90 PAPI_BR_TKN PAPI_CA_CLN 80 PAPI_BR_NTK Coefficient Percentage (%) 70 PAPI_RES_STL PAPI_L1_TCM 60 PAPI_CA_SHR PAPI_VEC_INS 50 PAPI_SR_INS 40 PAPI_BR_INS PAPI_L2_ICA 30 PAPI_L1_ICM 20 PAPI_L2_ICM PAPI_TLB_DM 10 PAPI_TLB_IM 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org

Counter Ranking Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) Ranking counters with percentage (>1%) (from the highest to the lowest) C 1 , C 2 , … ,C k Runtime TLB_IM: 64.29% TLB_DM: 14.03% Node Power CPU Power Memory Power L2_ICM:10.49% VEC_INS: 76.64% VEC_INS: 99.15% VEC_INS: 83.91% L1_ICM: 9.75% CA_SHR: 22.45% CA_CLN: 13.74% L2_ICA: 1.40% TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN http://www.mummi.org

Correlation Analysis Using Pair-wise Spearman n TLB_IM: Occurred in Runtime TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime Final counters: TLB_IM and VEC_INS L1_ICM: Corr Value=0.96934866 : Occurred in Runtime for optimization focus L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime n VEC_INS: Occurred in System, CPU, Memory http://www.mummi.org

Performance for PMLB with 128x128x128 on SystemG Original Optimized 256 128 Time (s)(log2) 64 32 16 1 2 4 8 16 32 64 128 Number of Cores http://www.mummi.org

Node Power Comparison on SystemG Original Optimized 350 340 330 320 Power per node (W) 310 300 290 280 270 260 250 1 2 4 8 16 32 64 128 Number of Cores http://www.mummi.org

Energy Comparison for PMLB on SystemG Original Optimized 65536 32768 Energy per node (J)(log2) 16384 8192 4096 1 2 4 8 16 32 64 128 Number of Cores (log2) http://www.mummi.org

Counter Ranking for Original PMLB on Mira 100 PAPI_FDV_INS 90 PAPI_FML_INS 80 PAPI_RES_STL Coefficient Percentage (%) PAPI_VEC_INS 70 PAPI_FP_INS PAPI_SR_INS 60 PAPI_BR_NTK PAPI_BR_MSP 50 PAPI_L1_ICM 40 PAPI_HW_INT 30 20 10 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org

Performance Comparison on Mira Orignial Optimized 256 128 64 Time (s) (log2) 32 16 8 4 Number of Nodes X Number of Threads per Node http://www.mummi.org

System Power Comparison on Mira Orignial Optimized 61 59 57 Power per node (W) 55 53 51 49 47 45 Number of Nodes X Number of Threads per Node http://www.mummi.org

Energy Comparison for PMLB with 512x512x512 on Mira Orignial Optimized 16384 8192 Energy per Node (I) (log2) 4096 2048 1024 512 256 Number of Nodes X Number of Threads per Node http://www.mummi.org

Counter Ranking on SystemG Counter Ranking for Original eq3dyna on SystemG 100 PAPI_SR_INS 90 PAPI_FDV_INS PAPI_L2_STM 80 PAPI_FML_INS 70 PAPI_L1_TCA Coefficient Percentage (%) PAPI_RES_STL 60 PAPI_TLB_DM 50 PAPI_L1_STM 40 PAPI_L2_TCW PAPI_BR_NTK 30 PAPI_FP_INS 20 PAPI_L2_DCW PAPI_L2_ICA 10 PAPI_L1_ICM 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org

Energy Comparison of eq3dyna on SystemG Original Optimized 512000 256000 128000 Energy per Node (J) (log2) 64000 32000 16000 8000 4000 1 2 4 8 16 32 64 128 256 Number of Cores (log2) http://www.mummi.org

Counter Ranking on ANL BGQ Mira Counter Ranking for Original eq3dyna on ANL BGQ Mira 100 PAPI_L1_DCM 90 PAPI_SR_INS 80 PAPI_BR_NTK PAPI_L1_STM 70 Coefficient Percentage (%) PAPI_RES_STL 60 PAPI_LD_INS PAPI_BR_MSP 50 PAPI_VEC_INS 40 30 20 10 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org

Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira Orignial Optimized 60000 50000 40000 Energy per node (J) 30000 20000 10000 0 32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64 Number of Nodes x Nunber of Threads per node (max number of threads per core is 4) http://www.mummi.org

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - PowerPoint PPT Presentation

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org Outline n Recent Development of MuMMI n

MuMMI : Multiple Metrics Modeling Infrastructure Valerie Taylor, Xingfu Wu, Charles Lively

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

VSAM P ERFORMANCE S UITE Optimize VSAM performance with this powerful suite of tools from CSI

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

An introduction to A/B testing using a Google Optimize example Juan M. Fonseca-Sol s

Define End-State and Optimize Monitoring Program Using High-Performance Computing Haruko

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

Online Algorithms for Rent or Buy with Expert Advice Sreenivas Gollapudi Debmalya Panigrahi How

Renewable Energy Strategy and the Analytics That Drive It GridCure uses data to optimize energy

Developing a Surrogate Reservoir Model (SRM) using CMG and IBM Analytics to Optimize

Long-term study of low energy counting rate with the Large Volume Detector Gianmarco Bruno

properties through on-chip tests Ramin Mirzazadeh, Aldo Ghisi and Stefano Mariani Politecnico di

12 Boundary conditions in multipole techniques Ivo Severens May 7, 2002 /k 12 1.

Why Does the Sun Shine? (E&M??) Coulombs Law including the 1 / r 2 dependence on the

PNWS AWWA Conference Water Audit Workshop 2018 In Introductions Mike Dexel Reinhard Sturm

Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw Staron Dominik Bargowski

L 22 Vibrations and Waves [2] Vibrations and Waves [2] L 22 resonance

Recursion Relations for Anomalous Dimensions of the 6d (2,0) Theory Arthur Lipstein GGI April

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - PowerPoint PPT Presentation

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org Outline n Recent Development of MuMMI n

MuMMI : Multiple Metrics Modeling Infrastructure Valerie Taylor, Xingfu Wu, Charles Lively

MINUTE OPTIMIZE YOUR PH MONITORING OPTIMIZE WITH HAVING CHALLENGES MEASURING

AVOIDING THE CRASH: AVOIDING THE CRASH 1: DONT INTUBATE , OPTIMIZE OPTIMIZE YOUR PRE, PERI,

Dont Optimize my Queries; Optimize my Data! Julian Hyde DataEngConf NYC 2017/10/30

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

OPTIMIZE YOUR PAGES, LEVERAGE YOUR BUSINESS CONTAINS ALL VIDEO SLIDEDECKS FOR THIS SESSION 1

VSAM P ERFORMANCE S UITE Optimize VSAM performance with this powerful suite of tools from CSI

Clean Energy Sources Wind Energy Hydro-Energy Bio-Energy Solar-Energy 1 Why Clean Energy

An introduction to A/B testing using a Google Optimize example Juan M. Fonseca-Sol s

Define End-State and Optimize Monitoring Program Using High-Performance Computing Haruko

AutoTVM &amp; Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Inducing a Discriminative Parser to Optimize Machine Translation Reordering Graham Neubig 1,2,3 ,

Optimizing zlib for A deflated story Adenilson Cavalcanti BS. MSc. Staff Engineer - Arm San Jose

Online Algorithms for Rent or Buy with Expert Advice Sreenivas Gollapudi Debmalya Panigrahi How

Renewable Energy Strategy and the Analytics That Drive It GridCure uses data to optimize energy

Developing a Surrogate Reservoir Model (SRM) using CMG and IBM Analytics to Optimize

Long-term study of low energy counting rate with the Large Volume Detector Gianmarco Bruno

properties through on-chip tests Ramin Mirzazadeh, Aldo Ghisi and Stefano Mariani Politecnico di

12 Boundary conditions in multipole techniques Ivo Severens May 7, 2002 /k 12 1.

Why Does the Sun Shine? (E&amp;M??) Coulombs Law including the 1 / r 2 dependence on the

PNWS AWWA Conference Water Audit Workshop 2018 In Introductions Mike Dexel Reinhard Sturm

Using ML to Design a Flexible LOC Counter Mirosaw Ochodek Miroslaw Staron Dominik Bargowski

L 22 Vibrations and Waves [2] Vibrations and Waves [2] L 22 resonance

Recursion Relations for Anomalous Dimensions of the 6d (2,0) Theory Arthur Lipstein GGI April

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data

Why Does the Sun Shine? (E&M??) Coulombs Law including the 1 / r 2 dependence on the