http://www.mummi.org
Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - - PowerPoint PPT Presentation
Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - - PowerPoint PPT Presentation
Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org Outline n Recent Development of MuMMI n
http://www.mummi.org
Outline
n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization
http://www.mummi.org
MuMMI (Multiple Metrics Modeling Infrastructure) Project
Multicore/Heterogeneous System for Execution
PAPI PowerPack Database Application E-AMOM EMON API
http://www.mummi.org
MuMMI Database Schema
Application Executable Run Application Performance Modules Function Performance Basic Unit Performance Data Structure Performance Inputs Systems Resource Connection Functions Module_Info Control Flow Compilers Model Template Function_Info Library Model_Info User Counters Sys_Comp Sys_Comm Power Coupling Power Counters
http://www.mummi.org
Data Collection: MAIDE System
Source code MAIDE Instrumented source code Compiler Instrumented executable Performance, HW counters, Power and Energy Data MuMMI Database Call Graph Power and HW Counters SOAP Server with Perl Script
http://www.mummi.org
Power Measurement Tools
n IBM EMON API on BlueGene P/Q (MonEQ) n Intel RAPL n NVIDA MLPM n PowerMon2 (RENCI) n PowerPack (VT)
http://www.mummi.org
PowerPack Schema (Virginia Tech)
Power sampling frequency: 1 sample per second
http://scape.cs.vt.edu/software/powerpack-2-0/
http://www.mummi.org
PowerPack
http://www.mummi.org
IBM EMON API
Power sampling frequency: ~2 samples per second Source: IBM
http://www.mummi.org
IBM EMON API
http://www.mummi.org
IBM EMON API
200 400 600 800 1000 1200 1400 1600 1800 2000 10 20 30 40 50 60 70 80 90 100 Power (W) Time (seconds)
Power per node card for GTC on 16384 nodes of ANL BGQ Mira
Node_Card Chip_Core DRAM Network SRAM Optics PCIexpress Link_Chip_Core
http://www.mummi.org
IBM EMON API
10 20 30 40 50 60 10 20 30 40 50 60 70 80 90 100 Power (W) Time (seconds)
Average Power per node for GTC on 16384 nodes of ANL BGQ Mira
Node Power CPU Power Memory Power Network
http://www.mummi.org
Performance Counter Tools
n Perf_events (Linux) n HPM (IBM) n perfmon (Linux) n PAPI (UTK)
http://www.mummi.org
Outline
n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization
http://www.mummi.org
Overview: Performance Counter-based Modeling
Runtime and Power Modeling HPC Application
Application or function-level Performance Counters (Ci)
Metric Spearman PCA PAPI
Application or Function-level Runtime and Power (CPU,mem) Predicted runtime Predicted power (node, CPU, mem)
Counters Model Regression
f(C1,C2,…,Cn)
Four metrics: runtime, node power, CPU power, memory power
Recommendations for Improvements
http://www.mummi.org
Four Models of Parallel eq3dyna on SystemG
http://www.mummi.org
Four Models of Parallel eq3dyna on ANL Mira
http://www.mummi.org
Prediction Error Rates on ANL Mira
http://www.mummi.org
Outline
n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization
http://www.mummi.org
Overview: Counter-Guided Optimizations
Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)
Ranking counters with percentage (>1%) (rank from the highest to the lowest)
C1, C2, …,Ck
Pair-wise spearman correlation analysis Final counters (rank from the highest to the lowest)
C1, C2, …,Cj (j < k)
http://www.mummi.org
Counter Ranking
Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)
For example, given a parallel aerospace simulation PMLB:
Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM: 10.49% L1_ICM: 9.75% L2_ICA: 1.40% BR_INS: 0.03% SR_INS: 0.01% Node Power VEC_INS: 76.64% CA_SHR: 22.45% L1_TCM: 0.89% RES_STL: 0.02% Memory Power VEC_INS: 83.91% CA_CLN: 13.74% BR_NTK: 0.98% L1_TCM: 0.92% RES_STL: 0.18% BR_TKN: 0.16% L1_ICA: 0.11% CPU Power VEC_INS: 99.15% BR_NTK: 0.81% RES_STL: 0.04%
http://www.mummi.org
10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power
Coefficient Percentage (%) Models
Counter Ranking for Original PMLB on SystemG
PAPI_L1_ICA PAPI_BR_TKN PAPI_CA_CLN PAPI_BR_NTK PAPI_RES_STL PAPI_L1_TCM PAPI_CA_SHR PAPI_VEC_INS PAPI_SR_INS PAPI_BR_INS PAPI_L2_ICA PAPI_L1_ICM PAPI_L2_ICM PAPI_TLB_DM PAPI_TLB_IM
http://www.mummi.org
Counter Ranking
Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)
Ranking counters with percentage (>1%) (from the highest to the lowest)
C1, C2, …,Ck Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM:10.49% L1_ICM: 9.75% L2_ICA: 1.40% Node Power VEC_INS: 76.64% CA_SHR: 22.45% Memory Power VEC_INS: 83.91% CA_CLN: 13.74% CPU Power VEC_INS: 99.15% TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN
http://www.mummi.org
Correlation Analysis Using Pair-wise Spearman
n TLB_IM: Occurred in Runtime
TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime L1_ICM: Corr Value=0.96934866 : Occurred in Runtime L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime
n VEC_INS: Occurred in System, CPU, Memory
Final counters: TLB_IM and VEC_INS for optimization focus
http://www.mummi.org
16 32 64 128 256 1 2 4 8 16 32 64 128
Time (s)(log2) Number of Cores
Performance for PMLB with 128x128x128 on SystemG
Original Optimized
http://www.mummi.org
250 260 270 280 290 300 310 320 330 340 350 1 2 4 8 16 32 64 128
Power per node (W) Number of Cores
Node Power Comparison on SystemG
Original Optimized
http://www.mummi.org
4096 8192 16384 32768 65536 1 2 4 8 16 32 64 128
Energy per node (J)(log2) Number of Cores (log2)
Energy Comparison for PMLB on SystemG
Original Optimized
http://www.mummi.org
10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power
Coefficient Percentage (%) Models
Counter Ranking for Original PMLB on Mira
PAPI_FDV_INS PAPI_FML_INS PAPI_RES_STL PAPI_VEC_INS PAPI_FP_INS PAPI_SR_INS PAPI_BR_NTK PAPI_BR_MSP PAPI_L1_ICM PAPI_HW_INT
http://www.mummi.org
4 8 16 32 64 128 256
Time (s) (log2) Number of Nodes X Number of Threads per Node
Performance Comparison on Mira
Orignial Optimized
http://www.mummi.org
45 47 49 51 53 55 57 59 61
Power per node (W) Number of Nodes X Number of Threads per Node
System Power Comparison on Mira
Orignial Optimized
http://www.mummi.org
256 512 1024 2048 4096 8192 16384
Energy per Node (I) (log2) Number of Nodes X Number of Threads per Node
Energy Comparison for PMLB with 512x512x512 on Mira
Orignial Optimized
http://www.mummi.org
Counter Ranking on SystemG
10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power Coefficient Percentage (%) Models
Counter Ranking for Original eq3dyna on SystemG
PAPI_SR_INS PAPI_FDV_INS PAPI_L2_STM PAPI_FML_INS PAPI_L1_TCA PAPI_RES_STL PAPI_TLB_DM PAPI_L1_STM PAPI_L2_TCW PAPI_BR_NTK PAPI_FP_INS PAPI_L2_DCW PAPI_L2_ICA PAPI_L1_ICM
http://www.mummi.org
4000 8000 16000 32000 64000 128000 256000 512000 1 2 4 8 16 32 64 128 256 Energy per Node (J) (log2) Number of Cores (log2)
Energy Comparison of eq3dyna on SystemG
Original Optimized
http://www.mummi.org
Counter Ranking on ANL BGQ Mira
10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power Coefficient Percentage (%) Models
Counter Ranking for Original eq3dyna on ANL BGQ Mira
PAPI_L1_DCM PAPI_SR_INS PAPI_BR_NTK PAPI_L1_STM PAPI_RES_STL PAPI_LD_INS PAPI_BR_MSP PAPI_VEC_INS
http://www.mummi.org
10000 20000 30000 40000 50000 60000 32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64
Energy per node (J) Number of Nodes x Nunber of Threads per node (max number of threads per core is 4)
Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira
Orignial Optimized
http://www.mummi.org
Summary and Future Work
n We used runtime and power models to identify the
most important counters for optimization focus
n Our counters-guided optimizations save energy by
u Eq3dyna:
Average 48.65% on Mira; 30.67% on SystemG
u PMLB: