Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - - PowerPoint PPT Presentation

using mummi to model and optimize energy and performance
SMART_READER_LITE
LIVE PREVIEW

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu - - PowerPoint PPT Presentation

Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org Outline n Recent Development of MuMMI n


slide-1
SLIDE 1

http://www.mummi.org

Using MuMMI to Model and Optimize Energy and Performance

Xingfu Wu and Valerie Taylor Texas A&M University

Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015

slide-2
SLIDE 2

http://www.mummi.org

Outline

n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization

slide-3
SLIDE 3

http://www.mummi.org

MuMMI (Multiple Metrics Modeling Infrastructure) Project

Multicore/Heterogeneous System for Execution

PAPI PowerPack Database Application E-AMOM EMON API

slide-4
SLIDE 4

http://www.mummi.org

MuMMI Database Schema

Application Executable Run Application Performance Modules Function Performance Basic Unit Performance Data Structure Performance Inputs Systems Resource Connection Functions Module_Info Control Flow Compilers Model Template Function_Info Library Model_Info User Counters Sys_Comp Sys_Comm Power Coupling Power Counters

slide-5
SLIDE 5

http://www.mummi.org

Data Collection: MAIDE System

Source code MAIDE Instrumented source code Compiler Instrumented executable Performance, HW counters, Power and Energy Data MuMMI Database Call Graph Power and HW Counters SOAP Server with Perl Script

slide-6
SLIDE 6

http://www.mummi.org

Power Measurement Tools

n IBM EMON API on BlueGene P/Q (MonEQ) n Intel RAPL n NVIDA MLPM n PowerMon2 (RENCI) n PowerPack (VT)

slide-7
SLIDE 7

http://www.mummi.org

PowerPack Schema (Virginia Tech)

Power sampling frequency: 1 sample per second

http://scape.cs.vt.edu/software/powerpack-2-0/

slide-8
SLIDE 8

http://www.mummi.org

PowerPack

slide-9
SLIDE 9

http://www.mummi.org

IBM EMON API

Power sampling frequency: ~2 samples per second Source: IBM

slide-10
SLIDE 10

http://www.mummi.org

IBM EMON API

slide-11
SLIDE 11

http://www.mummi.org

IBM EMON API

200 400 600 800 1000 1200 1400 1600 1800 2000 10 20 30 40 50 60 70 80 90 100 Power (W) Time (seconds)

Power per node card for GTC on 16384 nodes of ANL BGQ Mira

Node_Card Chip_Core DRAM Network SRAM Optics PCIexpress Link_Chip_Core

slide-12
SLIDE 12

http://www.mummi.org

IBM EMON API

10 20 30 40 50 60 10 20 30 40 50 60 70 80 90 100 Power (W) Time (seconds)

Average Power per node for GTC on 16384 nodes of ANL BGQ Mira

Node Power CPU Power Memory Power Network

slide-13
SLIDE 13

http://www.mummi.org

Performance Counter Tools

n Perf_events (Linux) n HPM (IBM) n perfmon (Linux) n PAPI (UTK)

slide-14
SLIDE 14

http://www.mummi.org

Outline

n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization

slide-15
SLIDE 15

http://www.mummi.org

Overview: Performance Counter-based Modeling

Runtime and Power Modeling HPC Application

Application or function-level Performance Counters (Ci)

Metric Spearman PCA PAPI

Application or Function-level Runtime and Power (CPU,mem) Predicted runtime Predicted power (node, CPU, mem)

Counters Model Regression

f(C1,C2,…,Cn)

Four metrics: runtime, node power, CPU power, memory power

Recommendations for Improvements

slide-16
SLIDE 16

http://www.mummi.org

Four Models of Parallel eq3dyna on SystemG

slide-17
SLIDE 17

http://www.mummi.org

Four Models of Parallel eq3dyna on ANL Mira

slide-18
SLIDE 18

http://www.mummi.org

Prediction Error Rates on ANL Mira

slide-19
SLIDE 19

http://www.mummi.org

Outline

n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization

slide-20
SLIDE 20

http://www.mummi.org

Overview: Counter-Guided Optimizations

Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)

Ranking counters with percentage (>1%) (rank from the highest to the lowest)

C1, C2, …,Ck

Pair-wise spearman correlation analysis Final counters (rank from the highest to the lowest)

C1, C2, …,Cj (j < k)

slide-21
SLIDE 21

http://www.mummi.org

Counter Ranking

Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)

For example, given a parallel aerospace simulation PMLB:

Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM: 10.49% L1_ICM: 9.75% L2_ICA: 1.40% BR_INS: 0.03% SR_INS: 0.01% Node Power VEC_INS: 76.64% CA_SHR: 22.45% L1_TCM: 0.89% RES_STL: 0.02% Memory Power VEC_INS: 83.91% CA_CLN: 13.74% BR_NTK: 0.98% L1_TCM: 0.92% RES_STL: 0.18% BR_TKN: 0.16% L1_ICA: 0.11% CPU Power VEC_INS: 99.15% BR_NTK: 0.81% RES_STL: 0.04%

slide-22
SLIDE 22

http://www.mummi.org

10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power

Coefficient Percentage (%) Models

Counter Ranking for Original PMLB on SystemG

PAPI_L1_ICA PAPI_BR_TKN PAPI_CA_CLN PAPI_BR_NTK PAPI_RES_STL PAPI_L1_TCM PAPI_CA_SHR PAPI_VEC_INS PAPI_SR_INS PAPI_BR_INS PAPI_L2_ICA PAPI_L1_ICM PAPI_L2_ICM PAPI_TLB_DM PAPI_TLB_IM

slide-23
SLIDE 23

http://www.mummi.org

Counter Ranking

Ranking counters based on coefficient percentage Rank(C11, C12, …,C1n), Rank(C21, C22, …,C2m) Rank(C31, C32, …,C3s), Rank(C41, C42, …,C4r) Runtime f1(C11, C12, …,C1n) System Power f2(C21, C22, …,C2m) Memory Power f4(C41, C42, …,C4r) CPU Power f3(C31, C32, …,C3s)

Ranking counters with percentage (>1%) (from the highest to the lowest)

C1, C2, …,Ck Runtime TLB_IM: 64.29% TLB_DM: 14.03% L2_ICM:10.49% L1_ICM: 9.75% L2_ICA: 1.40% Node Power VEC_INS: 76.64% CA_SHR: 22.45% Memory Power VEC_INS: 83.91% CA_CLN: 13.74% CPU Power VEC_INS: 99.15% TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN

slide-24
SLIDE 24

http://www.mummi.org

Correlation Analysis Using Pair-wise Spearman

n TLB_IM: Occurred in Runtime

TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime L1_ICM: Corr Value=0.96934866 : Occurred in Runtime L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime

n VEC_INS: Occurred in System, CPU, Memory

Final counters: TLB_IM and VEC_INS for optimization focus

slide-25
SLIDE 25

http://www.mummi.org

16 32 64 128 256 1 2 4 8 16 32 64 128

Time (s)(log2) Number of Cores

Performance for PMLB with 128x128x128 on SystemG

Original Optimized

slide-26
SLIDE 26

http://www.mummi.org

250 260 270 280 290 300 310 320 330 340 350 1 2 4 8 16 32 64 128

Power per node (W) Number of Cores

Node Power Comparison on SystemG

Original Optimized

slide-27
SLIDE 27

http://www.mummi.org

4096 8192 16384 32768 65536 1 2 4 8 16 32 64 128

Energy per node (J)(log2) Number of Cores (log2)

Energy Comparison for PMLB on SystemG

Original Optimized

slide-28
SLIDE 28

http://www.mummi.org

10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power

Coefficient Percentage (%) Models

Counter Ranking for Original PMLB on Mira

PAPI_FDV_INS PAPI_FML_INS PAPI_RES_STL PAPI_VEC_INS PAPI_FP_INS PAPI_SR_INS PAPI_BR_NTK PAPI_BR_MSP PAPI_L1_ICM PAPI_HW_INT

slide-29
SLIDE 29

http://www.mummi.org

4 8 16 32 64 128 256

Time (s) (log2) Number of Nodes X Number of Threads per Node

Performance Comparison on Mira

Orignial Optimized

slide-30
SLIDE 30

http://www.mummi.org

45 47 49 51 53 55 57 59 61

Power per node (W) Number of Nodes X Number of Threads per Node

System Power Comparison on Mira

Orignial Optimized

slide-31
SLIDE 31

http://www.mummi.org

256 512 1024 2048 4096 8192 16384

Energy per Node (I) (log2) Number of Nodes X Number of Threads per Node

Energy Comparison for PMLB with 512x512x512 on Mira

Orignial Optimized

slide-32
SLIDE 32

http://www.mummi.org

Counter Ranking on SystemG

10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power Coefficient Percentage (%) Models

Counter Ranking for Original eq3dyna on SystemG

PAPI_SR_INS PAPI_FDV_INS PAPI_L2_STM PAPI_FML_INS PAPI_L1_TCA PAPI_RES_STL PAPI_TLB_DM PAPI_L1_STM PAPI_L2_TCW PAPI_BR_NTK PAPI_FP_INS PAPI_L2_DCW PAPI_L2_ICA PAPI_L1_ICM

slide-33
SLIDE 33

http://www.mummi.org

4000 8000 16000 32000 64000 128000 256000 512000 1 2 4 8 16 32 64 128 256 Energy per Node (J) (log2) Number of Cores (log2)

Energy Comparison of eq3dyna on SystemG

Original Optimized

slide-34
SLIDE 34

http://www.mummi.org

Counter Ranking on ANL BGQ Mira

10 20 30 40 50 60 70 80 90 100 Runtime System Power CPU Power Memory Power Coefficient Percentage (%) Models

Counter Ranking for Original eq3dyna on ANL BGQ Mira

PAPI_L1_DCM PAPI_SR_INS PAPI_BR_NTK PAPI_L1_STM PAPI_RES_STL PAPI_LD_INS PAPI_BR_MSP PAPI_VEC_INS

slide-35
SLIDE 35

http://www.mummi.org

10000 20000 30000 40000 50000 60000 32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64

Energy per node (J) Number of Nodes x Nunber of Threads per node (max number of threads per core is 4)

Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira

Orignial Optimized

slide-36
SLIDE 36

http://www.mummi.org

Summary and Future Work

n We used runtime and power models to identify the

most important counters for optimization focus

n Our counters-guided optimizations save energy by

u Eq3dyna:

Average 48.65% on Mira; 30.67% on SystemG

u PMLB:

Average 18.28% on Mira; 11.28% on SystemG

n We believe that the modeling and optimization

methodology can be applied to large-scale scientific applications on other architectures such as GPU and Xeon Phi by utilizing counters for GPU and Xeon Phi.