 
              Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn, Norbert Siegmund, Sven Apel University of Passau FOSD Meeting 2014, Dagstuhl
High-Performance Computing and ExaStencils Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 2/16
High-Performance Computing and ExaStencils Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 2/16
High-Performance Computing and ExaStencils Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 2/16
High-Performance Computing and ExaStencils Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 2/16
High-Performance Computing and ExaStencils How to identify performance-optimal components and parameters for a specific hardware? Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 2/16
SPL Conqueror [Siegmund et al., 2012] Partial feature Optimal Prediction selection configuration CUDA {Local Memory, CUDA, Local Memory Padding = 0, Pixels per Thread = 3} Objective function: max(performance) Advantages: � Detection of feature interactions � Transparent (i.e., influences of individual features and feature interactions explicitly modeled and quantified) Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 3/16
Influence of Individual Features HIPAcc API Local Memory CUDA OpenCL Identification: = 800s Performance difference is interpreted = 500s as contribution of the feature in = -300s question Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 4/16
Influence of Individual Features HIPAcc API Local Memory CUDA OpenCL Identification: = 800s Performance difference is interpreted = 500s as contribution of the feature in = -300s question Heuristics: Feature-wise (FW) heuristic: Quantifies the influence of individual features on performance Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 4/16
Interactions Between Features = 800s = 800s = 500s = 400s = -300s = -400s  = 100s 350s Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 5/16
Interactions Between Features = 800s = 800s = 500s = 400s = -300s = -400s = 350s Heuristics: � Pair-wise (PW) heuristic: interactions between two features � Higher-order (HO) heuristic: interactions between three or more features � Hot-spot (HS) heuristic: interactions of "hot-spot" features Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 5/16
Numerical Parameters (Non-Boolean Features) Existing heuristics work for boolean features only! Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 6/16
Numerical Parameters (Non-Boolean Features) Existing heuristics work for boolean features only! Discretization:  X System System [0,1, … ,n] ... 0 X0 X1 X2 Xn Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 6/16
Numerical Parameters (Non-Boolean Features) Existing heuristics work for boolean features only! Discretization:  X System System [0,1, … ,n] ... 0 X0 X1 X2 Xn Disadvantages: � Increasing number of features � Loss of connection between parameter values Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 6/16
Influence of Parameters � Determine influence of parameter Padding Pixels per Thread [0..6] [1,2,3,4,5,6,7] HIPAcc values on performance 3 4 API Local Memory � Learn function for each pair of CUDA OpenCL parameter and feature � Independent sampling of parameters Padding Pixels per Thread Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 7/16
Influence of Parameters � Determine influence of parameter Padding Pixels per Thread [0..6] [1,2,3,4,5,6,7] HIPAcc values on performance 3 4 API Local Memory � Learn function for each pair of CUDA OpenCL parameter and feature � Independent sampling of parameters Padding Pixels per Thread Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 7/16
Influence of Parameters � Determine influence of parameter Padding Pixels per Thread [0..6] [1,2,3,4,5,6,7] HIPAcc values on performance 3 4 API Local Memory � Learn function for each pair of CUDA OpenCL parameter and feature � Independent sampling of parameters Padding Heuristics: Pixels per Thread � Function learning (FL) heuristic Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 7/16
First Results [Grebhahn et al., 2014] Research questions: � What is the prediction accuracy of the different heuristics? � Can we predict the performance-optimal configuration? Customizable programs: Highly Scalable Multi-Grid Solver (HSMGS) Multi-Grid Solver using DUNE (DUNE MGS) Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 8/16
HSMGS pre-smoothing post-smoothing [0, … ,6] [0, … ,6] HSMGP 3 3 Number of Cores [64,256,1024,4096] coarse grid solver smoother 64 IP_CG RED_AMG IP_AMG Jac GS GSAC RBGS RBGSAC BS sum (pre-smoothing, post-smoothing) > 0 Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 9/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF FW PW HO HS FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW PW HO HS FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW 26 (0.8) 23.4 ± 18.7 19.0 3.8 40 PW HO HS FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW 26 (0.8) 23.4 ± 18.7 19.0 3.8 40 PW 274 (7.9) 4.8 ± 8.6 1.8 31.4 77 HO HS FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW 26 (0.8) 23.4 ± 18.7 19.0 3.8 40 PW 274 (7.9) 4.8 ± 8.6 1.8 31.4 77 HO 1 331 (38.5) 60.7 ± 67.2 41.5 270.0 312 HS FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW 26 (0.8) 23.4 ± 18.7 19.0 3.8 40 PW 274 (7.9) 4.8 ± 8.6 1.8 31.4 77 HO 1 331 (38.5) 60.7 ± 67.2 41.5 270.0 312 HS 2 902 (84.0) 8.0 ± 33.9 0 270.0 55 FL Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Results Heu. # M (in %) e ± s � x δ [%] rank ¯ BF 3 456 (100) 0 0 0 1 FW 26 (0.8) 23.4 ± 18.7 19.0 3.8 40 PW 274 (7.9) 4.8 ± 8.6 1.8 31.4 77 HO 1 331 (38.5) 60.7 ± 67.2 41.5 270.0 312 HS 2 902 (84.0) 8.0 ± 33.9 0 270.0 55 FL 112 (3.2) 2.5 ± 3.1 1.8 0 1 Table: BF: brute force, FW: feature-wise, PW: pair-wise, HO: higher-order, HS: hot-spot, FL: function learning Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 10/16
HSMGS – Feature Interactions (Pair-Wise) GSRB GSACBE GSRBAC GSAC pre=0 GS pre=1 JAC pre=2 RED AMG pre=3 IP AMG pre=4 IP CG pre=5 numCores 4096 pre=6 numCores 1024 post=0 numCores 256 post=1 numCores 64 post=2 post=6 post=3 post=5 post=4 Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 11/16
HIPA cc , DUNE MGS – Results Heu. # M (in %) δ [%] rank e ± s � x ¯ BF HIPA cc HO HS FL BF DUNE MGS HO HS FL Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 12/16
HIPA cc , DUNE MGS – Results Heu. # M (in %) δ [%] rank e ± s � x ¯ BF 13 485 (100) 0 0 0 1 HIPA cc HO HS FL BF DUNE MGS HO HS FL Alexander Grebhahn Finding Performance-Optimal Configurations for High-Performance Computing 12/16
Recommend
More recommend