PETSc Tutorial
Profiling, Nonlinear Solvers, Unstructured Grids, Threads and GPUs
Karl Rupp me@karlrupp.net
Freelance Computational Scientist Seminarzentrum-Hotel Am Spiegeln, Vienna, Austria June 28-30, 2016
PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, - - PowerPoint PPT Presentation
PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, Threads and GPUs Karl Rupp me@karlrupp.net Freelance Computational Scientist Seminarzentrum-Hotel Am Spiegeln, Vienna, Austria June 28-30, 2016 Table of Contents Table of
Freelance Computational Scientist Seminarzentrum-Hotel Am Spiegeln, Vienna, Austria June 28-30, 2016
2
3
4
5
The CHKMEMQ macro causes a check of all allocated memory Track memory overwrites by bracketing them with CHKMEMQ
Use PetscMalloc() and PetscFree() for all allocation Print unfreed memory on PetscFinalize() with -malloc_dump
It checks memory access, cache performance, memory usage, etc. http://www.valgrind.org Pass -malloc 0 to PETSc when running under Valgrind Might need --trace-children=yes when running under MPI
6
Event timing Event flops Memory usage MPI messages
User can add new stages
User can add new events
7
Max Max/Min Avg Total Time (sec): 1.548e+02 1.00122 1.547e+02 Objects: 1.028e+03 1.00000 1.028e+03 Flops: 1.519e+10 1.01953 1.505e+10 1.204e+11 Flops/sec: 9.814e+07 1.01829 9.727e+07 7.782e+08 MPI Messages: 8.854e+03 1.00556 8.819e+03 7.055e+04 MPI Message Lengths: 1.936e+08 1.00950 2.185e+04 1.541e+09 MPI Reductions: 2.799e+03 1.00000
8
Event Count Time (sec) Flops
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R
VecDot 43 1.0 4.8879e-02 8.3 1.77e+06 1.0 0.0e+00 0.0e+00 4.3e+01 1 VecMDot 1747 1.0 1.3021e+00 4.6 8.16e+07 1.0 0.0e+00 0.0e+00 1.7e+03 1 0 14 1 1 0 27 VecNorm 3972 1.0 1.5460e+00 2.5 8.48e+07 1.0 0.0e+00 0.0e+00 4.0e+03 1 0 31 1 1 0 61 VecScale 3261 1.0 1.6703e-01 1.0 3.38e+07 1.0 0.0e+00 0.0e+00 0.0e+00 VecScatterBegin 4503 1.0 4.0440e-01 1.0 0.00e+00 0.0 6.1e+07 2.0e+03 0.0e+00 0 50 26 0 96 53 VecScatterEnd 4503 1.0 2.8207e+00 6.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 MatMult 3001 1.0 3.2634e+01 1.1 3.68e+09 1.1 4.9e+07 2.3e+03 0.0e+00 11 22 40 24 22 44 78 49 MatMultAdd 604 1.0 6.0195e-01 1.0 5.66e+07 1.0 3.7e+06 1.3e+02 0.0e+00 3 1 6 MatMultTranspose 676 1.0 1.3220e+00 1.6 6.50e+07 1.0 4.2e+06 1.4e+02 0.0e+00 3 1 1 7 MatSolve 3020 1.0 2.5957e+01 1.0 3.25e+09 1.0 0.0e+00 0.0e+00 0.0e+00 9 21 18 41 MatCholFctrSym 3 1.0 2.8324e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 MatCholFctrNum 69 1.0 5.7241e+00 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 4 4 9 MatAssemblyBegin 119 1.0 2.8250e+00 1.5 0.00e+00 0.0 2.1e+06 5.4e+04 3.1e+02 1 2 24 2 2 3 47 5 MatAssemblyEnd 119 1.0 1.9689e+00 1.4 0.00e+00 0.0 2.8e+05 1.3e+03 6.8e+01 1 1 1 1 SNESSolve 4 1.0 1.4302e+02 1.0 8.11e+09 1.0 6.3e+07 3.8e+03 6.3e+03 51 50 52 50 50 99100 99100 97 SNESLineSearch 43 1.0 1.5116e+01 1.0 1.05e+08 1.1 2.4e+06 3.6e+03 1.8e+02 5 1 2 2 1 10 1 4 4 3 SNESFunctionEval 55 1.0 1.4930e+01 1.0 0.00e+00 0.0 1.8e+06 3.3e+03 8.0e+00 5 1 1 10 3 3 SNESJacobianEval 43 1.0 3.7077e+01 1.0 7.77e+06 1.0 4.3e+06 2.6e+04 3.0e+02 13 4 24 2 26 7 48 5 KSPGMRESOrthog 1747 1.0 1.5737e+00 2.9 1.63e+08 1.0 0.0e+00 0.0e+00 1.7e+03 1 1 0 14 1 2 0 27 KSPSetup 224 1.0 2.1040e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+01 KSPSolve 43 1.0 8.9988e+01 1.0 7.99e+09 1.0 5.6e+07 2.0e+03 5.8e+03 32 49 46 24 46 62 99 88 48 88 PCSetUp 112 1.0 1.7354e+01 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 8.7e+01 6 4 1 12 9 1 PCSetUpOnBlocks 1208 1.0 5.8182e+00 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 8.7e+01 2 4 1 4 9 1 PCApply 276 1.0 7.1497e+01 1.0 7.14e+09 1.0 5.2e+07 1.8e+03 5.1e+03 25 44 42 20 41 49 88 81 39 79
9
VecDot VecMDot VecNorm MatAssemblyBegin Change algorithm (e.g. IBCGS)
VecScatter MatMult PCApply MatAssembly SNESFunctionEval SNESJacobianEval Compute subdomain boundary fluxes redundantly Ghost exchange for all fields at once Better partition
10
11
12
13
14
FormFunction(), set by SNESSetFunction() FormJacobian(), set by SNESSetJacobian()
15
PetscErrorCode (*func)(SNES snes, Vec x,Vec r, void *ctx) x - The current solution r - The residual ctx - The user context passed to SNESSetFunction() Use this to pass application information, e.g. physical constants
16
PetscErrorCode (*func)(SNES snes,Vec x,Mat *J,Mat *M, MatStructure *flag,void *ctx) x - The current solution J - The Jacobian M - The Jacobian preconditioning matrix (possibly J itself) ctx - The user context passed to SNESSetFunction() Use this to pass application information, e.g. physical constants
SAME_NONZERO_PATTERN DIFFERENT_NONZERO_PATTERN
17
Activated by -snes_fd Computed by SNESDefaultComputeJacobian()
Coloring is created by MatFDColoringCreate() Computed by SNESDefaultComputeJacobianColor()
Uses preconditioning matrix from SNESSetJacobian()
18
SNESSetDM(snes,dm);
DMDASNESSetFunctionLocal(dm,INSERT_VALUES, (DMDASNESFunction)FormFunctionLocal, &user);
19
20
21
10-1 100 101 2007 2008 2009 2010 2011 2012 2013 2014 2015 GFLOP/sec per Watt End of Year Peak Floating Point Operations per Watt, Double Precision CPUs, Intel X e
X 5 4 8 2 X e
X 5 4 9 2 X e
W 5 5 9 X e
X 5 6 8 X e
X 5 6 9 X e
E 5
6 9 X e
E 5
6 9 7 v 2 X e
E 5
6 9 9 v 3 X e
E 5
6 9 9 v 4 GPUs, NVIDIA GPUs, AMD MIC, Intel T e s l a C 1 6 T e s l a C 1 6 T e s l a C 2 5 T e s l a C 2 9 R a d e
H D 7 9 7 G H z E d . R a d e
H D 8 9 7 T e s l a K 4 T e s l a K 4 R a d e
H D 3 8 7 R a d e
H D 4 8 7 R a d e
H D 5 8 7 R a d e
H D 6 9 7 R a d e
H D 6 9 7 T e s l a K 2 T e s l a K 2 X F i r e P r
9 1 F i r e P r
9 1 X e
P h i X 7 1 2 X
22
50 100 150 200 250 300 350 400 450 1 10 100 Bandwidth (GB/sec) Threads STREAM Benchmark Results E5-2670 v3 (Haswell) E5-2650 v2 (Ivy Bridge) E5-2620 (Sandy Bridge) Xeon Phi 7120 KNL, 68 cores, DDR4 KNL, 68 cores, MCDRAM
https://www.karlrupp.net/2016/07/knights-landing-vs-knights-corner-haswell-ivy-bridge-and-sandy-bridge-stream-benchmark-results/
23
24
25
./configure [..] --with-cuda=1
./configure [..] --download-viennacl
26
struct _p_Vec { ... void *data; // host buffer PetscCUSPFlag valid_GPU_array; // flag void *spptr; // device buffer };
typedef enum {PETSC_CUSP_UNALLOCATED, PETSC_CUSP_GPU, PETSC_CUSP_CPU, PETSC_CUSP_BOTH} PetscCUSPFlag;
27
PetscErrorCode VecSetRandom_SeqCUSP_Private(..) { VecGetArray(...); // some operation on host memory VecRestoreArray(...); }
PetscErrorCode VecAYPX_SeqCUSP(..) { VecCUSPGetArrayReadWrite(...); // some operation on raw handles on device VecCUSPRestoreArrayReadWrite(...); }
28
$> ./ex12
KSPGMRESOrthog 228 1.0 6.2901e-01 KSPSolve 1 1.0 2.7332e+00
$> ./ex12 -vec_type cusp -mat_type aijcusp
[0]PETSC ERROR: MatSolverPackage petsc does not support matrix type seqaijcusp
29
$> ./ex12
KSPGMRESOrthog 1630 1.0 4.5866e+00 KSPSolve 1 1.0 1.6361e+01
$> ./ex12 -vec_type cusp -mat_type aijcusp
MatCUSPCopyTo 1 1.0 5.6108e-02 KSPGMRESOrthog 1630 1.0 5.5989e-01 KSPSolve 1 1.0 1.0202e+00
30
31
32
33
34
35
$> ./ex12 -vec_type viennacl -mat_type aijviennacl ...
$> ./ex12 -vec_type viennacl -mat_type aijviennacl
36
50 100 150 200 50 100 150 200
windtunnel ship spheres cantilever protein
1.42 1.87 3.33 2.52 1.20 1.46 1.13 1.88 0.82 0.92 0.94 1.49 0.61 0.79 0.69 1.26 0.79 0.70 0.98 1.41
NVIDIA Tesla K20m
50 100 150 200 50 100 150 200
1.14 2.07 NA NA 0.97 1.33 NA NA 0.68 1.11 NA NA 0.46 0.86 NA NA 0.79 0.87 NA NA
AMD FirePro W9100
ViennaCL 1.6.2 PARALUTION 0.7.0 MAGMA 1.5.0 CUSP 0.4.0
37
Tesla K20m 5 10 15 20 5 10 15 20 GFLOP/sec
cantilever economics epidemiology harbor protein qcd ship spheres windtunnel accelerator amazon0312 ca−CondMat cit−Patents circuit email−Enron p2p−Gnutella31 roadNet−CA webbase1m web−Google wiki−Vote 13.4 12.9 11.8 8.9 6.4 5.4 11.8 8.5 11.0 12.6 11.8 10.1 14.1 16.6 16.2 12.8 13.0 6.9 13.2 13.4 11.4 13.7 15.8 12.6 13.0 13.8 11.7 8.6 10.2 5.9 6.4 6.3 4.4 4.1 2.9 3.0 2.4 2.4 2.1 9.5 7.0 5.6 5.3 1.9 3.1 4.3 2.6 2.7 9.0 5.2 6.7 9.2 4.9 3.4 3.2 2.9 2.6 1.7 0.9 2.4
ViennaCL 1.7.0 CUSPARSE 7 CUSP 0.5.1
38
5 10 5 10 GFLOPs
cantilever economics epidemiology harbor protein qcd ship spheres windtunnel
ViennaCL 1.7.0, FirePro W9100 ViennaCL 1.7.0, Tesla K20m CUSPARSE 7, Tesla K20m CUSP 0.5.1, Tesla K20m ViennaCL 1.7.0, Xeon E5−2670v3 MKL 11.2.1, Xeon E5−2670v3 ViennaCL 1.7.0, Xeon Phi 7120 MKL 11.2.1, Xeon Phi 7120
39
10-2 10-1 100 101 102 103 104 105 106 107 Execution Time (sec) Unknowns Total Solver Execution Times, Poisson Equation in 2D Dual INTEL Xeon E5-2670 v3, No Preconditioner Dual INTEL Xeon E5-2670 v3, Smoothed Aggregation AMD FirePro W9100, No Preconditioner AMD FirePro W9100, Smoothed Aggregation NVIDIA Tesla K20m, No Preconditioner NVIDIA Tesla K20m, Smoothed Aggregation INTEL Xeon Phi 7120, No Preconditioner INTEL Xeon Phi 7120, Smoothed Aggregation
40
10-2 10-1 100 101 102 103 104 103 104 105 106 107 Execution Time (sec) Unknowns Poisson Equation in 3D, Linear Finite Elements Dual INTEL Xeon E5-2670 v3, Sequential ILU0 Dual INTEL Xeon E5-2670 v3, Chow-Patel ILU0 AMD FirePro W9100, Sequential ILU0 AMD FirePro W9100, Chow-Patel ILU0 NVIDIA Tesla K20m, Sequential ILU0 NVIDIA Tesla K20m, Chow-Patel ILU0 INTEL Xeon Phi 7120, Sequential ILU0 INTEL Xeon Phi 7120, Chow-Patel ILU0
41
42