PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, - - PowerPoint PPT Presentation

petsc tutorial
SMART_READER_LITE
LIVE PREVIEW

PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, - - PowerPoint PPT Presentation

PETSc Tutorial Profiling, Nonlinear Solvers, Unstructured Grids, Threads and GPUs Karl Rupp me@karlrupp.net Freelance Computational Scientist Seminarzentrum-Hotel Am Spiegeln, Vienna, Austria June 28-30, 2016 Table of Contents Table of


slide-1
SLIDE 1

PETSc Tutorial

Profiling, Nonlinear Solvers, Unstructured Grids, Threads and GPUs

Karl Rupp me@karlrupp.net

Freelance Computational Scientist Seminarzentrum-Hotel Am Spiegeln, Vienna, Austria June 28-30, 2016

slide-2
SLIDE 2

2

Table of Contents Table of Contents

Debugging and Profiling Nonlinear Solvers Unstructured Grids PETSc and GPUs

slide-3
SLIDE 3

3

PETSc PETSc Debugging and Profiling

slide-4
SLIDE 4

4

PETSc Debugging PETSc Debugging

By default, a debug build is provided Launch the debugger

  • start_in_debugger [gdb,dbx,noxterm]
  • on_error_attach_debugger [gdb,dbx,noxterm]

Attach the debugger only to some parallel processes

  • debugger_nodes 0,1

Set the display (often necessary on a cluster)

  • display :0
slide-5
SLIDE 5

5

Debugging Tips Debugging Tips

Put a breakpoint in PetscError() to catch errors as they occur PETSc tracks memory overwrites at both ends of arrays

The CHKMEMQ macro causes a check of all allocated memory Track memory overwrites by bracketing them with CHKMEMQ

PETSc checks for leaked memory

Use PetscMalloc() and PetscFree() for all allocation Print unfreed memory on PetscFinalize() with -malloc_dump

Simply the best tool today is Valgrind

It checks memory access, cache performance, memory usage, etc. http://www.valgrind.org Pass -malloc 0 to PETSc when running under Valgrind Might need --trace-children=yes when running under MPI

  • -track-origins=yes handy for uninitialized memory
slide-6
SLIDE 6

6

PETSc Profiling PETSc Profiling

Profiling

Use -log_summary for a performance profile

Event timing Event flops Memory usage MPI messages

Call PetscLogStagePush() and PetscLogStagePop()

User can add new stages

Call PetscLogEventBegin() and PetscLogEventEnd()

User can add new events

Call PetscLogFlops() to include your flops

slide-7
SLIDE 7

7

PETSc Profiling PETSc Profiling

Reading -log summary

Max Max/Min Avg Total Time (sec): 1.548e+02 1.00122 1.547e+02 Objects: 1.028e+03 1.00000 1.028e+03 Flops: 1.519e+10 1.01953 1.505e+10 1.204e+11 Flops/sec: 9.814e+07 1.01829 9.727e+07 7.782e+08 MPI Messages: 8.854e+03 1.00556 8.819e+03 7.055e+04 MPI Message Lengths: 1.936e+08 1.00950 2.185e+04 1.541e+09 MPI Reductions: 2.799e+03 1.00000

Also a summary per stage Memory usage per stage (based on when it was allocated) Time, messages, reductions, balance, flops per event per stage Always send -log_summary when asking performance questions on mailing list

slide-8
SLIDE 8

8

PETSc Profiling PETSc Profiling

Event Count Time (sec) Flops

  • -- Global ---
  • -- Stage ---

Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R

  • -- Event Stage 1: Full solve

VecDot 43 1.0 4.8879e-02 8.3 1.77e+06 1.0 0.0e+00 0.0e+00 4.3e+01 1 VecMDot 1747 1.0 1.3021e+00 4.6 8.16e+07 1.0 0.0e+00 0.0e+00 1.7e+03 1 0 14 1 1 0 27 VecNorm 3972 1.0 1.5460e+00 2.5 8.48e+07 1.0 0.0e+00 0.0e+00 4.0e+03 1 0 31 1 1 0 61 VecScale 3261 1.0 1.6703e-01 1.0 3.38e+07 1.0 0.0e+00 0.0e+00 0.0e+00 VecScatterBegin 4503 1.0 4.0440e-01 1.0 0.00e+00 0.0 6.1e+07 2.0e+03 0.0e+00 0 50 26 0 96 53 VecScatterEnd 4503 1.0 2.8207e+00 6.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 MatMult 3001 1.0 3.2634e+01 1.1 3.68e+09 1.1 4.9e+07 2.3e+03 0.0e+00 11 22 40 24 22 44 78 49 MatMultAdd 604 1.0 6.0195e-01 1.0 5.66e+07 1.0 3.7e+06 1.3e+02 0.0e+00 3 1 6 MatMultTranspose 676 1.0 1.3220e+00 1.6 6.50e+07 1.0 4.2e+06 1.4e+02 0.0e+00 3 1 1 7 MatSolve 3020 1.0 2.5957e+01 1.0 3.25e+09 1.0 0.0e+00 0.0e+00 0.0e+00 9 21 18 41 MatCholFctrSym 3 1.0 2.8324e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 MatCholFctrNum 69 1.0 5.7241e+00 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 4 4 9 MatAssemblyBegin 119 1.0 2.8250e+00 1.5 0.00e+00 0.0 2.1e+06 5.4e+04 3.1e+02 1 2 24 2 2 3 47 5 MatAssemblyEnd 119 1.0 1.9689e+00 1.4 0.00e+00 0.0 2.8e+05 1.3e+03 6.8e+01 1 1 1 1 SNESSolve 4 1.0 1.4302e+02 1.0 8.11e+09 1.0 6.3e+07 3.8e+03 6.3e+03 51 50 52 50 50 99100 99100 97 SNESLineSearch 43 1.0 1.5116e+01 1.0 1.05e+08 1.1 2.4e+06 3.6e+03 1.8e+02 5 1 2 2 1 10 1 4 4 3 SNESFunctionEval 55 1.0 1.4930e+01 1.0 0.00e+00 0.0 1.8e+06 3.3e+03 8.0e+00 5 1 1 10 3 3 SNESJacobianEval 43 1.0 3.7077e+01 1.0 7.77e+06 1.0 4.3e+06 2.6e+04 3.0e+02 13 4 24 2 26 7 48 5 KSPGMRESOrthog 1747 1.0 1.5737e+00 2.9 1.63e+08 1.0 0.0e+00 0.0e+00 1.7e+03 1 1 0 14 1 2 0 27 KSPSetup 224 1.0 2.1040e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 3.0e+01 KSPSolve 43 1.0 8.9988e+01 1.0 7.99e+09 1.0 5.6e+07 2.0e+03 5.8e+03 32 49 46 24 46 62 99 88 48 88 PCSetUp 112 1.0 1.7354e+01 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 8.7e+01 6 4 1 12 9 1 PCSetUpOnBlocks 1208 1.0 5.8182e+00 1.0 6.75e+08 1.0 0.0e+00 0.0e+00 8.7e+01 2 4 1 4 9 1 PCApply 276 1.0 7.1497e+01 1.0 7.14e+09 1.0 5.2e+07 1.8e+03 5.1e+03 25 44 42 20 41 49 88 81 39 79

slide-9
SLIDE 9

9

PETSc Profiling PETSc Profiling

Communication Costs

Reductions: usually part of Krylov method, latency limited

VecDot VecMDot VecNorm MatAssemblyBegin Change algorithm (e.g. IBCGS)

Point-to-point (nearest neighbor), latency or bandwidth

VecScatter MatMult PCApply MatAssembly SNESFunctionEval SNESJacobianEval Compute subdomain boundary fluxes redundantly Ghost exchange for all fields at once Better partition

slide-10
SLIDE 10

10

PETSc PETSc Nonlinear Solvers

slide-11
SLIDE 11

11

Newton Iteration: Workhorse of SNES Newton Iteration: Workhorse of SNES

Standard form of a nonlinear system

−∇ ·

  • |∇u|p−2∇u
  • − λeu = F(u) = 0

Iteration

Solve: J(u)w = −F(u) Update: u+ ← u + w Quadratically convergent near a root: |un+1 − u∗| ∈ O

  • |un − u∗|2

Picard is the same operation with a different J(u)

Jacobian Matrix for p-Bratu Equation

J(u)w ∼ −∇

  • (η1 + η′∇u ⊗ ∇u)∇w
  • − λeuw

η′ = p − 2 2 η/(ǫ2 + γ)

slide-12
SLIDE 12

12

SNES SNES

Scalable Nonlinear Equation Solvers

Newton solvers: Line Search, Thrust Region Inexact Newton-methods: Newton-Krylov Matrix-Free Methods: With iterative linear solvers

How to get the Jacobian Matrix?

Implement it by hand Let PETSc finite-difference it Use Automatic Differentiation software

slide-13
SLIDE 13

13

Nonlinear solvers in PETSc SNES Nonlinear solvers in PETSc SNES

Nonlinear solvers in PETSc SNES

LS, TR Newton-type with line search and trust region NRichardson Nonlinear Richardson, usually preconditioned VIRS, VISS reduced space and semi-smooth methods for variational inequalities QN Quasi-Newton methods like BFGS NGMRES Nonlinear GMRES NCG Nonlinear Conjugate Gradients GS Nonlinear Gauss-Seidel/multiplicative Schwarz sweeps FAS Full approximation scheme (nonlinear multigrid) MS Multi-stage smoothers, often used with FAS for hyperbolic problems Shell Your method, often used as a (nonlinear) preconditioner

slide-14
SLIDE 14

14

SNES Paradigm SNES Paradigm

SNES Interface based upon Callback Functions

FormFunction(), set by SNESSetFunction() FormJacobian(), set by SNESSetJacobian()

Evaluating the nonlinear residual F(x)

Solver calls the user’s function User function gets application state through the ctx variable PETSc never sees application data

slide-15
SLIDE 15

15

SNES Function SNES Function

F(u) = 0 The user provided function which calculates the nonlinear residual has signature

PetscErrorCode (*func)(SNES snes, Vec x,Vec r, void *ctx) x - The current solution r - The residual ctx - The user context passed to SNESSetFunction() Use this to pass application information, e.g. physical constants

slide-16
SLIDE 16

16

SNES Jacobian SNES Jacobian

User-provided function calculating the Jacobian Matrix

PetscErrorCode (*func)(SNES snes,Vec x,Mat *J,Mat *M, MatStructure *flag,void *ctx) x - The current solution J - The Jacobian M - The Jacobian preconditioning matrix (possibly J itself) ctx - The user context passed to SNESSetFunction() Use this to pass application information, e.g. physical constants

Possible MatStructure values are:

SAME_NONZERO_PATTERN DIFFERENT_NONZERO_PATTERN

Alternatives

a builtin sparse finite difference approximation (“coloring”) automatic differentiation (ADIC/ADIFOR)

slide-17
SLIDE 17

17

Finite Difference Jacobians Finite Difference Jacobians

PETSc can compute and explicitly store a Jacobian

Dense

Activated by -snes_fd Computed by SNESDefaultComputeJacobian()

Sparse via colorings

Coloring is created by MatFDColoringCreate() Computed by SNESDefaultComputeJacobianColor()

Also Matrix-free Newton-Krylov via 1st-order FD possible

Activated by -snes_mf without preconditioning Activated by -snes_mf_operator with user-defined preconditioning

Uses preconditioning matrix from SNESSetJacobian()

slide-18
SLIDE 18

18

DMDA and SNES DMDA and SNES

Fusing Distributed Arrays and Nonlinear Solvers

Make DM known to SNES solver

SNESSetDM(snes,dm);

Attach residual evaluation routine

DMDASNESSetFunctionLocal(dm,INSERT_VALUES, (DMDASNESFunction)FormFunctionLocal, &user);

Ready to Roll

First solver implementation completed Uses finite-differencing to obtain Jacobian Matrix Rather slow, but scalable!

slide-19
SLIDE 19

19

PETSc PETSc PETSc and GPUs

slide-20
SLIDE 20

20

Why bother? Why bother?

Don’t believe anything unless you can run it

Matt Knepley

slide-21
SLIDE 21

21

Why bother? Why bother?

GFLOPs/Watt

10-1 100 101 2007 2008 2009 2010 2011 2012 2013 2014 2015 GFLOP/sec per Watt End of Year Peak Floating Point Operations per Watt, Double Precision CPUs, Intel X e

  • n

X 5 4 8 2 X e

  • n

X 5 4 9 2 X e

  • n

W 5 5 9 X e

  • n

X 5 6 8 X e

  • n

X 5 6 9 X e

  • n

E 5

  • 2

6 9 X e

  • n

E 5

  • 2

6 9 7 v 2 X e

  • n

E 5

  • 2

6 9 9 v 3 X e

  • n

E 5

  • 2

6 9 9 v 4 GPUs, NVIDIA GPUs, AMD MIC, Intel T e s l a C 1 6 T e s l a C 1 6 T e s l a C 2 5 T e s l a C 2 9 R a d e

  • n

H D 7 9 7 G H z E d . R a d e

  • n

H D 8 9 7 T e s l a K 4 T e s l a K 4 R a d e

  • n

H D 3 8 7 R a d e

  • n

H D 4 8 7 R a d e

  • n

H D 5 8 7 R a d e

  • n

H D 6 9 7 R a d e

  • n

H D 6 9 7 T e s l a K 2 T e s l a K 2 X F i r e P r

  • W

9 1 F i r e P r

  • W

9 1 X e

  • n

P h i X 7 1 2 X

slide-22
SLIDE 22

22

Why bother? Why bother?

Procurements

Theta (ANL, 2016): 2nd generation INTEL Xeon Phi Summit (ORNL, 2017), Sierra (LLNL, 2017): NVIDIA Volta GPU Aurora (ANL, 2018): 3rd generation INTEL Xeon Phi

50 100 150 200 250 300 350 400 450 1 10 100 Bandwidth (GB/sec) Threads STREAM Benchmark Results E5-2670 v3 (Haswell) E5-2650 v2 (Ivy Bridge) E5-2620 (Sandy Bridge) Xeon Phi 7120 KNL, 68 cores, DDR4 KNL, 68 cores, MCDRAM

https://www.karlrupp.net/2016/07/knights-landing-vs-knights-corner-haswell-ivy-bridge-and-sandy-bridge-stream-benchmark-results/

slide-23
SLIDE 23

23

Current Status Current Status PETSc on GPUs and MIC: Current Status

slide-24
SLIDE 24

24

Available Options Available Options

Native on Xeon Phi

Cross-compile for Xeon Phi

CUDA

CUDA-support through CUSP as well as native

  • vec_type cusp -mat_type aijcusp
  • vec_type cuda -mat_type aijcusparse

Only for NVIDIA GPUs

CUDA/OpenCL/OpenMP

CUDA/OpenCL/OpenMP-support through ViennaCL

  • vec_type viennacl -mat_type aijviennacl

OpenCL on CPUs and MIC fairly poor

slide-25
SLIDE 25

25

Configuration Configuration

CUDA

CUDA-enabled configuration (minimum)

./configure [..] --with-cuda=1

With CUSP:

  • -with-cusp=1 --with-cusp-dir=/path/to/cusp

Customization:

  • -with-cudac=/path/to/cuda/bin/nvcc
  • -with-cuda-arch=sm_20

OpenCL (ViennaCL)

OpenCL-enabled configuration

./configure [..] --download-viennacl

  • -with-opencl-include=/path/to/OpenCL/include
  • -with-opencl-lib=/path/to/libOpenCL.so
slide-26
SLIDE 26

26

How Does It Work? How Does It Work?

Host and Device Data

struct _p_Vec { ... void *data; // host buffer PetscCUSPFlag valid_GPU_array; // flag void *spptr; // device buffer };

Possible Flag States

typedef enum {PETSC_CUSP_UNALLOCATED, PETSC_CUSP_GPU, PETSC_CUSP_CPU, PETSC_CUSP_BOTH} PetscCUSPFlag;

slide-27
SLIDE 27

27

How Does It Work? How Does It Work?

Fallback-Operations on Host

Data becomes valid on host (PETSC_CUSP_CPU)

PetscErrorCode VecSetRandom_SeqCUSP_Private(..) { VecGetArray(...); // some operation on host memory VecRestoreArray(...); }

Accelerated Operations on Device

Data becomes valid on device (PETSC_CUSP_GPU)

PetscErrorCode VecAYPX_SeqCUSP(..) { VecCUSPGetArrayReadWrite(...); // some operation on raw handles on device VecCUSPRestoreArrayReadWrite(...); }

slide-28
SLIDE 28

28

Example Example

KSP ex12 on Host

$> ./ex12

  • pc_type ilu -m 200 -n 200 -log_summary

KSPGMRESOrthog 228 1.0 6.2901e-01 KSPSolve 1 1.0 2.7332e+00

KSP ex12 on Device

$> ./ex12 -vec_type cusp -mat_type aijcusp

  • pc_type ilu -m 200 -n 200 -log_summary

[0]PETSC ERROR: MatSolverPackage petsc does not support matrix type seqaijcusp

slide-29
SLIDE 29

29

Example Example

KSP ex12 on Host

$> ./ex12

  • pc_type none -m 200 -n 200 -log_summary

KSPGMRESOrthog 1630 1.0 4.5866e+00 KSPSolve 1 1.0 1.6361e+01

KSP ex12 on Device

$> ./ex12 -vec_type cusp -mat_type aijcusp

  • pc_type none -m 200 -n 200 -log_summary

MatCUSPCopyTo 1 1.0 5.6108e-02 KSPGMRESOrthog 1630 1.0 5.5989e-01 KSPSolve 1 1.0 1.0202e+00

slide-30
SLIDE 30

30

Pitfalls Pitfalls

Pitfall: Repeated Host-Device Copies

PCI-Express transfers kill performance Complete algorithm needs to run on device Problematic for explicit time-stepping, etc.

Pitfall: Wrong Data Sizes

Data too small: Kernel launch latencies dominate Data too big: Out of memory

Pitfall: Function Pointers

Pass CUDA function “pointers” through library boundaries? OpenCL: Pass kernel sources, user-data hard to pass Composability?

slide-31
SLIDE 31

31

Current GPU-Functionality in PETSc Current GPU-Functionality in PETSc

Current GPU-Functionality in PETSc

CUSP/CUDA ViennaCL Programming Model CUDA CUDA/OpenCL/OpenMP Operations Vector, MatMult Vector, MatMult Matrix Formats CSR, ELL, HYB CSR Preconditioners SA-AMG, BiCGStab

  • MPI-related

Scatter

  • Additional Functionality

MatMult via cuSPARSE OpenCL residual evaluation for PetscFE

slide-32
SLIDE 32

32

Current Directions Current Directions PETSc on GPUs and MIC: Current Directions

slide-33
SLIDE 33

33

Current: CUDA Current: CUDA

Split CUDA-buffers from CUSP

Vector operations by cuBLAS MatMult by different packages CUSP (and others) provides add-on functionality

CUDA buffers CUSP ViennaCL

More CUSP Functionality in PETSc

Relaxations (Gauss-Seidel, SOR) Polynomial preconditioners Approximate inverses

slide-34
SLIDE 34

34

Current: PETSc + ViennaCL Current: PETSc + ViennaCL

ViennaCL

CUDA, OpenCL, OpenMP backends Backend switch at runtime Only OpenCL exposed in PETSc Focus on shared memory machines

Recent Advances

Pipelined Krylov solvers Fast sparse matrix-vector products Fast sparse matrix-matrix products Fine-grained algebraic multigrid Fine-grained parallel ILU

slide-35
SLIDE 35

35

Current: PETSc + ViennaCL Current: PETSc + ViennaCL

Current Use of ViennaCL in PETSc

$> ./ex12 -vec_type viennacl -mat_type aijviennacl ...

Executes on OpenCL device

New Use of ViennaCL in PETSc

$> ./ex12 -vec_type viennacl -mat_type aijviennacl

  • viennacl_backend openmp ...

Pros and Cons

Use CPU + GPU simultaneously Non-intrusive, use plugin-mechanism Non-optimal in strong-scaling limit Gather experiences for best long-term solution

slide-36
SLIDE 36

36

Upcoming PETSc+ViennaCL Features Upcoming PETSc+ViennaCL Features

Pipelined CG Method, Exec. Time per Iteration

50 100 150 200 50 100 150 200

  • Rel. Execution Time (%)

windtunnel ship spheres cantilever protein

1.42 1.87 3.33 2.52 1.20 1.46 1.13 1.88 0.82 0.92 0.94 1.49 0.61 0.79 0.69 1.26 0.79 0.70 0.98 1.41

NVIDIA Tesla K20m

50 100 150 200 50 100 150 200

  • Rel. Execution Time (%)

1.14 2.07 NA NA 0.97 1.33 NA NA 0.68 1.11 NA NA 0.46 0.86 NA NA 0.79 0.87 NA NA

AMD FirePro W9100

ViennaCL 1.6.2 PARALUTION 0.7.0 MAGMA 1.5.0 CUSP 0.4.0

slide-37
SLIDE 37

37

Upcoming PETSc+ViennaCL Features Upcoming PETSc+ViennaCL Features

Sparse Matrix-Vector Multiplication

Tesla K20m 5 10 15 20 5 10 15 20 GFLOP/sec

cantilever economics epidemiology harbor protein qcd ship spheres windtunnel accelerator amazon0312 ca−CondMat cit−Patents circuit email−Enron p2p−Gnutella31 roadNet−CA webbase1m web−Google wiki−Vote 13.4 12.9 11.8 8.9 6.4 5.4 11.8 8.5 11.0 12.6 11.8 10.1 14.1 16.6 16.2 12.8 13.0 6.9 13.2 13.4 11.4 13.7 15.8 12.6 13.0 13.8 11.7 8.6 10.2 5.9 6.4 6.3 4.4 4.1 2.9 3.0 2.4 2.4 2.1 9.5 7.0 5.6 5.3 1.9 3.1 4.3 2.6 2.7 9.0 5.2 6.7 9.2 4.9 3.4 3.2 2.9 2.6 1.7 0.9 2.4

ViennaCL 1.7.0 CUSPARSE 7 CUSP 0.5.1

slide-38
SLIDE 38

38

Upcoming PETSc+ViennaCL Feature Upcoming PETSc+ViennaCL Feature

Sparse Matrix-Matrix Products

5 10 5 10 GFLOPs

cantilever economics epidemiology harbor protein qcd ship spheres windtunnel

ViennaCL 1.7.0, FirePro W9100 ViennaCL 1.7.0, Tesla K20m CUSPARSE 7, Tesla K20m CUSP 0.5.1, Tesla K20m ViennaCL 1.7.0, Xeon E5−2670v3 MKL 11.2.1, Xeon E5−2670v3 ViennaCL 1.7.0, Xeon Phi 7120 MKL 11.2.1, Xeon Phi 7120

slide-39
SLIDE 39

39

Upcoming PETSc+ViennaCL Feature Upcoming PETSc+ViennaCL Feature

Algebraic Multigrid Preconditioners

10-2 10-1 100 101 102 103 104 105 106 107 Execution Time (sec) Unknowns Total Solver Execution Times, Poisson Equation in 2D Dual INTEL Xeon E5-2670 v3, No Preconditioner Dual INTEL Xeon E5-2670 v3, Smoothed Aggregation AMD FirePro W9100, No Preconditioner AMD FirePro W9100, Smoothed Aggregation NVIDIA Tesla K20m, No Preconditioner NVIDIA Tesla K20m, Smoothed Aggregation INTEL Xeon Phi 7120, No Preconditioner INTEL Xeon Phi 7120, Smoothed Aggregation

slide-40
SLIDE 40

40

Pipelined Solvers Pipelined Solvers

Fine-Grained Parallel ILU (Chow and Patel, SISC, 2015)

10-2 10-1 100 101 102 103 104 103 104 105 106 107 Execution Time (sec) Unknowns Poisson Equation in 3D, Linear Finite Elements Dual INTEL Xeon E5-2670 v3, Sequential ILU0 Dual INTEL Xeon E5-2670 v3, Chow-Patel ILU0 AMD FirePro W9100, Sequential ILU0 AMD FirePro W9100, Chow-Patel ILU0 NVIDIA Tesla K20m, Sequential ILU0 NVIDIA Tesla K20m, Chow-Patel ILU0 INTEL Xeon Phi 7120, Sequential ILU0 INTEL Xeon Phi 7120, Chow-Patel ILU0

slide-41
SLIDE 41

41

GPU Summary and Conclusion GPU Summary and Conclusion

Currently Available

CUSP for CUDA, ViennaCL for OpenCL Automatic use for vector operations and SpMV Smoothed Agg. AMG via CUSP ViennaCL as CUDA/OpenCL/OpenMP-hydra

Current Activities

Use of cuBLAS and cuSPARSE Better support for n > 1 processes

slide-42
SLIDE 42

42

Conclusions Conclusions

PETSc can help You

solve algebraic and DAE problems in your application area rapidly develop efficient parallel code, can start from examples develop new solution methods and data structures debug and analyze performance advice on software design, solution algorithms, and performance petsc-{users,dev,maint}@mcs.anl.gov

You can help PETSc

report bugs and inconsistencies, or if you think there is a better way tell us if the documentation is inconsistent or unclear consider developing new algebraic methods as plugins, contribute if your idea works