SC13 GPU Technology Theater AmgX: Performance Acceleration for - - PowerPoint PPT Presentation

sc13
SMART_READER_LITE
LIVE PREVIEW

SC13 GPU Technology Theater AmgX: Performance Acceleration for - - PowerPoint PPT Presentation

SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA Linear Solvers are Necessary CFD Energy Physics Nuclear Safety AmgX Overview Two forms of AMG Classical AMG, as in HYPRE,


slide-1
SLIDE 1

SC13

GPU Technology Theater

AmgX: Performance Acceleration for Large-Scale Iterative Methods

  • Dr. Joe Eaton, NVIDIA
slide-2
SLIDE 2

Linear Solvers are Necessary

CFD Energy Physics Nuclear Safety

slide-3
SLIDE 3

AmgX Overview

Two forms of AMG

Classical AMG, as in HYPRE, strong convergence, scalar Un-smoothed Aggregation AMG, lower setup times, handles block systems

Krylov methods

GMRES, CG, BiCGStab, preconditioned and ‘flexible’ variants

Classic iterative methods

Block-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1 Multi-colored versions for fine-grained parallelism

Flexible configuration

All methods as solvers, preconditioners, or smoothers; nesting

Designed for non-linear problems

Allows for frequently changing matrix, parallel and efficient setup

slide-4
SLIDE 4

Easy to Use

No CUDA experience necessary to use the library C API: links with C, C++ or Fortran Small API, focused Reads common matrix formats (CSR, COO, MM) Single GPU, Multi-GPU Interoperates easily with MPI, OpenMP, and Hybrid parallel applications Tuned for K20 & K40; supports Fermi and newer Single, Double precision Supported on Linux, Win64

slide-5
SLIDE 5

Minimal Example With Config

//One header #include “amgx_c.h” //Read config file AMGX_create_config(&cfg, cfgfile); //Create resources based on config AMGX_resources_create_simple(&res, cfg); //Create solver object, A,x,b, set precision AMGX_solver_create(&solver, res, mode, cfg); AMGX_matrix_create(&A,res,mode); AMGX_vector_create(&x,res,mode); AMGX_vector_create(&b,res,mode); //Read coefficients from a file AMGX_read_system(&A,&x,&b, matrixfile); //Setup and Solve AMGX_solver_setup(solver,A); AMGX_solver_solve(solver, b, x);

solver(main)=FGMRES main:max_iters=100 main:convergence=RELATIVE_INI main:tolerance=0.1 main:preconditioner(amg)=AMG amg:algorithm=AGGREGATION amg:selector=SIZE_8 amg:cycle=V amg:max_iters=1 amg:max_levels=10 amg:smoother(amg_smoother)=BLOCK_JACOBI amg:relaxation_factor= 0.75 amg:presweeps=1 amg:postsweeps=2 amg:coarsest_sweeps=4 determinism_flag=1

slide-6
SLIDE 6

Drop-In Acceleration on Real Apps

“Using AmgX has allowed us to exploit the power of the GPU while freeing up development time to concentrate on reservoir simulation” Garf Bowen, RidgewayKiteSoftware

400k cell Adaptive implicit Black-oil model AMG Pressure solver 10 Time step benchmark

Lower is Better

1150 197 98 500 1000 1500 in-house one core in-house GPU AmgX

Total Time (s)

slide-7
SLIDE 7

Industrial Strength, Robust

Designed to be used in commercial, academic, and research applications AmgX has run on clusters with 100s of nodes. Industrial problems >440 million unknowns have been solved successfully using 48 GPUs.

slide-8
SLIDE 8

ANSYS Fluent 15.0 -111 M aerodynamic problem

111 M Mixed cells External aerodynamics Steady, k-e turbulence Double-precision solver CPU: Sandy Bridge (E5- 2667); 12 cores per node GPU: Tesla K40m, 4 per node 444 million unknowns

Truck Body Model

144 CPU cores – Amg 48 GPUs – AmgX

Solver time per iteration(secs) 29 11 Fluent solution time per iteration(secs) 36 18

144 CPU cores 144 CPU cores + 48 GPUs

2.7 X 2 X

Lower is Better

slide-9
SLIDE 9

Integrating AmgX into your Application

slide-10
SLIDE 10

Integrates easily with MPI and OpenMP

Adding GPU support to existing applications raises new issues

  • Proper ratio CPU cores / GPU?
  • How can multiple CPU cores (MPI ranks) share a single GPU?
  • How does MPI switch between two sets of ‘ranks’: one set for CPUs,
  • ne set for GPUs?

AmgX handles this via Consolidation

  • Consolidate multiple smaller sub-matrices into single matrix
  • Handled automatically during PCIE data copy
slide-11
SLIDE 11

u1 u2 u4 u3 u5 u6 u7 u1 u2 u4 u3 u5 u6 u7 u’4 u’2 Rank 0 Rank 1 GPU u1 u2 u4 u3 u5 u6 u7 PCIE PCIE

Original Problem Partitioned to 2 MPI Ranks Consolidated onto 1 GPU

Boundary exchange

slide-12
SLIDE 12

Consolidation Examples

1 CPU socket <=> 1 GPU Dual socket CPU <=> 2 GPUs Dual socket CPU <=> 4 GPUs Arbitrary Cluster: 4 nodes x [2 CPUs + 3 GPUs]

IB

slide-13
SLIDE 13

Benefits of Consolidation

Add more GPUs or CPU cores without changing existing code Run 1 MPI rank per CPU core? Fine, keep it that way. Consolidation allows flexible ratio of CPUs/GPU Consolidation has similar benefits as “replication” strategies in multigrid: Consolidate many small coarse grid problems to reduce network communication

slide-14
SLIDE 14

Whole-App Performance

Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged Inner loop Typically 50-90% Accelerate this first

Typical application profile

slide-15
SLIDE 15

Amdahl’s Law

Typical performance example: Simulation has 70% fraction on linear solver AmgX provides 3x speedup Best possible result just accelerating solver: 1/(1-.7)=10/3=3.33x Achieved 3x for solver = 1.87x application speedup

100 53.33 20 40 60 80 100 120 Before AmgX After AmgX

Total Time

slide-16
SLIDE 16

Drop-in Acceleration

Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged AmgX Move data to/from GPU Solve Linear System is expensive Moving data to GPU is relatively cheap: 1-15%

slide-17
SLIDE 17

What Comes Next?

Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged Expand region

  • f acceleration

AmgX CUDA matrix assembly AmgX device pointer API allows data to start on GPU Works with consolidation No change to AmgX calls

slide-18
SLIDE 18

Amdahl’s law revisited

3x Linear solver speedup, 70% fraction 4x matrix assembly speedup, 25% fraction Potential speedup is dramatically higher! 1/(1-.95) = 100/5 =20x Achieved 3x solver and 4x assembly= 2.89x application speedup

100 53.33 34.58 20 40 60 80 100 120 Before AmgX After Amgx After GPU Assembly

Total Time

slide-19
SLIDE 19

AmgX – High Performance on Modern Algorithms

Interested in Time-to-solution Comparisons always against state-of-the art algorithms and implementations CPU codes are mature and well tuned

slide-20
SLIDE 20

7070 5883 3180

1000 2000 3000 4000 5000 6000 7000 8000

CPU CPU+GPU

ANSYS Fluent Time (Sec)

Segregated Solver 1.9x 2.2x

Lower is Better

Coupled Solver

GPU Acceleration of Coupled Solvers

Sedan geometry 3.6 M mixed cells Steady, turbulent External aerodynamics Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU

Sedan Model

Preview of ANSYS Fluent 15.0 Performance

NOTE: Times for total solution

slide-21
SLIDE 21

Florida Matrix Collection

7.14 11.85 6.40 1.41 8.06 2.88 2.36 2.63 6.82 2.54 2.86 1.83 10 20

AmgX Classical on K40 speedup vs HYPRE

Xeon E5-2670 @ 2.60GHz 8cores 128GB memory Total time to solution

Higher is Better

slide-22
SLIDE 22

miniFE Benchmark vs HYPRE

10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions

Single Node, 1 Socket Cpu & 1 GPU Total Time

1GPU-Agg 1GPU-Classical HYPRE 8 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 8cores 128GB memory 1x K40 miniFE – “mini app” from Sandia. Performs assembly and solution of Finite Element mesh, typical of DOE codes

Lower is Better

slide-23
SLIDE 23

miniFE Benchmark vs HYPRE

5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions

Dual Socket CPU & 2 GPUs, Total Time

2GPU-Agg HYPRE 16 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 8cores 128GB memory 2x K40

Lower is Better

slide-24
SLIDE 24

miniFE Benchmark vs HYPRE

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions

Single socket CPU & 1GPU, solve only, reuse setup

1GPU-Agg 1GPU-Classical HYPRE 8 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 2x8cores 128GB memory 1x K40

Lower is Better

slide-25
SLIDE 25

AmgX

Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit, GPU accelerated Ax = b solver Solve your problems faster with minimal disruption More than just fast solvers, AmgX helps you accelerate all of your code First step toward moving complex applications to GPU Public beta launching now: http://developer.nvidia.com/amgx