SLIDE 1 SC13
GPU Technology Theater
AmgX: Performance Acceleration for Large-Scale Iterative Methods
SLIDE 2 Linear Solvers are Necessary
CFD Energy Physics Nuclear Safety
SLIDE 3
AmgX Overview
Two forms of AMG
Classical AMG, as in HYPRE, strong convergence, scalar Un-smoothed Aggregation AMG, lower setup times, handles block systems
Krylov methods
GMRES, CG, BiCGStab, preconditioned and ‘flexible’ variants
Classic iterative methods
Block-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1 Multi-colored versions for fine-grained parallelism
Flexible configuration
All methods as solvers, preconditioners, or smoothers; nesting
Designed for non-linear problems
Allows for frequently changing matrix, parallel and efficient setup
SLIDE 4
Easy to Use
No CUDA experience necessary to use the library C API: links with C, C++ or Fortran Small API, focused Reads common matrix formats (CSR, COO, MM) Single GPU, Multi-GPU Interoperates easily with MPI, OpenMP, and Hybrid parallel applications Tuned for K20 & K40; supports Fermi and newer Single, Double precision Supported on Linux, Win64
SLIDE 5 Minimal Example With Config
//One header #include “amgx_c.h” //Read config file AMGX_create_config(&cfg, cfgfile); //Create resources based on config AMGX_resources_create_simple(&res, cfg); //Create solver object, A,x,b, set precision AMGX_solver_create(&solver, res, mode, cfg); AMGX_matrix_create(&A,res,mode); AMGX_vector_create(&x,res,mode); AMGX_vector_create(&b,res,mode); //Read coefficients from a file AMGX_read_system(&A,&x,&b, matrixfile); //Setup and Solve AMGX_solver_setup(solver,A); AMGX_solver_solve(solver, b, x);
solver(main)=FGMRES main:max_iters=100 main:convergence=RELATIVE_INI main:tolerance=0.1 main:preconditioner(amg)=AMG amg:algorithm=AGGREGATION amg:selector=SIZE_8 amg:cycle=V amg:max_iters=1 amg:max_levels=10 amg:smoother(amg_smoother)=BLOCK_JACOBI amg:relaxation_factor= 0.75 amg:presweeps=1 amg:postsweeps=2 amg:coarsest_sweeps=4 determinism_flag=1
SLIDE 6 Drop-In Acceleration on Real Apps
“Using AmgX has allowed us to exploit the power of the GPU while freeing up development time to concentrate on reservoir simulation” Garf Bowen, RidgewayKiteSoftware
400k cell Adaptive implicit Black-oil model AMG Pressure solver 10 Time step benchmark
Lower is Better
1150 197 98 500 1000 1500 in-house one core in-house GPU AmgX
Total Time (s)
SLIDE 7
Industrial Strength, Robust
Designed to be used in commercial, academic, and research applications AmgX has run on clusters with 100s of nodes. Industrial problems >440 million unknowns have been solved successfully using 48 GPUs.
SLIDE 8 ANSYS Fluent 15.0 -111 M aerodynamic problem
111 M Mixed cells External aerodynamics Steady, k-e turbulence Double-precision solver CPU: Sandy Bridge (E5- 2667); 12 cores per node GPU: Tesla K40m, 4 per node 444 million unknowns
Truck Body Model
144 CPU cores – Amg 48 GPUs – AmgX
Solver time per iteration(secs) 29 11 Fluent solution time per iteration(secs) 36 18
144 CPU cores 144 CPU cores + 48 GPUs
2.7 X 2 X
Lower is Better
SLIDE 9
Integrating AmgX into your Application
SLIDE 10 Integrates easily with MPI and OpenMP
Adding GPU support to existing applications raises new issues
- Proper ratio CPU cores / GPU?
- How can multiple CPU cores (MPI ranks) share a single GPU?
- How does MPI switch between two sets of ‘ranks’: one set for CPUs,
- ne set for GPUs?
AmgX handles this via Consolidation
- Consolidate multiple smaller sub-matrices into single matrix
- Handled automatically during PCIE data copy
SLIDE 11 u1 u2 u4 u3 u5 u6 u7 u1 u2 u4 u3 u5 u6 u7 u’4 u’2 Rank 0 Rank 1 GPU u1 u2 u4 u3 u5 u6 u7 PCIE PCIE
Original Problem Partitioned to 2 MPI Ranks Consolidated onto 1 GPU
Boundary exchange
SLIDE 12 Consolidation Examples
1 CPU socket <=> 1 GPU Dual socket CPU <=> 2 GPUs Dual socket CPU <=> 4 GPUs Arbitrary Cluster: 4 nodes x [2 CPUs + 3 GPUs]
IB
SLIDE 13
Benefits of Consolidation
Add more GPUs or CPU cores without changing existing code Run 1 MPI rank per CPU core? Fine, keep it that way. Consolidation allows flexible ratio of CPUs/GPU Consolidation has similar benefits as “replication” strategies in multigrid: Consolidate many small coarse grid problems to reduce network communication
SLIDE 14
Whole-App Performance
Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged Inner loop Typically 50-90% Accelerate this first
Typical application profile
SLIDE 15 Amdahl’s Law
Typical performance example: Simulation has 70% fraction on linear solver AmgX provides 3x speedup Best possible result just accelerating solver: 1/(1-.7)=10/3=3.33x Achieved 3x for solver = 1.87x application speedup
100 53.33 20 40 60 80 100 120 Before AmgX After AmgX
Total Time
SLIDE 16
Drop-in Acceleration
Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged AmgX Move data to/from GPU Solve Linear System is expensive Moving data to GPU is relatively cheap: 1-15%
SLIDE 17 What Comes Next?
Advance in time Compute Physics Solve non-linear PDE Linearize Solve Linear System Next time step Until converged Expand region
AmgX CUDA matrix assembly AmgX device pointer API allows data to start on GPU Works with consolidation No change to AmgX calls
SLIDE 18 Amdahl’s law revisited
3x Linear solver speedup, 70% fraction 4x matrix assembly speedup, 25% fraction Potential speedup is dramatically higher! 1/(1-.95) = 100/5 =20x Achieved 3x solver and 4x assembly= 2.89x application speedup
100 53.33 34.58 20 40 60 80 100 120 Before AmgX After Amgx After GPU Assembly
Total Time
SLIDE 19
AmgX – High Performance on Modern Algorithms
Interested in Time-to-solution Comparisons always against state-of-the art algorithms and implementations CPU codes are mature and well tuned
SLIDE 20 7070 5883 3180
1000 2000 3000 4000 5000 6000 7000 8000
CPU CPU+GPU
ANSYS Fluent Time (Sec)
Segregated Solver 1.9x 2.2x
Lower is Better
Coupled Solver
GPU Acceleration of Coupled Solvers
Sedan geometry 3.6 M mixed cells Steady, turbulent External aerodynamics Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU
Sedan Model
Preview of ANSYS Fluent 15.0 Performance
NOTE: Times for total solution
SLIDE 21 Florida Matrix Collection
7.14 11.85 6.40 1.41 8.06 2.88 2.36 2.63 6.82 2.54 2.86 1.83 10 20
AmgX Classical on K40 speedup vs HYPRE
Xeon E5-2670 @ 2.60GHz 8cores 128GB memory Total time to solution
Higher is Better
SLIDE 22 miniFE Benchmark vs HYPRE
10 20 30 40 50 60 70 80 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions
Single Node, 1 Socket Cpu & 1 GPU Total Time
1GPU-Agg 1GPU-Classical HYPRE 8 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 8cores 128GB memory 1x K40 miniFE – “mini app” from Sandia. Performs assembly and solution of Finite Element mesh, typical of DOE codes
Lower is Better
SLIDE 23 miniFE Benchmark vs HYPRE
5 10 15 20 25 30 35 40 45 50 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions
Dual Socket CPU & 2 GPUs, Total Time
2GPU-Agg HYPRE 16 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 8cores 128GB memory 2x K40
Lower is Better
SLIDE 24 miniFE Benchmark vs HYPRE
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 T i m e ( s ) Number of Unknowns Millions
Single socket CPU & 1GPU, solve only, reuse setup
1GPU-Agg 1GPU-Classical HYPRE 8 Cores All runs solved to machine tolerance Xeon E5-2670 @ 2.60GHz 2x8cores 128GB memory 1x K40
Lower is Better
SLIDE 25
AmgX
Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit, GPU accelerated Ax = b solver Solve your problems faster with minimal disruption More than just fast solvers, AmgX helps you accelerate all of your code First step toward moving complex applications to GPU Public beta launching now: http://developer.nvidia.com/amgx