SC13 GPU Technology Theater AmgX: Performance Acceleration for - PowerPoint PPT Presentation

SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA

Linear Solvers are Necessary CFD Energy Physics Nuclear Safety

AmgX Overview Two forms of AMG Classical AMG, as in HYPRE, strong convergence, scalar Un-smoothed Aggregation AMG, lower setup times, handles block systems Krylov methods GMRES, CG, BiCGStab , preconditioned and ‘flexible’ variants Classic iterative methods Block-Jacobi, Gauss-Seidel, Chebyshev, ILU0, ILU1 Multi-colored versions for fine-grained parallelism Flexible configuration All methods as solvers, preconditioners, or smoothers; nesting Designed for non-linear problems Allows for frequently changing matrix, parallel and efficient setup

Easy to Use No CUDA experience necessary to use the library C API: links with C, C++ or Fortran Small API, focused Reads common matrix formats (CSR, COO, MM) Single GPU, Multi-GPU Interoperates easily with MPI, OpenMP, and Hybrid parallel applications Tuned for K20 & K40; supports Fermi and newer Single, Double precision Supported on Linux, Win64

Minimal Example With Config //One header #include “ amgx_c.h ” solver(main)=FGMRES //Read config file main:max_iters=100 AMGX_create_config(&cfg, cfgfile); main:convergence=RELATIVE_INI main:tolerance=0.1 //Create resources based on config main:preconditioner(amg)=AMG AMGX_resources_create_simple(&res, cfg); amg:algorithm=AGGREGATION //Create solver object, A,x,b, set precision amg:selector=SIZE_8 AMGX_solver_create(&solver, res, mode, cfg); amg:cycle=V AMGX_matrix_create(&A,res,mode); amg:max_iters=1 AMGX_vector_create(&x,res,mode); amg:max_levels=10 AMGX_vector_create(&b,res,mode); amg:smoother(amg_smoother)=BLOCK_JACOBI //Read coefficients from a file amg:relaxation_factor= 0.75 amg:presweeps=1 AMGX_read_system(&A,&x,&b, matrixfile); amg:postsweeps=2 //Setup and Solve amg:coarsest_sweeps=4 AMGX_solver_setup(solver,A); determinism_flag=1 AMGX_solver_solve(solver, b, x);

Drop-In Acceleration on Real Apps “Using AmgX has allowed us to exploit the power of the GPU while freeing up development time to concentrate on reservoir simulation” Garf Bowen, RidgewayKiteSoftware Total Time (s) 400k cell 1500 Adaptive implicit Lower 1150 is Black-oil model Better 1000 AMG Pressure solver 10 Time step benchmark 500 197 98 0 in-house one core in-house GPU AmgX

Industrial Strength, Robust Designed to be used in commercial, academic, and research applications AmgX has run on clusters with 100s of nodes. Industrial problems >440 million unknowns have been solved successfully using 48 GPUs.

ANSYS Fluent 15.0 -111 M aerodynamic problem 144 CPU cores – Amg 144 CPU cores 48 GPUs – AmgX 144 CPU cores + 48 GPUs Truck Body Model 36 2 X 29 111 M Mixed cells External aerodynamics Lower 2.7 X is Steady, k- e turbulence Better 18 Double-precision solver CPU: Sandy Bridge (E5- 2667); 12 cores per node 11 GPU: Tesla K40m, 4 per node 444 million unknowns Solver time Fluent solution time per iteration(secs) per iteration(secs)

Integrating AmgX into your Application

Integrates easily with MPI and OpenMP Adding GPU support to existing applications raises new issues  Proper ratio CPU cores / GPU?  How can multiple CPU cores (MPI ranks) share a single GPU?  How does MPI switch between two sets of ‘ranks’: one set for CPUs, one set for GPUs? AmgX handles this via Consolidation  Consolidate multiple smaller sub-matrices into single matrix  Handled automatically during PCIE data copy

Original Problem Partitioned to 2 MPI Ranks Consolidated onto 1 GPU Rank 0 u 5 u 1 PCIE GPU u 2 u 5 u’ 4 u 1 u 5 u 3 u 1 u 2 u 4 u 2 Boundary exchange u 3 u 4 PCIE u 3 Rank 1 u 6 u 6 u 7 u’ 2 u 7 u 4 u 6 u 7

Consolidation Examples Arbitrary Cluster: 1 CPU socket <=> 1 GPU 4 nodes x [2 CPUs + 3 GPUs] Dual socket CPU <=> 2 GPUs IB Dual socket CPU <=> 4 GPUs

Benefits of Consolidation Add more GPUs or CPU cores without changing existing code Run 1 MPI rank per CPU core? Fine, keep it that way. Consolidation allows flexible ratio of CPUs/GPU Consolidation has similar benefits as “replication” strategies in multigrid: Consolidate many small coarse grid problems to reduce network communication

Whole-App Performance Typical application profile Advance in time Compute Physics Solve non-linear PDE Linearize Next time step Inner loop Solve Linear Typically 50-90% System Accelerate this first Until converged

Amdahl’s Law Typical performance example: Total Time 120 Simulation has 70% fraction on 100 linear solver 100 AmgX provides 3x speedup 80 Best possible result just 53.33 60 accelerating solver: 40 1/(1-.7)=10/3=3.33x 20 Achieved 3x for solver = 1.87x application speedup 0 Before After AmgX AmgX

Drop-in Acceleration Solve Linear System is expensive Advance in time Moving data to GPU is relatively Compute cheap: 1-15% Physics Solve non-linear PDE Linearize Next time step Solve Linear AmgX System Move data Until converged to/from GPU

What Comes Next? AmgX device pointer API allows data to start on GPU Advance in time Works with consolidation Compute No change to AmgX calls Physics Expand region CUDA of acceleration Solve non-linear matrix PDE assembly Linearize Next time step Solve Linear AmgX System Until converged

Amdahl’s law revisited Total Time 3x Linear solver speedup, 70% 120 fraction 100 4x matrix assembly speedup, 100 25% fraction 80 Potential speedup is dramatically higher! 60 53.33 1/(1-.95) = 100/5 =20x 34.58 40 Achieved 3x solver and 4x 20 assembly= 2.89x application speedup 0 Before After After GPU AmgX Amgx Assembly

AmgX – High Performance on Modern Algorithms Interested in Time-to-solution Comparisons always against state-of-the art algorithms and implementations CPU codes are mature and well tuned

GPU Acceleration of Coupled Solvers Preview of ANSYS Fluent 15.0 Performance 8000 Sedan Model 7000 ANSYS Fluent Time (Sec) 7070 6000 5883 Lower 2.2x is 5000 Better 1.9x 4000 Sedan geometry 3000 CPU CPU+GPU 3180 3.6 M mixed cells Steady, turbulent 2000 External aerodynamics 1000 Coupled PBNS, DP AMG F-cycle on CPU 0 AMG V-cycle on GPU Segregated Solver Coupled Solver NOTE: Times for total solution

Florida Matrix Collection Xeon E5-2670 @ 2.60GHz 8cores AmgX Classical on K40 speedup vs HYPRE 128GB memory 20 Total time to solution Higher is 11.85 Better 10 8.06 7.14 6.82 6.40 2.88 2.86 2.63 2.54 2.36 1.83 1.41 0

miniFE Benchmark vs HYPRE Single Node, 1 Socket Cpu & 1 GPU Total Time All runs solved to 80 machine tolerance Xeon E5-2670 @ 70 2.60GHz 8cores Lower 60 is T 128GB memory Better i 50 1x K40 m 40 1GPU-Agg e 30 ( 1GPU-Classical s 20 HYPRE 8 Cores ) 10 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns miniFE – “mini app” from Sandia. Performs assembly and solution of Finite Element mesh, typical of DOE codes

miniFE Benchmark vs HYPRE Dual Socket CPU & 2 GPUs, Total Time All runs solved to 50 machine tolerance 45 Xeon E5-2670 @ Lower 2.60GHz 8cores 40 is T Better 128GB memory 35 i 2x K40 30 m 25 e 20 2GPU-Agg ( 15 s HYPRE 16 Cores 10 ) 5 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns

miniFE Benchmark vs HYPRE Single socket CPU & 1GPU, solve only, reuse setup All runs solved to 10 machine tolerance T 9 Xeon E5-2670 @ Lower 2.60GHz 2x8cores i 8 is 128GB memory 7 Better m 1x K40 6 e 5 ( s 1GPU-Agg 4 ) 1GPU-Classical 3 2 HYPRE 8 Cores 1 0 0 1 2 3 4 5 6 7 Millions Number of Unknowns

AmgX Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit, GPU accelerated Ax = b solver Solve your problems faster with minimal disruption More than just fast solvers, AmgX helps you accelerate all of your code First step toward moving complex applications to GPU Public beta launching now: http://developer.nvidia.com/amgx

SC13 GPU Technology Theater AmgX: Performance Acceleration for - PowerPoint PPT Presentation

SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA Linear Solvers are Necessary CFD Energy Physics Nuclear Safety AmgX Overview Two forms of AMG Classical AMG, as in HYPRE,

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Skin Barrier Inves-ga-on Using GPU Enhanced Molecular Dynamics

Accelerating computational science and engineering with leadership computing Jack C. Wells

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Research and Forecasting (WRF) Model Bormin Huang Space Science and Engineering Center

From brain research to high-energy physics: GPU-accelerated applications in Jlich Dirk

Predicting Intermediate Storage Performance for Workflow Applications Lauro B. Costa , Samer

Advanced MPI Programming Latest slides and code examples are available

Student Cluster Competition Rebecca Hartman-Baker NERSC/Lawrence Berkeley Lab

VERBIVORE vocabulary application powered by learners Remember Dae-su? Hard-working student He

Best Practice Pneumonia Demonstration Projects: Improving outcomes through innovation Pneumonia

Positioning our workforce for the future Industry driven Pragmatic Practical outcomes 30 April

PPDC PRIA Process Improvement Workgroup July 27, 2011 Meeting 1 Timing of Consultations During

Concepts and Math Problems in Electronic Structure Calculations Lin-Wang Wang Scientific

The reactive transport benchmark proposed by GdR MoMaS. Presentation and first results Article

ROTARY & TURBINE GAS METER Ftima Gouveia, Product Manager September 2018 AN EYE TO THE

Oregon Joint Use Association Annual Meeting Tom McGowan Itron October 2018 THE QUESTION IS NOW:

WATER SOLUTIONS Keeping our most valuable resources flowing NON-REVENUE WATER KEY

MyWaterTracker MyWaterTracker Features Real-time water budget and water use data Updated

Opportunities for Energy Efficiency in Texas under the Clean Power Plan, 111(d) of the Clean Air

The Smart Energy Marketplace CES 2017 January 5-8, 2017 Las Vegas, NV Presentation Schedule

FORECAST 20 Electricity Savings in Vermont From 20 Years of Continued End-Use Efficiency

Self-Generation Incentive Program Assembly Utilities and Commerce Committee Informational Hearing

SC13 GPU Technology Theater AmgX: Performance Acceleration for - PowerPoint PPT Presentation

SC13 GPU Technology Theater AmgX: Performance Acceleration for Large-Scale Iterative Methods Dr. Joe Eaton, NVIDIA Linear Solvers are Necessary CFD Energy Physics Nuclear Safety AmgX Overview Two forms of AMG Classical AMG, as in HYPRE,

SC13 GPU Technology Theater Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Skin Barrier Inves-ga-on Using GPU Enhanced Molecular Dynamics

Accelerating computational science and engineering with leadership computing Jack C. Wells

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Research and Forecasting (WRF) Model Bormin Huang Space Science and Engineering Center

From brain research to high-energy physics: GPU-accelerated applications in Jlich Dirk

Predicting Intermediate Storage Performance for Workflow Applications Lauro B. Costa , Samer

Advanced MPI Programming Latest slides and code examples are available

Student Cluster Competition Rebecca Hartman-Baker NERSC/Lawrence Berkeley Lab

VERBIVORE vocabulary application powered by learners Remember Dae-su? Hard-working student He

Best Practice Pneumonia Demonstration Projects: Improving outcomes through innovation Pneumonia

Positioning our workforce for the future Industry driven Pragmatic Practical outcomes 30 April

PPDC PRIA Process Improvement Workgroup July 27, 2011 Meeting 1 Timing of Consultations During

Concepts and Math Problems in Electronic Structure Calculations Lin-Wang Wang Scientific

The reactive transport benchmark proposed by GdR MoMaS. Presentation and first results Article

ROTARY &amp; TURBINE GAS METER Ftima Gouveia, Product Manager September 2018 AN EYE TO THE

Oregon Joint Use Association Annual Meeting Tom McGowan Itron October 2018 THE QUESTION IS NOW:

WATER SOLUTIONS Keeping our most valuable resources flowing NON-REVENUE WATER KEY

MyWaterTracker MyWaterTracker Features Real-time water budget and water use data Updated

Opportunities for Energy Efficiency in Texas under the Clean Power Plan, 111(d) of the Clean Air

The Smart Energy Marketplace CES 2017 January 5-8, 2017 Las Vegas, NV Presentation Schedule

FORECAST 20 Electricity Savings in Vermont From 20 Years of Continued End-Use Efficiency

Self-Generation Incentive Program Assembly Utilities and Commerce Committee Informational Hearing

ROTARY & TURBINE GAS METER Ftima Gouveia, Product Manager September 2018 AN EYE TO THE