Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples
Ana Iontcheva
Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018
Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - - PowerPoint PPT Presentation
Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples Ana Iontcheva Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018 Outline Introduction - Who is Silvaco? What is TCAD? Linear Solvers for
Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018
Layout Spice Parasitic extraction Spice modeling Process Device Measured Data Schematic Spice LVS DRC
TCAD
Parasitic reduction
Modeling Design & Verification
Reliability Analysis Variation Analysis
Simulation
Applications:
Field Effect Transistor - 3D device used in modern processors, Quantum models
Electric Railways, Medical Equipment, High-Power Equipment in Power Plants
Structure Builder, Process Simulation Victory Process Device Simulation Victory Device 3D RC Extraction Clever Layout Automation and Optimization
2D/3D Visualization
insulators or masks for ion implantation.
to form the source, drain and chanel region in a transistor. The output from the Process Simulation is a 2D or 3D structure which can be used in the Device Simulation.
accelerated with GPUs and compare the results with our parallel linear solvers running on CPUs only. Our purpose was to identify the types of tools for which GPU acceleration is possible and get an approximate idea of the speedup that we would get.
with matrices extracted from different modules of our Process Simulator Victory Process.
triangular solvers and parameters and these result show the best performing combination that we could come up for a specific problem.
CPU.
The numerical simulations needed for this work were performed on Microway's Tesla GPU accelerated compute cluster. The author is grateful for the HPC time provided by Microway, Inc. 1. PARALUTION Labs. Paralution v1.1 2016. http://www.paralution.com 2. Magma 2.2.0 http://icl.cs.utk.edu/magma/software/
laplace.mtx - non-symmetric Source - Victory Process n=1 496 362 nnz=11 640 245 #iterations time residual Paralution Solver BICGSTAB Preconditioner Gauss-Seidel 135 0.59 s 5.86e-10 Magma Solver BICGSTAB Preconditioner ILU(0) Triangular Solver Jacobi 128 0.72 7.75e-10 SolverLib Solver BICGSTAB Preconditioner ILU(0) 57 6.87 s 6.92e-10
The laplace matrix comes from the finite difference approximation of a stationary diffusion equation (Laplace equation). The equation models 3D stationary diffusion of oxygen in silicon oxide during the thermal oxidation of silicon. In this example Paralution shows the best performance - probably due to the Gauss-Seidel preconditioner. The speedup of Paralution compared to SolverLib is 11.6 times. The SolverLib BICGSTAB solver is running on 1 core in this example. Paralution and Magma results are close 1.2 speedup with
The next three matrices come from the finite difference approximation of a Stokes equation (Navier-Stokes model of a liquid with high viscosity and hence with very low Reynolds number). The equation models a 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon. This is the most time consuming part of the simulation so the speedup here is of particular importance. We have 3 different size matrices - 1 million, 2 million and 4 million. We wanted to see whether we would get consistent results.
vas_stokes_1M.mtx -non-symmetric n=1 090 664 nnz=32 767 207 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 1759 16.93 s 7.8e-10 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 286 13.55 9.58e-10 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 434 37.5 s 6.2e-10
vas_stokes_2M.mtx non-symmetric n=2 146 677 nnz=65 129 037 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 1745 26.97 s 9.08e-10 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 271 23.5 1.45e-09 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 385 62.8 s 8.45e-10
vas_stokes_4M.mtx non-symmetric n=4 382 246 nnz=131 577 61 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 216 61.76 s 1.46e-09 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 33 57.9 s 2.04e-09 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 47 174.1 s 3.83e-10
We can see that the results are consistent. On all three examples Magma is outperforming the other
and Induced Dimension Reduction (IDR) with dimension 8 was used in Paralution and Magma. To compare we used Silvaco’s parallel domain decomposition based solver PAM on 16 cores. Magma’s speedup compared to SolverLib on 16 cores is 2.77
problem and 3 on the 4 million sized problem. So overall close to 3 times. Paralution showed results that were close to Magma.
source: Victory Device size: 75 468 nnz: 2 449 194 Very ill-conditioned matrix - huge condition number PAS running on 36 threads. Sparse QR quad-double - NVIDIA
Solvers total time residual error PAS 0.938 s 1.9e-15 Sparse QR 722 s 5.13e-26
source: Clever size: 1 802 979 nnz: 25 180 017 PAM running on 16 MPI processes PCG with Jacobi preconditioner - on CPU and GPU.
Iterative solvers # iterations total time residual error PAM 36 5.6 s 1.20e-09 PCG+Jacobi 493 1.9 s 1.27e-09
Comment: PCG is about 2.9 times faster.