Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - PowerPoint PPT Presentation

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples Ana Iontcheva Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018

Outline ● Introduction - Who is Silvaco? What is TCAD? ● Linear Solvers for TCAD Simulations ● Performance Evaluation Results ● Conclusion

Silvaco ● Silvaco - Silicon Valley Company ● 34 year old, headquartered in Santa Clara, CA ● 16 offices worldwide, 7 development centers - US, Europe and Asia ● Largest privately held Electronic Design Automation company ● Develops advanced software tools for design and verification of semiconductor chips

EDA Design Flow Schematic Reliability Analysis Process LVS Spice DRC Variation Analysis Device Layout Spice modeling Measured Data Parasitic extraction Parasitic reduction Spice TCAD Modeling Design & Verification Simulation

Technology Computer-Aided Design TCAD - category of software tools for designing semiconductor devices Applications: - Memory - STTRAM, 3D NAND Flash - Display - OLED, QLED, Flexible LCD - Optical - Laser, Solar Cell, Photodiodes, TFT - Advanced Process Development - FinFET- Fin Field Effect Transistor - 3D device used in modern processors, Quantum models - Radiation/ Reliability - space exploration - Power - Electric Cars, Consumer Appliances, Electric Railways, Medical Equipment, High-Power Equipment in Power Plants

TCAD Design Flow 2D/3D Visualization Device Simulation Victory Device Structure Builder, Process Simulation Layout Victory Process 3D RC Extraction Clever Automation and Optimization

Technology Computer-Aided Design ● Process Simulation - models the fabrication steps of semiconductor devices Series of stacked very thin layers, the layers are usually made from different materials. - Etching and deposition. - Oxidation - process which converts silicon on the wafer to silicon oxide. Silicon dioxide layers are used as insulators or masks for ion implantation. - Ion implantation - introduce dopant impurities into crystalline silicon - Diffusion - the movement of impurity atoms in a semiconductor material at high temperatures. Diffusion is used to form the source, drain and chanel region in a transistor. The output from the Process Simulation is a 2D or 3D structure which can be used in the Device Simulation. ● Device Simulation - models the electrical, thermal and optical behavior of semiconductor devices

Linear Solvers for TCAD Simulations • Semiconductor process and device simulations require the solution of a system of PDEs discretized on a mesh. • The nonlinear problem is solved with a nonlinear solver and at each iteration of the nonlinear solver a sparse linear system needs to be solved. • A significant part of the overall computation time is spent on solving the linear systems so the performance of the linear solvers is essential. • Accuracy is a very important requirement for the linear solvers.

Evaluation Setup • Our first step towards incorporating accelerators into our linear solvers was to evaluate third party linear solver libraries accelerated with GPUs and compare the results with our parallel linear solvers running on CPUs only. Our purpose was to identify the types of tools for which GPU acceleration is possible and get an approximate idea of the speedup that we would get. • We tested the GPU accelerated linear solvers for sparse linear systems in Paralution [1] and Magma [2] on linear systems with matrices extracted from different modules of our Process Simulator Victory Process. • SolverLib is a library of linear solvers developed in Silvaco. We tried various combinations of solvers, preconditioners, triangular solvers and parameters and these result show the best performing combination that we could come up for a specific problem. • For our testing we used an NVIDIA Tesla P100 16GB accelerator. SolverLib was run on an Intel Xeon CPU E5-2699 CPU. The numerical simulations needed for this work were performed on Microway's Tesla GPU accelerated compute cluster. The author is grateful for the HPC time provided by Microway, Inc. 1. PARALUTION Labs. Paralution v1.1 2016. http://www.paralution.com 2. Magma 2.2.0 http://icl.cs.utk.edu/magma/software/

Victory Process - 3D stationary diffusion of oxygen in silicon oxide laplace.mtx - non-symmetric Source - Victory Process #iterations time residual n=1 496 362 nnz=11 640 245 Paralution Solver BICGSTAB 135 0.59 s 5.86e-10 Preconditioner Gauss-Seidel Magma Solver BICGSTAB 128 0.72 7.75e-10 Preconditioner ILU(0) Triangular Solver Jacobi SolverLib Solver BICGSTAB 57 6.87 s 6.92e-10 Preconditioner ILU(0) The laplace matrix comes from the finite difference approximation of a stationary diffusion equation (Laplace equation). The equation models 3D stationary diffusion of oxygen in silicon oxide during the thermal oxidation of silicon. In this example Paralution shows the best performance - probably due to the Gauss-Seidel preconditioner. The speedup of Paralution compared to SolverLib is 11.6 times. The SolverLib BICGSTAB solver is running on 1 core in this example. Paralution and Magma results are close 1.2 speedup with Paralution. Magma did not have Gauss Seidel preconditioner as an option so that the comparison can be calibrated further.

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon The next three matrices come from the finite difference approximation of a Stokes equation (Navier-Stokes model of a liquid with high viscosity and hence with very low Reynolds number). The equation models a 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon. This is the most time consuming part of the simulation so the speedup here is of particular importance. We have 3 different size matrices - 1 million, 2 million and 4 million. We wanted to see whether we would get consistent results.

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon vas_stokes_1M.mtx -non-symmetric n=1 090 664 #iterations time residual nnz=32 767 207 Paralution Solver IDR(8) 1759 16.93 s 7.8e-10 Preconditioner ILU(0,1) Magma Solver IDR(8) 286 13.55 9.58e-10 Preconditioner ILU(0) Triangular Solver Jacobi SolverLib Solver PAM BICGSTAB- 16 CPU 434 37.5 s 6.2e-10 Preconditioner ILU(1)

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon vas_stokes_2M.mtx non-symmetric n=2 146 677 #iterations time residual nnz=65 129 037 Paralution Solver IDR(8) 1745 26.97 s 9.08e-10 Preconditioner ILU(0,1) Magma Solver IDR(8) 271 23.5 1.45e-09 Preconditioner ILU(0) Triangular Solver Jacobi SolverLib Solver PAM BICGSTAB- 16 CPU 385 62.8 s 8.45e-10 Preconditioner ILU(1)

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon vas_stokes_4M.mtx non-symmetric #iterations time residual n=4 382 246 nnz=131 577 61 Paralution Solver IDR(8) 216 61.76 s 1.46e-09 Preconditioner ILU(0,1) Magma Solver IDR(8) 33 57.9 s 2.04e-09 Preconditioner ILU(0) Triangular Solver Jacobi SolverLib Solver PAM BICGSTAB- 47 174.1 s 3.83e-10 16 CPU Preconditioner ILU(1)

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon We can see that the results are consistent. On all three examples Magma is outperforming the other solvers. It should be noted here that the matrices are ill-conditioned and Induced Dimension Reduction (IDR) with dimension 8 was used in Paralution and Magma. To compare we used Silvaco’s parallel domain decomposition based solver PAM on 16 cores. Magma’s speedup compared to SolverLib on 16 cores is 2.77 on the 1 million sized problems, 2.67 on the 2 million sized problem and 3 on the 4 million sized problem. So overall close to 3 times. Paralution showed results that were close to Magma.

Victory Device - example nv1 source: Victory Device size: 75 468 nnz: 2 449 194 Very ill-conditioned matrix - huge condition number PAS running on 36 threads. Sparse QR quad-double - NVIDIA Solvers total time residual error PAS 0.938 s 1.9e-15 Sparse QR 722 s 5.13e-26

CLEVER - example nv7 source: Clever size: 1 802 979 nnz: 25 180 017 PAM running on 16 MPI processes PCG with Jacobi preconditioner - on CPU and GPU. Iterative solvers # iterations total time residual error PAM 36 5.6 s 1.20e-09 PCG+Jacobi 493 1.9 s 1.27e-09 Comment: PCG is about 2.9 times faster.

Conclusion ● GPU accelerated linear solvers are a promising tool for accelerating Process Simulation and RC extraction tools and we have added such solvers into our library of linear solvers. ● We are continuing to work on developing a good GPU accelerated linear solver for our Device Simulation tools.

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - PowerPoint PPT Presentation

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples Ana Iontcheva Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018 Outline Introduction - Who is Silvaco? What is TCAD? Linear Solvers for

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Financing Infrastructure The Role of Institutional Investors AKASH DEEP Harvard University

Group Risk Management May 10, 2010 David Benson Chief Risk Officer Agenda 1 . Risk Management

Lowell Public Schools Helping Students Stay the Course Jeannine Durkin, Deputy Superintendent for

I nternational Standards Nick Halsey, EMA An agency of the European Union Definitions ICSR =

2 Drive occupancy An increase in qualified enquiries (online space) Improved SA

THE TERRACES PADDINGTON: NEIGHBOUR BRIEFING VIBE HOTEL, RUSHCUTTERS BAY 10 OCTOBER 2017 PURPOS

Update Presentation 23 March 2017 www.shantagold.com Disclaimer This Document comprises an

Quarter Ended March 31, 2012 Earnings Conference Call May 10, 2012 Cautionary Statement Certain

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - PowerPoint PPT Presentation

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples Ana Iontcheva Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018 Outline Introduction - Who is Silvaco? What is TCAD? Linear Solvers for

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

OpenFOAMs basic solvers for linear systems of equations Solvers, preconditioners, smoothers

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

SMT Solvers: A Disruptive Technology John Rushby Computer Science Laboratory SRI International

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Financing Infrastructure The Role of Institutional Investors AKASH DEEP Harvard University

Group Risk Management May 10, 2010 David Benson Chief Risk Officer Agenda 1 . Risk Management

Lowell Public Schools Helping Students Stay the Course Jeannine Durkin, Deputy Superintendent for

I nternational Standards Nick Halsey, EMA An agency of the European Union Definitions ICSR =

2 Drive occupancy An increase in qualified enquiries (online space) Improved SA

THE TERRACES PADDINGTON: NEIGHBOUR BRIEFING VIBE HOTEL, RUSHCUTTERS BAY 10 OCTOBER 2017 PURPOS

Update Presentation 23 March 2017 www.shantagold.com Disclaimer This Document comprises an

Quarter Ended March 31, 2012 Earnings Conference Call May 10, 2012 Cautionary Statement Certain

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team