Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - - PowerPoint PPT Presentation

performance evaluation of gpu accelerated linear solvers
SMART_READER_LITE
LIVE PREVIEW

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD - - PowerPoint PPT Presentation

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples Ana Iontcheva Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018 Outline Introduction - Who is Silvaco? What is TCAD? Linear Solvers for


slide-1
SLIDE 1

Performance Evaluation of GPU Accelerated Linear Solvers with TCAD Examples

Ana Iontcheva

Senior Development Engineer - Numerics Silvaco Inc, March 26, 2018

slide-2
SLIDE 2
  • Introduction - Who is Silvaco? What is TCAD?
  • Linear Solvers for TCAD Simulations
  • Performance Evaluation Results
  • Conclusion

Outline

slide-3
SLIDE 3
  • Silvaco - Silicon Valley Company
  • 34 year old, headquartered in Santa Clara, CA
  • 16 offices worldwide, 7 development centers - US, Europe and Asia
  • Largest privately held Electronic Design Automation company
  • Develops advanced software tools for design and verification of semiconductor chips

Silvaco

slide-4
SLIDE 4

EDA Design Flow

Layout Spice Parasitic extraction Spice modeling Process Device Measured Data Schematic Spice LVS DRC

TCAD

Parasitic reduction

Modeling Design & Verification

Reliability Analysis Variation Analysis

Simulation

slide-5
SLIDE 5

Technology Computer-Aided Design

TCAD - category of software tools for designing semiconductor devices

Applications:

  • Memory - STTRAM, 3D NAND Flash
  • Display - OLED, QLED, Flexible LCD
  • Optical - Laser, Solar Cell, Photodiodes, TFT
  • Advanced Process Development - FinFET- Fin

Field Effect Transistor - 3D device used in modern processors, Quantum models

  • Radiation/ Reliability - space exploration
  • Power - Electric Cars, Consumer Appliances,

Electric Railways, Medical Equipment, High-Power Equipment in Power Plants

slide-6
SLIDE 6

TCAD Design Flow

Structure Builder, Process Simulation Victory Process Device Simulation Victory Device 3D RC Extraction Clever Layout Automation and Optimization

2D/3D Visualization

slide-7
SLIDE 7

Technology Computer-Aided Design

  • Process Simulation - models the fabrication steps of semiconductor devices

Series of stacked very thin layers, the layers are usually made from different materials.

  • Etching and deposition.
  • Oxidation - process which converts silicon on the wafer to silicon oxide. Silicon dioxide layers are used as

insulators or masks for ion implantation.

  • Ion implantation - introduce dopant impurities into crystalline silicon
  • Diffusion - the movement of impurity atoms in a semiconductor material at high temperatures. Diffusion is used

to form the source, drain and chanel region in a transistor. The output from the Process Simulation is a 2D or 3D structure which can be used in the Device Simulation.

  • Device Simulation - models the electrical, thermal and optical behavior of

semiconductor devices

slide-8
SLIDE 8

Linear Solvers for TCAD Simulations

  • Semiconductor process and device

simulations require the solution of a system of PDEs discretized on a mesh.

  • The nonlinear problem is solved with a

nonlinear solver and at each iteration of the nonlinear solver a sparse linear system needs to be solved.

  • A significant part of the overall

computation time is spent on solving the linear systems so the performance of the linear solvers is essential.

  • Accuracy is a very important requirement

for the linear solvers.

slide-9
SLIDE 9

Evaluation Setup

  • Our first step towards incorporating accelerators into our linear solvers was to evaluate third party linear solver libraries

accelerated with GPUs and compare the results with our parallel linear solvers running on CPUs only. Our purpose was to identify the types of tools for which GPU acceleration is possible and get an approximate idea of the speedup that we would get.

  • We tested the GPU accelerated linear solvers for sparse linear systems in Paralution [1] and Magma [2] on linear systems

with matrices extracted from different modules of our Process Simulator Victory Process.

  • SolverLib is a library of linear solvers developed in Silvaco. We tried various combinations of solvers, preconditioners,

triangular solvers and parameters and these result show the best performing combination that we could come up for a specific problem.

  • For our testing we used an NVIDIA Tesla P100 16GB accelerator. SolverLib was run on an Intel Xeon CPU E5-2699

CPU.

The numerical simulations needed for this work were performed on Microway's Tesla GPU accelerated compute cluster. The author is grateful for the HPC time provided by Microway, Inc. 1. PARALUTION Labs. Paralution v1.1 2016. http://www.paralution.com 2. Magma 2.2.0 http://icl.cs.utk.edu/magma/software/

slide-10
SLIDE 10

Victory Process - 3D stationary diffusion of oxygen in silicon oxide

laplace.mtx - non-symmetric Source - Victory Process n=1 496 362 nnz=11 640 245 #iterations time residual Paralution Solver BICGSTAB Preconditioner Gauss-Seidel 135 0.59 s 5.86e-10 Magma Solver BICGSTAB Preconditioner ILU(0) Triangular Solver Jacobi 128 0.72 7.75e-10 SolverLib Solver BICGSTAB Preconditioner ILU(0) 57 6.87 s 6.92e-10

The laplace matrix comes from the finite difference approximation of a stationary diffusion equation (Laplace equation). The equation models 3D stationary diffusion of oxygen in silicon oxide during the thermal oxidation of silicon. In this example Paralution shows the best performance - probably due to the Gauss-Seidel preconditioner. The speedup of Paralution compared to SolverLib is 11.6 times. The SolverLib BICGSTAB solver is running on 1 core in this example. Paralution and Magma results are close 1.2 speedup with

  • Paralution. Magma did not have Gauss Seidel preconditioner as an option so that the comparison can be calibrated further.
slide-11
SLIDE 11

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

The next three matrices come from the finite difference approximation of a Stokes equation (Navier-Stokes model of a liquid with high viscosity and hence with very low Reynolds number). The equation models a 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon. This is the most time consuming part of the simulation so the speedup here is of particular importance. We have 3 different size matrices - 1 million, 2 million and 4 million. We wanted to see whether we would get consistent results.

slide-12
SLIDE 12

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

vas_stokes_1M.mtx -non-symmetric n=1 090 664 nnz=32 767 207 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 1759 16.93 s 7.8e-10 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 286 13.55 9.58e-10 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 434 37.5 s 6.2e-10

slide-13
SLIDE 13

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

vas_stokes_2M.mtx non-symmetric n=2 146 677 nnz=65 129 037 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 1745 26.97 s 9.08e-10 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 271 23.5 1.45e-09 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 385 62.8 s 8.45e-10

slide-14
SLIDE 14

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

vas_stokes_4M.mtx non-symmetric n=4 382 246 nnz=131 577 61 #iterations time residual Paralution Solver IDR(8) Preconditioner ILU(0,1) 216 61.76 s 1.46e-09 Magma Solver IDR(8) Preconditioner ILU(0) Triangular Solver Jacobi 33 57.9 s 2.04e-09 SolverLib Solver PAM BICGSTAB- 16 CPU Preconditioner ILU(1) 47 174.1 s 3.83e-10

slide-15
SLIDE 15

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

We can see that the results are consistent. On all three examples Magma is outperforming the other

  • solvers. It should be noted here that the matrices are ill-conditioned

and Induced Dimension Reduction (IDR) with dimension 8 was used in Paralution and Magma. To compare we used Silvaco’s parallel domain decomposition based solver PAM on 16 cores. Magma’s speedup compared to SolverLib on 16 cores is 2.77

  • n the 1 million sized problems, 2.67 on the 2 million sized

problem and 3 on the 4 million sized problem. So overall close to 3 times. Paralution showed results that were close to Magma.

slide-16
SLIDE 16

Victory Device - example nv1

source: Victory Device size: 75 468 nnz: 2 449 194 Very ill-conditioned matrix - huge condition number PAS running on 36 threads. Sparse QR quad-double - NVIDIA

Solvers total time residual error PAS 0.938 s 1.9e-15 Sparse QR 722 s 5.13e-26

slide-17
SLIDE 17

CLEVER - example nv7

source: Clever size: 1 802 979 nnz: 25 180 017 PAM running on 16 MPI processes PCG with Jacobi preconditioner - on CPU and GPU.

Iterative solvers # iterations total time residual error PAM 36 5.6 s 1.20e-09 PCG+Jacobi 493 1.9 s 1.27e-09

Comment: PCG is about 2.9 times faster.

slide-18
SLIDE 18

Conclusion

  • GPU accelerated linear solvers are a promising tool for accelerating Process Simulation

and RC extraction tools and we have added such solvers into our library of linear solvers.

  • We are continuing to work on developing a good GPU accelerated linear solver for our

Device Simulation tools.