frequency domain photonic simulation
play

Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin - PowerPoint PPT Presentation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s


  1. Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i , T a i w a n * * I B M T . J . W a t s o n R e s e a r c h C e n t e r N Y , U S 5/8/2017 GTC 2017 @ San Jose

  2. Outline 2  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  3. (Ref: Sun et al., Introduction Nature 528, 2015) 3  Photonics  Waveguides  Resonant cavities  Frequency filters  Plasmonic devices  Design concerns  Structural characteristics (Ref: Ivinskaya & Lavrinenko, 2011)  Parameter refinement  Experiment data 5/8/2017 GTC 2017 @ San Jose

  4. Introduction - Why Multi-GPU Scaling 4  Global supercomputing trend  High energy efficiency  Growing popularity in deep learning applications  Integration of high-performance numerical simulation and deep learning Source: ORNL Source: NVIDIA 5/8/2017 GTC 2017 @ San Jose

  5. Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for Iterative Side-Equation Solver Shift-Inverse Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  6. Introduction 6 Machine-Learning-Derived Behavior Model Nonlinear Equations Photonic Integrated Broadband with Multiphysics Circuit Design Spectral Analysis Features Photonic Crystal Analyzer Preconditioner and Algorithm for When iterative Iterative Side-Equation Solver Shift-Inverse solver fails… Eigensolver Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  7. Introduction 7  Objectives  Fast generation of numerical data for different parameters  Data-driven intelligent design of optical components  Explicit and fast acquisition of quantitative characteristics  Reduction of postprocessing and data storage/transfer requirement  Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  8. Outline 8  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  9. Implementation 9  FDFD Problem  Linear system 𝟑 𝜻 𝒔 𝑭 = 𝒅 Ԧ  −𝜶 × 𝜶 × 𝑭 + 𝒍 𝟏 𝑲  Direct solver for robust solution • Yee’s mesh • Perfectly-matched layer • High-frequency problem  Challenge • Heavy factorization loads Parallel Direct FDFD Solver Kernel 5/8/2017 GTC 2017 @ San Jose

  10. Implementation 10  Compressed hierarchical Schur method (CHiS)  Domain decomposition, multi-level algorithm  3D nested dissection of Yee’s mesh ( 𝑂 𝑦 × 𝑂 𝑧 × 𝑂 𝑨 )  Ideal periodic structure  𝑬 𝟐 = 𝐸 2 = 𝐸 3 = ⋯ = 𝐸 16  𝑻 𝟐,𝟐 = 𝑇 1,2 = 𝑇 1,3 = ⋯ = 𝑇 1,8  𝑻 𝟑,𝟐 = 𝑇 2,2 = 𝑇 2,3 = 𝑇 2,4  𝑻 𝟒,𝟐 = 𝑇 3,2  𝑻 𝟓,𝟐 5/8/2017 GTC 2017 @ San Jose

  11. Implementation 11  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children 𝑱 𝑽 𝑱 𝑴 GTC 2017 @ San Jose 5/8/2017 5/8/2017

  12. Implementation 12  Compressed hierarchical Schur method  Elimination tree deduplication  Diagonals  Interfaces to children GTC 2017 @ San Jose 5/8/2017 5/8/2017

  13. Implementation 13  Compressed hierarchical Schur method  Leaf-level Interface Compression (LIC)  Use one updating submatrix over multiple Schur complement submatrices with row/column permutations.  The less sparse matrix computing, the less CPU-centric load 5/8/2017 GTC 2017 @ San Jose

  14. Implementation 14  Compressed Hierarchical Schur method  Expose larger chunks of matrix computation  Major function calls and libraries (Option 1) PARDISO, Sparse BLAS  Subdomains (Option 2) MUMPS  Sparse diagonal: Sparse factorize  Sparse interface: Sparse LS solve and matrix multiply  Separators  Dense diagonal: Dense LU  Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration BLAS (ZGEMM) and (GPU: cuBLAS, cuSolver, etc.) LAPACK (ZGETRF, ZGETRS) 5/8/2017 GTC 2017 @ San Jose

  15. Implementation 15  GPU acceleration  Considerations  Multi-GPU scaling in single node (Scale-up)  No longer solely based on nested dissection  Asynchronous streams for small submatrices  Overlapping some computation kernels  Hardware scheduling  Threaded GPU controls  Thread affinity 5/8/2017 GTC 2017 @ San Jose

  16. Implementation 16 Factorize all diagonal blocks 𝑇 𝑗,𝑘  GPU acceleration related to level 𝑗 . (CPU or GPU work.) 5/8/2017 GTC 2017 @ San Jose

  17. Implementation 17 Asynchronously send some  GPU acceleration blocks to GPU and perform −1 𝐽 𝑉 𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

  18. Implementation 18  GPU acceleration Continue to ZGEMM, no D2H data transmission −1 𝐽 𝑉 kept in GPU for 𝐽 𝑀 𝑇 𝑗,𝑘 −1 𝐽 𝑉 𝑇 𝑗,𝑘 operation later. Workspace will be simply discarded if no longer needed. 5/8/2017 GTC 2017 @ San Jose

  19. Implementation 19 Asynchronously perform  GPU acceleration −1 𝐽 𝑉 ) ZGEMM 𝐽 𝑀 (𝑇 𝑗,𝑘 5/8/2017 GTC 2017 @ San Jose

  20. Implementation 20 −1 𝐽 𝑉 ) from all GPUs Collect 𝐽 𝑀 (𝑇 𝑗,𝑘  GPU acceleration and perform higher-level Schur update by CPU 5/8/2017 GTC 2017 @ San Jose

  21. Implementation 21 Continue more ZGEMM  GPU acceleration −1 𝐽 𝑉 ) related to (𝑇 𝑗,𝑘 −1 𝐽 𝑉 ) and 𝐽 𝑀 (𝑇 𝑗,𝑘 Schur updates… 5/8/2017 GTC 2017 @ San Jose

  22. Implementation 22  GPU acceleration  Workload balance for multi-GPU  Distribute 𝐽 𝑉 blocks by parent levels  Tackle extreme cases with lots of duplicates  Minor increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

  23. Implementation 23  GPU acceleration  Workload balance for multi-GPU  Panel 𝐽 𝑉  Each 𝐽 𝑉 column should be large enough  Multiple 𝐽 𝑀 copies sent to GPUs  Moderate increase in H2D transfer 5/8/2017 GTC 2017 @ San Jose

  24. Implementation 24 Finishing time  GPU acceleration > 325 seconds  Without workload balance 5/8/2017 GTC 2017 @ San Jose

  25. Implementation 25  GPU acceleration Finishing time < 250 seconds  With workload balance 5/8/2017 GTC 2017 @ San Jose

  26. Outline 26  Introduction  Implementation  Numerical Results I  P2P Matrix Sharing  Numerical Results II  Summary 5/8/2017 GTC 2017 @ San Jose

  27. Numerical Results I 27  Hardware specifications Server Brillante P8Exp CPU 2 × Intel E5-2670 v3 2 × IBM Power8 12 + 12 cores used 8 + 8 cores used Memory 256 GB 1 TB GPU 2 × K40 4 × K80 Software Intel Parallel Studio 2016 IBM ESSL and Parallel ESSL update 1 Intel PARDISO IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5 5/8/2017 GTC 2017 @ San Jose

  28. Numerical Results I 28  SOI dielectric waveguide  Total grids: 79 × 319 × 39 , 2,948,517 in matrix dimension  Wavelength: 1.5 𝜈𝑛  Grid size: 0.02 𝜈𝑛  100 GB RAM 5/8/2017 GTC 2017 @ San Jose

  29. Numerical Results I 29  Brillante: 2 × 𝐿40 ZGETRS + ZGEMM 𝟓𝟒𝟘. 𝟒 seconds ( 𝟘𝟏% overall time) 5/8/2017 GTC 2017 @ San Jose

  30. Numerical Results I Naïve GPU acceleration yields good speedup due to high AI. 30 “Scatter” time includes D2H transfer.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  31. Numerical Results I Async streams apply to low-level separators, which is finished in 31 seconds even in CPU-only mode.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  32. Numerical Results I Workload balance yields better 32 speedup and multi-GPU scaling.  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  33. Numerical Results I 33  P8Exp: 4 × K80 with autoboost • Good performance scaling in quad-K80 server • Higher performance with half-K80 computing • Two threads competing single PCI-E bandwidth when using full-K80 5/8/2017 GTC 2017 @ San Jose

  34. Numerical Results I 34  P8Exp: 4 × K80 with autoboost  AccTRSMM: multi-GPU scaling  Increased H2D transfer due to multiple 𝐽 𝑀 copies to work- sharing GPUs  We still get acceptable scaling performance 5/8/2017 GTC 2017 @ San Jose

  35. Numerical Results I 35  Periodic air hole wavelength filter  No propagation at 𝜇 0 = 1.5 μm  Total grids: 79 × 575 × 47 , 6,404,925 in matrix dimension  188 GB RAM 5/8/2017 GTC 2017 @ San Jose

  36. Numerical Results I 36  Brillante: 2 × 𝐿40 5/8/2017 GTC 2017 @ San Jose

  37. Numerical Results I 37  P8Exp: 4 × K80 with autoboost 5/8/2017 GTC 2017 @ San Jose

  38. Numerical Results I 38  P8Exp: GPU-scaling of AccTRSMM  Much more dense matrix operations  Good scaling in multi-GPU systems 5/8/2017 GTC 2017 @ San Jose

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend