GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei - PowerPoint PPT Presentation

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei Huang 2 , Yibo Lin 1 1 CS Department, Peking University 2 ECE Department, University of Utah

Outline 2  Introduction – Static timing analysis (STA) – Previous work on STA acceleration  Problem formulation and our proposed algorithms – RC delay computation – Levelization – Timing propagation  Experimental result  Conclusion

Static Timing Analysis: Basic Concepts 3  Correct functionality  Performance Image source: https://www.synopsys.com/glossary/what-is-static-timing-analysis.html https://vlsiuniverse.blogspot.com/2016/12/setup-time-vs-hold-time.html https://sites.google.com/site/taucontest2015/

Static Timing Analysis: Basic Concepts 4  Correct functionality and performance  Simplified delay models – Cell delay: non-linear delay model (NLDM) – Net delay: Elmore delay model (Parasitic RC Tree) Image source: https://www.synopsys.com/glossary/what-is-static-timing-analysis.html https://vlsiuniverse.blogspot.com/2016/12/setup-time-vs-hold-time.html https://sites.google.com/site/taucontest2015/

Static Timing Analysis: Call For Acceleration 5  Time-consuming for million/billion-size VLSI designs  Need to be called many times to guide optimization – Timing-driven placement, timing-driven routing etc. Image source: ePlace [Lu, TODAES’15], Dr. CU [Chen, TCAD’20]

Prior Works and Challenges 6  Parallelization on CPU by multithreading – [Huang, ICCAD’15] [Lee, ASP - DAC’18]... – cannot scale beyond 8-16 threads  Statistical STA acceleration using GPU – [Gulati, ASPDAC’09] [Cong, FPGA’10]... – Less challenging than conventional STA Image source: [Huang, TCAD’20] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction

Prior Works and Challenges 7  Accelerate STA using modern GPU – Lookup table query and timing propagation [Wang, ICPP’14] [Murray, FPT’18] – 6.2x kernel time speed-up, but 0.9x of entire time because of data copying  Leveraging GPU is challenging – Graph-oriented: diverse computational patterns and irregular memory access – Data copy overhead Image source: [Huang, TCAD’20] https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction

Fully GPU-Accelerated STA 8  Efficient GPU algorithms – Covers the runtime bottlenecks  Implementation based on open source STA engine OpenTimer https://github.com/OpenTimer/OpenTimer

RC Delay Computation 9  The Elmore delay model explained.  𝑚𝑝𝑏𝑒 𝑣 = σ 𝑤 is child of 𝑣 𝑑𝑏𝑞 𝑤 – eg. 𝑚𝑝𝑏𝑒 𝐵 = 𝑑𝑏𝑞 𝐵 + 𝑑𝑏𝑞 𝐶 + 𝑑𝑏𝑞 𝐷 + 𝑑𝑏𝑞 𝐸 = 𝑑𝑏𝑞 𝐵 + 𝑚𝑝𝑏𝑒 𝐶 + 𝑚𝑝𝑏𝑒 𝐸  𝑒𝑓𝑚𝑏𝑧 𝑣 = σ 𝑤 is any node 𝑑𝑏𝑞 𝑤 × 𝑆 𝑎→𝑀𝐷𝐵 𝑣,𝑤 – eg. 𝑒𝑓𝑚𝑏𝑧 𝐶 = 𝑑𝑏𝑞 𝐵 𝑆 𝑎→𝐵 + 𝑑𝑏𝑞 𝐸 𝑆 𝑎→𝐵 + 𝑑𝑏𝑞 𝐶 𝑆 𝑎→𝐶 + 𝑑𝑏𝑞 𝐷 𝑆 𝑎→𝐶 = 𝑒𝑓𝑚𝑏𝑧 𝐵 + 𝑆 𝐵→𝐶 𝑚𝑝𝑏𝑒 𝐶

RC Delay Computation 10  The Elmore delay model explained.  𝑚𝑒𝑓𝑚𝑏𝑧 𝑣 = σ 𝑤 is child of 𝑣 𝑑𝑏𝑞 𝑤 × 𝑒𝑓𝑚𝑏𝑧 𝑤  𝛾 𝑤 = σ 𝑤 is any node 𝑑𝑏𝑞 𝑤 × 𝑒𝑓𝑚𝑏𝑧 𝑤 × 𝑆 𝑎→𝑀𝐷𝐵 𝑣,𝑤

RC Delay Computation 11  Flatten the RC trees by parallel BFS and counting sort on GPU.  Store only parent index of each node on GPU  Redesign the dynamic programming on trees

RC Delay Computation 12  Store only parent index of each node on GPU  Redesign the dynamic programming on trees DFS_load(u): load[u] = cap[u] For child v of u: DFS_load(v) load[u] += load[v] GPU_load: For u in [ C, D, B, E, A ]: load[u] += cap[u] load[u.parent] += load[u]

RC Delay Computation 13  Store only parent index of each node on GPU, and re-implement the dynamic programming on trees, based on the direction of value update. DFS_delay(u): For child v of u: temp := R[u,v]*load[v] delay[v] = delay[u] + temp DFS_delay(v) GPU_delay: For u in [ A, E, B, D, C ]: temp := R[u.parent,u]*load[u] delay[u]=delay[u.parent] + temp

RC Delay Memory Coalesce 14  Global memory read/write introduces delay. GPU will automatically coalesce adjacent memory requests. Image source: https://docs.nvidia.com/cuda/cuda-c- programming-guide/index.html#memory-hierarchy

Task Graph Levelization 15  Build level-by-level dependencies for timing propagation tasks. – Essentially a parallel topological sorting.  Maintain a set of nodes called frontiers, and update the set using “advance” operation.

Task Graph Levelization: Reverse Technique 16 Benchmark #nodes Max In-degree Max Out-degree netcard 3999174 8 260 vga_lcd 397809 12 329 wb_dma 13125 12 95

GPU Look-up Table Query 17  Do linear interpolation/extrapolation and eliminate unnecessary branches – Unified inter-/extrapolation – Degenerated LUTs

Experiment Setup 18  Nvidia CUDA, RTX 2080, 40 Intel Xeon Gold 6138 CPU cores  RC Tree Flattening – 64 threads per block with one block for each net  Elmore delay computation – 4 threads for each net (one for each Early/Late and Rise/Fall condition) with a block of 64 nets  Levelization – 128 threads per block  Timing propagation – 4 threads for each arc, with a block of 32 arcs

Experimental Results 19  Up to 3.69 × speed-up (including data copy)  Bigger performance margin with bigger problem size

Experimental Results 20  Up to 3.69 × speed-up (including data copy)  Bigger performance margin with bigger problem size

Experimental Results (Incremental Timing) 21  Break-even point – 45K nets and gates – 67K propagation candidates  useful for timing driven optimization  Mixed strategy

Conclusions and Future Work 22  Conclusions: – GPU-accelerated STA that go beyond the scalability of existing methods – GPU-efficient data structures and algorithms for delay computation, levelization and timing propagation – Up to 3.69x speedup  Future Work – Explore different cell/net delay models. – Develop efficient GPU algorithms for CPPR

Thanks! Questions are welcome Website: https://guozz.cn Email: gzz@pku.edu.cn

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei - PowerPoint PPT Presentation

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei Huang 2 , Yibo Lin 1 1 CS Department, Peking University 2 ECE Department, University of Utah Outline 2 Introduction Static timing analysis (STA) Previous work on STA

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Use Tesla to provide first GPU VM Service in China Feng Zhu

Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a

Nobuyuki Kawai and Yoshikazu Kanai (Tokyo Tech) On behalf of Fermi/LAT Collaboration Motivation

Timing Analysis of Linux-Based CAN-to-CAN Gateway Michal Sojka Czech Technical University in

VHDL Design flow General design flow steps Design entry Register Transfer Level (RTL)

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Quiz 1 QUIZ NO TALKING NO NOTES Q1: The AND

Pl PlantUML ON UML DIAGRAM DEVELOPMENT AN AN OPEN SOURCE PROJECT FOCUSED ON A NDRS G ARCA C

Digital Logic Design: a rigorous approach c Chapter 17: Flip-Flops Guy Even Moti Medina

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei - PowerPoint PPT Presentation

GPU-Accelerated Static Timing Analysis Zizheng Guo 1 , Tsung-Wei Huang 2 , Yibo Lin 1 1 CS Department, Peking University 2 ECE Department, University of Utah Outline 2 Introduction Static timing analysis (STA) Previous work on STA

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

GPU-Accelerated Undecimated Wavelet Transform for Film and Video Denoising Hermann Frntratt ,

GPU Teaching Kit Accelerated Computing The GPU Teaching Kit is licensed by NVIDIA and the

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Use Tesla to provide first GPU VM Service in China Feng Zhu

Pyramid: Machine Learning Framework to Estimate the Optimal Timing and Resource Usage of a

Nobuyuki Kawai and Yoshikazu Kanai (Tokyo Tech) On behalf of Fermi/LAT Collaboration Motivation

Timing Analysis of Linux-Based CAN-to-CAN Gateway Michal Sojka Czech Technical University in

VHDL Design flow General design flow steps Design entry Register Transfer Level (RTL)

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Quiz 1 QUIZ NO TALKING NO NOTES Q1: The AND

Pl PlantUML ON UML DIAGRAM DEVELOPMENT AN AN OPEN SOURCE PROJECT FOCUSED ON A NDRS G ARCA C

Digital Logic Design: a rigorous approach c Chapter 17: Flip-Flops Guy Even Moti Medina

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team