Improving Reliability Through Analyzing and Debugging Floating-Point - PowerPoint PPT Presentation

Improving Reliability Through Analyzing and Debugging Floating-Point Software Ignacio Laguna Computer Scientist Center for Applied Scientific Computing 2020 ECP Annual Meeting, Feb 4, 2020 LLNL-PRES-802189 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- AC52-07NA27344. Lawrence Livermore National Security, LLC

A Hard-To-Debug Case Early development and porting to new system (IBM Power8, NVIDIA GPUs) clang –O1: |e| = 129941.1064990107 Hydrodynamics mini application clang –O2: |e| = 129941.1064990107 clang –O3: |e| = 129941.1064990107 gcc –O1: |e| = 129941.1064990107 gcc –O2: |e| = 129941.1064990107 gcc –O3: |e| = 129941.1064990107 xlc –O1: |e| = 129941.1064990107 xlc –O2: |e| = 129941.1064990107 xlc –O3: |e| = 144174.9336610391 It took several weeks of effort to debug it 2 LLNL-PRES-802189

Many Factors are Involved in Unexpected Numerical Results clang –O1: |e| = 129941.1064990107 clang –O2: |e| = 129941.1064990107 Round-off error clang –O3: |e| = 129941.1064990107 gcc –O1: |e| = 129941.1064990107 gcc –O2: |e| = 129941.1064990107 gcc –O3: |e| = 129941.1064990107 xlc –O1: |e| = 129941.1064990107 xlc –O2: |e| = 129941.1064990107 xlc –O3: |e| = 144174.9336610391 Floating-point Optimizations precision (be careful with –O3) Compiler (proprietary vs. open-source) Architecture (CPU ≠ GPU) Language semantics (FP is underspecified in C) 3 LLNL-PRES-802189

What Floating-Point Code Can Produce Variability? Result Random Test Compiler 1 Run 3.1415 V ARITY tool Result Compiler 2 Run 3.1498 https://github.com/LLNL/Varity 4 LLNL-PRES-802189

Variability Examples Found by Varity Example 1: variability between host and device Example 2: variability even with –O0 void compute(double comp,int var_1,double var_2, void compute(double tmp_1, double tmp_2, double tmp_3, double var_3,double var_4,double var_5,double var_6, double tmp_4, double tmp_5, double tmp_6) { double var_7,double var_8,double var_9,double var_10, if (tmp_1 > (-1.9275E54 * tmp_2 + (tmp_3 - tmp_4 * tmp_5))) double var_11,double var_12,double var_13, { double var_14) { tmp_1 = (0 * tmp_6); double tmp_1 = +1.7948E-306; } comp = tmp_1 + +1.2280E305 - var_2 + printf("%.17g\n", tmp_1); ceil((+1.0525E-307 - var_3 / var_4 / var_5)); for (int i=0; i < var_1; ++i) { return 0; comp += (var_6 * (var_7 - var_8 - var_9)); } } if (comp > var_10 * var_11) { Input comp = (-1.7924E-320 - (+0.0 / (var_12/var_13))); comp += (var_14 * (+0.0 - -1.4541E-306)); } +1.3438E306 -1.8226E305 +1.4310E306 -1.8556E305 - printf("%.17g\n", comp); 1.2631E305 -1.0353E3 } Input clang -O0 0.0 5 -0.0 -1.3121E-306 +1.9332E-313 +1.0351E-306 $ ./test-clang +1.1275E172 -1.7335E113 +1.2916E306 +1.9142E-319 1.3437999999999999e+306 +1.1877E-306 +1.2973E-101 +1.0607E-181 -1.9621E-306 -1.5913E118-O3 gcc -O0 clang -O3 $ ./test-gcc $ ./test-clang 1.3437999999999999e+306 NaN xlc -O0 nvcc -O3 (V100 GPU) $ ./test-xlc $ ./test-nvcc -0 -2.3139093300000002e-188 5 LLNL-PRES-802189

FLiT: Floating-Point Litmus Tester Bisection Method Multiple Levels: § Determine variability-inducing compilations § Analyze the tradeoff of reproducibility and performance § Locate variability by identifying files and functions causing variability Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, and Holger E. Jones. Multi- Level Analysis of Compiler-Induced Variability and Performance Tradeoffs . In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’19). 6 LLNL-PRES-802189

Detecting the Result of Exceptions in a CUDA Program § Place printf statements in the code (as many as possible) double x = 0; x = x/x; printf("res = %e\n", x); § Programing checks are available in CUDA: __device__ int isnan ( float a ); __device__ int isnan ( double a ); These solutions are not ideal; they require significant programming effort 7 LLNL-PRES-802189

FPChecker: Automatic Detection of Floating-Point Exceptions in GPUs Instrumentation Runtime Runtime Runtime Runtime device code Exceptions Binary CUDA LLVM Input Binary Report Program Compiler host code Execution phase Compilation phase https://github.com/LLNL/FPChecker 8 LLNL-PRES-802189

Floating-Point Precision Levels in GPUs Are Increasing 0.6 0.5 1:2 1:2 0.4 Pa Pascal Volta Vo 1:8 FP64 1:8 FP64 Fe Fermi 0.3 FP32 Te Tesla FP32 1:24 1:32 FP64 FP16 FP64 FP16 Ke Kepler Maxwell Ma FP32 FP32 0.2 FP64 FP64 FP32 FP32 0.1 FP32 0 2006 2008 2009 2010 2012 2013 2014 2016 2017 2019 FP32 , FP FP FP64 Compute capability 1.3 9 LLNL-PRES-802189

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications Fast GPU Program GPU program Profiling Run Mixed-Precision • Performance speedup (Optional) Configurations • Accuracy constraints kernel 1 satisfied kernel 2 kernel 3 Accuracy- Compiler Static Driven Analysis Analysis Dynamic analysis Ignacio Laguna, Paul C. Wood, Ranvijay Singh, Saurabh Bagchi. GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications . ISC High Performance, Frankfurt, Germany, Jun 16-20, 2019 (Best paper) 10 LLNL-PRES-802189

Tutorial on Floating-Point Analysis Tools http://fpanalysistools.org/ § Demonstrate several analysis tools § Hands-on exercises § Cover various important aspects of floating-point and repro § Tutorials: — LANL, Jan 9 th , 2020 — SC19, Denver, Nov 17th, 2019 — PEARC19, Chicago, Jul 30th, 2019 11 LLNL-PRES-802189

Disclaimer This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.

Improving Reliability Through Analyzing and Debugging Floating-Point - PowerPoint PPT Presentation

Improving Reliability Through Analyzing and Debugging Floating-Point Software Ignacio Laguna Computer Scientist Center for Applied Scientific Computing 2020 ECP Annual Meeting, Feb 4, 2020 LLNL-PRES-802189 This work was performed under the

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

Introduction to Debugging the Introduction to Debugging the FreeBSD Kernel FreeBSD Kernel May

Visual Debugging Software What is Debugging Visualization Visualizing

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

On Verifying Causal Consistency Ahmed Bouajjani, Constantin Enea, Rachid Guerraoui, Jad Hamza

Elements of Email Quick tips you can use right away to improve the essentials of your email

Demystifying the NSF CAREER Program: Tips from a Program Officer December 17, 2018 o Live

THANKS FOR JOINING THE WEBINAR WILL START IN A BRIEF MOMENT Kubernetes data management is more

Processes OS Lecture 2 UdS/TUKL WS 2015 1 Who am I? Bjrn Brandenburg bbb@mpi-sws.org

Linux as a Real-Time OS 2 Tuesday, July 9, 13 2 Linux as a Real-Time OS Optimizing system

www. cs. utah. edu / fv pruners.github.io subcontract collaborations with students, Utah

Class 14 @rwdkent Overview Favicon Exercise (15 min) Break (5 min) Pattern Libraries &

Improving Reliability Through Analyzing and Debugging Floating-Point - PowerPoint PPT Presentation

Improving Reliability Through Analyzing and Debugging Floating-Point Software Ignacio Laguna Computer Scientist Center for Applied Scientific Computing 2020 ECP Annual Meeting, Feb 4, 2020 LLNL-PRES-802189 This work was performed under the

Debugging Debugging Tools Module Overview Introduction to Debugging Problems in Production

Coroutines Update Seva Tolstopyatov @qwwdfsad October 13, 2020 Coroutines debugging Coroutines

Debugging Debugging with High Level Languages Same goals as low-level debugging Examine and

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Introduction to Debugging with Windbg Module Overview Introduction to Debugging Callstacks and

Kernel Debugging and Virtualization John Baldwin January 15, 2015 What is Kernel Debugging

Scalable Post-Mortem Debugging Abel Mathew CEO - Backtrace amathew@backtrace.io @nullisnt0

Debugging microservices in production Bryan Cantrill CTO bryan@joyent.com @bcantrill

Embedded Software TI2726-B 8. Debugging techniques Koen Langendoen Embedded Software Group

Debugging Techniques for C Programs Debugging Basics Will focus on the gcc/gdb combination.

Introduction to Debugging the Introduction to Debugging the FreeBSD Kernel FreeBSD Kernel May

Visual Debugging Software What is Debugging Visualization Visualizing

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

On Verifying Causal Consistency Ahmed Bouajjani, Constantin Enea, Rachid Guerraoui, Jad Hamza

Elements of Email Quick tips you can use right away to improve the essentials of your email

Demystifying the NSF CAREER Program: Tips from a Program Officer December 17, 2018 o Live

THANKS FOR JOINING THE WEBINAR WILL START IN A BRIEF MOMENT Kubernetes data management is more

Processes OS Lecture 2 UdS/TUKL WS 2015 1 Who am I? Bjrn Brandenburg bbb@mpi-sws.org

Linux as a Real-Time OS 2 Tuesday, July 9, 13 2 Linux as a Real-Time OS Optimizing system

www. cs. utah. edu / fv pruners.github.io subcontract collaborations with students, Utah

Class 14 @rwdkent Overview Favicon Exercise (15 min) Break (5 min) Pattern Libraries &amp;

Class 14 @rwdkent Overview Favicon Exercise (15 min) Break (5 min) Pattern Libraries &