Software Tools for Mixed-Precision Program Analysis Mike Lam James - PowerPoint PPT Presentation

Software Tools for Mixed-Precision Program Analysis Mike Lam James Madison University Lawrence Livermore National Lab

About Me • Ph.D in CS from University of Maryland ('07-'14) – Topic: Automated floating-point program analysis (w/ Jeff Hollingsworth) – Intern @ Lawrence Livermore National Lab (LLNL) in Summer ’11 • Assistant professor at James Madison University since '14 – Teaching: computer organization, parallel & distributed systems, compilers, and programming languages – Research: high-performance analysis research group (w/ Dee Weikle) • Faculty scholar @ LLNL since Summer '16 – Energy-efficient computing project (w/ Barry Rountree) – Variable precision computing project (w/ Jeff Hittinger et al.)

Context • IEEE floating-point arithmetic – Ubiquitous in scientific computing – More bits => higher accuracy (usually) – Fewer bits => higher performance (usually) 32 16 8 4 0 Single Precision (FP32) Exponent (8 bits) Significand (23 bits) 64 32 8 4 0 16 Double Precision (FP64) Exponent (11 bits) Significand (52 bits)

Motivation • Vector single precision 2X+ faster – Possibly better if memory pressure is alleviated – Newest GPUs use mixed precision for tensor ops Operation FP32 Packed FP32 FP64 Add 6 6 6 Subtract 6 6 6 Multiply 6 6 6 Divide 27 32 42 FP64 Square root 28 38 43 FP32 Instruction latencies for Intel Knights Landing Mixed FP16 / FP32 Credit: https://agner.org/optimize/ and NVIDIA Tesla V100 Datasheet

Questions • How many bits do you need ? • Where does reduced precision help ?

Prior Approaches • Rigorous: forwards/backwards error analysis – Requires numerical analysis expertise • Pragmatic: “guess-and-check” – Requires manual code conversion effort //double x[N], y[N]; float x[N], y[N]; double alpha; Credit: Wikimedia Commons

Research Question • What can we learn about floating-point behavior with automated analysis? – Specifically: can we build mixed-precision versions of a program automatically? • Caveat: few (or no) formal guarantees – Rely on user-provided representative run (and sometimes a verification routine) double sum = 0.0; double sum = 0.0; void sum2pi_x() void sum2pi_x() { → { double tmp; float tmp; double acc; float acc; int i, j; int i; int j; [...] [...]

FPAnalysis / CRAFT (2011) • Dynamic binary analysis via Dyninst • Cancellation detection • Range (exponent) tracking 3.682236 - 3.682234 0.000002 (6 digits cancelled)

CRAFT (2013) • Dynamic binary analysis via Dyninst • Instruction-level replacement of doubles w/ floats • Hierarchical search for valid replacements Program Func1 Func2 Func3 Insn1 Insn2 Insn3 … InsnN

CRAFT (2013)

CRAFT (2013) NAS Benchmark Candidate Configurations % Dynamic (name.CLASS) Instructions Tested Replaced bt.A 6,262 4,000 78.6 cg.A 956 255 5.6 ep.A 423 114 45.5 ft.A 426 74 0.2 lu.A 6,014 3,057 57.4 mg.A 1,393 437 36.6 sp.A 4,507 4,920 30.5

Issues • High overhead – Must check and (possibly) convert operands before each instruction • Lengthy search process – Search space is exponential wrt. instruction count • Coarse-grained analysis – Binary decision: single or double

CRAFT (2016) • Reduced-precision analysis – Simulate conservatively via bit-mask truncation – Report min output precision for each instruction – Finer-grained analysis and lower overhead

CRAFT (2016) • Scalability via heuristic search – Focus on most-executed instructions – Analysis time vs. benefit tradeoff >0.5% - 9:45 >5.0% - 4:66 >1.0% - 5:93 >0.1% - 15:45 >0.05% - 23:60 Full – 28:71

Issue • Only considers precision reduction – No higher precision or arbitrary-precision – No alternative representations – No dynamic tracking of error

SHVAL (2016) • Generic floating-point shadow value analysis – Maintain “shadow” value for every memory location – Execute shadow operations for all computation – Shadow type is parameterized (native, MPFR, Unum, Posit, etc.) – Pintool: less overhead than similar frameworks like Valgrind

SHVAL (ongoing) • Single precision shadow values – Trace execution and build data flow graph – Color nodes by error w.r.t. original double precision values – Highlights high-error regions – Inherent scaling issues Low error Low error input input Medium error Medium error x input intermediate Gaussian elimination example High error + output

Issue • No source-level mixed precision – Difficult to translate instruction-level analysis results to source-level transformations – Some users might be satisfied with opaque compiler- based optimization, but most HPC users want to know what changed!

CRAFT (2013) • Memory-based replacement analysis – Leave computation intact but round outputs – Aggregate instructions that modify same variable – Found several valid variable-level replacements NAS Benchmark Candidate Configurations % Executions (name.CLASS) Operands Tested Replaced bt.A 2,342 300 97.0 cg.A 287 68 71.3 ep.A 236 59 37.9 ft.A 466 108 46.2 lu.A 1,742 104 99.9 mg.A 597 153 83.4 sp.A 1,525 1,094 88.9

SHVAL (2017) • Single-vs-double shadow value analysis – Aggregate error by instruction or memory location over time • Computer vision case study (Apriltags) – 1.7x speedup on average with only 4% error – 40% energy savings in embedded experiments Credit: RamyMedhat (ramy.medhat@uwaterloo.ca)

Issues • Each instruction or variable is tested in isolation – Union of valid replacements is often invalid • Cannot ensure speedup – Instrumentation overhead – Added casts to convert data between regions – Lack of vectorization and data packing

CRAFT (ongoing) • Variable-centric mixed precision analysis – Use TypeForge (an AST-level type conversion tool) for source-to-source mixed precision • Search for best speedup – Run full compiler backend w/ optimizations – Report fastest configuration that passes verification double sum = 0.0; double sum = 0.0; void sum2pi_x() void sum2pi_x() { { → double tmp; float tmp; double acc; float acc; int i, j; int i; int j; [...] [...]

Related Work • CRAFT/SHVAL, Precimonious [Rubio’13], GPUMixer [Laguna’19], etc. – Very practical – Widely-used tool frameworks (Dyninst, Pin, LLVM) – Few (or no) formal guarantees – Tested on HPC benchmarks on Linux/x86 • Daisy [Darulova’18], FPTuner [Chiang’17], etc. – Very rigorous – Custom input formats – Provable error bounds for given input range – Impractical for HPC benchmarks

ADAPT (2018) • Automatic backwards error analysis – Obtain gradients via reverse-mode algorithmic differentiation (CoDiPack or TAPENADE) – Calculate error contribution of intermediate results – Aggregate by program variable – Greedy algorithm builds mixed-precision allocation Credit: Harshitha Menon (gopalakrishn1@llnl.gov)

ADAPT (2018)

ADAPT (2018) • Used ADAPT on LULESH benchmark to help develop a mixed-precision CUDA version • Achieved speedup of 20% within original error threshold on NVIDIA GK110 GPU Credit: Harshitha Menon (gopalakrishn1@llnl.gov)

FloatSmith (ongoing) • Mixed-precision search via CRAFT • Source-to-source translation via TypeForge • Optionally, use TypeForge-automated ADAPT analysis to narrow search and provide more rigorous guarantees

FloatSmith (ongoing) • Guided mode (Q&A) • Batch mode (command-line parameters) • Dockerfile provided • Can offload configuration testing to a cluster floatsmith -B --run "./demo" double p = 1.00000003; double p = 1.00000003; double l = 0.00000003; float l = 0.00000003; double o; double o; → int main() { int main() { o = p + l; o = p + l; // should print 1.00000006 // should print 1.00000006 printf("%.8f\n", (double)o); printf("%.8f\n", (double)o); return 0; return 0; } }

FPHPC (ongoing) • Benchmark suite aimed at facilitating scale-up for mixed-precision analysis tools – “Middle ground” between real-valued expressions and full applications – Currently looking for good case studies

Future Work • (Better) OpenMP/MPI support • (Better) GPU and FPGA support • Model-based performance prediction • Dynamic runtime precision tuning • Ensemble floating-point analysis

Summary • Automated mixed precision is possible – Practicality vs. rigor tradeoff • Multiple active projects – Various goals and approaches – All target HPC applications • Many avenues for future research

Software Tools for Mixed-Precision Program Analysis Mike Lam James - PowerPoint PPT Presentation

Software Tools for Mixed-Precision Program Analysis Mike Lam James Madison University Lawrence Livermore National Lab About Me Ph.D in CS from University of Maryland ('07-'14) Topic: Automated floating-point program analysis (w/ Jeff

Mixed Precision Training PAI Overview What is mixed-precision

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

AUTOMATIC MIXED PRECISION IN PYTORCH Michael Carilli and Michael Ruberry, 3/20/2019 THIS TALK

Mixed Feelings about Mixed Precision? Judy Hill Scientific Computing Group Leader, Center for

RHAPSODY & AUTOSAR WALTER VAN DER HEIDEN WILLERT SOFTWARE TOOLS ABOUT WILLERT SOFTWARE TOOLS

Strong Normalization by HOAS Andrei Popescu Joint work with Elsa Gunter Simply-typed

Setting u p a CFA FAC TOR AN ALYSIS IN R Jennifer Br u sso w Ps y chometrician Wh y a

The Local Area Multicomputer (LAM) Implementation of MPI Jeffrey M. Squyres, Andrew Lumsdaine

strt Prr

Object representatives: a uniform abstraction for pointer information Eric Bodden, Patrick Lam

The Interoperable Message Passing Interface (IMPI) Extensions to LAM/MPI Jeffrey M. Squyres,

High Affinity Methanotrophs Are an Important Overlooked Methane Sink in the Arctic and Global

ConnectHome Nation Webinar ConnectHome Nation Webinar Introducing Starry Connect December 17,

Software Tools for Mixed-Precision Program Analysis Mike Lam James - PowerPoint PPT Presentation

Software Tools for Mixed-Precision Program Analysis Mike Lam James Madison University Lawrence Livermore National Lab About Me Ph.D in CS from University of Maryland ('07-'14) Topic: Automated floating-point program analysis (w/ Jeff

Mixed Precision Training PAI Overview What is mixed-precision

EFFECTIVE USE OF MIXED PRECISION FOR HPC Kate Clark, Smoky Mountain Conference 2019 Why Mixed

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

MIXED PRECISION TRAINING Michael OConnor MIXED PRECISION What is the benefit? Using mixed

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision

Automated Mixed-Precision for TensorFlow Training Reed Wanderman-Milne (Google) and Nathan Luehr

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Mixed Methodological Analysis David F. Feldon Utah State University May 8, 2018 Mixed Methods

Program Analsysis Tools Steven J Zeil April 18, 2013 Program Analsysis Tools Outline

Regression 2: Mixed Models Marco Baroni Practical Statistics in R Outline Mixed models with

Mixing it up with random effects Joshua Loftus Mixed models Intro to mixed models What is a

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

AUTOMATIC MIXED PRECISION IN PYTORCH Michael Carilli and Michael Ruberry, 3/20/2019 THIS TALK

Mixed Feelings about Mixed Precision? Judy Hill Scientific Computing Group Leader, Center for

RHAPSODY &amp; AUTOSAR WALTER VAN DER HEIDEN WILLERT SOFTWARE TOOLS ABOUT WILLERT SOFTWARE TOOLS

Strong Normalization by HOAS Andrei Popescu Joint work with Elsa Gunter Simply-typed

Setting u p a CFA FAC TOR AN ALYSIS IN R Jennifer Br u sso w Ps y chometrician Wh y a

The Local Area Multicomputer (LAM) Implementation of MPI Jeffrey M. Squyres, Andrew Lumsdaine

strt Prr

Object representatives: a uniform abstraction for pointer information Eric Bodden, Patrick Lam

The Interoperable Message Passing Interface (IMPI) Extensions to LAM/MPI Jeffrey M. Squyres,

High Affinity Methanotrophs Are an Important Overlooked Methane Sink in the Arctic and Global

ConnectHome Nation Webinar ConnectHome Nation Webinar Introducing Starry Connect December 17,

RHAPSODY & AUTOSAR WALTER VAN DER HEIDEN WILLERT SOFTWARE TOOLS ABOUT WILLERT SOFTWARE TOOLS