software tools for mixed precision program analysis
play

Software Tools for Mixed-Precision Program Analysis Mike Lam James - PowerPoint PPT Presentation

Software Tools for Mixed-Precision Program Analysis Mike Lam James Madison University Lawrence Livermore National Lab About Me Ph.D in CS from University of Maryland ('07-'14) Topic: Automated floating-point program analysis (w/ Jeff


  1. Software Tools for Mixed-Precision Program Analysis Mike Lam James Madison University Lawrence Livermore National Lab

  2. About Me • Ph.D in CS from University of Maryland ('07-'14) – Topic: Automated floating-point program analysis (w/ Jeff Hollingsworth) – Intern @ Lawrence Livermore National Lab (LLNL) in Summer ’11 • Assistant professor at James Madison University since '14 – Teaching: computer organization, parallel & distributed systems, compilers, and programming languages – Research: high-performance analysis research group (w/ Dee Weikle) • Faculty scholar @ LLNL since Summer '16 – Energy-efficient computing project (w/ Barry Rountree) – Variable precision computing project (w/ Jeff Hittinger et al.)

  3. Context • IEEE floating-point arithmetic – Ubiquitous in scientific computing – More bits => higher accuracy (usually) – Fewer bits => higher performance (usually) 32 16 8 4 0 Single Precision (FP32) Exponent (8 bits) Significand (23 bits) 64 32 8 4 0 16 Double Precision (FP64) Exponent (11 bits) Significand (52 bits)

  4. Motivation • Vector single precision 2X+ faster – Possibly better if memory pressure is alleviated – Newest GPUs use mixed precision for tensor ops Operation FP32 Packed FP32 FP64 Add 6 6 6 Subtract 6 6 6 Multiply 6 6 6 Divide 27 32 42 FP64 Square root 28 38 43 FP32 Instruction latencies for Intel Knights Landing Mixed FP16 / FP32 Credit: https://agner.org/optimize/ and NVIDIA Tesla V100 Datasheet

  5. Questions • How many bits do you need ? • Where does reduced precision help ?

  6. Prior Approaches • Rigorous: forwards/backwards error analysis – Requires numerical analysis expertise • Pragmatic: “guess-and-check” – Requires manual code conversion effort //double x[N], y[N]; float x[N], y[N]; double alpha; Credit: Wikimedia Commons

  7. Research Question • What can we learn about floating-point behavior with automated analysis? – Specifically: can we build mixed-precision versions of a program automatically? • Caveat: few (or no) formal guarantees – Rely on user-provided representative run (and sometimes a verification routine) double sum = 0.0; double sum = 0.0; void sum2pi_x() void sum2pi_x() { → { double tmp; float tmp; double acc; float acc; int i, j; int i; int j; [...] [...]

  8. FPAnalysis / CRAFT (2011) • Dynamic binary analysis via Dyninst • Cancellation detection • Range (exponent) tracking 3.682236 - 3.682234 0.000002 (6 digits cancelled)

  9. CRAFT (2013) • Dynamic binary analysis via Dyninst • Instruction-level replacement of doubles w/ floats • Hierarchical search for valid replacements Program Func1 Func2 Func3 Insn1 Insn2 Insn3 … InsnN

  10. CRAFT (2013)

  11. CRAFT (2013) NAS Benchmark Candidate Configurations % Dynamic (name.CLASS) Instructions Tested Replaced bt.A 6,262 4,000 78.6 cg.A 956 255 5.6 ep.A 423 114 45.5 ft.A 426 74 0.2 lu.A 6,014 3,057 57.4 mg.A 1,393 437 36.6 sp.A 4,507 4,920 30.5

  12. Issues • High overhead – Must check and (possibly) convert operands before each instruction • Lengthy search process – Search space is exponential wrt. instruction count • Coarse-grained analysis – Binary decision: single or double

  13. CRAFT (2016) • Reduced-precision analysis – Simulate conservatively via bit-mask truncation – Report min output precision for each instruction – Finer-grained analysis and lower overhead

  14. CRAFT (2016) • Scalability via heuristic search – Focus on most-executed instructions – Analysis time vs. benefit tradeoff >0.5% - 9:45 >5.0% - 4:66 >1.0% - 5:93 >0.1% - 15:45 >0.05% - 23:60 Full – 28:71

  15. Issue • Only considers precision reduction – No higher precision or arbitrary-precision – No alternative representations – No dynamic tracking of error

  16. SHVAL (2016) • Generic floating-point shadow value analysis – Maintain “shadow” value for every memory location – Execute shadow operations for all computation – Shadow type is parameterized (native, MPFR, Unum, Posit, etc.) – Pintool: less overhead than similar frameworks like Valgrind

  17. SHVAL (ongoing) • Single precision shadow values – Trace execution and build data flow graph – Color nodes by error w.r.t. original double precision values – Highlights high-error regions – Inherent scaling issues Low error Low error input input Medium error Medium error x input intermediate Gaussian elimination example High error + output

  18. Issue • No source-level mixed precision – Difficult to translate instruction-level analysis results to source-level transformations – Some users might be satisfied with opaque compiler- based optimization, but most HPC users want to know what changed!

  19. CRAFT (2013) • Memory-based replacement analysis – Leave computation intact but round outputs – Aggregate instructions that modify same variable – Found several valid variable-level replacements NAS Benchmark Candidate Configurations % Executions (name.CLASS) Operands Tested Replaced bt.A 2,342 300 97.0 cg.A 287 68 71.3 ep.A 236 59 37.9 ft.A 466 108 46.2 lu.A 1,742 104 99.9 mg.A 597 153 83.4 sp.A 1,525 1,094 88.9

  20. SHVAL (2017) • Single-vs-double shadow value analysis – Aggregate error by instruction or memory location over time • Computer vision case study (Apriltags) – 1.7x speedup on average with only 4% error – 40% energy savings in embedded experiments Credit: RamyMedhat (ramy.medhat@uwaterloo.ca)

  21. Issues • Each instruction or variable is tested in isolation – Union of valid replacements is often invalid • Cannot ensure speedup – Instrumentation overhead – Added casts to convert data between regions – Lack of vectorization and data packing

  22. CRAFT (ongoing) • Variable-centric mixed precision analysis – Use TypeForge (an AST-level type conversion tool) for source-to-source mixed precision • Search for best speedup – Run full compiler backend w/ optimizations – Report fastest configuration that passes verification double sum = 0.0; double sum = 0.0; void sum2pi_x() void sum2pi_x() { { → double tmp; float tmp; double acc; float acc; int i, j; int i; int j; [...] [...]

  23. Related Work • CRAFT/SHVAL, Precimonious [Rubio’13], GPUMixer [Laguna’19], etc. – Very practical – Widely-used tool frameworks (Dyninst, Pin, LLVM) – Few (or no) formal guarantees – Tested on HPC benchmarks on Linux/x86 • Daisy [Darulova’18], FPTuner [Chiang’17], etc. – Very rigorous – Custom input formats – Provable error bounds for given input range – Impractical for HPC benchmarks

  24. ADAPT (2018) • Automatic backwards error analysis – Obtain gradients via reverse-mode algorithmic differentiation (CoDiPack or TAPENADE) – Calculate error contribution of intermediate results – Aggregate by program variable – Greedy algorithm builds mixed-precision allocation Credit: Harshitha Menon (gopalakrishn1@llnl.gov)

  25. ADAPT (2018)

  26. ADAPT (2018) • Used ADAPT on LULESH benchmark to help develop a mixed-precision CUDA version • Achieved speedup of 20% within original error threshold on NVIDIA GK110 GPU Credit: Harshitha Menon (gopalakrishn1@llnl.gov)

  27. FloatSmith (ongoing) • Mixed-precision search via CRAFT • Source-to-source translation via TypeForge • Optionally, use TypeForge-automated ADAPT analysis to narrow search and provide more rigorous guarantees

  28. FloatSmith (ongoing) • Guided mode (Q&A) • Batch mode (command-line parameters) • Dockerfile provided • Can offload configuration testing to a cluster floatsmith -B --run "./demo" double p = 1.00000003; double p = 1.00000003; double l = 0.00000003; float l = 0.00000003; double o; double o; → int main() { int main() { o = p + l; o = p + l; // should print 1.00000006 // should print 1.00000006 printf("%.8f\n", (double)o); printf("%.8f\n", (double)o); return 0; return 0; } }

  29. FPHPC (ongoing) • Benchmark suite aimed at facilitating scale-up for mixed-precision analysis tools – “Middle ground” between real-valued expressions and full applications – Currently looking for good case studies

  30. Future Work • (Better) OpenMP/MPI support • (Better) GPU and FPGA support • Model-based performance prediction • Dynamic runtime precision tuning • Ensemble floating-point analysis

  31. Summary • Automated mixed precision is possible – Practicality vs. rigor tradeoff • Multiple active projects – Various goals and approaches – All target HPC applications • Many avenues for future research

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend