Scalable Precision Tuning of Numerical Software Cindy Rubio-Gonzlez - - PowerPoint PPT Presentation

scalable precision tuning of numerical software
SMART_READER_LITE
LIVE PREVIEW

Scalable Precision Tuning of Numerical Software Cindy Rubio-Gonzlez - - PowerPoint PPT Presentation

Scalable Precision Tuning of Numerical Software Cindy Rubio-Gonzlez Department of Computer Science University of California, Davis Best Practices for HPC Software Developers Webinar, October 14 th , 2020 Floating-Point Precision Tuning


slide-1
SLIDE 1

Scalable Precision Tuning of Numerical Software

Cindy Rubio-González

Department of Computer Science University of California, Davis

Best Practices for HPC Software Developers Webinar, October 14th, 2020

slide-2
SLIDE 2
  • Reasoning about floating-point programs is difficult
  • Large variety of numerical problems
  • Most programmers not expert in floating point
  • Common practice: use highest

available precision

  • Disadvantage: more expensive!
  • Automated techniques for tuning precision

Given : Accuracy Requirement Action: Reduce precision Goal : Accuracy and/or Performance

2

Floating-Point Precision Tuning

slide-3
SLIDE 3

1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 … 28 }

3

Original Program

Precision Tuning Example

Tuned Program

Error threshold 10-8

slide-4
SLIDE 4

4

Original Program

Precision Tuning Example

Tuned Program

1 long double fun(double p) { 2 double pi = acos(-1.0); 3 long double q = sinf(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 float a, b; 9 double s, x; float h; 10 const long float fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 … 28 } 1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 … 28 }

Tuned program runs 78.7% faster!

slide-5
SLIDE 5
  • Searching efficiently over variable types and

function implementations

– Naïve approach → exponential time

  • 2n or 3n where n is the number of variables

– Global minimum vs. a local minimum

  • Evaluating type configurations

– Less precision → not necessarily faster – Based on run time, energy consumption, etc.

  • Determining accuracy constraints

– How accurate must the final result be? – What error threshold to use?

Challenges in Precision Tuning

5

slide-6
SLIDE 6
  • Reducing precision vs. improving performance

– Different objectives

  • Dynamic vs. static approaches

– Dynamic: Performed at runtime, requires program inputs, handles larger and more complex code, no guarantees for untested inputs – Static: Analyzes program without running it, limitations with certain program structures (e.g., loops), formal guarantees for analyzed code

  • Instructions vs. variables vs. function calls

– Various granularities of program transformation – Different scopes

  • Binary vs. IR vs. source code

– Tradeoff between granularity of transformation and tool usability

Precision Tuning Approaches

6

slide-7
SLIDE 7

7

Dynamic Tools for Precision Tuning

Precimonious HiFPTuner

  • Hierarchical Precision Tuner

– Leverages relationship among variables to reduce search space and number of runs

  • Dynamic Analysis for Precision Tuning

– Black-box approach to systematically search over variable types and functions

slide-8
SLIDE 8

TYPE CONFIGURATION

PRECIMONIOUS

TEST INPUTS SOURCE CODE

PRECIMONIOUS

Annotated with error threshold Less Precision Speedup Result within error threshold for all test inputs

8

Search over types of variables and function implementations

  • C. Rubio-González, C. Nguyen, H. D. Nguyen, J. Demmel, W. Kahan, K. Sen, D.H. Bailey, C. Iancu, and D. Hough.

“Precimonious: Tuning Assistant for Floating-Point Precision”, SC 2013.

https://github.com/ucd-plse/precimonious

Dynamic Analysis for Floating-Point Precision Tuning

slide-9
SLIDE 9
  • Based on the Delta-Debugging Search Algorithm [1]
  • Change the types of variables and function calls

– Examples: double x → float x, sin → sinf

  • Our success criteria

– Resulting program produces an “accurate enough” answer – Resulting program is faster faster than the original program

  • Main idea

– Start by associating each variable with set of types

  • Example: x → {long double, double, float}

– Refine set until it contains only one type

  • Find a local minimum

– Lowering the precision of one more variable violates success criteria

Search Algorithm

9

[1] A. Zeller and R. Hildebrandt. “Simplifying and Isolating Failure-Inducing Input”, TSE 2002.

slide-10
SLIDE 10

double precision single precision

Searching for Type Configuration

10

slide-11
SLIDE 11

double precision single precision

✘ ✘

Searching for Type Configuration

11

slide-12
SLIDE 12

double precision single precision

✘ ✘

Searching for Type Configuration

12

slide-13
SLIDE 13

double precision single precision

✘ ✘

Searching for Type Configuration

13

slide-14
SLIDE 14

double precision single precision

✘ ✘ ✘

Searching for Type Configuration

14

slide-15
SLIDE 15

double precision single precision

✘ ✘ ✘

Searching for Type Configuration

15

slide-16
SLIDE 16

double precision single precision

✘ ✘ ✘

Failed configurations Proposed configuration

Searching for Type Configuration

16

slide-17
SLIDE 17
  • Automatically generate program variants

– Reflect type configurations produced by the algorithm

  • Intermediate representation

– LLVM IR

  • Transformation rules for each LLVM instruction

– alloca, load, store, fadd, fsub, fpext, fptrunc, etc. – Changes equivalent to modifying the program at the source level – Clang plugin to provide modified source code

  • Able to run resulting modified program

– Evaluate type configuration: accuracy & performance

Applying Type Configuration

17

slide-18
SLIDE 18
  • Precimonious is open source

– Most recent version can be found at https://github.com/ucd-plse/precimonious

  • Dockerfile and examples

– Tutorial on Floating-Point Analysis Tools at SC’19 and PEARC’19

http://fpanalysistools.org – Dockerfile and examples can be found at https://github.com/ucd-plse/tutorial-precision-tuning

Where to Find Precimonious

18

slide-19
SLIDE 19
  • Initial requirements

– Does your program compile with clang? – Where does your program store the result? – How much error are you willing to tolerate?

  • Examples: 10-4,10-6, 10-8, and 10-10

– Do you have representative inputs to use during tuning?

  • Optional information

– Are there specific functions/variables to focus on, or to ignore during tuning?

  • What you get

– Listing of variables (and function) and their proposed types – Useful start point to identify areas of interest

How to Use Precimonious

19

slide-20
SLIDE 20
  • Type configurations rely on program inputs tested

– No guarantees if worse conditioned input – Use representative inputs whenever possible – Consider input generation tools, e.g., S3FP [1], FPGen [2], etc.

  • Analysis scalability

– Scalability limitations when tuning long-running applications – Need to reduce search space, and reduce number of runs – Consider starting with a specific area of the program – Consider synthesizing smaller workloads

  • Analysis effectiveness

– Black-box approach does not exploit relationship among variables

Limitations and Recommendations

20

[1] W. Chiang, G. Gopalakrishnan, Z. Rakamaric and A. Solovyev. “Efficient Search for Inputs Causing High Floating-point Errors”, PPoPP 2014. [2] H. Guo and C. Rubio-González. “Efficient Generation of Error-Inducing Floating-Point Inputs via Symbolic Execution”, ICSE 2020.

slide-21
SLIDE 21

21

Dynamic Tools for Precision Tuning

Precimonious

  • Dynamic Analysis for Precision Tuning

– Black-box approach to systematically search over variable types and functions

HiFPTuner

  • Hierarchical Precision Tuner

– Leverages relationship among variables to reduce search space and number of runs

slide-22
SLIDE 22

22

  • Precimonious follows a black-box approach
  • Related variables assigned types independently
  • Large number of variables → Slow search
  • More type casts → Less speedup

Local minimum Global minimum Original

Uses lower precision Speedup: 78.7% Shifts precision less often Speedup: 90%

Impact of Precision Shifting

slide-23
SLIDE 23

23

  • White box nature
  • Related variables pre-grouped into hierarchy → Same type
  • Fewer groups in search space → Faster search
  • Fewer type casts → Larger speedups
  • Can we leverage the program to perform a more

informed precision tuning?

1 2 3 4 5 6 7 8 1 4 3 6 8 2 5 7 3 6 8 1 4 2 5 7 Search top to bottom Level 0 Level 1 Level 2

Exploiting Community Structure

slide-24
SLIDE 24

Speeds up program by reducing precision with respect to accuracy constraint

24

HiFPTuner Approach

SOURCE CODE

  • 1. Type Dependence Analysis + Edge Profiling

Weighted Dependence Graph

TEST INPUTS

  • 2. Iterative Community Detection + Ordering

Ordered Community Structure of Variables

  • 3. Hierarchical Precision Tuning

TYPE CONFIGURATION

Accuracy Constraint

  • H. Guo and C. Rubio-González. “Exploiting Community Structure for Floating-Point Precision Tuning”, ISSTA 2018.
  • M. Girvan and M.E. Newman. “Community Structure in Social and Biological Networks”, NAS 2002.
  • F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. “Defining and Identifying Communities in Networks”, NAS 2004.

https://github.com/ucd-plse/HiFPTuner

Hierarchical Floating-Point Precision Tuning

slide-25
SLIDE 25

25

Simpsons Example

pi, p, q a, b, h, x s a b h pi p q x s

  • Found global minimum configuration that leads to 90% speedup!
  • Weighted dependence graph

Ordered community structure

HiFPTuner explores 24 configurations, almost 5x fewer configurations

slide-26
SLIDE 26
  • Items at top level of hierarchy reduced by 53% on

average in comparison to Precimonious

  • Higher search efficiency over Precimonious for 75% of

the programs in our study

– Explored 45% fewer configurations

  • HiFPTuner finds better configurations for half of the

programs, with up to 90% speedup

Better Scalability & Speedup

26

slide-27
SLIDE 27
  • HiFPTuner is open source

– https://github.com/ucd-plse/HiFPTuner

  • Dockerfile and examples

– Tutorial on Floating-Point Analysis Tools at SC’19 and PEARC’19

http://fpanalysistools.org – Dockerfile and examples can be found at https://github.com/ucd-plse/tutorial-precision-tuning

  • Same requirements as Precimonious

Where to Find HiFPTuner

27

slide-28
SLIDE 28

+ Considers both accuracy and performance + Works for medium size non- trivial programs + Easily configurable

  • Requires a run for each type

configurations

  • Ordering of variables may

give different results + White-box hierarchical approach, groups variables based on their usage + Over twice as fast as Precimonious + Finds configurations that lead to higher speedups

  • Requires program profiling
  • Still requires a run for each

type configuration + Performs shadow execution, requires a single run of the program + Identifies variables that can be single precision + Combined with Precimonious leads to 9x faster analysis

  • Focuses on accuracy, not

performance

  • 50x overhead by shadow

execution engine

  • Still black box approach

Comparison of Precision Tuners

28

Precimonious HiFPTuner Blame Analysis [1]

PROS CONS

[1] C. Rubio-González, C. Nguyen, B. Mehne, K. Sen, J. Demmel, W. Kahan, C. Iancu, W. Lavrijsen, D.H. Bailey and D. Hough. “Floating-Point Precision Tuning Using Blame Analysis”, ICSE 2016.

slide-29
SLIDE 29
  • 1. Type configurations rely on program inputs tested

– How problematic is this for HPC applications? – Can we leverage application-dependent correctness metrics?

  • 2. Analysis scalability

– How can we further reduce the search space? – How can we reduce the number of program runs?

  • 3. Analysis effectiveness

– How far are we from the best configuration(s)? – Are there other program transformations to explore? – Can we incorporate domain knowledge to guide search?

  • 4. Benchmarks

– Difficult to find programs to test precision tuners at scale – Need for collaboration between application and tool developers

Current Challenges for HPC Applications

29

slide-30
SLIDE 30
  • Other recent precision tuners

Some Useful Resources

30

  • An exhaustive list of tools: https://fpbench.org/community.html
  • I. Laguna, P.C. Wood, R. Singh and S. Bagchi. “GPUMixer: Performance-Driven Floating-Point Tuning for

GPU Scientific Applications”, ISC 2019.

  • M. Lam, T. Vanderbruggen, H. Menon and M. Schordan. “Tool Integration for Source-Level Mixed

Precision”. CORRECTNESS@SC 2019.

  • S. Cherubin, D. Cattaneo, M. Chiari and G. Agosta. “Dynamic Precision Autotuning with TAFFO”. ACM
  • Trans. Archit. Code Optim. 2019.

P.V. Kotipalli, R. Singh, P. Wood, I. Laguna and S. Bagchi. “AMPT-GA: Automatic Mixed Precision Floating Point Tuning for GPU Applications”. ICS 2019.

  • H. Menon, M. Lam, D. Osei-Kuffuor, M. Schordan, S. Lloyd, K. Mohror and J. Hittinger. “ADAPT:

Algorithmic Differentiation Applied to Floating-Point Precision Tuning”, SC 2018.

  • E. Darulova, E. Horn and S. Sharma. “Sound Mixed-Precision Optimization with Rewriting”. ICCPS 2018.
  • W. Chiang, M. Baranowski, I. Briggs, A. Solovyev, G. Gopalakrishnan and Z. Rakamaric. “Rigorous

Floating-Point Mixed-Precision Tuning”. POPL 2017.

  • Check out recent survey on reduced precision
  • S. Cherubin and G. Agosta. “Tools for Reduced Precision Computation: A Survey. ACM Computing

Surveys 2020.

slide-31
SLIDE 31

SC Workshop on Software Correctness

31

Co-Organized with Ignacio Laguna from Lawrence Livermore National Lab November 11th, 2020 (half day, 2:30pm to 6:30pm EDT)

slide-32
SLIDE 32
  • Precision tuning can have an important impact on the

performance of HPC applications

  • Many techniques for precision tuning

– Different approaches: dynamic vs. static

  • We discussed two of our tools for precision tuning

– Precimonious and HiFPTuner

  • A lot of progress, but there are still challenges and
  • pportunities to apply precision tuning at scale
  • Ap

Application and tool deve velopers rs must work together to improve scalability and effectiveness of precision tuning

Summary

32

slide-33
SLIDE 33

Collaborators

Cuong Nguyen Diep Nguyen James Demmel William Kahan Koushik Sen David Bailey Costin Iancu David Hough

UC Berkeley Oracle LBNL

Ben Mehne Wim Lavrijsen 33 Hui Guo

UC Davis

slide-34
SLIDE 34

Acknowledgements/Sponsors

34