Exploiting Community Structure for Floating-Point Precision Tuning - PowerPoint PPT Presentation

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-González ISSTA’18 – Amsterdam, Netherlands, July 2018

Background • Floating-point (FP) arithmetic used in many domains • Reasoning about FP programs is difficult - Large variety of numerical problems - Most programmers are not experts in FP • Common practice: use highest available precision - Disadvantage: more expensive! • Tools have been developed for precision tuning Given : Accuracy constraints Action: Reduce precision Goal : Performance 2

Precision Tuning Example 1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; Tuned Program 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; Error threshold 10 -8 25 L110: 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 28 } Original Program 3

Precision Tuning Example 1 long double fun(long double p) { 1 long double fun(double p) { 2 long double pi = acos(-1.0); 2 double pi = acos(-1.0); 3 long double q = sin(pi * p); 3 long double q = sinf(pi * p); 4 return q; 4 return q; 5 } 5 } 6 6 7 void simpsons() { 7 void simpsons() { 8 long double a, b; 8 float a, b; 9 long double h, s, x; 9 double s, x; float h; 10 const long double fuzz = 1e-26; 10 const long float fuzz = 1e-26; 11 const int n = 2000000; 11 const int n = 2000000; 12 … 12 … Tuned program runs 78.7% faster! 18 L100: 18 L100: 19 x = x + h; 19 x = x + h; 20 s = s + 4.0 * fun(x); 20 s = s + 4.0 * fun(x); 21 x = x + h; 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 23 s = s + 2.0 * fun(x); 24 goto L100; 24 goto L100; 25 L110: 25 L110: 26 s = s + fun(x); 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 27 //final answer:(long double)h *s/3.0 28 } 28 } Original Program Tuned Program 4

State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ single precision 5

State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✘ ✘ single precision 6

State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ ✘ single precision 7

State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ Proposed configuration ✔ ✘ … Failed configurations ✘ single precision 10

State-of-the-art: Black-box Precision Tuning • State of the art groups variables arbitrarily • Black box nature - Related variables assigned types independently Large number of variables → Slow search - More type casts → Less speedup - Local minimum Global minimum Original Uses lower precision Shifts precision less often Speedup: 78.7% Speedup: 90% 11

Exploiting Community Structure • Can we leverage the program to perform a more informed precision tuning? • White box nature Related variables pre-grouped into hierarchy → Same type - Fewer groups in search space → Faster search - Fewer type casts → Larger speedups - 7 8 4 2 5 3 6 1 Level 2 Search top to bottom 1 4 6 8 7 3 2 5 Level 1 4 7 1 2 3 5 6 8 Level 0 12

Approach TEST SOURCE INPUTS CODE Accuracy Constraint 3. Hierarchical Precision Tuning 1. Type Dependence Analysis + Edge Profiling Weighted Dependence Graph TYPE CONFIGURATION 2. Iterative Community Detection + Ordering Speeds up program by reducing precision with respect to accuracy Ordered Community constraint Structure of Variables 13

Type Dependence Analysis + Edge Profiling 1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); Identify assignments to 4 return q; 5 } floating-point variables 6 7 void simpsons() { 8 long double a, b; 9 // subinterval length, integral approximation, x 10 long double h,s,x; 11 const long double fuzz = 1e-26; 12 const int n = 2000000; 13 a = 0.0; 14 b = 1.0; 15 h = (b - a) / n; 16 x = a; 17 s = fun(x); 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 printf("%1.16Le\n", (long double)h * s / 3.0); 28 } 14

Type Dependence Analysis + Edge Profiling 2 long double pi = acos(-1.0); Weighted dependence graph 3 long double q = sin(pi * p); 11 const long double fuzz = 1e-26; 2000000 b 1 pi 13 a = 0.0; s h 14 b = 1.0; 2000000 2000001 1 15 h = (b - a) / n; 2000001 fuzz 1 2000001 16 x = a; x p q a 17 s = fun(x); 2000000 19 x = x + h; Variables Variable dependence 20 s = s + 4.0 * fun(x); variables in main 21 x = x + h; variables in fun 23 s = s + 2.0 * fun(x); 26 s = s + fun(x); A vertex in the graph represents a FP variable, and an edge u → v denotes that value u is used to compute value v at least once 15

Iterative Community Detection + Ordering Use modularity maximization [1, 2] to iteratively detect communities on the generated dependence graph until no new communities are found Top 2000001 2000000 c3 c2 c1 c3 c2 c1 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 s s x a h b x a h b q pi p q pi p Bottom 2000000 Community structure of floating-point variables [1] M. E. Newman. Fast algorithm for detecting community structure in networks. Physical review E, 2004. 16 [2] M. E. Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 2006.

Iterative Community Detection + Ordering Sort the items at each level of the hierarchy using topological order to follow the dependence flow Top 2000001 2000000 c3 c3 c2 c1 c2 c1 c3 c2 c1 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 s x a h b s s x a h b pi x a h b q p pi q pi p q p Bottom 2000000 Ordered community structure 17

Hierarchical Precision Tuning Search through the hierarchy from top down to the bottom Original precision configuration Top Reduce precision to speed up program 2000001 2000000 pi, p, q a, b, h, x s c3 c2 c1 c3 c2 c1 Top-level precision configuration 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 x a h b s x a h b s pi q pi p q p a b h pi p q x s Reduce precision to speed up program Bottom 2000000 Bottom-level precision configuration Global minimum configuration with 90% TYPE speedup! CONFIGURATION 18

Experimental Setup • Hierarchical search algorithm implemented in tool HiFPTuner • Benchmarks : 4 GSL programs (inputs that maximize coverge), 2 NAS Parallel Benchmarks (inputs Class A), 3 other numerical programs including simpsons (input free) • Error thresholds o Multiple error thresholds: 10 -4 ,10 -6 , 10 -8 , and 10 -10 o User can evaluate trade-off between accuracy and speedup o 35 experiments in total • Evaluated search efficiency and effectiveness in comparison with state-of-the-art tool Precimonious 19

Number of Communities Initial Type Configuration Items to Tune Communities # Items Program L D F C L2 L1 L0 11 simpsons 9 0 0 2 - 6 11 11 - 7 11 arclenght 8 0 0 3 17 - 6 17 piqpr 17 0 0 0 25 11 14 25 fft 0 22 1 2 58 gaussian 0 56 0 2 18 22 58 36 sum 0 34 0 2 23 24 36 29 11 14 29 bessel 0 24 0 5 17 9 9 17 ep 0 13 0 4 35 21 24 35 cp 0 32 0 3 The number of tunable items at the top level of the hierarchy is reduced by 53% from 239 to 112 20

RQ1: Search Efficiency How efficient is hierarchical search for precision tuning in comparison with Precimonious? Answer: HiFPTuner exhibits higher search efficiency over Precimonious for 75.9% (22 out of 29) of the experiments that require tuning Overall, HiFPTuner explores 45% (3,326) fewer configurations than Precimonious 21

Configurations for Error Threshold 10 -8 Number of Configurations 800 735 700 600 533 497 500 433 400 HiFPTuner Precimonious 297 300 275 211 200 164 142 116 100 77 52 45 43 30 24 0 simpsons arclength piqpr fft gaussian sum ep cp 22

Configurations for Error Threshold 10 -8 Initial Type Configuration HiFPTuner Precimonious Error threshold: 10 -8 Error threshold: 10 -8 L D F S Program L D F C L D F S 1 3 5 1 116 simpsons 9 0 0 2 0 8 1 1 24 0 7 1 1 142 arclenght 8 0 0 3 0 7 1 1 30 3 13 1 0 164 piqpr 17 0 0 0 3 14 0 0 52 0 21 2 0 297 fft 0 22 0 2 0 22 0 2 43 0 56 0 2 275 gaussian 0 56 0 2 0 10 46 2 211 0 34 0 2 433 sum 0 34 0 2 0 10 24 2 533 0 13 0 4 77 ep 0 13 0 4 0 13 0 4 45 0 32 0 3 735 cp 0 32 0 3 0 24 8 3 497 23

Exploiting Community Structure for Floating-Point Precision Tuning - PowerPoint PPT Presentation

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez ISSTA18 Amsterdam, Netherlands, July 2018 Background Floating-point (FP) arithmetic used in many domains Reasoning about FP programs

Debugging Floating-Point Debugging Floating-Point Debugging Floating-Point Math in Racket Math

Formal verification of floating-point algorithms John Harrison Intel Corporation Floating

Floating-point numbers Fractional binary numbers IEEE floating-point standard Floating-point

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

Machine numbers: how floating point numbers are stored? Floating-point number representation

Floating point Today ! IEEE Floating Point Standard ! Rounding ! Floating Point Operations !

15-213 The course that gives CMU its Zip! Floating Point Sept 6, 2006 Topics Topics

ECS 231 Computer Arithmetic 1 / 27 Outline Floating-point numbers and representations 1

9/20/2018 Today: Floating Point Background: Fractional binary numbers IEEE floating point

2/10/2020 Today: Floating Point Background: Fractional binary numbers IEEE floating point

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Pavel Alex James Zach Panchekha Sanchez-Stern Wilcox Tatlock Floating Points Wild

CS 356 Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent

Unit 3 IEEE 754 Floating Point Representation 3.2 Floating Point Used to represent very

for Optimization and Analysis of Floating-Point Computations Heiko Becker, Pavel Panchekha, Eva

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Singlet Assisted Electroweak Phase Transitions and Precision Higgs Studies Peter Winslow Based

n -nucleus modeling: priorities for T2K/T2HK (my personal point of view) S.Bolognesi (IRFU, CEA)

Stochastic arithmetic in multiprecision Stef Graillat Joint work with Fabienne Jzquel and

Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

for Efficient Quantum Sorting Naveed Mahmud, Bailey K. Srimoungchanh, Bennett Haase-Divine, Nolan

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

3. Text and document databases Normal databases: formatted records; document databases: