exploiting community structure for floating point
play

Exploiting Community Structure for Floating-Point Precision Tuning - PowerPoint PPT Presentation

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez ISSTA18 Amsterdam, Netherlands, July 2018 Background Floating-point (FP) arithmetic used in many domains Reasoning about FP programs


  1. Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-González ISSTA’18 – Amsterdam, Netherlands, July 2018

  2. Background • Floating-point (FP) arithmetic used in many domains • Reasoning about FP programs is difficult - Large variety of numerical problems - Most programmers are not experts in FP • Common practice: use highest available precision - Disadvantage: more expensive! • Tools have been developed for precision tuning Given : Accuracy constraints Action: Reduce precision Goal : Performance 2

  3. Precision Tuning Example 1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; Tuned Program 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; Error threshold 10 -8 25 L110: 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 28 } Original Program 3

  4. Precision Tuning Example 1 long double fun(long double p) { 1 long double fun(double p) { 2 long double pi = acos(-1.0); 2 double pi = acos(-1.0); 3 long double q = sin(pi * p); 3 long double q = sinf(pi * p); 4 return q; 4 return q; 5 } 5 } 6 6 7 void simpsons() { 7 void simpsons() { 8 long double a, b; 8 float a, b; 9 long double h, s, x; 9 double s, x; float h; 10 const long double fuzz = 1e-26; 10 const long float fuzz = 1e-26; 11 const int n = 2000000; 11 const int n = 2000000; 12 … 12 … Tuned program runs 78.7% faster! 18 L100: 18 L100: 19 x = x + h; 19 x = x + h; 20 s = s + 4.0 * fun(x); 20 s = s + 4.0 * fun(x); 21 x = x + h; 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 23 s = s + 2.0 * fun(x); 24 goto L100; 24 goto L100; 25 L110: 25 L110: 26 s = s + fun(x); 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 27 //final answer:(long double)h *s/3.0 28 } 28 } Original Program Tuned Program 4

  5. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ single precision 5

  6. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✘ ✘ single precision 6

  7. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ ✘ single precision 7

  8. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ ✘ single precision 8

  9. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ ✘ single precision 9

  10. State-of-the-art: Black-box Precision Tuning ✔ double precision ✘ ✔ ✔ ✘ Proposed configuration ✔ ✘ … Failed configurations ✘ single precision 10

  11. State-of-the-art: Black-box Precision Tuning • State of the art groups variables arbitrarily • Black box nature - Related variables assigned types independently Large number of variables → Slow search - More type casts → Less speedup - Local minimum Global minimum Original Uses lower precision Shifts precision less often Speedup: 78.7% Speedup: 90% 11

  12. Exploiting Community Structure • Can we leverage the program to perform a more informed precision tuning? • White box nature Related variables pre-grouped into hierarchy → Same type - Fewer groups in search space → Faster search - Fewer type casts → Larger speedups - 7 8 4 2 5 3 6 1 Level 2 Search top to bottom 1 4 6 8 7 3 2 5 Level 1 4 7 1 2 3 5 6 8 Level 0 12

  13. Approach TEST SOURCE INPUTS CODE Accuracy Constraint 3. Hierarchical Precision Tuning 1. Type Dependence Analysis + Edge Profiling Weighted Dependence Graph TYPE CONFIGURATION 2. Iterative Community Detection + Ordering Speeds up program by reducing precision with respect to accuracy Ordered Community constraint Structure of Variables 13

  14. Type Dependence Analysis + Edge Profiling 1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); Identify assignments to 4 return q; 5 } floating-point variables 6 7 void simpsons() { 8 long double a, b; 9 // subinterval length, integral approximation, x 10 long double h,s,x; 11 const long double fuzz = 1e-26; 12 const int n = 2000000; 13 a = 0.0; 14 b = 1.0; 15 h = (b - a) / n; 16 x = a; 17 s = fun(x); 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 printf("%1.16Le\n", (long double)h * s / 3.0); 28 } 14

  15. Type Dependence Analysis + Edge Profiling 2 long double pi = acos(-1.0); Weighted dependence graph 3 long double q = sin(pi * p); 11 const long double fuzz = 1e-26; 2000000 b 1 pi 13 a = 0.0; s h 14 b = 1.0; 2000000 2000001 1 15 h = (b - a) / n; 2000001 fuzz 1 2000001 16 x = a; x p q a 17 s = fun(x); 2000000 19 x = x + h; Variables Variable dependence 20 s = s + 4.0 * fun(x); variables in main 21 x = x + h; variables in fun 23 s = s + 2.0 * fun(x); 26 s = s + fun(x); A vertex in the graph represents a FP variable, and an edge u → v denotes that value u is used to compute value v at least once 15

  16. Iterative Community Detection + Ordering Use modularity maximization [1, 2] to iteratively detect communities on the generated dependence graph until no new communities are found Top 2000001 2000000 c3 c2 c1 c3 c2 c1 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 s s x a h b x a h b q pi p q pi p Bottom 2000000 Community structure of floating-point variables [1] M. E. Newman. Fast algorithm for detecting community structure in networks. Physical review E, 2004. 16 [2] M. E. Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 2006.

  17. Iterative Community Detection + Ordering Sort the items at each level of the hierarchy using topological order to follow the dependence flow Top 2000001 2000000 c3 c3 c2 c1 c2 c1 c3 c2 c1 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 s x a h b s s x a h b pi x a h b q p pi q pi p q p Bottom 2000000 Ordered community structure 17

  18. Hierarchical Precision Tuning Search through the hierarchy from top down to the bottom Original precision configuration Top Reduce precision to speed up program 2000001 2000000 pi, p, q a, b, h, x s c3 c2 c1 c3 c2 c1 Top-level precision configuration 4000003 4000002 2000001 2000001 2000000 2000000 2000001 1 1 1 x a h b s x a h b s pi q pi p q p a b h pi p q x s Reduce precision to speed up program Bottom 2000000 Bottom-level precision configuration Global minimum configuration with 90% TYPE speedup! CONFIGURATION 18

  19. Experimental Setup • Hierarchical search algorithm implemented in tool HiFPTuner • Benchmarks : 4 GSL programs (inputs that maximize coverge), 2 NAS Parallel Benchmarks (inputs Class A), 3 other numerical programs including simpsons (input free) • Error thresholds o Multiple error thresholds: 10 -4 ,10 -6 , 10 -8 , and 10 -10 o User can evaluate trade-off between accuracy and speedup o 35 experiments in total • Evaluated search efficiency and effectiveness in comparison with state-of-the-art tool Precimonious 19

  20. Number of Communities Initial Type Configuration Items to Tune Communities # Items Program L D F C L2 L1 L0 11 simpsons 9 0 0 2 - 6 11 11 - 7 11 arclenght 8 0 0 3 17 - 6 17 piqpr 17 0 0 0 25 11 14 25 fft 0 22 1 2 58 gaussian 0 56 0 2 18 22 58 36 sum 0 34 0 2 23 24 36 29 11 14 29 bessel 0 24 0 5 17 9 9 17 ep 0 13 0 4 35 21 24 35 cp 0 32 0 3 The number of tunable items at the top level of the hierarchy is reduced by 53% from 239 to 112 20

  21. RQ1: Search Efficiency How efficient is hierarchical search for precision tuning in comparison with Precimonious? Answer: HiFPTuner exhibits higher search efficiency over Precimonious for 75.9% (22 out of 29) of the experiments that require tuning Overall, HiFPTuner explores 45% (3,326) fewer configurations than Precimonious 21

  22. Configurations for Error Threshold 10 -8 Number of Configurations 800 735 700 600 533 497 500 433 400 HiFPTuner Precimonious 297 300 275 211 200 164 142 116 100 77 52 45 43 30 24 0 simpsons arclength piqpr fft gaussian sum ep cp 22

  23. Configurations for Error Threshold 10 -8 Initial Type Configuration HiFPTuner Precimonious Error threshold: 10 -8 Error threshold: 10 -8 L D F S Program L D F C L D F S 1 3 5 1 116 simpsons 9 0 0 2 0 8 1 1 24 0 7 1 1 142 arclenght 8 0 0 3 0 7 1 1 30 3 13 1 0 164 piqpr 17 0 0 0 3 14 0 0 52 0 21 2 0 297 fft 0 22 0 2 0 22 0 2 43 0 56 0 2 275 gaussian 0 56 0 2 0 10 46 2 211 0 34 0 2 433 sum 0 34 0 2 0 10 24 2 533 0 13 0 4 77 ep 0 13 0 4 0 13 0 4 45 0 32 0 3 735 cp 0 32 0 3 0 24 8 3 497 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend