Exploiting Community Structure for Floating-Point Precision Tuning - - PowerPoint PPT Presentation

exploiting community structure for floating point
SMART_READER_LITE
LIVE PREVIEW

Exploiting Community Structure for Floating-Point Precision Tuning - - PowerPoint PPT Presentation

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez ISSTA18 Amsterdam, Netherlands, July 2018 Background Floating-point (FP) arithmetic used in many domains Reasoning about FP programs


slide-1
SLIDE 1

Exploiting Community Structure for Floating-Point Precision Tuning

ISSTA’18 – Amsterdam, Netherlands, July 2018

Hui Guo Cindy Rubio-González

slide-2
SLIDE 2
  • Floating-point (FP) arithmetic used in many domains
  • Reasoning about FP programs is difficult
  • Large variety of numerical problems
  • Most programmers are not experts in FP
  • Common practice: use highest available precision
  • Disadvantage: more expensive!
  • Tools have been developed for precision tuning

Given : Accuracy constraints Action: Reduce precision Goal : Performance

2

Background

slide-3
SLIDE 3

1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 28 }

3

Original Program

Precision Tuning Example

Tuned Program

Error threshold 10-8

slide-4
SLIDE 4

4

Original Program

Precision Tuning Example

Tuned Program

1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 long double h, s, x; 10 const long double fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 28 } 1 long double fun(double p) { 2 double pi = acos(-1.0); 3 long double q = sinf(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 float a, b; 9 double s, x; float h; 10 const long float fuzz = 1e-26; 11 const int n = 2000000; 12 … 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 //final answer:(long double)h *s/3.0 28 }

Tuned program runs 78.7% faster!

slide-5
SLIDE 5

✔ ✘

double precision single precision

5

State-of-the-art: Black-box Precision Tuning

slide-6
SLIDE 6

✔ ✘

double precision single precision

✘ ✘

6

State-of-the-art: Black-box Precision Tuning

slide-7
SLIDE 7

✔ ✘

double precision single precision

✘ ✘ ✔ ✔

7

State-of-the-art: Black-box Precision Tuning

slide-8
SLIDE 8

✔ ✘

double precision single precision

✘ ✘ ✔ ✔

8

State-of-the-art: Black-box Precision Tuning

slide-9
SLIDE 9

✔ ✘

double precision single precision

✘ ✘ ✔ ✔

9

State-of-the-art: Black-box Precision Tuning

slide-10
SLIDE 10

✔ ✘

double precision single precision

✘ ✘ ✔ ✔ ✘ ✔

Failed configurations Proposed configuration

10

State-of-the-art: Black-box Precision Tuning

slide-11
SLIDE 11

11

State-of-the-art: Black-box Precision Tuning

  • Black box nature
  • Related variables assigned types independently
  • Large number of variables → Slow search
  • More type casts → Less speedup
  • State of the art groups variables arbitrarily

Local minimum Global minimum Original

Uses lower precision Speedup: 78.7% Shifts precision less often Speedup: 90%

slide-12
SLIDE 12

12

Exploiting Community Structure

  • White box nature
  • Related variables pre-grouped into hierarchy → Same type
  • Fewer groups in search space → Faster search
  • Fewer type casts → Larger speedups
  • Can we leverage the program to perform a more

informed precision tuning?

1 2 3 4 5 6 7 8 1 4 3 6 8 2 5 7 3 6 8 1 4 2 5 7 Search top to bottom Level 0 Level 1 Level 2

slide-13
SLIDE 13

Speeds up program by reducing precision with respect to accuracy constraint

13

Approach

SOURCE CODE

  • 1. Type Dependence Analysis + Edge Profiling

Weighted Dependence Graph

TEST INPUTS

  • 2. Iterative Community Detection + Ordering

Ordered Community Structure of Variables

  • 3. Hierarchical Precision Tuning

TYPE CONFIGURATION

Accuracy Constraint

slide-14
SLIDE 14

14

Type Dependence Analysis + Edge Profiling

1 long double fun(long double p) { 2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 4 return q; 5 } 6 7 void simpsons() { 8 long double a, b; 9 // subinterval length, integral approximation, x 10 long double h,s,x; 11 const long double fuzz = 1e-26; 12 const int n = 2000000; 13 a = 0.0; 14 b = 1.0; 15 h = (b - a) / n; 16 x = a; 17 s = fun(x); 18 L100: 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 22 if (x + fuzz >= b) goto L110; 23 s = s + 2.0 * fun(x); 24 goto L100; 25 L110: 26 s = s + fun(x); 27 printf("%1.16Le\n", (long double)h * s / 3.0); 28 }

Identify assignments to floating-point variables

slide-15
SLIDE 15

15

Type Dependence Analysis + Edge Profiling

2 long double pi = acos(-1.0); 3 long double q = sin(pi * p); 11 const long double fuzz = 1e-26; 13 a = 0.0; 14 b = 1.0; 15 h = (b - a) / n; 16 x = a; 17 s = fun(x); 19 x = x + h; 20 s = s + 4.0 * fun(x); 21 x = x + h; 23 s = s + 2.0 * fun(x); 26 s = s + fun(x); a h x b

fuzz

s

Variable dependence

p q pi

Variables variables in main variables in fun

1 1 1 2000000 2000000 2000000 2000001 2000001 2000001

Weighted dependence graph

A vertex in the graph represents a FP variable, and an edge u → v denotes that value u is used to compute value v at least once

slide-16
SLIDE 16

16

Iterative Community Detection + Ordering

c3 x a h b q p c1 pi c2

4000003 2000001 2000000 4000002 2000000 1 1 1 2000000 2000001 2000000 2000001 2000001

s

Top

Bottom x a h b q p c1 pi c2 c3 s

Use modularity maximization [1, 2] to iteratively detect communities on the generated dependence graph until no new communities are found

Community structure of floating-point variables

[1] M. E. Newman. Fast algorithm for detecting community structure in networks. Physical review E, 2004. [2] M. E. Newman. Modularity and community structure in networks. Proceedings of the national academy of sciences, 2006.

slide-17
SLIDE 17

17

Iterative Community Detection + Ordering

Sort the items at each level of the hierarchy using topological order to follow the dependence flow

Ordered community structure

c3 x a h b q p c1 pi c2

4000003 2000001 2000000 4000002 2000000 1 1 1 2000000 2000001 2000000 2000001 2000001

s

Top

Bottom x a h b q p c1 pi c2 c3 s

c1 c3 x a h b q p c2 pi s

slide-18
SLIDE 18

18

Hierarchical Precision Tuning

Search through the hierarchy from top down to the bottom

pi, p, q a, b, h, x s a b h pi p q x s Original precision configuration Top-level precision configuration Bottom-level precision configuration

Reduce precision to speed up program Reduce precision to speed up program

c3 x a h b q p c1 pi c2

4000003 2000001 2000000 4000002 2000000 1 1 1 2000000 2000001 2000000 2000001 2000001

s

Top

Bottom x a h b q p c1 pi c2 c3 s

TYPE CONFIGURATION

Global minimum configuration with 90% speedup!

slide-19
SLIDE 19

Experimental Setup

  • Benchmarks: 4 GSL programs (inputs that maximize coverge), 2

NAS Parallel Benchmarks (inputs Class A), 3 other numerical programs including simpsons (input free)

  • Error thresholds
  • Multiple error thresholds: 10-4,10-6, 10-8, and 10-10
  • User can evaluate trade-off between accuracy and speedup
  • 35 experiments in total

19

  • Evaluated search efficiency and effectiveness in

comparison with state-of-the-art tool Precimonious

  • Hierarchical search algorithm implemented in tool

HiFPTuner

slide-20
SLIDE 20

Number of Communities

20

Program L D F C simpsons 9 2 arclenght 8 3 piqpr 17 fft 22 1 2 gaussian 56 2 sum 34 2 bessel 24 5 ep 13 4 cp 32 3 Initial Type Configuration L2 L1 L0

  • 6

11

  • 7

11

  • 6

17 11 14 25 18 22 58 23 24 36 11 14 29 9 9 17 21 24 35 Communities # Items 11 11 17 25 58 36 29 17 35 Items to Tune

The number of tunable items at the top level of the hierarchy is reduced by 53% from 239 to 112

slide-21
SLIDE 21

RQ1: Search Efficiency

21

How efficient is hierarchical search for precision tuning in comparison with Precimonious? Answer: HiFPTuner exhibits higher search efficiency over Precimonious for 75.9% (22 out of 29) of the experiments that require tuning Overall, HiFPTuner explores 45% (3,326) fewer configurations than Precimonious

slide-22
SLIDE 22

Configurations for Error Threshold 10-8

22

24 30 52 43 211 533 45 497 116 142 164 297 275 433 77 735

100 200 300 400 500 600 700 800 simpsons arclength piqpr fft gaussian sum ep cp

Number of Configurations

HiFPTuner Precimonious

slide-23
SLIDE 23

L D F S 1 3 5 1 116 7 1 1 142 3 13 1 164 21 2 297 56 2 275 34 2 433 13 4 77 32 3 735 Program L D F C simpsons 9 2 arclenght 8 3 piqpr 17 fft 22 2 gaussian 56 2 sum 34 2 ep 13 4 cp 32 3 L D F S 8 1 1 24 7 1 1 30 3 14 52 22 2 43 10 46 2 211 10 24 2 533 13 4 45 24 8 3 497

Configurations for Error Threshold 10-8

Initial Type Configuration HiFPTuner Error threshold: 10-8

23

Precimonious Error threshold: 10-8

slide-24
SLIDE 24

Configurations for Simpsons

24 20 40 60 80 100 120 1uPEHr of ExplorHd Configurations 40 64 80 AvHragH PrHFision

simpsons 10−8

PrHFiPonious HiFPTunHr HiFPTunHr lHvHl linH

20 40 60 80 100 120 1uPEHr of ExplorHd Configurations 40 64 80 AvHragH PrHFision

simpsons 10−8

PrHFiPonious HiFPTunHr HiFPTunHr lHvHl linH

HiFPTuner top level’s configuration has the best performance 24 vs. 116 configurations

slide-25
SLIDE 25

RQ2: Search Effectiveness

25

How effective is hierarchical search in finding higher quality configurations than Precimonious? Answer: HiFPTuner finds better configurations for 51.7% (15 out of 29) of the experiments that require tuning compared to Precimonious

slide-26
SLIDE 26

Summary

  • White-box approach for dynamic precision tuning
  • Analyzes code and runtime behavior to construct a

weighted dependence graph used to detect communities of variables and construct a hierarchy

  • Experimental evaluation on 9 programs shows

– HIFPTUNER reduces the search space by 53% on average – HIFPTUNER finds better configurations for 51.7% of the programs x error thresholds

  • HIFPTUNER makes a step towards more scalable and

effective floating-point precision tuning

26