The Numerical Reproducibility Fair Trade: Facing the Concurrency - - PowerPoint PPT Presentation

the numerical reproducibility fair trade facing the
SMART_READER_LITE
LIVE PREVIEW

The Numerical Reproducibility Fair Trade: Facing the Concurrency - - PowerPoint PPT Presentation

The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale Michela Taufer With Dylan Chapp, Travis Johnston Based on our IEEE Cluster 2015 paper University of Delaware Reproducible Accuracy From Van


slide-1
SLIDE 1

The Numerical Reproducibility Fair Trade: Facing the Concurrency Challenges at the Extreme Scale

Michela Taufer With Dylan Chapp, Travis Johnston

Based on our IEEE Cluster 2015 paper

University of Delaware

slide-2
SLIDE 2

Reproducible Accuracy

  • From Van Nostrand’s ScienDfic Encyclopedia

Reproducibility: “closeness of agreement among repeated simulaDon results under the same iniDal condiDons over Dme” Accuracy: “conformity of a resulted value to an accepted standard (or scienDfic laws)”

  • Context: ensemble simulaDons of scienDfic phenomena at

extreme scale with mulDthreading hardware consisDng of mulD-core processors coupled with many-core accelerators

2

slide-3
SLIDE 3
  • Repeatability (Same team, same experimental setup)

▪ The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operaDng condiDons, in the same locaDon on mulDple trials. For computaDonal experiments, this means that a researcher can reliably repeat her own computaDon.

  • Replicability (Different team, same experimental setup)

▪ The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operaDng condiDons, in the same or a different locaDon on mulDple

  • trials. For computaDonal experiments, this means that an independent group

can obtain the same result using the author’s own arDfacts.

  • Reproducibility (Different team, different experimental setup)

▪ The measurement can be obtained with stated precision by a different team, a different measuring system, in a different locaDon on mulDple trials. For computaDonal experiments, this means that an independent group can obtain the same result using arDfacts which they develop completely independently. 3 From: hQps://www.acm.org/publicaDons/policies/arDfact-review-badging

slide-4
SLIDE 4

Molecular Dynamics on Accelerators

Force à AcceleraDon à Velocity à PosiDon

MD simulation step:

  • Each GPU-thread computes

forces on single atoms

▪ E.g., bond, angle, dihedrals

and, nonbond forces

  • Forces are added to compute

acceleration

  • Acceleration is used to update

velocities

  • Velocities are used to update the

positions

4

slide-5
SLIDE 5

The Strange Case of Constant Energy MDs

  • ---- Single precision
  • Enhancing performance of MD simulaDons allows simulaDons of

larger Dme scales and length scales

  • GPU compuDng enables large-scale MD simulaDon

▪ SimulaDons exhibit unprecedented speed-up factors

  • MD simulation of NaI solution

system containing 988 waters, 18 Na+, and 18 I−: GPU is X15 faster than CPU

Constant energy MD simulation

5

slide-6
SLIDE 6

The Strange Case of Constant Energy MDs

  • ---- Single precision
  • Enhancing performance of MD simulaDons allows simulaDons of

larger Dme scales and length scales

  • GPU compuDng enables large-scale MD simulaDon

▪ SimulaDons exhibit speed-up factors of X10-X30

  • MD simulation of NaI solution

system containing 988 waters, 18 Na+, and 18 I−: GPU is X15 faster than CPU

Constant energy MD simulation

6

slide-7
SLIDE 7

The Strange Case of Constant Energy MDs

  • ---- Single precision
  • Enhancing performance of MD simulaDons allows simulaDons of

larger Dme scales and length scales

  • GPU compuDng enables large-scale MD simulaDon

▪ SimulaDons exhibit unprecedented speed-up factors

  • MD simulation of NaI solution

system containing 988 waters, 18 Na+, and 18 I−: GPU is X15 faster than CPU

GPU single precision GPU single precision GPU double precision

7

slide-8
SLIDE 8

The Strange Case of Constant Energy MDs

  • Enhancing performance of MD simulaDons allows simulaDons of

larger Dme scales and length scales

  • GPU compuDng enables large-scale MD simulaDon

▪ SimulaDons exhibit unprecedented speed-up factors

  • MD simulation of NaI solution

system containing 988 waters, 18 Na+, and 18 I−: GPU is X15 faster than CPU

GPU double precision

8

slide-9
SLIDE 9

Just a Case of Code Accuracy?

  • A plot of the energy

fluctua@ons versus @me step size should follow an approximately logarithmic trend 1

  • Energy fluctuaDons are

proporDonal to Dme step size for large Dme step size

  • Larger than 0.5 fs
  • A different behavior for step

size less than 0.5 fs is consistent with results previously presented and discussed in

  • ther work 2

1 Allen and Tildesley, Oxford: Clarendon Press, (1987) 2 Bauer et al., J. Comput. Chem. 32(3): 375 – 385, 2011

9

slide-10
SLIDE 10

From a recent talk of Lucy Nowell, DoE Program Director (Distinguished Speaker Lecture, University of Delaware, Oct 10, 2014)

The Exascale Environment

10

slide-11
SLIDE 11

The Exascale Environment

From a recent talk of Lucy Nowell, DoE Program Director (Distinguished Speaker Lecture, University of Delaware, Oct 10, 2014) 11

slide-12
SLIDE 12

Discussion Outline

  • Focus on reproducible accuracy of global summa@on
  • ScienDsts demand increased reproducible accuracy

▪ Must be reproducible enough

  • Many approaches have been proposed

▪ Must be cost effec@ve

  • Empirical results illustrate the need for runDme selecDon of

reducDon operators that ensure a given degree of reproducible accuracy

12

slide-13
SLIDE 13

Discussion Outline

  • Causes of loss of reproducibility

▪ Well-known floaDng-point issues ▪ Non-determinism at exascale

  • Techniques for recovering reproducibility

▪ Enhanced summaDon algorithms

  • Empirical evaluaDon of summaDon algorithms’ cost
  • QuanDfying reproducible accuracy

▪ IdenDfy key factors in variability of error accumulaDon ▪ Study response of summaDon algorithms to those factors

  • Lesson learned

13

slide-14
SLIDE 14

Well-Known Problem

  • The modeling of finite-precision arithme@c maps an infinite set of real

numbers onto a finite set of machine numbers

http://cs.smith.edu/dftwiki/index.php/CSC231 An Introduction to Fixed- and Floating-Point Numbers 14

slide-15
SLIDE 15

Simple Example

a = 109, b = −109, c = 10−9 Summation order 1 (a + b) + c = (109 − 109) + 10−9 = 10−9 Summation order 2 a + (b + c) = 109 + (−109 + 10−9) = 0

15

slide-16
SLIDE 16

Simple Example

a = 109, b = −109, c = 10−9 Summation order 1 (a + b) + c = (109 − 109) + 10−9 = 10−9 Summation order 2 a + (b + c) = 109 + (−109 + 10−9) = 0

16

slide-17
SLIDE 17

Non-Determinism at Extreme Scale

ReducDon tree shape

17

Causes include: dynamic task scheduling and fault recovery

x1 x2 x3 x4 x5 x6 x7 x8 + + + + + + s

x1 x2 x3 x4 x5 x6 x7 x8 + + + + + + s

( ) ( )

error bounds s1 s1 s2 s2 exact sum exact sum

slide-18
SLIDE 18

Non-Determinism at Extreme Scale

Arrangement of operands

18

Causes include: dynamic task scheduling and fault recovery

x1 x2 x3 x4 x5 x6 x7 x8 + + + + + + s x6 x3 x1 x7 x8 x2 x5 x4 + + + + + + s

( ) ( )

error bounds s1 s1 s2 s2 exact sum exact sum

slide-19
SLIDE 19

Number of Operands Error Magnitude

  • No control on the way N floaDng-point numbers are assigned to N threads

19

  • Different thread orders cause

round-off errors to accumulate in different ways, leading to different summation results

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

19

Non-AssociaDvity + Non-Determinism

slide-20
SLIDE 20

Non-AssociaDvity + Non-Determinism

20 Number of Operands Error Magnitude

slide-21
SLIDE 21

Non-AssociaDvity + Non-Determinism

21 Number of Operands Error Magnitude

slide-22
SLIDE 22

Non-AssociaDvity + Non-Determinism

22 Number of Operands Error Magnitude

slide-23
SLIDE 23

Non-AssociaDvity + Non-Determinism

23 Number of Operands Error Magnitude Increasing concurrency == Widening interval of possible sums

slide-24
SLIDE 24

Inadequacy of ConvenDonal Wisdom

Worst case error bound

  • In pracDce error bounds are overly pessimisDc (i.e., usually N

* ε << 1) and thus unreliable predictors

24

slide-25
SLIDE 25

Techniques for Recovering Reproducibility

  • Fixed reducDon order

▪ Ensuring that all floaDng-point operaDons are evaluated in the same order from run to run

  • Increased precision numerical types

▪ Mixed precision - e.g. use higher-precision types for sensiDve computaDons and standard types for less sensiDve computaDons

  • Interval arithmeDc

▪ Replace floaDng-point types with custom types represenDng finite-length intervals of real numbers

  • Enhanced SummaDon Algorithms

▪ Compensated summaDon e.g., Kahan and composite precision ▪ Pre-rounded reproducible summaDon

25

slide-26
SLIDE 26

Techniques for Recovering Reproducibility

  • Fixed reducDon order

▪ Ensuring that all floaDng-point operaDons are evaluated in the same order from run to run

  • Increased precision numerical types

▪ Mixed precision - e.g. use higher-precision types for sensiDve computaDons and standard types for less sensiDve computaDons

  • Interval arithmeDc

▪ Replace floaDng-point types with custom types represenDng finite-length intervals of real numbers

  • Enhanced summaDon algorithms

▪ Compensated summaDon e.g., Kahan and composite precision ▪ Pre-rounded reproducible summaDon

26

slide-27
SLIDE 27

Techniques for Recovering Reproducibility

  • Fixed reducDon order

▪ Ensuring that all floaDng-point operaDons are evaluated in the same

  • rder from run to run
  • Increased precision numerical types

▪ Mixed precision - e.g. use higher precision types for sensiDve computaDons and standard types for less sensiDve computaDons

  • Interval arithmeDc

▪ Replace floaDng-point types with custom types represenDng finite-length intervals of real numbers

  • Enhanced summaDon algorithms

▪ Compensated summaDon e.g., Kahan and composite precision ▪ Pre-rounded reproducible summaDon

27

slide-28
SLIDE 28

Standard SummaDon: DefiniDon

28

slide-29
SLIDE 29

Kahan SummaDon: DefiniDon

Holds error

Capture error & add to operand

  • n next iteration

Kahan “Further Remarks on Reducing Truncation Errors” (1964) 29

slide-30
SLIDE 30

Composite Precision: DefiniDon

Taufer et al.” Improving Numerical Reproducibility and Stability in Large-Scale Numerical Simulations on GPUs” (2010)

Value or result Error approximation Error carried through each

  • peration

30

slide-31
SLIDE 31

Pre-rounded SummaDon: DefiniDon

31

Demmel and Nguyen “Parallel Reproducible Summation” (2014) Arteaga, Hoefler et al. “Designing Bit-Reproducible Portable High-Performance Applications” (2014)

31

select extractor M

v1 + v2:

q1 + q2 q1 = (v1 + M) – M q2 = (v2 + M) – M r1 = v1 – q1 r2 = v2 – q2 Un@l error < threshold

slide-32
SLIDE 32

Techniques for Reproducible SummaDon

  • Fixed reducDon order

▪ Ensuring that all floaDng-point operaDons are evaluated in the same order from run to run

  • Increased precision numerical types

▪ Mixed precision - e.g. use of doubles for sensiDve computaDons and floats everywhere else

  • Interval arithmeDc

▪ Replace floaDng-point types with custom types represenDng finite- length intervals of real numbers

  • Enhanced summaDon algorithms

▪ Compensated summaDon e.g., Kahn and composite precision ▪ Pre-rounded reproducible summaDon

H O W C O S T L Y ?

32

slide-33
SLIDE 33

Empirical Study: Cost

  • Emulate simulaDon execuDon

▪ Run parallel sum of 1M doubles using MPI ▪ Perform parDal sums independently ▪ Reduce by global sum with MPI_REDUCE

  • SummaDon algorithms tested

▪ Standard (ST) ▪ Kahan (K) ▪ Composite Precision (CP) ▪ Pre-rounded (PR)

33

slide-34
SLIDE 34

34

Empirical Study: Cost

slide-35
SLIDE 35

Error-free TransformaDons: Times

  • 2-fold pre-rounding versions and varying vector sizes

Demmel and Nguyen “Parallel Reproducible Summation” (2013) Intel MKL library

~X7 ~X4

35

slide-36
SLIDE 36

Techniques for Reproducible SummaDon

  • Fixed reducDon order

▪ Ensuring that all floaDng-point operaDons are evaluated in the same order from run to run

  • Increased precision numerical types

▪ Mixed precision - e.g. use of doubles for sensiDve computaDons and floats everywhere else

  • Interval arithmeDc

▪ Replace floaDng-point types with custom types represenDng finite- length intervals of real numbers

  • Enhanced summaDon algorithms

▪ Compensated summaDon e.g., Kahn and composite precision ▪ Pre-rounded reproducible summaDon

H

  • w

r e p r

  • d

u c i b l e ? H

  • w

a c c u r a t e ?

36

slide-37
SLIDE 37

Empirical Study: Reproducible Accuracy

  • Emulate sums expected in exascale simulaDons

▪ Shuffling summaDon order emulates nondeterminisDc reducDon tree

  • Measure sensiDvity of summaDon algorithms to:

▪ Changes in summaDon order ▪ MathemaDcal properDes of summands

37

slide-38
SLIDE 38

Empirical Study: Reproducible Accuracy

  • Emulate sums expected in exascale simulaDons

▪ Shuffling summaDon order emulates nondeterminisDc reducDon tree

  • Measure sensiDvity of summaDon algorithms to:

▪ Changes in summaDon order ▪ MathemaDcal properDes of summands

  • Interpret width of result interval as sensiDvity

38

slide-39
SLIDE 39

Empirical Study: Reproducible Accuracy

  • Emulate sums expected in exascale simulaDons

▪ Shuffling summaDon order emulates nondeterminisDc reducDon tree

  • Measure sensiDvity of summaDon algorithms to:

▪ Changes in summaDon order ▪ MathemaDcal properDes of summands

  • Interpret width of result interval as sensiDvity
  • Test summaDon algorithms: Standard (ST), Kahan (K),

Composite-Precision (CP), Pre-rounded (PR)

39

slide-40
SLIDE 40

EmulaDng Exascale Scenarios

40 Real Numbers FloaDng-point Numbers f(x0) x0 f(x4) x4 f(x1) x1 f(x2) x2 f(x3) x3 f(x5) x5 s0 s2 s3 s1 s4 Roundoff errors accumulate Non-determinism at exascale == shuffled summaDon order {sj == sum wrt jth summaDon order} {sj}

slide-41
SLIDE 41

EmulaDng Exascale Scenarios

41 Real Numbers FloaDng-point Numbers f(x0) x0 f(x4) x4 f(x1) x1 f(x2) x2 f(x3) x3 f(x5) x5 s0 s2 s3 s1 s4 Roundoff errors accumulate Non-determinism at exascale == shuffled summaDon order {sj == sum wrt jth summaDon order} {sj}

width of interval ∝ irreproducibility

slide-42
SLIDE 42

Characterizing Sets of Summands

42

|Sexact−Sj| |Sexact|

≤ (n − 1) · u ·

Pn

i=1 |xi|

| Pn

i=1 xi|

slide-43
SLIDE 43

Characterizing Sets of Summands

43

CriDcal Parameters

  • Size: n
  • CondiDon number: k
  • Dynamic range: dr

|Sexact−Sj| |Sexact|

≤ (n − 1) · u ·

Pn

i=1 |xi|

| Pn

i=1 xi|

slide-44
SLIDE 44

Taxonomy of Values

44

slide-45
SLIDE 45

Taxonomy of Values

45

slide-46
SLIDE 46

Characterizing Sets of Summands

CriDcal Parameters

  • Size: n
  • CondiDon number: k
  • Dynamic range: dr

46

|Sexact−Sj| |Sexact|

≤ (n − 1) · u ·

Pn

i=1 |xi|

| Pn

i=1 xi|

Proxy for… Concurrency SubtracDve cancellaDon Alignment error

slide-47
SLIDE 47

Empirical Study: Results

  • Varying the shape of the reducDon tree

▪ Ill-condiDoned, high dynamic range values ▪ Balanced vs. unbalanced reducDon trees

  • Error variability within the parameter space

▪ n vs. k ▪ n vs. dr ▪ k vs. dr

  • SummaDon algorithm selecDon

▪ Given a variability threshold, which algorithm is needed

47

slide-48
SLIDE 48

Empirical Studies of Reproducible Accuracy

  • Varying the shape of the reducDon tree

▪ Ill-condiDoned, high dynamic range values ▪ Balanced vs. unbalanced reducDon trees

  • Error variability within the parameter space

▪ n vs. k ▪ n vs. dr ▪ k vs. dr

  • SummaDon algorithm selecDon

▪ Given a variability threshold, which algorithm is needed

48

slide-49
SLIDE 49

Visualizing Degree of Reproducible Accuracy

k dr

{x1, x2, …. xn} Sσ = xσ(i) σ – perm. of [n] {Sσ_1 ,Sσ_2 ,Sσ_3 ….. Sσ_100 } {εσ_1 , εσ_2 , εσ_3 ….. εσ_100 }

error variability Values Sum of shuffled values Multiple sums

  • f multiple

permutations Errors w/r/t GNU MPFR result Error variability ….. 49 Darker == More Variability

slide-50
SLIDE 50

CondiDon Number vs. Dynamic Range

Parameter ranges: N = 106, k ∈ [1, 106], dr ∈ [0, 32]

CP 50 CondiDon Number (k)

Compensated summaDon incrementally improves reproducible accuracy

x1e-13 0 1 2 3 4 5 6 7 8 9

Standard DeviaDon

Dynamic Range (dr) Cell variability ST K

slide-51
SLIDE 51

Empirical Studies of Reproducible Accuracy

  • Varying the shape of the reducDon tree

▪ Ill-condiDoned, high dynamic range values ▪ Balanced vs. unbalanced reducDon trees

  • Error variability within the parameter space

▪ n vs. k ▪ n vs. dr ▪ k vs. dr

  • SummaDon algorithm selecDon

▪ Given a variability threshold, which algorithm is needed?

51

slide-52
SLIDE 52

k dr

{x1, x2, …. xn} Sσ = xσ(i) σ – perm. of [n] {Sσ_1 ,Sσ_2 ,Sσ_3 ….. Sσ_100 } {εσ_1 , εσ_2 , εσ_3 ….. εσ_100 }

error variability Values Sum of shuffled values Multiple sums

  • f multiple

permutations Errors w/r/t GNU MPFR result Error variability ….. 52 Cell shade == algorithm keeps variability below threshold K ST CP

slide-53
SLIDE 53

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

53 Variability threshold = 5e-13

slide-54
SLIDE 54

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

54 Variability threshold = 4.5e-13

slide-55
SLIDE 55

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

55 Variability threshold = 4e-13

slide-56
SLIDE 56

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

56 Variability threshold = 3.5e-13

slide-57
SLIDE 57

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

57 Variability threshold = 3e-13

slide-58
SLIDE 58

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

58 Variability threshold = 2.5e-13

slide-59
SLIDE 59

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

59 Variability threshold = 1.5e-13

slide-60
SLIDE 60

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

60 Variability threshold = 2.5e-14

slide-61
SLIDE 61

SelecDng a Sufficient Algorithm

High Medium

K ST CP

Low

61 Variability threshold = 5e-14

slide-62
SLIDE 62

Lesson Learned

62

  • We study an emulated scenario of global summaDon on

exascale pla~orms

  • Increasingly costly summaDon algorithms needed for

reproducible accuracy in certain regions of parameter space ▪ High concurrency, ill-condiDoned, high dynamic range

  • Exascale applicaDons need to maintain awareness of

mathemaDcal properDes of summands ▪ Adjust summaDon algorithms used to keep variability below threshold

slide-63
SLIDE 63

Future DirecDons

  • Can we achieve reproducible numerical accuracy by

intelligent run7me selec7on of reduc7on algorithms?

63

slide-64
SLIDE 64

Acknowledgments

64

Sponsors:

Contact: taufer@udel.edu gcl.cis.udel.edu

Global Computing Lab @ UDel