Recovering Numerical Reproducibility in Hydrodynamics Simulations - - PowerPoint PPT Presentation

recovering numerical reproducibility in hydrodynamics
SMART_READER_LITE
LIVE PREVIEW

Recovering Numerical Reproducibility in Hydrodynamics Simulations - - PowerPoint PPT Presentation

23rd IEEE Symposium on Computer Arithmetic July 10-13 2016, Santa-Clara, USA Recovering Numerical Reproducibility in Hydrodynamics Simulations Philippe Langlois, Rafife Nheili, Christophe Denis DALI, Universit de Perpignan Via Domitia LIRMM,


slide-1
SLIDE 1

23rd IEEE Symposium on Computer Arithmetic July 10-13 2016, Santa-Clara, USA

Recovering Numerical Reproducibility in Hydrodynamics Simulations

Philippe Langlois, Rafife Nheili, Christophe Denis

DALI, Université de Perpignan Via Domitia LIRMM, UMR 5506 CNRS, Université de Montpellier CMLA, ENS Cachan, France

1 / 31

slide-2
SLIDE 2

Recovering numerical reproducibility

Execution: Sequential 2 procs 4 procs p procs

Original code

2 / 31

slide-3
SLIDE 3

Recovering numerical reproducibility

Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

2 / 31

slide-4
SLIDE 4

Recovering numerical reproducibility

Reproducibility: bitwise identical results for every p-parallel run, p ≥ 1 Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

Reproducible code

2 / 31

slide-5
SLIDE 5

Recovering numerical reproducibility

Reproducibility: bitwise identical results for every p-parallel run, p ≥ 1 Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

Reproducible code reproducibility

2 / 31

slide-6
SLIDE 6

Recovering numerical reproducibility

Reproducibility: bitwise identical results for every p-parallel run, p ≥ 1 Reproducibility = Accuracy Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

Reproducible code reproducibility accuracy

2 / 31

slide-7
SLIDE 7

Recovering numerical reproducibility

Reproducibility: bitwise identical results for every p-parallel run, p ≥ 1 Reproducibility = Accuracy Failures reported in numerical simulation for energy [10], dynamic weather science [2], dynamic molecular [9], dynamic fluid [8] Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

Reproducible code reproducibility accuracy

2 / 31

slide-8
SLIDE 8

Recovering numerical reproducibility

Reproducibility: bitwise identical results for every p-parallel run, p ≥ 1 Reproducibility = Accuracy Failures reported in numerical simulation for energy [10], dynamic weather science [2], dynamic molecular [9], dynamic fluid [8] How to debug? How to test? How to validate? How to receive legal agreements? Execution: Sequential 2 procs 4 procs p procs Non-reproducible

  • riginal code

Reproducible code reproducibility accuracy

2 / 31

slide-9
SLIDE 9

Hydrodynamics simulation

One industrial scale simulation code Simulation of free-surface flows in 1D-2D-3D hydrodynamics 300 000 loc. of open source Fortran 90 20 years, 4000 registered users, EDF R&D + international consortium Telemac 2D [3] 2D hydrodynamic: Saint Venant equations Finite element method, triangular element mesh, sub-domain decomposition for parallel resolution Mesh node unknowns: water depth (H) and velocity (U,V)

3 / 31

slide-10
SLIDE 10

Telemac2D: the simplest gouttedo simulation

The gouttedo simulation test case 2D-simulation of a water drop fall in a square basin Unknown: water depth for a 0.2 sec time step Triangular mesh: 8978 elements and 4624 nodes Expected numerical reproducibility (time step = 1, 2, . . . ) Sequential Parallel p = 2

4 / 31

slide-11
SLIDE 11

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 1 Sequential Parallel p = 2

5 / 31

slide-12
SLIDE 12

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 2 Sequential Parallel p = 2

5 / 31

slide-13
SLIDE 13

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 3 Sequential Parallel p = 2

5 / 31

slide-14
SLIDE 14

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 4 Sequential Parallel p = 2

5 / 31

slide-15
SLIDE 15

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 5 Sequential Parallel p = 2

5 / 31

slide-16
SLIDE 16

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 6 Sequential Parallel p = 2

5 / 31

slide-17
SLIDE 17

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 7 Sequential Parallel p = 2

5 / 31

slide-18
SLIDE 18

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 8 Sequential Parallel p = 2

5 / 31

slide-19
SLIDE 19

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 9 Sequential Parallel p = 2

5 / 31

slide-20
SLIDE 20

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 10 Sequential Parallel p = 2

5 / 31

slide-21
SLIDE 21

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 11 Sequential Parallel p = 2

5 / 31

slide-22
SLIDE 22

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 12 Sequential Parallel p = 2

5 / 31

slide-23
SLIDE 23

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 13 Sequential Parallel p = 2

5 / 31

slide-24
SLIDE 24

A white plot displays a non-reproducible value

Numerical reproducibility?

time step = 14 Sequential Parallel p = 2

5 / 31

slide-25
SLIDE 25

A white plot displays a non-reproducible value

NO numerical reproducibility!

time step = 15 Sequential Parallel p = 2

5 / 31

slide-26
SLIDE 26

Telemac2D: gouttedo

NO numerical reproducibility!

Sequential Parallel p = 2

6 / 31

slide-27
SLIDE 27

Today’s issues

Case study Industrial scale software: openTelemac-Mascaret Finite element simulation, domain decomposition, linear system solving

  • 2 modules: Tomawac, Telemac2D

7 / 31

slide-28
SLIDE 28

Today’s issues

Case study Industrial scale software: openTelemac-Mascaret Finite element simulation, domain decomposition, linear system solving

  • 2 modules: Tomawac, Telemac2D

Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily?

  • Compensation yields reproducibility here!

7 / 31

slide-29
SLIDE 29

Today’s issues

Case study Industrial scale software: openTelemac-Mascaret Finite element simulation, domain decomposition, linear system solving

  • 2 modules: Tomawac, Telemac2D

Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily?

  • Compensation yields reproducibility here!

Efficiency How much to pay for reproducibility?

  • ×1.2 ↔ ×2.3 extra-cost which decreases as the problem size increases
  • OK to debug, to validate and even to simulate!

7 / 31

slide-30
SLIDE 30

Outline

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

8 / 31

slide-31
SLIDE 31

Parallel reduction and compensation techniques

Non associative floating point addition The computed value depends on the operation order

9 / 31

slide-32
SLIDE 32

Parallel reduction and compensation techniques

Non associative floating point addition The computed value depends on the operation order Parallel reduction of undefined order generates reproducibility failure a b c d a⊕b c⊕d (a⊕b)⊕(c⊕d) a b c d a⊕b (a⊕b)⊕c ((a⊕b)⊕c)⊕d =

9 / 31

slide-33
SLIDE 33

Parallel reduction and compensation techniques

Non associative floating point addition The computed value depends on the operation order Parallel reduction of undefined order generates reproducibility failure Compensate rounding errors with error free transformations a b c d a⊕b e1 c⊕d e2 (a⊕b)⊕(c⊕d) e3 a b c d a⊕b f1 (a⊕b)⊕c f2 ((a⊕b)⊕c)⊕d f3 ((a⊕b)⊕(c⊕d)) ⊕ ((e1 ⊕ e2) ⊕ e3) = (((a⊕b)⊕c)⊕d) ⊕((f1 ⊕ f2) ⊕ f3) Should be repeted for too ill-conditionned sums

9 / 31

slide-34
SLIDE 34

Finite element assembly: the sequential case

The assembly step: V (i) =

elements Wel(i)

compute the inner node values V (i) accumulating local Wel for every el that contains i

!"

!"# $$# %&# "!# "# &# ' '# ()*+,'-# +# *+#

The assembly loop

for p = 1,np //p: triangular local number (np=3) for el = 1,nel i = IKLE(el,p) % V(i) = V(i) + W(el,p) //i: domain global number

10 / 31

slide-35
SLIDE 35

Finite element assembly: the sequential case

The assembly step: V (i) =

elements Wel(i)

compute the inner node values V (i) accumulating local Wel for every el that contains i

!"

!"# $$# %&# "!# "# &# ' '# ()*+,'-# +# *+#

The assembly loop

for p = 1,np //p: triangular local number (np=3) for el = 1,nel i = IKLE(el,p) % <–- LOOP INDEX INDIRECTION V(i) = V(i) + W(el,p) //i: domain global number

10 / 31

slide-36
SLIDE 36

Finite element assembly: the parallel case

Interface point assembly: communications and reductions

V (i) =

Dk V (i)

for sub-domains Dk, k = 1...p sequential parallel sub-domains inner nodes − → interface points

V (i) = a VD1(i) = b VD2(i) = c V (i) = b + c = a Interface point assembly

Exact arithmetic

11 / 31

slide-37
SLIDE 37

Finite element assembly: the parallel case

Interface point assembly: communications and reductions

V (i) =

DkV (i)

for sub-domains Dk, k = 1...p sequential parallel sub-domains inner nodes − → interface points

V (i) = a VD1(i) = b VD2(i) = c V (i) = b ⊕ c = a Interface point assembly

Floating point arithmetic

11 / 31

slide-38
SLIDE 38

Basic ingredients

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

11 / 31

slide-39
SLIDE 39

Sources of non reproducibility in Telemac2D

Culprits: theory

1 Building step: interface point assembly 2 Solving step (conjugate gradient): parallel matrix-vector and dot products 12 / 31

slide-40
SLIDE 40

Sources of non reproducibility in Telemac2D

Culprits: theory

1 Building step: interface point assembly 2 Solving step (conjugate gradient): parallel matrix-vector and dot products

Culprits: practice = optimization Interface point assembly and linear system solving are merged Element-by-element storage (EBE) of the FE matrix

EBE parallel matrix-vector product: no BLAS Everything is vector, no matrix!

Wave equation, “mass-lumping” and associated algebraic transformations

System decoupling and many diagonal matrices Everything is vector, no matrix!

12 / 31

slide-41
SLIDE 41

Sources of non reproducibility in Telemac2D

The Telemac2D FE steps

Solution U, V diagonal resolution conjugate gradient Solution H

Interface point assembly: A1d in each iteration C2 = Bu − AuhH C3 = Bv − AvhH Interface point assembly: C2, C3

System equation AX = C

A2, A3 : diagonal matrices A1 = Ahh − AhuA−1

2 Auh − AhvA−1 3 Avh

C1 = Bh − AhuA−1

2 Bu − AhvA−1 3 Bv

Interface point assembly: A2, A3, C1

   Ahh Ahu Ahv Auh Auu Avh Avv      H U V   =    Bh Bu Bv   

wave equation

  A1 A2 A3     H U V   =   C1 C2 C3  

Mesh (elements, nodes) Discretisation FE assembly + algebraic computation

H A1, C1 A2, A3

13 / 31

slide-42
SLIDE 42

Sources of non reproducibility in Telemac2D

The Telemac2D FE steps

Solution U, V

Interface point assembly: A1d in each iteration C2 = Bu − AuhH C3 = Bv − AvhH Interface point assembly: C2, C3

diagonal resolution conjugate gradient Solution U, V Solution H

Objective Correct sources of non-reproducibility to compute reproducible system and solutions

System equation AX = C

A2, A3 : diagonal matrices A1 = Ahh − AhuA−1

2 Auh − AhvA−1 3 Avh

C1 = Bh − AhuA−1

2 Bu − AhvA−1 3 Bv

Interface point assembly: A2, A3, C1

   Ahh Ahu Ahv Auh Auu Avh Avv      H U V   =    Bh Bu Bv   

wave equation

  A1 A2 A3     H U V   =   C1 C2 C3  

Mesh (elements, nodes) Discretisation FE assembly + algebraic computation

H A1, C1 A2, A3

13 / 31

slide-43
SLIDE 43

Recovering reproducibility in a finite element resolution

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

13 / 31

slide-44
SLIDE 44

Recovering reproducibility in Telemac2D

Sources FE assembly: diagonal of the matrices and second members Resolution: EBE matrix-vector and dot products Wave equation: algebraic transformations and diagonal resolutions

Reproducible resolution: principles

vector V → [V , EV ] → V + EV Computes EV in the FE assembly of V Propagates EV over each V operation Compensates all nodes while assembling the Interface Point Compensate MPI parallel dot products that include MPI reduction

14 / 31

slide-45
SLIDE 45

Reproducible FE assembly

i: any node; Wel(i): contribution for every el that contains i

Original FE assembly: V (i) =

elements Wel(i)

V (i) = Wel1(i) + Wel2(i) + · · · + Welni (i)

Modified FE assembly: [V (i), EV(i)] = ReprodAsselementsWel(i)

V (i) = Wel1(i) + Wel2(i) + · · · + Welni (i) ↓ ↓ ↓ e1 e2 eni EV (i) = e1 + e2 + · · · + eni−1

15 / 31

slide-46
SLIDE 46

Reproducible interface point assembly

i: one interface point of D1, D2, · · · , Dk−1, Dk

Original IP assembly: V (i) =

Dk V (i)

Communicate VDi to D1, D2, · · · and compute: V (i) = VD1(i) + VD2(i) + · · · + VDk−1(i) + VDk(i)

Modified IP assembly: [V (i), EV(i)] = ReprodAssDk[V (i), EVDk (i)]

Communicate [VDi , EDVi ] to D1, D2, · · · and compute: V (i) = VD1(i) + VD2(i) + · · · + VDk−1(i) + VDk(i) ↓ ↓ ↓ δ1 δ2 δk−1 EV (i) = (EVD1 (i) + EVD2 (i) + δ1) + · · · + (EVDk−1 (i) + EVDk (i) + δk−1)

Reproducibility: at the IP assembly step

[V , EV ] − → V + EV compensates every node value

16 / 31

slide-47
SLIDE 47

Reproducible algebraic operations

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

16 / 31

slide-48
SLIDE 48

Algebraic operation: the vector case

Reproducible algebraic vector operations

  • penTelemac’s included library: BIEF

Entry-wise vector ops: copy, opp/inv., add/sub, Hadamard prod., . . . Applies also for diagonal matrix Propagate rounding errors to compensate while assembling IP Example: Hadamard product

Original version

X, Y − → V = X ◦ Y V (i) = X(i) · Y (i)

Modified version

[X, EX], [Y , EY ] − → [V , EV ] with (V , eV ) = 2Prod(X, Y ) and EV = X ◦ EY + Y ◦ EX + eV

17 / 31

slide-49
SLIDE 49

What is reproducible now?

Most of the linear system: FE assembly algebraic vector operations interface point assembly except: the matrix of the H system its dependencies: the second members of the U and V systems Next step: conjugate gradient

Partially reproducible Telemac2D

Diagonal resolutions: C2 = Bu − AuhH, C3 = Bv − AvhH. Interface point assembly: C2, C3 Solution U, V Conjugate gradient : Interface point assembly: A1d in each iteration Solution H Wave equation: A2, A3 : diagonal matrices, A1 = Ahh − AhuA−1

2 Auh − AhvA−1 3 Avh,

C1 = Bh − AhuA−1

2 Bu − AhvA−1 3 Bv,

Interface point assembly: A2, A3, C1

H A1,C1 A2, A3 18 / 31

slide-50
SLIDE 50

Recovering reproducibility in a finite element resolution

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

18 / 31

slide-51
SLIDE 51

Towards a reproducible conjugate gradient

Initialization: a given d0 r0 = AX0 − B; ρ0 = (r0, d0) (Ad0, d0) ; X1 = X0 − ρ0d0 Iterations until stopping criteria: rm = rm−1 − ρm−1Adm−1 dm = rm + (rm, rm) (rm−1, rm−1) dm−1 ρm = (rm, dm) (dm, Adm) Xm+1 = Xm − ρmdm A = [A1, EA1 ] B = C1 X = H

19 / 31

slide-52
SLIDE 52

Towards a reproducible conjugate gradient

Initialization: a given d0 r0 = AX0 − B; ρ0 = (r0, d0) (Ad0, d0) ; X1 = X0 − ρ0d0 Iterations until stopping criteria: rm = rm−1 − ρm−1Adm−1 dm = rm + (rm, rm) (rm−1, rm−1) dm−1 ρm = (rm, dm) (dm, Adm) Xm+1 = Xm − ρmdm A = [A1, EA1 ] B = C1 X = H

Non-reproducibility sources

EBE matrix-vector product dot product

MPI reduction weighted dot products for IP shared by p sub-domains

19 / 31

slide-53
SLIDE 53

The EBE storage

M = D +

nel

  • el=1

Xel nodes i ∈ [1, np], elements el ∈ [1, nel], element vertices j, k, l ∈ el M is decomposed as:

1

assembled diagonal D[np] : D = [D(1), · · · , D(np)]

2

elementary extra-diagonal Xel[6]: Xel =    · Xjk(el) Xjl(el) Xkj(el) · Xkl(el) Xlj(el) Xlk(el) ·    = [Xel(1), · · · , Xel(6)]

20 / 31

slide-54
SLIDE 54

The EBE Matrix-Vector product

R = M · V = D · V +

nel

  • el=1

Xel · Vel Steps of the EBE matrix-vector product

1 R1(i) = D(i) · V (i),

i ∈ [1, np]

2 Xel.Vel = [Xel(1) · V (k), Xel(2) · V (l), · · · , Xel(6) · V (k)],

el ∈ [1, nel]

3 FE assembly → R2[np]: R2 = nel

el=1 Xel · Vel

4 R = R1 + R2 5 IP assembly: R(i) =

Dk R(i) for all IP i

21 / 31

slide-55
SLIDE 55

Reproducible EBE matrix-vector product

Original EBE matrix-vector product

R = D · V + nel

el=1 Xel · Vel

R(i) =

Dk R(i)

Reproducible EBE matrix-vector product

[R, ER] = [D, ED]◦V + ReprodAssnel

el=1Xel · Vel

R(i) = ReprodAssDk[R(i), ER(i)] Compensation: R + ER

22 / 31

slide-56
SLIDE 56

A reproducible conjugate gradient

Initialization: a given d0 r0 = AX0 − B; ρ0 = (r0, d0) (Ad0, d0) ; X1 = X0 − ρ0d0 Iterations until stopping criteria: rm = rm−1 − ρm−1Adm−1 dm = rm + (rm, rm) (rm−1, rm−1) dm−1 ρm = (rm, dm) (dm, Adm) Xm+1 = Xm − ρmdm A = [A1, EA1 ] B = C1 X = H

Non-reproducibility: sources and solutions

Reproducible EBE matrix-vector product dot product

MPI reduction: a parallel compensated dot2 weights: (1/k, 1/k, . . . , 1/k) → (1, 0, . . . , 0)

Reproducible operations − → reproducible results

23 / 31

slide-57
SLIDE 57

A reproducible conjugate gradient

Initialization: a given d0 r0 = AX0−B; ρ0 = (r0, d0) (Ad0, d0) ; X1 = X0−ρ0·d0 Iterations until stopping criteria: rm = rm−1−ρm−1·Adm−1 dm = rm+ (rm, rm) (rm−1, rm−1) ·dm−1 ρm = (rm, dm) (dm, Adm) Xm+1 = Xm−ρm·dm A=[A1, EA1 ] B=C1 X=H

Non-reproducibility: sources and solutions

Reproducible EBE matrix-vector product dot product

MPI reduction: a parallel compensated dot2 weights: (1/k, 1/k, . . . , 1/k) → (1, 0, . . . , 0)

Reproducible operations − → reproducible results Same errors for both sequential and parallel executions

23 / 31

slide-58
SLIDE 58

Recovering reproducibility in a finite element resolution

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

23 / 31

slide-59
SLIDE 59

Reproducible Telemac2D

Reproducible Telemac2D

Diagonal resolutions: C2 = Bu − AuhH, C3 = Bv − AvhH. Interface point assembly: C2, C3 Solution U, V Conjugate gradient : Interface point assembly: A1d in each iteration Solution H Wave equation: A2, A3 : diagonal matrices, A1 = Ahh − AhuA−1

2 Auh − AhvA−1 3 Avh,

C1 = Bh − AhuA−1

2 Bu − AhvA−1 3 Bv,

Interface point assembly: A2, A3, C1

H A1,C1 A2, A3 Execution: Sequential 2 procs 4 procs p procs

Original code Non-reproducible

  • riginal code

Reproducible code reproducibility accuracy

5 10 15 20 Time step 10-18 10-16 10-14 10-12 10-10 10-8 10-6 Maximum Relative Error vs. Sequential Computation COMP P=2 COMP P=4 COMP P=8 COMP P=2 COMP P=4 COMP P=8

acc rep

Maximum relative error gouttedo test case

24 / 31

slide-60
SLIDE 60

Reproducible gouttedo!

Time step 1

p=1 p=2 p=4 p=8

25 / 31

slide-61
SLIDE 61

Reproducible gouttedo!

Time step 2

p=1 p=2 p=4 p=8

25 / 31

slide-62
SLIDE 62

Reproducible gouttedo!

Time step 3

p=1 p=2 p=4 p=8

25 / 31

slide-63
SLIDE 63

Reproducible gouttedo!

Time step 4

p=1 p=2 p=4 p=8

25 / 31

slide-64
SLIDE 64

Reproducible gouttedo!

Time step 5

p=1 p=2 p=4 p=8

25 / 31

slide-65
SLIDE 65

Reproducible gouttedo!

Time step 6

p=1 p=2 p=4 p=8

25 / 31

slide-66
SLIDE 66

Reproducible gouttedo!

Time step 7

p=1 p=2 p=4 p=8

25 / 31

slide-67
SLIDE 67

Reproducible gouttedo!

Time step 8

p=1 p=2 p=4 p=8

25 / 31

slide-68
SLIDE 68

Reproducible gouttedo!

Time step 9

p=1 p=2 p=4 p=8

25 / 31

slide-69
SLIDE 69

Reproducible gouttedo!

Time step 10

p=1 p=2 p=4 p=8

25 / 31

slide-70
SLIDE 70

Reproducible gouttedo!

Time step 11

p=1 p=2 p=4 p=8

25 / 31

slide-71
SLIDE 71

Reproducible gouttedo!

Time step 12

p=1 p=2 p=4 p=8

25 / 31

slide-72
SLIDE 72

Reproducible gouttedo!

Time step 13

p=1 p=2 p=4 p=8

25 / 31

slide-73
SLIDE 73

Reproducible gouttedo!

Time step 14

p=1 p=2 p=4 p=8

25 / 31

slide-74
SLIDE 74

Reproducible gouttedo!

Time step 15

p=1 p=2 p=4 p=8

25 / 31

slide-75
SLIDE 75

Reproducible gouttedo!

Time step 15

p=1 p=2 p=4 p=8

26 / 31

slide-76
SLIDE 76

Efficiency

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

26 / 31

slide-77
SLIDE 77

Runtime extra-cost for reproducible simulations

Measures, test cases and mesh sizes hardware cycle counter: rdtsc gouttedo mesh sizes: 4624, 18225, 72361 nodes (1, ≈ ×4, ≈ ×16)

#nodes #proc. 4624 18225 72361 2 72 143 280 4 304 674 1368 8 501 1152 2020 Number of IP

Hardware and software env.

  • penTelemac v7.2

socket: Intel Xeon E5-2660 2.20GHz (L3 cache = 20 M) 2 sockets of 8 cores each GNU Fortran 4.6.3, -O3 OpenMPI 1.5.4 Linux 3.5.0-54-generic

27 / 31

slide-78
SLIDE 78

The core runtime extra-cost for reproducible gouttedo

gouttedo core: no input/output steps, just building+solving

2 4 8 # processors 10

8

10

9

10

10

10

11

10

12

#cycles x 1.64 x 1.83 x 2.21 x 2.34 x 1.31 x 1.44 x 1.59 x 1.88 x 1.16 x 1.23 x 1.43 x 1.71

Telemac v7, gouttedo

Original, #nodes= 4624 Reproducible, #nodes= 4624 Original, #nodes= 18225 Reproducible, #nodes= 18225 Original, #nodes= 72361 Reproducible, #nodes= 72361 28 / 31

slide-79
SLIDE 79

Time to conclude

1

Motivation

2

Reproducibility failure in a finite element simulation Sequential and parallel FE assembly Sources of non reproducibility in Telemac2D

3

Recovering reproducibility Reproducible parallel FE assembly Reproducible algebraic operations Reproducible conjugate gradient Reproducible Telemac2D

4

Efficiency

5

Conclusion

28 / 31

slide-80
SLIDE 80

Conclusion

Recovering numerical reproducibility Industrial scale software: openTelemac-Mascaret Finite element simulation, domain decomposition, linear system solving

  • 2 reproducible modules: Tomawac, Telemac2D
  • Integration in the next openTelemac version: current work

29 / 31

slide-81
SLIDE 81

Conclusion

Recovering numerical reproducibility Industrial scale software: openTelemac-Mascaret Finite element simulation, domain decomposition, linear system solving

  • 2 reproducible modules: Tomawac, Telemac2D
  • Integration in the next openTelemac version: current work

Feasibility How to recover reproducibility? Sources of non-reproducibility? Do existing techniques apply? how easily?

  • Hand-made analysis of the computing workflow
  • Compensation yields reproducibility here!
  • Fits well to the openTelemac’s vector library
  • Other existing techniques also apply and more or less easily [4]

29 / 31

slide-82
SLIDE 82

Conclusion

Efficiency How much to pay for reproducibility?

  • ×1.2 ↔ ×2.3 extra-cost which decreases as the problem size increases
  • OK to debug, to validate and even to simulate!

30 / 31

slide-83
SLIDE 83

Conclusion

Efficiency How much to pay for reproducibility?

  • ×1.2 ↔ ×2.3 extra-cost which decreases as the problem size increases
  • OK to debug, to validate and even to simulate!

Reproducibility at a larger scale: the whole openTelemac software suite Does it still working for complex, large and real-life simulations?

  • The two FE test cases are significant enough to validate the methodology
  • Localization of the failure sources is difficult to automatize
  • but the methodology application is easy for the software developpers

30 / 31

slide-84
SLIDE 84

Acknowledgment

Jean-Michel Hervouet, LNE, EDR R&D, Chatou Chemseddine Chohra, DALI/LIRMM, UPVD, Perpignan

30 / 31

slide-85
SLIDE 85

Références I

  • J. W. Demmel and H. D. Nguyen.

Fast reproducible floating-point summation. In Proc. 21th IEEE Symposium on Computer Arithmetic. Austin, Texas, USA, 2013.

  • Y. He and C. Ding.

Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications.

  • J. Supercomput., 18:259–277, 2001.

J.-M. Hervouet. Hydrodynamics of free surface flows: Modelling with the finite element method. John Wiley & Sons, 2007.

  • P. Langlois, R. Nheili, and C. Denis.

Numerical Reproducibility: Feasibility Issues. In NTMS’2015: 7th IFIP International Conference on New Technologies, Mobility and Security, pages 1–5, Paris, France, July 2015. IEEE, IEEE COMSOC & IFIP TC6.5 WG.

  • P. Langlois, R. Nheili, and C. Denis.

Recovering numerical reproducibility in hydrodynamic simulations. In J. Hormigo, S. Oberman, and N. Revol, editors, 23rd IEEE International Symposium on Computer Arithmetic, pages 1–10. IEEE Computer Society, July 2016. (Silicon Valley, USA. July 10-13 2016).

30 / 31

slide-86
SLIDE 86

Références II

  • T. Ogita, S. M. Rump, and S. Oishi.

Accurate sum and dot product. SIAM J. Sci. Comput., 26(6):1955–1988, 2005. Open TELEMAC-MASCARET. v.7.0, Release notes. www.opentelemac.org, 2014.

  • R. W. Robey, J. M. Robey, and R. Aulwes.

In search of numerical consistency in parallel programming. Parallel Comput., 37(4-5):217–229, 2011.

  • M. Taufer, O. Padron, P. Saponaro, and S. Patel.

Improving numerical reproducibility and stability in large-scale numerical simulations on gpus. In IPDPS, pages 1–9. IEEE, 2010.

  • O. Villa, D. G. Chavarría-Miranda, V. Gurumoorthi, A. Márquez, and S. Krishnamoorthy.

Effects of floating-point non-associativity on numerical computations on massively multithreaded systems. In CUG 2009 Proceedings, pages 1–11, 2009.

31 / 31