FaultTolerantLinearAlgebra: goalsandmethods. - - PowerPoint PPT Presentation

fault tolerant linear algebra goals and methods
SMART_READER_LITE
LIVE PREVIEW

FaultTolerantLinearAlgebra: goalsandmethods. - - PowerPoint PPT Presentation

FaultTolerantLinearAlgebra: goalsandmethods. JulienLangou,UniversityofColoradoDenver FaulttolerantLinearAlgebra:GoalsandMethods. 0GOALS


slide-1
SLIDE 1

Fault
Tolerant
Linear
Algebra:
 goals
and
methods.





Julien
Langou,
University
of
Colorado
Denver


slide-2
SLIDE 2

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-3
SLIDE 3

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-4
SLIDE 4

Goals


Perform
reliable
and
efficent
computaFon
with
 unreliable
units.


  • Unreliable
units:
Process
crash,
hardware


failure,
erroneous
communicaFon,
erroneous
 computaFon,
…


  • Our
method:
at
the
algorithm
level.

  • MoFvaFon:
cost
effecFve,
large
unit
count

slide-5
SLIDE 5

P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available


Errasure
Problem
 Error
Problem


P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available


slide-6
SLIDE 6

P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available
 Lost
processor
2


Errasure
Problem
 Error
Problem


P1


2


P2


5


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available
 Processor
2
returns
an
 incorrect
result


slide-7
SLIDE 7

P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available


Errasure
Problem


‐ 
we
know
whether
there
is
an
errasure
or
not,


Error
Problem


P1


2


P2


5


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available
 ‐ 
we
do
not
know
if
there
is
an
error,
 Lost
processor
2
 Processor
2
returns
an
 incorrect
result


slide-8
SLIDE 8

P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available


Errasure
Problem


‐ 
we
know
whether
there
is
an
errasure
or
not,
 ‐ 
we
know
where
the
errasure
is,



Error
Problem


P1


2


P2


5


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available
 ‐ 
we
do
not
know
if
there
is
an
error,
 ‐ 
assuming
we
know
that
an
error
occurs,
we
do
not
know
where
it
is
 Lost
processor
2
 Processor
2
returns
an
 incorrect
result


slide-9
SLIDE 9

P1


2


P2


4


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available


Errasure
Problem


‐ 
we
know
whether
there
is
an
errasure
or
not,
 ‐ 
we
know
where
the
errasure
is,

 ‐ 
so
we
only
need
to
recover


Error
Problem


P1


2


P2


5


P3


6


P4


8


P1


1+1


P2


2+2


P3


3+3


P4


4+4
 4
processors
available
 ‐ 
we
do
not
know
if
there
is
an
error,
 ‐ 
assuming
we
know
that
an
error
occurs,
we
do
not
know
where
it
is
 ‐ 
we
also
need
to
recover
 Lost
processor
2
 Processor
2
returns
an
 incorrect
result


slide-10
SLIDE 10

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-11
SLIDE 11

P1
 P2
 P3
 P4
 4
processors
available


Diskless
checkpoinFng


slide-12
SLIDE 12

P1
 P2
 +

 P3
 P4
 +

 +

 Pc
 P1
 P2
 P3
 P4
 P1
 P2
 P3
 P4
 Pc
 4
processors
available
 Add
a
5th
one
and
perform
a
 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 …
…
…


Diskless
checkpoinFng


slide-13
SLIDE 13

P1
 P2
 +

 P3
 P4
 +

 +

 Pc
 P1
 P2
 P3
 P4
 Pc
 P1
 P2
 P3
 P4
 P1
 P3
 P4
 Pc
 P1
 P2
 P3
 P4
 Pc
 4
processors
available
 Add
a
5th
one
and
perform
a
 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 …
…
…
 Lost
a
processor


Diskless
checkpoinFng


slide-14
SLIDE 14

P1
 P2
 +

 P3
 P4
 +

 +

 Pc
 P1
 P2
 P3
 P4
 Pc
 P1
 P2
 P3
 P4
 P1
 P3
 P4
 Pc
 Pc
 P1
 ‐

 P3
 P4
 ‐

 ‐

 P2
 P1
 P2
 P3
 P4
 Pc
 P1
 P2
 P3
 P4
 Pc
 4
processors
available
 Add
a
5th
one
and
perform
a
 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 …
…
…
 Lost
a
processor
 Recover
the
processor
(FT‐MPI)
 Recover
the
data
(MPI_Reduce)
 Ready
for
the
computaFons


Diskless
checkpoinFng


slide-15
SLIDE 15

Diskless
checkpoinFng
(remarks)


  • You
can
use
either
floaFng‐point
arithmeFc
or


binary
arithmeFc
for
the
checksum


  • MulFple
failures/errors
supported
through


Reed‐Solomon
algorithm,
opFmal
algorithm
in
 the
sense
that,
to
support
p
simultaneous
 failures/errors,
only
need
to
add
p
processes.


slide-16
SLIDE 16

Time
for
a
MPI_Reduce
(using
 MVAPICH)
on
Infiniband
on
 jacquard.nersc.gov


0
 0.5
 1
 1.5
 2
 2.5
 64
 81
 100
 121
 256
 122.0
MB
 68.7
MB
 30.5
MB
 7.6
MB
 #
of
processors
 Fme
(sec)


slide-17
SLIDE 17

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-18
SLIDE 18

ABFT
=
Algorithm
Based
Fault
Tolerance.


  • K.
Huang,
J.
Abraham,
"Algorithm‐Based
Fault
Tolerance
for
Matrix


Opera;ons,"
IEEE
Trans.
on
Comp.
(Spec.
Issue
Reliable
&
Fault‐ Tolerant
Comp.),
C‐33,
1984,
pp.
518‐528.


  • If
checkpoints
are
performed
in
floaFng‐point
arithmeFc
then
we


can
exploit
the
linearity
of
the
mathemaFcal
relaFons
on
the
object
 to
maintain
the
checksums


slide-19
SLIDE 19

ABFT
concept
in
an
example


X1
 X2
 X3
 X4
 Y1
 Y2
 Y3
 Y4
 Z1
 Z2
 Z3
 Z4
 +

 +

 +

 +

 We
want
to
perform
z
=
λx+μy.
 Proc
1
 Proc
2
 Proc
3
 Proc
4
 λ
 λ
 λ
 λ
 μ
 μ
 μ
 μ


slide-20
SLIDE 20

ABFT
concept
in
an
example


X1
 X2
 +

 X3
 X4
 +

 +

 Y1
 Y2
 +

 Y3
 Y4
 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4
 We
want
to
perform
z
=
λx+μy.


slide-21
SLIDE 21

ABFT
concept
in
an
example


X1
 X2
 +

 X3
 X4
 +

 +

 Xc
 Y1
 Y2
 +

 Y3
 Y4
 +

 +

 Yc
 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c
 checkX
 checkY
 We
want
to
perform
z
=
λx+μy.


slide-22
SLIDE 22

ABFT
concept
in
an
example


X1
 X2
 X3
 X4
 Xc
 Y1
 Y2
 Y3
 Y4
 Yc
 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c
 checkX
 checkY
 We
want
to
perform
z
=
λx+μy.


slide-23
SLIDE 23

ABFT
concept
in
an
example


X1
 X2
 X3
 X4
 Xc
 Y1
 Y2
 Y3
 Y4
 Yc
 Z1
 Z2
 Z3
 Z4
 +

 +

 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c
 checkX
 checkY
 λ
 λ
 λ
 λ
 μ
 μ
 μ
 μ
 We
want
to
perform
z
=
λx+μy.


slide-24
SLIDE 24

ABFT
concept
in
an
example


X1
 X2
 X3
 X4
 Xc
 Y1
 Y2
 Y3
 Y4
 Yc
 Z1
 Z2
 +

 Z3
 Z4
 +

 +

 Zc
 +

 +

 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c
 checkX
 checkY
 checkZ
 λ
 λ
 λ
 λ
 μ
 μ
 μ
 μ
 We
want
to
perform
z
=
λx+μy.


slide-25
SLIDE 25

ABFT
concept
in
an
example


X1
 X2
 X3
 X4
 Xc
 Y1
 Y2
 Y3
 Y4
 Yc
 Z1
 Z2
 Z3
 Z4
 Zc
 +

 +

 +

 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c
 checkX
 checkY
 checkZ
 No
overhead
to
compute
the
checksum
of
Z.
 Property
used:
(λx1+μy1)+(λx2+μy2)=λ(x1+x2)+μ(y1+y2),
distribuFvity
of
external
 mulFplicaFon
over
intenal
addiFon,
associaFvity
of
internal
addiFon.

 λ
 λ
 λ
 λ
 μ
 μ
 μ
 μ
 λ
 μ
 We
want
to
perform
z
=
λx+μy.


slide-26
SLIDE 26

ABFT
summary.


  • Relies
on
floaFng‐point
arithmeFc
checksums

  • Exploit
the
checksum
processors

  • Algorithms
exist
for
any
linear
operaFons:


– AXPY,
SCAL,
(BLAS1)
 – GEMV
(BLAS2)
 – GEMM
(BLAS3)
 – LU,
QR,
Cholesky
(LAPACK)
 – FFT


slide-27
SLIDE 27

Our
contribuFon
(1)


  • Lack
of
generalizaFon
in
the
previous


approach
which
has
restricted
the
number
 algorithms
used:
Cholesky,
QR
through
Gram‐ Schmidt,
LU
without
pivoFng.


  • With
an
ABFT
BLAS
and
the
LAPACK
algorithm,


we
have
developped:


– QR
with
Houesholder
reflecFon,
 – LU
with
pivoFng
 – Hessenberg
reducFon



slide-28
SLIDE 28

Our
contribuFon
(2)


  • If
there
is
no
error
then
ABFT
guarantees
that
the


checksum
of
L
and
U
are
consisten
at
the
end
of
 the
LU
factorizaFon.



  • If
there
is
an
error,
you
can
then
detect
it.

  • However
you
can
not
correct
it
in
the
case
where


the
error
propagate.


  • (Then
why
even
using
ABFT?)

  • We
have
a
light‐weight
mechanism
(
O(n2)
)
to


detect
errors
before
they
propagate


slide-29
SLIDE 29

Our
contribuFon
(3)


  • Error
correcFng
codes
are
known
to
be


unstable
in
floaFng‐point
arithmeFc.


  • We
have
developed
a
stable
error
correcFng


code
(although
naïve
and
not
opFmal,
it
works
 for
us
and
is
efficient
enough).


slide-30
SLIDE 30

Our
contribuFons


  • 1. Generalize
ABFT
to
``all’’
LAPACK
algorithms

  • 2. Avoid
error
propagaFon


  • 3. Stable
error
correcFng
code
in
floaFng‐point


arithmeFc
 =>
Maybe
ABFT
might
work
ayer
all!


slide-31
SLIDE 31

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-32
SLIDE 32

Error
DETECTION:
Residual
checking


  • To
detect
an
error
in



C
 αA*
B
+
βCin
 
 
(1)


  • 1. Save
Cin

  • 2. Perform
C
 αA*
B
+
βCin


  • 3. Take
random
(vector)
x,
check
that



||
Cx
‐
α( A*
(
B
x

))
+
(
βCin
x

)
||
<
ε


  • 4. If
check
is
not
good,
start
again
from
step
2.

  • Works
with
almost
anything
(e.g.
A
=
VDVT)



 
 
 



slide-33
SLIDE 33

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-34
SLIDE 34

Encoding


x


  • Error
detecFon.

slide-35
SLIDE 35

Encoding


=


G
 x
 y


  • Error
detecFon.

slide-36
SLIDE 36

Encoding


x_
 y


slide-37
SLIDE 37

Encoding


=


G
 x_
 y_
 y


  • Error
detecWon.

  • Note
that
G
could
have
been


1‐by‐n
(no
need
for
2‐by‐n)


  • Note
that
correcWon
is


possible
if
locaWon
is
known


slide-38
SLIDE 38

=


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
one
zero


  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


s
 y


=
 ‐


y_


slide-39
SLIDE 39

=


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
=
x
–
x_
has


  • ne
zero

  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


slide-40
SLIDE 40

3
 1
 3
 1
 0
 1
 7
 9
 2
 5
 8
 2
 1
 1
 1
 3
 =
 3
 3


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
one
zero


  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


slide-41
SLIDE 41

3
 1
 3
 1
 0
 1
 7
 9
 2
 5
 8
 2
 1
 1
 1
 3
 0
 0
 0
 0
 0
 3
 0
 0
 =
 3
 3


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
one
zero


  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


slide-42
SLIDE 42

Reed‐Solomon


1
 1
 1
 1
 1
 1
 1
 1
 1
 2
 3
 4
 5
 6
 7
 8
 =
 2
 8


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
one
zero


  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


slide-43
SLIDE 43

Reed‐Solomon


1
 1
 1
 1
 1
 1
 1
 1
 1
 2
 3
 4
 5
 6
 7
 8
 0
 0
 0
 2
 0
 0
 0
 0
 =
 2
 8


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
one
zero


  • In
other
words,
find
column
of


G
colinear
to
s
::
locaFon
 problem
solved


slide-44
SLIDE 44

3
 1
 3
 1
 0
 1
 7
 9
 2
 5
 8
 2
 1
 1
 1
 3
 0
 0
 0
 0
 0
 1
 0
 =
 4


G
 e
 s


  • Solve
Ge
=
s
under
the


constraints
that
e
has
two
zero


  • In
other
words,
find
the


unique
two
columns
that
 generates
s.
Complexity
 choose(n,2)
=
28


1
 4
 9
 3
 0
 0
 5
 0
 ‐7
 ‐4
 ‐1


slide-45
SLIDE 45

Stable
 Recovery
Cost
 Extra
Memory
 Reed
Solomon
 ✗
 ✓
 nberr.

✓
 Random
 ✓
 ✗
 nberr.

✓


slide-46
SLIDE 46

Encoding


1
 2
 3
 4
 5
 6
 7
 8


x


slide-47
SLIDE 47

Encoding


1
 2
 3
 4
 5
 6
 7
 8


x


1
 2
 3
 4
 5
 6
 7
 8


x


slide-48
SLIDE 48

Encoding


1
 2
 3
 4
 5
 6
 7
 8


x


1
 2
 3
 4
 5
 6
 7
 8


x


1
 2
 3
 4
 5
 6
 7
 8


x
 y


slide-49
SLIDE 49

Stable
 Recovery
Cost
 Extra
Memory
 Reed
Solomon
 ✗
 ✓
 nberr.

✓
 Random
 ✓
 ✗
 nberr.

✓
 Coordinate
 ✓
 ✓
 Sqrt(n)
✗


slide-50
SLIDE 50

Timing
for
recovery


n,
size
of
x
 Time
in
sec


slide-51
SLIDE 51
  • Accuracy
comparison
ayer
recovery.

  • maxerr
=
3

  • nberr
=3


n,
size
of
x
 max(
|
xi
–
yi|/|xi|)


slide-52
SLIDE 52
  • Accuracy
comparison
ayer
recovery.

  • maxerr
=
4

  • nberr
=4


n
size
of
x
 n,
size
of
x
 max(
|
xi
–
yi|/|xi|)


slide-53
SLIDE 53
  • maxerr
=
10

  • nberr
=10


n,
size
of
x
 n,
size
of
x
 max(
|
xi
–
yi|/|xi|)
 Time
in
sec


slide-54
SLIDE 54

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-55
SLIDE 55

LU
factorizaFon:


A
 Ar
 Ac
 Af
 Bc

T 



A
 I
 I
 Br
 StarFng
from
the
encoding:
 =


slide-56
SLIDE 56

A
 Ar
 Ac
 Af
 Bc

T 



A
 =
 Ur
 Lc
 0
 =
 U
 L
 =
 Ac
 A
 =
 Ar
 Br
 Bc

T 



A
 =
 Br
 (1)
 (2)
 (3)
 Bc

T 



=
 Lc
 =
 Ur
 Br
 (1)
 (2)
 U
 L
 P
 End
 Af


slide-57
SLIDE 57

A
 Ar
 Ac
 Af
 Bc

T 



A
 =
 Ur
 Lc
 0
 =
 U
 L
 =
 Ac
 A
 =
 Ar
 Br
 Bc

T 



A
 =
 Br
 (1)
 (2)
 (3)
 Bc

T 



=
 Lc
 =
 Ur
 Br
 (1)
 (2)
 U
 L
 Ur
 Af
 Bc

T 



=
 =
 Ac
 =
 Ar
 Br
 =
 (1)
 (2)
 (3)
 U
 L
 Lc
 Ar
 Ac
 A
 U
 L
 (4)
 (5)
 L
 Bc

T 



=
 L
 P
 P
 Br
 =
 U
 A
 A
 Bc

T 



A
 Br
 For
j=1:nb:n,
 End
 Af
 Af
 Lc
 Ur


slide-58
SLIDE 58

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-59
SLIDE 59

Experiments
on
MM


  • Goal:



– Write
a
FT‐PDGEMM
(Fault
Tolerant
matrix
Matrix
 MulFply)


  • TesWng:


– Perform
FT‐PDGEMM
in
a
loop
and
check
results
 with
residual
checking
 
on
top
of
this
add
on
automaFc
process
killer.


*
 α
 +
β
 A
 B
 C
 C
 α
 +
β
 A
 B
 Cin
 Cout
 x ‐
 x

(
 (
 )
)


x

[
 ]


slide-60
SLIDE 60

ABFT‐BLAS:
a
Parallel
Fault
Tolerant
 library
BLAS
based
on
ABFT
techniques


  • Constructed
on
top
of
FT‐MPI.

  • Provides
users
a
fault‐tolerant
environment:


– Detect
failures
 – Recover
data
automaFcally
 – Enables
the
user
to
stack
computaFonal
rouFnes
the


  • ne
on
top
of
the
others
(two
general
problems)


– Goal:
research
library
for
conducFng
experiments
on
 fault
tolerance


  • Provides
developpers
with
an
automaFc
process


killer


slide-61
SLIDE 61

EXAMPLE
CODE


int
rc;
 struct
Vector
v;
 struct
Matrix
a;
 struct
Dataworld
worldmpi;
 struct
Global_ddata
normv;
 struct
Global_idata
nbr_iter;
 …
 rc
=
MPI_Init(&argc,
&argv);
 rc
=
init_world(&worldmpi,
p,
q,
rc);
 rc
=
get_info_on_grid(&worldmpi,
&me,

 
 
&myrow,
&mycol,
&nprow,
&npcol);
 …
 rc
=
allocate_vector(&v,
POS_ROW,
0,
nb_n,
&worldmpi,
"v");
 rc
=
allocate_matrix(&a,
m,
n,
nb_m,
nb_n,
&worldmpi,
"a");
 rc
=
allocate_dglobal(&normv,
1,
&worldmpi);
 rc
=
allocate_iglobal(&nbr_iter,
1,
&worldmpi);
 …
 if
(!worldmpi.recovering)
 {
 ...
here
goes
the
user
code
to
iniWalize
objects
...
 rc
=
make_checksum_matrix(&a,
&worldmpi);
 rc
=
make_checksum_vector(&v,
&worldmpi);
 }
 if
(worldmpi.user_state
==
0)
 {
 rc
=
gdnrm2(&worldmpi,
&v,
normv.data);
 worldmpi.user_state
=
1;
 }
 if
(worldmpi.user_state
==
1)
 {
 ...
here
goes
any
call
to
the
ABFT‐BLAS
numerical
rouWnes
...
 worldmpi.user_state
=
2;
 }
 free_vector(&v);
 free_matrix(&a);
 free_dglobal(&normv);
 free_iglobal(&nbr_iter);
 exit(0);


slide-62
SLIDE 62

0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)

 Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.


slide-63
SLIDE 63

PDGEMM‐SUMMA


PDGEMM.


A
 B
 *
 + C
 C
 For
k
=
1:nb:n,
 End
For
 PDGEMM
:


slide-64
SLIDE 64

PDGEMM‐SUMMA
 ABFT‐PDGEMM‐SUMMA


ABFT‐PDGEMM.


  • The
algorithm
maintains
the
consistency
of


the
checkpoints
of
the
matrix
C
naturally.


A
 B
 *
 + C
 C
 For
k
=
1:nb:n,
 End
For
 ABFT‐PDGEMM
:


slide-65
SLIDE 65

jacquard.nersc.gov


  • Processor
type
Opteron
2.2
GHz


  • Processor
theoreWcal
peak
4.4
GFlops/sec


  • Number
of
applicaWon
processors
712


  • System
theoreWcal
peak
(computaWonal
nodes)
3.13
TFlops/sec


  • Number
of
shared‐memory
applicaWon
nodes
356

  • Processors
per
node
2


  • Physical
memory
per
node
6
GBytes


  • Usable
memory
per
node
3‐5
GBytes


  • Switch
Interconnect
InfiniBand


  • Switch
MPI
UnidrecWonal
Latency
4.5
μsec


  • Switch
MPI
UnidirecWonal
Bandwidth
(peak)
620
MB/s


  • Global
shared
disk
GPFS
Usable
disk
space
30
TBytes


  • Batch
system
PBS
Pro

slide-66
SLIDE 66

Mvapich
vs
FTMPI


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 64
 81
 100
 121
 256
 454
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 64
 81
 100
 121
 256
 454
 4000
 3000
 2000
 1000
 #
of
processors
 #
of
processors
 GFLOPs/sec/proc
 GFLOPs/sec/proc


slide-67
SLIDE 67

FT‐PDGEMM
‐‐
nloc=4,000


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 64
 81
 100
 121
 256
 484
 FT
off
 FT
on
 FT
on,1
fault
 GFLOPs/sec/proc
 #
of
processors


slide-68
SLIDE 68

FT‐PDGEMM
‐‐
nloc=4,000


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 64
 81
 100
 121
 256
 484
 FT
off
 FT
on
 FT
on,1
fault
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
 64
 81
 100
121
256
484
 GFLOPs/sec/proc
 #
of
processors
 #
of
processors
 GFLOPs/sec


slide-69
SLIDE 69

Performance
modeling


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 100
 400
 1024
 Model
SUMMA
 Model
ABFT
 Measured
SUMMA
 Measured
ABFT
 #
of
processors
 GFLOPs/sec/proc


slide-70
SLIDE 70

Strong
scalability


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4
 100
 400
 1024
 nloc
=
60,000
SUMMA
 nloc
=
40,000
SUMMA
 nloc
=
30,000
SUMMA
 nloc
=
20,000
SUMMA
 nloc
=
10,000
SUMMA
 #
of
processors
 GFLOPs/sec/proc


slide-71
SLIDE 71

Strong
scalability


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4
 100
 400
 1024
 nloc
=
60,000
SUMMA
 nloc
=
60,000
ABFT
 nloc
=
40,000
SUMMA
 nloc
=
30,000
ABFT
 nloc
=
30,000
SUMMA
 nloc
=
30,000
ABFT2
 nloc
=
20,000
SUMMA
 nloc
=
20,000
ABFT
 nloc
=
10,000
SUMMA
 nloc
=
10,000
ABFT
 #
of
processors
 GFLOPs/sec/proc
 ABFT
represents
the
only
known
alternaFve
to
address
fault
tolerance
in
strong
scalability


slide-72
SLIDE 72

Strong
scalability


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4
 100
 400
 1024
 nloc
=
60,000
SUMMA
 nloc
=
60,000
ABFT
 nloc
=
40,000
SUMMA
 nloc
=
30,000
ABFT
 nloc
=
30,000
SUMMA
 nloc
=
30,000
ABFT2
 nloc
=
20,000
SUMMA
 nloc
=
20,000
ABFT
 nloc
=
10,000
SUMMA
 nloc
=
10,000
ABFT
 #
of
processors
 GFLOPs/sec/proc
 ABFT
represents
the
only
known
alternaFve
to
address
fault
tolerance
in
strong
scalability


slide-73
SLIDE 73

ABFT
advantages
over
diskless
checkpoinFng


  • Independent
of
the
Surface
(n2)
/
volume
(n3).


– Important
for
n/n
operaFons
(e.g.
FFT)

 – Important
for
MM
with
small
n


  • Independent
of
failure
rate


– No
need
to
guess
parameters


  • Fits
nicely
in
the
algorithm



– No
need
for
explicit
synchronizaFon
for
example


ABFT
disadvantage
over
diskless
checkpoinFng


  • Rely
on
floaWng
point
arithmeWc
checksums


– CancellaFon,
ill‐condiFonned
matrices,
etc.