fault tolerant linear algebra goals and methods
play

FaultTolerantLinearAlgebra: goalsandmethods. - PowerPoint PPT Presentation

FaultTolerantLinearAlgebra: goalsandmethods. JulienLangou,UniversityofColoradoDenver FaulttolerantLinearAlgebra:GoalsandMethods. 0GOALS


  1. Fault
Tolerant
Linear
Algebra:
 goals
and
methods.


 
Julien
Langou,
University
of
Colorado
Denver


  2. Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.
 0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)



  3. Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.
 0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)



  4. Goals
 Perform
reliable
and
efficent
computaFon
with
 unreliable
units.
 • Unreliable
units:
Process
crash,
hardware
 failure,
erroneous
communicaFon,
erroneous
 computaFon,
…
 • Our
method:
at
the
algorithm
level.
 • MoFvaFon:
cost
effecFve,
large
unit
count


  5. Errasure
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 2
 4
 6
 8
 Error
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 2
 4
 6
 8


  6. Errasure
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 Lost
processor
2
 2
 4
 6
 8
 Error
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 Processor
2
returns
an
 P 1 
 P 2 
 P 3 
 P 4 
 2
 5
 6
 8
 incorrect
result


  7. Errasure
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 Lost
processor
2
 2
 4
 6
 8
 ‐ 
we
know
whether
there
is
an
errasure
or
not,
 Error
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 Processor
2
returns
an
 P 1 
 P 2 
 P 3 
 P 4 
 2
 5
 6
 8
 incorrect
result
 ‐ 
we
do
not
know
if
there
is
an
error,


  8. Errasure
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 Lost
processor
2
 2
 4
 6
 8
 ‐ 
we
know
whether
there
is
an
errasure
or
not,
 ‐ 
we
know
where
the
errasure
is,

 Error
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 Processor
2
returns
an
 P 1 
 P 2 
 P 3 
 P 4 
 2
 5
 6
 8
 incorrect
result
 ‐ 
we
do
not
know
if
there
is
an
error,
 ‐ 
assuming
we
know
that
an
error
occurs,
we
do
not
know
where
it
is


  9. Errasure
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 P 1 
 P 2 
 P 3 
 P 4 
 Lost
processor
2
 2
 4
 6
 8
 ‐ 
we
know
whether
there
is
an
errasure
or
not,
 ‐ 
we
know
where
the
errasure
is,

 ‐ 
so
we
only
need
to
recover
 Error
Problem
 P 1 
 P 2 
 P 3 
 P 4 
 4
processors
available
 1+1
 2+2
 3+3
 4+4
 Processor
2
returns
an
 P 1 
 P 2 
 P 3 
 P 4 
 2
 5
 6
 8
 incorrect
result
 ‐ 
we
do
not
know
if
there
is
an
error,
 ‐ 
assuming
we
know
that
an
error
occurs,
we
do
not
know
where
it
is
 ‐ 
we
also
need
to
recover


  10. Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.
 0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)



  11. Diskless
checkpoinFng
 4
processors
available
 P 1 
 P 2 
 P 3 
 P 4 


  12. Diskless
checkpoinFng
 4
processors
available
 P 1 
 P 2 
 P 3 
 P 4 
 Add
a
5 th 
one
and
perform
a
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 +

 +

 +

 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 …
…
…


  13. Diskless
checkpoinFng
 4
processors
available
 P 1 
 P 2 
 P 3 
 P 4 
 Add
a
5 th 
one
and
perform
a
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 +

 +

 +

 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 …
…
…
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 Lost
a
processor
 P 1 
 P 3 
 P 4 
 P c 


  14. Diskless
checkpoinFng
 4
processors
available
 P 1 
 P 2 
 P 3 
 P 4 
 Add
a
5 th 
one
and
perform
a
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 +

 +

 +

 checksum
(MPI_Reduce)
 Ready
for
the
computaFons
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 …
…
…
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 Lost
a
processor
 P 1 
 P 3 
 P 4 
 P c 
 Recover
the
processor
(FT‐MPI)
 P c 
 P 1 
 P 3 
 P 4 
 P 2 
 ‐

 ‐

 ‐

 Recover
the
data
(MPI_Reduce)
 P 1 
 P 2 
 P 3 
 P 4 
 P c 
 Ready
for
the
computaFons


  15. Diskless
checkpoinFng
(remarks)
 • You
can
use
either
floaFng‐point
arithmeFc
or
 binary
arithmeFc
for
the
checksum
 • MulFple
failures/errors
supported
through
 Reed‐Solomon
algorithm,
opFmal
algorithm
in
 the
sense
that,
to
support
p
simultaneous
 failures/errors,
only
need
to
add
p
processes.


  16. Time
for
a
MPI_Reduce
(using
 MVAPICH)
on
Infiniband
on
 jacquard.nersc.gov
 2.5
 2
 Fme
(sec)
 1.5
 122.0
MB
 68.7
MB
 1
 30.5
MB
 7.6
MB
 0.5
 0
 64
 81
 100
 121
 256
 #
of
processors


  17. Fault‐tolerant
Linear
Algebra:
Goals
and
Methods.
 0‐
GOALS

 
0.1‐
ERRASURE
OR
ERROR?
 1‐
METHODS
 
1.1‐
ERRASURE:
DISKLESS
CHECKPOINTING
AND
ROLLBACK
 
1.2‐
ERRASURE
&
ERROR:
ABFT:
ALGORITHM
BASED
FAULT
TOLERANCE
 
1.3‐
OTHERS
 2‐
ERROR:
DETECTING/LOCATING/CORRECTING
IN
FLOATING‐POINT
ARITHMETIC

 3‐
NOVEL
ABFT‐ALGORITHM
(GEMM,
LU,
QR,
ETC.)
(ERRASURE
OR
ERROR)
 4‐
ABFT‐BLAS
LIBRARY
 5‐
ABFT‐BLAS
EXPERIMENTS
(ERRASURE)



  18. ABFT
=
Algorithm
Based
Fault
Tolerance.
 • K.
Huang,
J.
Abraham,
" Algorithm‐Based
Fault
Tolerance
for
Matrix
 Opera;ons ,"
IEEE
Trans.
on
Comp.
(Spec.
Issue
Reliable
&
Fault‐ Tolerant
Comp.),
C‐33,
1984,
pp.
518‐528.
 • If
checkpoints
are
performed
in
floaFng‐point
arithmeFc
then
we
 can
exploit
the
linearity
of
the
mathemaFcal
relaFons
on
the
object
 to
maintain
the
checksums


  19. ABFT
concept
in
an
example
 We
want
to
perform
z
=
λx+μy.
 λ
 λ
 λ
 λ
 X 1 
 X 2 
 X 3 
 X 4 
 +

 +

 +

 +

 Y 1 
 Y 2 
 Y 3 
 Y 4 
 μ
 μ
 μ
 μ
 Z 1 
 Z 2 
 Z 3 
 Z 4 
 Proc
1
 Proc
2
 Proc
3
 Proc
4


  20. ABFT
concept
in
an
example
 We
want
to
perform
z
=
λx+μy.
 X 1 
 X 2 
 X 3 
 X 4 
 +

 +

 +

 Y 1 
 Y 2 
 Y 3 
 Y 4 
 +

 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4


  21. ABFT
concept
in
an
example
 We
want
to
perform
z
=
λx+μy.
 checkX
 X 1 
 X 2 
 X 3 
 X 4 
 X c 
 +

 +

 +

 checkY
 Y 1 
 Y 2 
 Y 3 
 Y 4 
 Y c 
 +

 +

 +

 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c


  22. ABFT
concept
in
an
example
 We
want
to
perform
z
=
λx+μy.
 checkX
 X 1 
 X 2 
 X 3 
 X 4 
 X c 
 checkY
 Y 1 
 Y 2 
 Y 3 
 Y 4 
 Y c 
 Proc
1
 Proc
2
 Proc
3
 Proc
4
 Proc
c


Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend