Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - - PowerPoint PPT Presentation

kento sato 1 kathryn mohror 2 adam moody 2 todd gamblin 2
SMART_READER_LITE
LIVE PREVIEW

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - - PowerPoint PPT Presentation

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski 2 , Naoya Maruyama 3 and Satoshi Matsuoka 1 1 Tokyo Institute of Technology 2 Lawrence Livermore National Laboratory 3 RIKEN


slide-1
SLIDE 1

LLNL#PRES#654744,

Kento Sato†1, Kathryn Mohror†2, Adam Moody†2, Todd Gamblin†2, Bronis R. de Supinski†2, Naoya Maruyama†3 and Satoshi Matsuoka†1

†1 Tokyo Institute of Technology †2 Lawrence Livermore National Laboratory †3 RIKEN Advanced institute for Computational Science

This,work,performed,under,the,auspices,of,the,U.S.,Department,of,Energy,by,Lawrence,Livermore,NaFonal,Laboratory,, under,Contract,DE#AC52#,07NA27344.,LLNL#PRES#654744#DRAFT, May,27th,,2014, CCGrid2014@Chicago

slide-2
SLIDE 2

LLNL#PRES#654744,

Failures,on,HPC,systems,

  • ExponenFal,growth,in,computaFonal,power,

– Enables,finer,grained,simulaFons,with,shorter,period,Fme,

  • Overall,failure,rate,increase,accordingly,because,of,the,increasing,

system,size,

  • 191,failures,out,of,5#million,node#hours,,

– A,producFon,applicaFon,of,Laser#plasma,interacFon,code,(pF3D), – Hera,,Atlas,and,Coastal,clusters,@LLNL,

2,

1,000,nodes, 10,000,nodes, 100,000,nodes, MTBF, 1.2,days, (Measured), 2.9,hours, (EsFmaFon), 17,minutes, (EsFmaFon),

Estimated MTBF (w/o hardware reliability improvement per component in future)

  • Will,be,difficult,for,applicaFons,to,conFnuously,run,for,a,long,

Fme,without,fault,tolerance,at,extreme,scale,

slide-3
SLIDE 3

LLNL#PRES#654744,

Checkpoint/Restart,

Checkpoint/Restart,(So^ware#Lv.),

  • Idea,of,Checkpoint/Restart,

– Checkpoint,

  • Periodically,save,snapshots,of,

an,applicaFon,state,to,PFS,

– Restart,

  • On,a,failure,,restart,the,

execuFon,from,the,latest, checkpoint,

3,

  • Improved,Checkpoint/Restart,

– MulF#level,checkpoinFng,[1], – Asynchronous,checkpoinFng,[2], – In#memory,diskless,checkpoinFng,[3],

  • We,found,that,so^ware#level,approaches,may,be,limited,in,

increasing,resiliency,at,extreme,scale,

check, point, check, point, check, point, Failure, Parallel,file,system,(PFS), CheckpoinFng,overhead,

[1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12 [3] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery", IPDPS2014

slide-4
SLIDE 4

LLNL#PRES#654744,

Storage,architectures,

  • We,consider,architecture#level,approaches,
  • Burst,buffer,

– A,new,Fer,in,storage,hierarchies, – Absorb,bursty,I/O,requests,from, applicaFons, – Fill,performance,gap,between,node#local, storage,and,PFSs,in,both,latency,and, bandwidth,

  • If,you,write,checkpoints,to,burst,buffers,,

– Faster,checkpoint/restart,Fme,than,PFS, – More,reliable,than,storing,on,compute, nodes,

4,

[4] Doraimani, Shyamala and Iamnitchi, Adriana, “File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces”, HPDC '08

  • However,…,

– Adding,burst,buffer,nodes,may,increase,total,system,size,,and,failure,rates, accordingly,,

  • ,It’s,not,clear,if,burst,buffers,improve,overall,system,efficiency,

– Because,burst,buffers,also,connect,to,networks,,the,burst,buffers,may,sFll,be,a, bofleneck,

Compute nodes

Parallel file system Burst buffers

slide-5
SLIDE 5

LLNL#PRES#654744,

Goal,and,ContribuFons,

  • Goal:,,,

– Develop,an,interface,to,exploit,bandwidth,of,burst,buffers, – Explore,effecFveness,of,burst,buffers, – Find,out,the,best,C/R,strategy,on,burst,buffers,

  • Contribu,ons:,,

– Development,of,IBIO,exploiFng,bandwidth,to,burst,buffers, – A,model,to,evaluate,system,resiliency,given,a,C/R,strategy, and,storage,configuraFon, – Our,experimental,results,show,a,direcFon,to,build,resilient, systems,for,extreme,scale,compuFng,

slide-6
SLIDE 6

LLNL#PRES#654744,

Outlines,

  • IntroducFon,
  • Checkpoint,strategies,
  • Storage,designs,
  • IBIO:,InfiniBand#based,I/O,interface,
  • Modeling,
  • Experiments,
  • Conclusion,

6,

slide-7
SLIDE 7

LLNL#PRES#599833,

8% 15%

Diskless,checkpoint/restart,(C/R),

7,

ckpt,A3, ckpt,A2, ckpt,A1, Parity,1, ckpt,B3, ckpt,B2, Parity,2, ckpt,B1, ckpt,C3, Parity,3, ckpt,C2, ckpt,C1, Parity,4, ckpt,D3, ckpt,D2, ckpt,D1,

Node%1% Node%2% Node%3% Node%4%

XOR,encoding,example,

failure,

Failure analysis on TSUBAME2.0

  • Most,of,failures,comes,from,one,node,,or,can,recover,from,XOR,checkpoint,

– e.g.,1),TSUBAME2.0:,92%,failures, – e.g.,2),LLNL,clusters:,85%,failures,

  • Diskless,C/R,

– Create,redundant,data,across,local,storages,

  • n,compute,nodes,using,a,encoding,

technique,such,as,XOR, – Can,restore,lost,checkpoints,on,a,failure, caused,by,small,#,of,nodes,like,RAID#5,

7,

Failure analysis on LLNL clusters

LOCAL/XOR/PARTNER checkpoint PFS checkpoint

92% 85%

Diskless,checkpoint,is, promising,approach,

Rest,of,failures,sFll,require,a,checkpoint,on,a,reliable,PFS,

slide-8
SLIDE 8

LLNL#PRES#654621,

MulF#level,Checkpoint/Restart,(MLC/R),[1,2]

  • MLC,hierarchically,use,storage,levels,

– Diskless,checkpoint:,Frequent,,for,one, node,for,a,few,node,failure, – PFS,checkpoint:,Less,frequent,and, asynchronous,for,mulF#node,failure,

  • Our,evaluaFon,showed,system,efficiency,

drops,to,less,than,10%,when,MTBF,is,a, few,hours,

,

8,

[1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 10 50 100

Efficiency Scale factor (xF, xL2)

Level#1, Level#2,

Diskless, checkpoint,

PFS, checkpoint, MLC,

1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1

k

p0(t +ck) t0(t +ck)

k

pi(t +ck) ti(t +ck)

i

k k

i

p0(r

k)

pi(r

k)

p0(r

k)

t0(r

k)

ti(r

k)

Duration

t +ck r

k

No failure Failure

λi : i -level checkpoint time

: c -level checkpoint time

r

c : c -level recovery time

cc

t : Interval

p0(T) = e−λT t0(T) = T pi(T) = λi λ (1 − e−λT ) ti(T) = 1 − (λT + 1) · e−λT λ · (1 − e−λT )

p0(T) t0(T)

: No failure for T seconds : Expected time when p0(T)

pi(T)

ti(T) : i - level failure for T seconds : Expected time when pi(T)

MLC,model,

MTBF a few hours MTBF days or a day

slide-9
SLIDE 9

LLNL#PRES#654621,

Coordinated,C/R,

Uncoordinated,C/R,+,MLC

9, 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 10 50 100

Efficiency Scale factor (xF, xL2)

Coordinated C/R Uncoordinated C/R

  • Coordinated,C/R,

– All,processes,globally,synchronize,before, taking,checkpoints,and,restart,on,a,failure,, – Restart,overhead,

  • Uncoordinated,C/R,

– Create,clusters,,and,log,messages, exchanged,between,clusters, – Message,logging,overhead,is,incurred,,but, rolling#back,only,a,cluster,can,restart,the, execuFon,on,a,failure,

,MLC,+,Uncoordinated,C/R,(So^ware#level), approaches,may,be,limited,at,extreme,scale,

P0, P1, P2, P3,

ckpt, ckpt,

Cluster/A/ Cluster/B/

P0, P1, P2, P3,

ckpt, ckpt, ckpt, ckpt,

msg,logging,

Uncoordinated,C/R,

Failure, Failure,

MTBF a few hours MTBF days or a day

slide-10
SLIDE 10

LLNL#PRES#654744,

Storage,designs,

  • AddiFon,to,the,so^ware#level,approaches,,we,also,explore,two,

architecture#level,approaches,,

– Flat,buffer,system:,,

  • ,Current,storage,system,

– Burst,buffer,system:,,

  • Separated,buffer,space,

10, SSD,1, SSD,2, SSD,3, SSD,4,

Compute, node,1, Compute, node,2, Compute, node,3, Compute, node,4, Compute, node,1,, Compute, node,2, Compute, node,3, Compute, node,4,

PFS,(Parallel,file,system), PFS,(Parallel,file,system),

Flat,buffer,system, Burst,buffer,system,

SSD,2, SSD,3, SSD,4, SSD,1,

slide-11
SLIDE 11

LLNL#PRES#654744,

Cluster,

Flat,Buffer,Systems,

  • Design,concept,

– Each,compute,node,has,its, dedicated,node#local,, storage, – Scalable,with,increasing,, number,of,compute,nodes,

11,

Compute, node,1, Compute, node,2, Compute, node,3, Compute, node,4,

PFS,(Parallel,file,system),

  • This,design,has,drawbacks:,

1. Unreliable,checkpoint,storage,

e.g.),If,compute,node,2,fails,,a,checkpoint,on,SSD,2,will,be,lost,because,SSD,2,is,physically, afached,to,the,failed,compute,node,2,

2. Inefficient,uFlizaFon,of,storage,resources,on,uncoordinated,checkpoinFng,

e.g.),If,compute,node,1,&,3,are,in,a,same,cluster,,and,restart,from,a,failure,,the,bandwidth,of, SSD,2,&,4,will,not,be,uFlized,

SSD,2, SSD,3, SSD,4, SSD,1, idle, idle,

Flat,buffer,system,

slide-12
SLIDE 12

LLNL#PRES#654744,

Burst,buffer,system,

Cluster,

Burst,Buffer,Systems,

  • Design,concept,

– A,burst,buffer,is,a,storage, space,to,bridge,the,gap,in, latency,and,bandwidth, between,node#local,storage, and,the,PFS, – Shared,by,a,subset,of,compute, nodes,

12,

  • Although,addiFonal,nodes,are,required,,several,advantages,

1. More,Reliable,because,burst,buffers,are,located,on,a,smaller,#,of,nodes,

e.g.),Even,if,compute,node,2,fails,,a,checkpoint,of,compute,node,2,is,accessible,from,the,

  • ther,compute,node,1,

2. Efficient,uFlizaFon,of,storage,resources,,on,uncoordinated,checkpoinFng,

e.g.),if,compute,node,1,and,3,are,in,a,same,cluster,,and,both,restart,from,a,failure,,the, processes,can,uFlize,all,SSD,bandwidth,unlike,a,flat,buffer,system,

SSD,1, SSD,2, SSD,3, SSD,4,

Compute, node,1,, Compute, node,2, Compute, node,3, Compute, node,4,

PFS,(Parallel,file,system),

failure,

slide-13
SLIDE 13

LLNL#PRES#654744,

Challenges,for,using,burst,buffer,system,

Challenges,for,using,burst,buffers,

  • ExploiFng,storage,bandwidth,of,burst,buffers,

– Burst,buffers,are,connected,to,networks,,networks,can,be,bofleneck,

  • Analyzing,reliability,of,systems,with,burst,buffers,

– Adding,burst,buffer,nodes,increase,total,system,size, – System,efficiency,may,decrease,due,to,Increased,overall,failure,by,added, burst,buffers,

13,

SSD,1, SSD,2, SSD,3, SSD,4,

Compute, node,1,, Compute, node,2, Compute, node,3, Compute, node,4,

PFS,(Parallel,file,system),

Network bottleneck IBIO: InfinBand-based I/O interface

Reliability Storage model

slide-14
SLIDE 14

LLNL#PRES#654744,

Burst,buffer,prototype,, mulF#mSATA,High,I/O,BW,&,cost

CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores) Memory Cetus DDR3-1600 (16GB) M/B GIGABYTE GA-Z77X-UD5H SSD Crucial m4 msata 256GB CT256M4SSD3 (Peak read: 500MB/s, Peak write: 260MB/s) SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA Device Converter with Metal Fram RAID Card Adaptec RAID 7805Q ASR-7805Q Single

Node specification

0.5 1 1.5 2 2.5 3 3.5 4 4.5 2 4 6 8 10 12 14 16

Read/Write throughput (GB/)

# of Processes

Read - Peak Read - Local Read - NFS Write - Peak Write - Local Write - NFS

Interconnect :Mellanox FDR HCA (Model No.: MCX354A-FCBT) mSATA 8 (Read: 500MB/s, Write: 260MB/s) Adaptec RAID 1

mSATA mSATA mSATA mSATA mSATA mSATA mSATA mSATA

14,

slide-15
SLIDE 15

LLNL#PRES#654744,

IBIO,read,

IBIO:,InfiniBand#based,I/O,interface,

  • Provide,POSIX,I/O,interfaces,

– open,,read,,write,and,close, – Client,can,open,any,files,on,any,servers,

  • open(“hostname:/path/to/file”, mode)
  • IBIO,use,ibverbs,for,communicaFon,between,clients,and,servers,

– Exploit,network,bandwidth,of,infiniBand,,

15,

Chunk buffers

Compute node 1 Compute node 2 Compute node 3 Compute node 4

IBIO client IBIO client IBIO client

IBIO server thread

file4,

Compute node 1 Compute node 2 Compute node 3 Compute node 4

IBIO client IBIO client IBIO client IBIO client

Storage

IBIO server thread

file3, file2, file1

3,

file4,

Storage

file3, file2, file1,

Chunk buffers

4, 3,

fd1, fd2, fd3, fd4,

2, Writer thread Writer thread Writer thread Writer thread

Writer threads Reader threads

chunk, 1, 4, 5,

IBIO client

1, 5, Reader thread Reader thread Reader thread Reader thread 2,

fd1, fd2, fd3, fd4,

IBIO,write:,four,IBIO,clients,and,one,IBIO,server, IBIO,read:,four,IBIO,clients,and,one,IBIO,server,

IBIO,write,

slide-16
SLIDE 16

LLNL#PRES#654744,

IBIO,write/read,

16,

Chunk buffers

Compute node 1 Compute node 2 Compute node 3 Compute node 4

IBIO client IBIO client IBIO client

IBIO server thread

file4,

Storage

file3, file2, file1

3,

fd1, fd2, fd3, fd4,

2, Writer thread Writer thread Writer thread Writer thread

Writer threads

chunk, 1, 4, 5,

IBIO client

IBIO,write:,four,IBIO,clients,and,one,IBIO,server,

IBIO,write,

  • IBIO,write,

1. ApplicaFon,call,IBIO,client,funcFon,with,data,to,write, 2. IBIO,client,divides,the,data,into,chunks,,then,send,the,address,to,IBIO,server,for,RDMA, 3. IBIO,server,issues,RDMA,read,to,the,address,,and,reply,ack, 4. ConFnues,unFl,all,chunks,are,sent,,and,return,to,applicaFon, 5. Writer,threads,asynchronously,,write,received,data,to,storage,

  • IBIO,read,

– Reads,chunks,by,reader,threads,and,send,to,clients,in,the,same,way,as,IBIO, write,by,using,RDMA,

Compute node Burst buffer node

Application IBIO Client IBIO Server Write threads

addr

RDMA ack

slide-17
SLIDE 17

LLNL#PRES#654744,

Challenges,for,using,burst,buffer,system,

Challenges,for,using,burst,buffers,

  • ExploiFng,storage,bandwidth,of,burst,buffers,

– Burst,buffers,are,connected,to,networks,,networks,can,be,bofleneck,

  • Analyzing,reliability,of,systems,with,burst,buffers,

– Adding,burst,buffer,nodes,increase,total,system,size, – System,efficiency,may,decrease,due,to,Increased,overall,failure,by,added, burst,buffers,

17,

SSD,1, SSD,2, SSD,3, SSD,4,

Compute, node,1,, Compute, node,2, Compute, node,3, Compute, node,4,

PFS,(Parallel,file,system),

Network bottleneck IBIO: InfinBand-based I/O interface

Reliability Storage model

slide-18
SLIDE 18

LLNL#PRES#654744,

Modeling,overview,

18,

[2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12

  • To,find,out,the,best,checkpoint/restart,strategy,for,systems,with,burst,

buffers,,we,model,checkpoinFng,strategies,

Efficiency,

FracFon,of,Fme,an,applicaFon, spends,only,, in,useful,computaFon,,

, ,

Hi

Compute, node,

Si

i = 0, i > 0,

1 2

mi

Hi-1 Hi-1 Hi-1

Storage,Model: HN {m1, m2, . . . , mN }

Recursive,structured,storage,model, C/R,strategy,model, Li = Ci + Ei, Oi =, Ci + Ei (Sync.) , Ii (Async.), Ci or Ri =,

<,C/R,date,size,/,node,>,<#,of,C/R,nodes,per,Si*,>,,

<,write,perf.,(,wi ),,>,,,or,,,<read,perf.,(,ri ),>,, +,

1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1

k

p0(t +ck) t0(t +ck)

k

pi(t +ck) ti(t +ck)

i

k k

i

p0(r

k)

pi(r

k)

p0(r

k)

t0(r

k)

ti(r

k)

Duration

t +ck r

k

No failure Failure

λi : i -level checkpoint time

: c -level checkpoint time

r

c : c -level recovery time

cc

t : Interval

p0(T) = e−λT t0(T) = T pi(T) = λi λ (1 − e−λT ) ti(T) = 1 − (λT + 1) · e−λT λ · (1 − e−λT )

p0(T) t0(T)

: No failure for T seconds : Expected time when p0(T)

pi(T)

ti(T) : i - level failure for T seconds : Expected time when pi(T)

MLC,model,[2]

slide-19
SLIDE 19

LLNL#PRES#654744,

MulF#level,Asynchronous,C/R,Model,[2],,

  • OpFmize,checkpoint,intervals,and,compute,checkpoint/restart,

“Efficiency” using,Markov,model,

– Vertex:,Compute,state,OR,CheckpoinFng,state,OR,Recovery,state, – Edge:,CompleFon,of,each,state,

[2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12

Efficiency

1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1

k

p0(t +ck) t0(t +ck)

k

pi(t +ck)

ti(t +ck)

i k k i

p0(r

k)

pi(r

k)

p0(r

k)

t0(r

k)

ti(r

k)

Duration

t +ck r

k No failure Failure

λi

: i -level checkpoint time : c -level checkpoint time

r

c

: c -level recovery time

cc

t

: Interval

p0(T) = e−λT t0(T) = T pi(T) = λi λ (1 − e−λT ) ti(T) = 1 − (λT + 1) · e−λT λ · (1 − e−λT )

p0(T) t0(T) : No failure for T seconds : Expected time when

p0(T)

pi(T)

ti(T)

: i - level failure for T seconds : Expected time when

pi(T)

  • Input:,Each,level,of,,

– Li :,Checkpoint,Latency, – Oi :,Checkpoint,overhead, – Ri :,Restart,Fme, – Fi :,Failure,rate,

  • Output:,“Efficiency”,

,

L i=1...N O i=1...N R i=1...N F i=1...N

Efficiency,

FracFon,of,Fme,an,applicaFon, spends,only,in,computaFon,in,

  • pFmal,checkpoint,interval,

, ,

slide-20
SLIDE 20

LLNL#PRES#654744,

Modeling,of,C/R,Strategies,

Ci or Ri =,

<,C/R,data,size,/,node,>,,,,,<#,of,C/R,nodes,per,Si*,>,, <,write,perf.,(,wi ),,>,,,or,,,<read,perf.,(,ri ),>,,

Synchronous,checkpoinFng,(Diskless,C/R),, Checkpoint, Encoding, ,

C i E i L i :,,Checkpoint,latency, O i :,,Checkpoint,overhead,

Asynchronous,checkpoinFng,(PFS), Init, Encoding,

I i

E i

L i :,,Checkpoint,latency, O i :,,Checkpoint,overhead,

Checkpoint,

C i

Li = Ci + Ei, Oi =, Ci + Ei (Sync.) , Ii (Async.),

  • Li :,Checkpoint,Latency,

– Time,to,complete,a,checkpoint,(Ci),and, encoding,(Ei),

  • Oi :,Checkpoint,overhead,

– The,increased,execuFon,Fme,of,an, applicaFon,,

  • Ci & Ri :,Checkpoint/Restart,Fme,
slide-21
SLIDE 21

LLNL#PRES#654744,

Recursive,structured,storage,model,

  • GeneralizaFon,of,storage,architectures,

with,”context.free%grammar”,

– A,Fer,i%hierarchical,enFty,(Hi),,has,a, storage,(Si%)shared,by,(mi)%upper, hierarchical,enFFes,(Hi−1,), – Hi=0 ,is,a,compute,node, – HN {m1, m2, . . . , mN },

,

21,

Hi

Compute, node,

Si

i = 0, i > 0,

1 2

mi

Hi-1 Hi-1 Hi-1

S1

S2

Storage,Model: HN {m1, m2, . . . , mN } S1

H2 S2 H1 H1

compute% node%1% compute% node%2% compute% node%3% compute% node%4% compute% node%5% compute% node%6% compute% node%7% compute% node%8%

H0 H0, H0, H0, H0, H0, H0, H0,

  • e.g.,),H2 {4, 2 }

– H2,has,an,S2,shared,by,2,H1 – H1,has,an,S1,shared,by,4,H0 – H0,is,a,compute,node,

Recursive,Structured,Storage,Model,

slide-22
SLIDE 22

LLNL#PRES#654744,

Recursive,Structured,Storage,Model,(cont’d),

22,

  • The,number,of,nodes,accessing,to,Si

<#,of,C/R,nodes,per,Si,>,, K

=

<#,of,Si,>, K : C/R,cluster,size, <#,of,Si,>,=,, ΠN

k=i+1 mk,,,(i,<,N,),

1 (i,=,N),

S1

S2

S1

compute% node%1% compute% node%2% compute% node%3% compute% node%4% compute% node%5% compute% node%6% compute% node%7% compute% node%8%

  • e.g.,),K = 4

– #,of,C/R,nodes,per,S1,,

  • 4/2,=,2,nodes,

– #,of,C/R,nodes,per,S2

  • 4/1 = 4 ,nodes,
slide-23
SLIDE 23

LLNL#PRES#654744,

EvaluaFons,

  • IBIO,performance,

– SequenFal,read/write,for,C/R,

  • Several,system,efficiency,evaluaFons,

23,

slide-24
SLIDE 24

LLNL#PRES#654744,

SequenFal,IBIO,read/write,performance,

24,

0.5 1 1.5 2 2.5 3 3.5 4 4.5

2 4 6 8 10 12 14 16 Read/Write throughput (GB/sec)

# of Processes

Read - Peak Read - Local Read - IBIO Read - NFS Write - Peak Write - Local Write - IBIO Write - NFS CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores) Memory Cetus DDR3-1600 (16GB) M/B GIGABYTE GA-Z77X-UD5H SSD Crucial m4 msata 256GB CT256M4SSD3 (Peak read: 500MB/s, Peak write: 260MB/s) SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA Device Converter with Metal Fram RAID Card Adaptec RAID 7805Q ASR-7805Q Single

Node specification

Interconnect :Mellanox FDR HCA (Model No.: MCX354A-FCBT)

IBIO achieve the same remote read/write performance as the local read/write performance by using RDMA

  • Set,chunk,size,to,64MB,

for,both,IBIO,and,NFS,to, maximize,the, throughputs,

mSATA 8 (Read: 500MB/s, Write: 260MB/s) Adaptec RAID 1

mSATA mSATA mSATA mSATA mSATA mSATA mSATA mSATA

slide-25
SLIDE 25

LLNL#PRES#654744,

,*,Guermouche,,A.,,Ropars,,T.,,Snir,,M.,and,

Cappello,,F.:,HydEE:,Failure,Containment, without,Event,Logging,for,Large,Scale,Send#, DeterminisFc,MPI,ApplicaFons,

Experimental,setup,

Checkpoint,size:, 5,GB/node, Logging,cluster,size:, 16,nodes,*,

25,

S1 S1

Node,

1,

Node,

2,

Node,

1088,

S2

S1 S1

Node,

32,

S2

S1

Read:,10,GB/s, Write:,10,GB/s, Aggregate,Read:, 544,GB/s, Aggregate,Write:, 283,GB/s,

Burst,buffer,system:,H2 {32, 34}, Flat,buffer,system:,H2 {1, 1088},

1,Compute,node, 32,Compute,node, Read:,500,MB/s, Write:,260,MB/s, Read:,16,GB/s, Write:,8.32,GB/s,

Node,

1,

Node,

1088,

The,system,sizes,are, based,on,the,Coastal, cluster,at,LLNL, (88.5TFLOPS),

slide-26
SLIDE 26

LLNL#PRES#654744,

Level,2,

(PFS,checkpoint,required),

1.33,x,10#8,

Level,1,

(XOR,checkpoint,required),

2.63,x,10#6,

Level,2,

(PFS,checkpoint,required),

4.28,x,10#7,

Level,1,

(XOR,checkpoint,required),

2.14,x,10#6,

Experimental,setup,

26,

Node,

1,

Node,

2,

Node,

1088,

Node,

32,

Burst,buffer,system:,H2 {32, 34}, Flat,buffer,system:,H2 {1, 1088},

Node,

1,

Node,

1088,

EsFmated,failure,rates,are, based,on,failure,analysis,on, the,Coastal,cluster,at,LLNL, (88.5TFLOPS),[1]

S1 S1 S1

S2

S1 S1

S2

[1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10)

slide-27
SLIDE 27

LLNL#PRES#654744,

Efficiency,with,Increasing,Failure,Rates, and,Checkpoint,Costs,,

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 10" 50" 100" Efficiency(

Scale(factor((xF, xL2 L2)( Flat"Buffer6Coordinated" Flat"Buffer6Uncoordinated" Burst"Buffer6Coordinated" Burst"Buffer6Uncoordinated"

27,

  • Assuming,there,is,no,message,logging,overhead,

,

In days or a day of MTBF, there is no big efficiency differences In a few hours of MTBF, with burst buffers, systems can still achieve high efficiency Even in a hour of MTBF, with uncoordinated, systems can still achieve 70% efficiency

Partial restart accelerate recovery time from burst buffers and PFS checkpoint

MTBF = days a day

2, 3H 1H

slide-28
SLIDE 28

LLNL#PRES#654744,

Allowable,Message,Logging,overhead,,

  • Logging,overhead,must,be,relaFvely,small,,less,than,a,few,percent,in,days,
  • r,a,day,of,MTBF,

– In,a,few,hours,or,a,hour,,very,high,message,logging,overheads,are,tolerated,,

,Uncoordinated,checkpoinFng,can,be,more,effecFve,on,future,systems,, ,

28,

Flat buffer Burst buffer scale factor Allowable message scale factor Allowable message logging overhead logging overhead 1 0.0232% 1 0.00435% 2 0.0929% 2 0.0175% 10 2.45% 10 0.468% 50 84.5% 50 42.0% 100 ≈ 100% 100 99.9%

Message,logging,overhead,allowed,in,uncoordinated,checkpoinFng, to,achieve,a,higher,efficiency,than,coordinated,checkpoinFng,

Coordinated,, Uncoordinated,,

slide-29
SLIDE 29

LLNL#PRES#654744,

Effect,of,Improving,Storage,Performance,

29,

To,see,which,storage,impact,to,efficiency,, we,increase,performance,of,level#1,and, level#2,storage,while,keeping,MTBF,a,hour,,

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 5" 10" 20" Efficiency(

Scale(factor((L1 L1/)(

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 5" 10" 20" Efficiency(

Scale(factor((L2 L2/)(

Flat"Buffer6Coordinated" Flat"Buffer6Uncoordinated" Burst"Buffer6Coordinated" Burst"Buffer6Uncoordinated"

0.9"

Improvement,of,level#1,storage, performance,does,not,impact, efficiency,for,both,flat,buffer,and, burst,buffer,systems, Increasing,the,performance,of,the, PFS,does,impact,system,efficiency,

L1 performance improvement

L2,C/R,overhead,is,a,major,cause,of, degrading,efficiency,,so,reducing, level#2,failure,rate,and,improving, level#2,C/R,is,criFcal,on,future,systems,

L2 performance improvement

slide-30
SLIDE 30

LLNL#PRES#654744,

RaFo,of,Compute,nodes,to,Burst,Buffer,nodes,,

  • The,raFo,is,not,important,mafer,when,MTBF,is,from,a,day,to,days,
  • When,MTBF,is,a,few,hours,,a,larger,number,of,burst,buffer,nodes,decreases,

efficiency,,

,Adding,addiFonal,burst,buffer,nodes,increases,the,failure,rate,which,degrades, system,efficiency,more,than,the,efficiency,gained,by,the,increased,bandwidth,

30,

0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 10" 50"

Efficiency(

Scale(factor((xF, xL2 L2)( 1""compute"nodes" 2"compute"nodes" 4"compute"nodes" 8"compute"nodes" 16"compute"nodes" 32"compute"nodes" 0.8" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 10" 50" Efficiency( Scale(factor((xF, xL2 L2)( 1""compute"nodes" 2"compute"nodes" 4"compute"nodes" 8"compute"nodes" 16"compute"nodes" 32"compute"nodes"

Coordinated, Uncoordinated, Another,thing,to,consider,when,building,a,burst,buffer,system,is, the,raFo,of,compute,nodes,to,burst,buffer,nodes,,

slide-31
SLIDE 31

LLNL#PRES#654744,

Towards,resilient,extreme,scale,compuFng,

1. Burst,buffers,,

– Burst,buffers,are,beneficial,for,C/R,at,extreme,scale,

2. Uncoordinated,C/R,

– When,MTBF,is,days,or,a,day,,uncoordinated,C/R,may,not,be,effecFve, – If,MTBF,is,a,few,hours,or,less,,will,be,effecFve,

3. Level#2,failure,,and,Level#2,performance,

– Reducing,Level#2,failure,and,increasing,Level#2,performance,are, criFcal,to,improve,overall,system,efficiency,

4. Fewer,number,of,burst,buffers,

– Adding,addiFonal,burst,buffer,nodes,increases,the,failure,rate, – May,degrades,system,efficiency,more,than,the,efficiency,gained,by, the,increased,bandwidth, – We,need,to,be,careful,a,trade#off,between,I/O,performance,and, reliability,of,burst,buffers,

31,

slide-32
SLIDE 32

LLNL#PRES#654744,

Conclusion,

  • Fault,tolerance,is,criFcal,at,extreme,scale,

– Both,C/R,strategy,and,storage,design,are,important,

  • We,developed,IBIO,to,maximize,remote,access,to,

burst,buffers,,and,modeled,C/R,strategy,and,storage, design,

  • We,listed,up,key,factors,to,build,resilient,systems,

based,on,our,evaluaFons,

  • We,expect,our,findings,can,benefit,system,designers,to,

create,efficient,and,cost#effecFve,systems,

32,

slide-33
SLIDE 33

LLNL#PRES#654744,

Q,&,A,

Spea eaker er:

Kento Sato ( ) kent@matsulab.is.titech.ac.jp

Tokyo Institute of Technology (Tokyo Tech) %Research Fellow of the Japan Society for the Promotion of Science http://matsu-www.is.titech.ac.jp/~kent/index_en.html

33,

Col

  • llabor
  • rator
  • rs

Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R de. Supinski, Naoya Maruyama, Satoshi Matsuoka

Acknow

  • wled

edgem emen ent

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. (LLNL- CONF-645209). This work was also supported by Grant-in-Aid for Research Fellow of the Japan Society for the Promotion of Science (JSPS Fellows) 24008253, and Grant-in-Aid for Scientific Research S 23220003.