kento sato 1 kathryn mohror 2 adam moody 2 todd gamblin 2
play

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - PowerPoint PPT Presentation

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski 2 , Naoya Maruyama 3 and Satoshi Matsuoka 1 1 Tokyo Institute of Technology 2 Lawrence Livermore National Laboratory 3 RIKEN


  1. Kento Sato †1 , Kathryn Mohror †2 , Adam Moody †2 , Todd Gamblin †2 , Bronis R. de Supinski †2 , Naoya Maruyama †3 and Satoshi Matsuoka †1 †1 Tokyo Institute of Technology †2 Lawrence Livermore National Laboratory †3 RIKEN Advanced institute for Computational Science This,work,performed,under,the,auspices,of,the,U.S.,Department,of,Energy,by,Lawrence,Livermore,NaFonal,Laboratory,, under,Contract,DE#AC52#,07NA27344.,LLNL#PRES#654744#DRAFT, May,27 th ,,2014, CCGrid2014@Chicago LLNL#PRES#654744,

  2. Failures,on,HPC,systems, • ExponenFal,growth,in,computaFonal,power, – Enables,finer,grained,simulaFons,with,shorter,period,Fme, • Overall,failure,rate,increase,accordingly,because,of,the,increasing, system,size, • 191,failures,out,of,5#million,node#hours,, – A,producFon,applicaFon,of,Laser#plasma,interacFon,code,( pF3D ), – Hera,,Atlas,and,Coastal,clusters,@LLNL, Estimated MTBF (w/o hardware reliability improvement per component in future) 1,000,nodes, 10,000,nodes, 100,000,nodes, MTBF, 1.2,days, 2.9,hours, 17,minutes, (Measured), (EsFmaFon), (EsFmaFon), • Will,be,difficult,for,applicaFons,to,conFnuously,run,for,a,long, Fme,without,fault,tolerance,at,extreme,scale, 2, LLNL#PRES#654744,

  3. Checkpoint/Restart,(So^ware#Lv.), • Idea,of,Checkpoint/Restart, – Checkpoint, Checkpoint/Restart, • Periodically,save,snapshots,of, Failure, an,applicaFon,state,to,PFS, – Restart, check, check, check, • On,a,failure,,restart,the, point, point, point, execuFon,from,the,latest, checkpoint, CheckpoinFng,overhead, • Improved,Checkpoint/Restart, Parallel,file,system,(PFS), – MulF#level,checkpoinFng,[1], – Asynchronous,checkpoinFng,[2], – In#memory,diskless,checkpoinFng,[3], • We,found,that,so^ware#level,approaches,may,be,limited,in, increasing,resiliency,at,extreme,scale, [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12 [3] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery", IPDPS2014 3, LLNL#PRES#654744,

  4. Storage,architectures, We,consider,architecture#level,approaches, • Compute nodes Burst,buffer, • – A,new,Fer,in,storage,hierarchies, – Absorb,bursty,I/O,requests,from, applicaFons, – Fill,performance,gap,between,node#local, Burst buffers storage,and,PFSs,in,both,latency,and, bandwidth, If,you,write,checkpoints,to,burst,buffers,, • – Faster,checkpoint/restart,Fme,than,PFS, Parallel file system – More,reliable,than,storing,on,compute, nodes, However,…, • – Adding,burst,buffer,nodes,may,increase,total,system,size,,and,failure,rates, accordingly,, , It’s,not,clear,if,burst,buffers,improve,overall,system,efficiency, • – Because,burst,buffers,also,connect,to,networks,,the,burst,buffers,may,sFll,be,a, bofleneck, [4] Doraimani, Shyamala and Iamnitchi, Adriana, “File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces”, HPDC '08 4, LLNL#PRES#654744,

  5. Goal,and,ContribuFons, • Goal :,,, – Develop,an,interface,to,exploit,bandwidth,of,burst,buffers, – Explore,effecFveness,of,burst,buffers, – Find,out,the,best,C/R,strategy,on,burst,buffers, • Contribu,ons :,, – Development,of,IBIO,exploiFng,bandwidth,to,burst,buffers, – A,model,to,evaluate,system,resiliency,given,a,C/R,strategy, and,storage,configuraFon, – Our,experimental,results,show,a,direcFon,to,build,resilient, systems,for,extreme,scale,compuFng , LLNL#PRES#654744,

  6. Outlines, • IntroducFon, • Checkpoint,strategies, • Storage,designs, • IBIO:,InfiniBand#based,I/O,interface, • Modeling, • Experiments, • Conclusion, 6, LLNL#PRES#654744,

  7. Diskless,checkpoint/restart,(C/R), failure, 7, Diskless,C/R, • Parity,1, ckpt,B1, ckpt,C1, ckpt,D1, – Create,redundant,data,across,local,storages, ckpt,A1, Parity,2, ckpt,C2, ckpt,D2, ckpt,A2, ckpt,B2, Parity,3, ckpt,D3, on,compute,nodes,using,a,encoding, ckpt,A3, ckpt,B3, ckpt,C3, Parity,4, technique,such,as,XOR, Node%1% Node%2% Node%3% Node%4% – Can,restore,lost,checkpoints,on,a,failure, caused,by,small,#,of,nodes,like,RAID#5, XOR,encoding,example, Most,of,failures,comes,from,one,node,,or,can,recover,from,XOR,checkpoint, • – e.g.,1),TSUBAME2.0:,92%,failures, Rest,of,failures,sFll,require,a,checkpoint,on,a,reliable,PFS, 8% – e.g.,2),LLNL,clusters:,85%,failures, 15% LOCAL/XOR/PARTNER checkpoint PFS checkpoint 92% 85% Diskless,checkpoint,is, promising,approach, Failure analysis on TSUBAME2.0 Failure analysis on LLNL clusters 7, LLNL#PRES#599833,

  8. MulF#level,Checkpoint/Restart,(MLC/R), [1,2] MLC, MLC,model, Duration 2 1 1 1 1 2 1 1 1 1 2 t + c k r 1 1 1 1 1 1 1 1 k p 0 ( t + c k ) p 0 ( r p 0 ( r k ) k ) No k k Diskless, 1 1 1 1 1 1 failure t 0 ( r k ) Level#1, t 0 ( t + c k ) checkpoint, 1 1 1 1 p i ( r k ) p i ( t + c k ) i i k k Failure t i ( r k ) 2 t i ( t + c k ) 2 : No failure for T seconds t : Interval p 0 ( T ) e − λ T PFS, p 0 ( T ) = Level#2, c c t 0 ( T ) = T : Expected time when p 0 ( T ) t 0 ( T ) : c -level checkpoint time checkpoint, λ i λ (1 − e − λ T ) r p i ( T ) = : i - level failure for T seconds c : c -level recovery time p i ( T ) 1 − ( λ T + 1) · e − λ T t i ( T ) = : Expected time when p i ( T ) λ i : i -level checkpoint time t i ( T ) λ · (1 − e − λ T ) MLC,hierarchically,use,storage,levels, • 1 – Diskless,checkpoint:,Frequent,,for,one, 0.9 node,for,a,few,node,failure, 0.8 MTBF 0.7 Efficiency – PFS,checkpoint:,Less,frequent,and, a few hours 0.6 0.5 asynchronous,for,mulF#node,failure, MTBF 0.4 days or a day Our,evaluaFon,showed,system,efficiency, 0.3 • 0.2 drops,to,less,than,10%,when,MTBF,is,a, 0.1 0 few,hours , 1 2 10 50 100 Scale factor (xF, xL2) , [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- 8, LLNL#PRES#654621, blocking Checkpointing System", SC12

  9. Uncoordinated,C/R,+,MLC Coordinated,C/R, Uncoordinated,C/R, Cluster/A/ P0, P0, ckpt, ckpt, msg,logging, P1, P1, ckpt, ckpt, Cluster/B/ Failure, P2, Failure, P2, ckpt, ckpt, P3, P3, Coordinated,C/R, • Coordinated C/R Uncoordinated C/R 1 – All,processes,globally,synchronize,before, 0.9 taking,checkpoints,and,restart,on,a,failure,, 0.8 – Restart,overhead, MTBF 0.7 Efficiency a few hours 0.6 Uncoordinated,C/R, • 0.5 MTBF – Create,clusters,,and,log,messages, 0.4 days or a day exchanged,between,clusters, 0.3 0.2 – Message,logging,overhead,is,incurred,,but, 0.1 rolling#back,only,a,cluster,can,restart,the, 0 execuFon,on,a,failure, 1 2 10 50 100 Scale factor (xF, xL2) � ,MLC,+,Uncoordinated,C/R,(So^ware#level), approaches,may,be,limited,at,extreme,scale, 9, LLNL#PRES#654621,

  10. Storage,designs, • AddiFon,to,the,so^ware#level,approaches,,we,also,explore,two, architecture#level,approaches,, – Flat,buffer,system:,, • ,Current,storage,system, – Burst,buffer,system:,, • Separated,buffer,space, Burst,buffer,system, Flat,buffer,system, Compute, Compute, Compute, Compute, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,1,, node,3, node,2, node,3, node,4, SSD,1, SSD,2, SSD,3, SSD,4, SSD,1, SSD,2, SSD,3, SSD,4, PFS,(Parallel,file,system), PFS,(Parallel,file,system), 10, LLNL#PRES#654744,

  11. Flat,Buffer,Systems, Flat,buffer,system, • Design,concept, Cluster, – Each,compute,node,has,its, dedicated,node#local,, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,3, storage, idle, idle, SSD,1, SSD,2, SSD,3, SSD,4, – Scalable,with,increasing,, number,of,compute,nodes, PFS,(Parallel,file,system), • This,design,has,drawbacks:, 1. Unreliable,checkpoint,storage, e.g.),If,compute,node,2,fails,,a,checkpoint,on,SSD,2,will,be,lost,because,SSD,2,is,physically, afached,to,the,failed,compute,node,2, 2. Inefficient,uFlizaFon,of,storage,resources,on,uncoordinated,checkpoinFng, e.g.),If,compute,node,1,&,3,are,in,a,same,cluster,,and,restart,from,a,failure,,the,bandwidth,of, SSD,2,&,4,will,not,be,uFlized, 11, LLNL#PRES#654744,

  12. Burst,Buffer,Systems, Burst,buffer,system, Cluster, • Design,concept, – A,burst,buffer,is,a,storage, failure, space,to,bridge,the,gap,in, Compute, Compute, Compute, Compute, latency,and,bandwidth, node,1,, node,2, node,4, node,3, between,node#local,storage, and,the,PFS, SSD,1, SSD,2, SSD,3, SSD,4, – Shared,by,a,subset,of,compute, nodes, PFS,(Parallel,file,system), • Although,addiFonal,nodes,are,required,,several,advantages, 1. More,Reliable,because,burst,buffers,are,located,on,a,smaller,#,of,nodes, e.g.),Even,if,compute,node,2,fails,,a,checkpoint,of,compute,node,2,is,accessible,from,the, other,compute,node,1, 2. Efficient,uFlizaFon,of,storage,resources,,on,uncoordinated,checkpoinFng, e.g.),if,compute,node,1,and,3,are,in,a,same,cluster,,and,both,restart,from,a,failure,,the, processes,can,uFlize,all,SSD,bandwidth,unlike,a,flat,buffer,system, 12, LLNL#PRES#654744,

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend