blocking and non blocking checkpointing and rollback
play

Blocking and Non-blocking Checkpointing and Rollback Recovery for - PowerPoint PPT Presentation

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1 , Cristian Grecu 2 , Lorena Anghel 1 1 TIMA Laboratory, CNRS-UJF-INP, Grenoble, France 2 SoC Laboratory, University of British Columbia, Vancouver,


  1. Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1 , Cristian Grecu 2 , Lorena Anghel 1 1 TIMA Laboratory, CNRS-UJF-INP, Grenoble, France 2 SoC Laboratory, University of British Columbia, Vancouver, Canada WDSN 2008 – Anchorage, AK 1

  2. OUTLINE • Introduction – Networks-on-Chip Networks-on-Chip – – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 2

  3. Network-on-Chip based Systems • NoC vs. traditional connection systems P2P • NoC advantages NoC – Efficient sharing of wires – Shorter design Bus time, lower effort – Scalability Router PE Link WDSN 2008 – Anchorage, AK 3

  4. NoC QoS vs. Faults • Quality of service (QoS) – reliability, throughput, latency, bandwidth • Unreliable signal transmission medium – timing and data errors – process variation, crosstalk, electromagnetic interference, radiations • Technology down Increased scaling vulnerability => • Increased system to faults complexity WDSN 2008 – Anchorage, AK 4

  5. Fault Tolerance in Networks-on-Chip • Faults and Fault Tolerance – At different NoC components Router PE • Links Link • Routers Fault – switching blocks – memories – At different levels of the communication protocol stack Application Transport • Fault tolerant solutions Network – adaptive routing Data link – stochastic communication Physical – EDC, ECC, NMR WDSN 2008 – Anchorage, AK 5

  6. OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery Checkpoint and rollback recovery – • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 6

  7. Checkpoint and Rollback Recovery. Principle restart • No failure tolerance – Failure => Restart start t failure rollback • Checkpoint and rollback rollback recovery recovery start consistent t failure – Failure => Resume from a more state recent state – Principle • Failure-free – periodically store states on stable storage • Failure – rollback to the last consistent stored state WDSN 2008 – Anchorage, AK 7

  8. Checkpoint and Rollback Recovery. Consistent State • Message types vs. recovery line S A early/orphan message T A late t message message message future past S B T B t • Consistent state with late messages S A future T A message late t message message message future past S B T B t • early messages are avoided • late messages are to be replayed after rollback WDSN 2008 – Anchorage, AK 8

  9. Checkpoint and Rollback Recovery. Classification • Checkpointing checkpointing coordinated uncoordinated communication-induced blocking non-blocking • Message logging message logging optimistic pessimistic causal WDSN 2008 – Anchorage, AK 9

  10. OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated Coordinated checkpointing checkpointing • • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 10

  11. Coordinated Checkpointing • Task checkpoint • Principle – task state rollback – list of late messages epoch • Late messages log T A – optimistic approach T B -> small latency on failure-free T C – logged at receiver T D -> small recovery overhead global consistent • Unique coordinator synchronizations states – reduced overhead • Unique blocking and non-blocking • Failure-free – synchronization protocol –> consistent state – allows for the same checkpoint • Failure the blocking of a task set and the – rollback to the last non-blocking of another consistent state WDSN 2008 – Anchorage, AK 11

  12. Synchronization. Markers Inconsistent state • Markers T A – are used to message 1 • avoid early messages (early) • identify late messages and to T B end the log of late messages message 2 e ) t a – dedicated messages (avoid l ( T C long checkpointing durations when communication among certain tasks is scarce) Consistent state using markers • A task has taken the T A message 1 marker 1 checkpoint only after state and late messages form T B ) other tasks are on stable 2 y message 2 a r l e p k e storage r r a r m o f ( T C WDSN 2008 – Anchorage, AK 12

  13. OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated Blocking and non-blocking coordinated • checkpointing checkpointing • Case study • Conclusions and future work WDSN 2008 – Anchorage, AK 13

  14. Blocking and Non-blocking Coordinated Checkpointing Protocol • Checkpointing protocol • Synchronization messages Initiator Non-initiator (blocking or not) - broadcast CK_REQ - on CK_REQ receipt I - broadcast CK_START - when CK_START received from all tasks T D T A - take local - when CK_TAKEN checkpoint received from - send to all tasks initiator T C - validate CK_TAKEN T B global checkpoint WDSN 2008 – Anchorage, AK 14

  15. Blocking and Non-blocking Overhead • Synchronization messages I – n nodes • CK_REQ n T D T A • CK_START n *( n -1) O ( n 2 ) • CK_TAKEN n T C T B • Messages in NoC during checkpointing � Blocking � Non-blocking – synchronization messages – synchronization messages – application messages WDSN 2008 – Anchorage, AK 15

  16. Checkpointing Duration • High overhead during checkpointing –> checkpointing phase reduced rollback rollback T A T A T B T B T C T C T D T D • Long checkpointing durations –> reduced number of checkpoints • When failure rate is comparable with checkpointing duration -> rollbacks to the same old checkpoint WDSN 2008 – Anchorage, AK 16

  17. OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study Case study • • Conclusions and future work WDSN 2008 – Anchorage, AK 17

  18. Case Study • 4x4 mesh direct NoC – XY routing Router – Wormhole switching PE Link • Consider – Different traffic loads • uniform traffic loads • constant message length – Different failure rates • Analyze – Checkpointing duration and overhead – Application latency WDSN 2008 – Anchorage, AK 18

  19. Checkpointing Duration and Overhead • Checkpointing Duration • Memory Overhead WDSN 2008 – Anchorage, AK 19

  20. Application Latency WDSN 2008 – Anchorage, AK 20

  21. OUTLINE • Introduction – Networks-on-Chip – Checkpoint and rollback recovery • Coordinated checkpointing • Blocking and non-blocking coordinated checkpointing • Case study • Conclusions and future work Conclusions and future work • WDSN 2008 – Anchorage, AK 21

  22. Conclusions and Future Work • Blocking and Non-blocking coordinated checkpointing – unique protocol • Analyze and compare overhead and latency – Checkpointing duration increases with the traffic load • Non-blocking: significantly • Blocking: lesser – Application latency increases with the traffic load and the failure rate • Non-blocking: significantly • Blocking: lesser –> For higher traffic loads and higher failure rates, the blocking approach becomes mandatory • Future work – Evaluate the proposed protocol • on other traffic patterns • on application with high traffic loads and critical tasks –> subsets of blocking and non-blocking tasks WDSN 2008 – Anchorage, AK 22

  23. Thank you! WDSN 2008 – Anchorage, AK 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend