Blocking and Non-blocking Checkpointing and Rollback Recovery for - - PowerPoint PPT Presentation

blocking and non blocking checkpointing and rollback
SMART_READER_LITE
LIVE PREVIEW

Blocking and Non-blocking Checkpointing and Rollback Recovery for - - PowerPoint PPT Presentation

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1 , Cristian Grecu 2 , Lorena Anghel 1 1 TIMA Laboratory, CNRS-UJF-INP, Grenoble, France 2 SoC Laboratory, University of British Columbia, Vancouver,


slide-1
SLIDE 1

WDSN 2008 – Anchorage, AK 1

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip

Claudia Rusu1, Cristian Grecu2, Lorena Anghel1

1 TIMA Laboratory, CNRS-UJF-INP, Grenoble, France

2 SoC Laboratory, University of British Columbia, Vancouver, Canada

slide-2
SLIDE 2

WDSN 2008 – Anchorage, AK 2

OUTLINE

  • Introduction

– – Networks-on-Chip Networks-on-Chip

– Checkpoint and rollback recovery

  • Coordinated checkpointing
  • Blocking and non-blocking coordinated

checkpointing

  • Case study
  • Conclusions and future work
slide-3
SLIDE 3

WDSN 2008 – Anchorage, AK 3

Network-on-Chip based Systems

  • NoC advantages

– Efficient sharing

  • f wires

– Shorter design time, lower effort – Scalability

P2P Bus NoC

  • NoC vs. traditional connection systems

Router

PE Link

slide-4
SLIDE 4

WDSN 2008 – Anchorage, AK 4

NoC QoS vs. Faults

  • Quality of service (QoS)

– reliability, throughput, latency, bandwidth

  • Unreliable signal transmission medium

– timing and data errors – process variation, crosstalk, electromagnetic interference, radiations

=> Increased vulnerability to faults

  • Technology down

scaling

  • Increased system

complexity

slide-5
SLIDE 5

WDSN 2008 – Anchorage, AK 5

Fault Tolerance in Networks-on-Chip

Physical Data link Network Transport Application

  • Faults and Fault Tolerance

– At different NoC components

  • Links
  • Routers

– switching blocks – memories

– At different levels of the communication protocol stack

  • Fault tolerant solutions

– adaptive routing – stochastic communication – EDC, ECC, NMR

Router

PE Link Fault

slide-6
SLIDE 6

WDSN 2008 – Anchorage, AK 6

OUTLINE

  • Introduction

– Networks-on-Chip

– – Checkpoint and rollback recovery Checkpoint and rollback recovery

  • Coordinated checkpointing
  • Blocking and non-blocking coordinated

checkpointing

  • Case study
  • Conclusions and future work
slide-7
SLIDE 7

WDSN 2008 – Anchorage, AK 7

Checkpoint and Rollback Recovery.

Principle

  • No failure tolerance

– Failure => Restart

  • Checkpoint and rollback

recovery

– Failure => Resume from a more recent state

t

failure restart consistent state

t

failure rollback

rollback recovery

start start

– Principle

  • Failure-free

– periodically store states on stable storage

  • Failure

– rollback to the last consistent stored state

slide-8
SLIDE 8

WDSN 2008 – Anchorage, AK 8

Checkpoint and Rollback Recovery.

Consistent State

  • Message types vs. recovery line

SA TA TB SB

late message early/orphan message past message future message t t

  • Consistent state with late messages
  • early messages are avoided
  • late messages are to be replayed after rollback

SA TA TB SB

late message future message past message future message t t

slide-9
SLIDE 9

WDSN 2008 – Anchorage, AK 9

Checkpoint and Rollback Recovery.

Classification

message logging causal

  • ptimistic

pessimistic

  • Checkpointing
  • Message logging

checkpointing uncoordinated coordinated blocking non-blocking communication-induced

slide-10
SLIDE 10

WDSN 2008 – Anchorage, AK 10

OUTLINE

  • Introduction

– Networks-on-Chip – Checkpoint and rollback recovery

  • Coordinated

Coordinated checkpointing checkpointing

  • Blocking and non-blocking coordinated

checkpointing

  • Case study
  • Conclusions and future work
slide-11
SLIDE 11

WDSN 2008 – Anchorage, AK 11

Coordinated Checkpointing

  • Principle

TA TB TC

global synchronizations consistent states

epoch rollback

  • Failure-free

– synchronization –> consistent state

  • Failure

– rollback to the last consistent state

TD

  • Task checkpoint

– task state – list of late messages

  • Late messages log

– optimistic approach

  • > small latency on failure-free

– logged at receiver

  • > small recovery overhead
  • Unique coordinator

– reduced overhead

  • Unique blocking and non-blocking

protocol

– allows for the same checkpoint the blocking of a task set and the non-blocking of another

slide-12
SLIDE 12

WDSN 2008 – Anchorage, AK 12

  • Synchronization. Markers
  • Markers

– are used to

  • avoid early messages
  • identify late messages and to

end the log of late messages

– dedicated messages (avoid long checkpointing durations when communication among certain tasks is scarce)

  • A task has taken the

checkpoint only after state and late messages form

  • ther tasks are on stable

storage

message 2

( l a t e)

TA

message 1

(early)

TB TC

message 2

( f

  • r

r e p l a y )

message 1

marker 1 m a r k e r 2

TA TB TC

Inconsistent state

Consistent state using markers

slide-13
SLIDE 13

WDSN 2008 – Anchorage, AK 13

OUTLINE

  • Introduction

– Networks-on-Chip – Checkpoint and rollback recovery

  • Coordinated checkpointing
  • Blocking and non-blocking coordinated

Blocking and non-blocking coordinated checkpointing checkpointing

  • Case study
  • Conclusions and future work
slide-14
SLIDE 14

WDSN 2008 – Anchorage, AK 14

Blocking and Non-blocking Coordinated Checkpointing Protocol

  • Checkpointing protocol

I TA TB TD TC

  • Synchronization

messages

Initiator

  • broadcast

CK_REQ

  • when CK_TAKEN

received from all tasks

  • validate

global checkpoint Non-initiator (blocking or not)

  • on CK_REQ receipt
  • broadcast

CK_START

  • when CK_START

received from all tasks

  • take local

checkpoint

  • send to

initiator CK_TAKEN

slide-15
SLIDE 15

WDSN 2008 – Anchorage, AK 15

Blocking and Non-blocking Overhead

– n nodes

  • CK_REQ n
  • CK_START n*(n-1) O(n2)
  • CK_TAKEN n

I

TA TB TD TC

Blocking

– synchronization messages

  • Synchronization messages

Non-blocking

– synchronization messages – application messages

  • Messages in NoC during checkpointing
slide-16
SLIDE 16

WDSN 2008 – Anchorage, AK 16

Checkpointing Duration

  • Long checkpointing durations

–> reduced number of checkpoints

  • When failure rate is comparable with

checkpointing duration

  • > rollbacks to the same old checkpoint

TA TB TC rollback TD TA TB TC rollback TD

  • High overhead during checkpointing

–> checkpointing phase reduced

slide-17
SLIDE 17

WDSN 2008 – Anchorage, AK 17

OUTLINE

  • Introduction

– Networks-on-Chip – Checkpoint and rollback recovery

  • Coordinated checkpointing
  • Blocking and non-blocking coordinated

checkpointing

  • Case study

Case study

  • Conclusions and future work
slide-18
SLIDE 18

WDSN 2008 – Anchorage, AK 18

Case Study

  • 4x4 mesh direct NoC

– XY routing – Wormhole switching

  • Consider

– Different traffic loads

  • uniform traffic loads
  • constant message length

– Different failure rates

  • Analyze

– Checkpointing duration and overhead – Application latency

Router

PE Link

slide-19
SLIDE 19

WDSN 2008 – Anchorage, AK 19

Checkpointing Duration and Overhead

  • Memory Overhead
  • Checkpointing Duration
slide-20
SLIDE 20

WDSN 2008 – Anchorage, AK 20

Application Latency

slide-21
SLIDE 21

WDSN 2008 – Anchorage, AK 21

OUTLINE

  • Introduction

– Networks-on-Chip – Checkpoint and rollback recovery

  • Coordinated checkpointing
  • Blocking and non-blocking coordinated

checkpointing

  • Case study
  • Conclusions and future work

Conclusions and future work

slide-22
SLIDE 22

WDSN 2008 – Anchorage, AK 22

Conclusions and Future Work

  • Blocking and Non-blocking coordinated checkpointing

– unique protocol

  • Analyze and compare overhead and latency

– Checkpointing duration increases with the traffic load

  • Non-blocking: significantly
  • Blocking: lesser

– Application latency increases with the traffic load and the failure rate

  • Non-blocking: significantly
  • Blocking: lesser

–> For higher traffic loads and higher failure rates, the blocking approach becomes mandatory

  • Future work

– Evaluate the proposed protocol

  • n other traffic patterns
  • n application with high traffic loads and critical tasks

–> subsets of blocking and non-blocking tasks

slide-23
SLIDE 23

WDSN 2008 – Anchorage, AK 23

Thank you!