Rebound: Scalable Checkpointing for Coherent Shared Memor for - - PowerPoint PPT Presentation

rebound scalable checkpointing for coherent shared memor
SMART_READER_LITE
LIVE PREVIEW

Rebound: Scalable Checkpointing for Coherent Shared Memor for - - PowerPoint PPT Presentation

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign


slide-1
SLIDE 1

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas D f C S i Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu p

slide-2
SLIDE 2

Checkpointing in Shared-Memory MPs

rollback Fault save chkpt save chkpt

  • HW-based schemes for small CMPs use Global checkpointing

– All procs participate in system-wide checkpoints

P1 P2 P3 P4 checkpoint h k i t P1 P2 P3 P4

  • Global checkpointing is not scalable

checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

– Synchronization, bursty movement of data, loss in rollback…

2

slide-3
SLIDE 3

Alternative: Coordinated Local Checkpointing

  • Idea: threads coordinate their checkpointing in groups
  • Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Global Chkpt Local Chkpt Local Chkpt

+ Scalable: Checkpoint and rollback in processor groups C l it R d i t th d d d d i ll

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 3

– Complexity: Record inter-thread dependences dynamically.

slide-4
SLIDE 4

Contributions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

  • Leverages directory protocol to track inter-thread deps.

p g y

  • Opts to boost checkpointing efficiency:
  • Delaying write-back of data to safe memory at checkpoints
  • Supporting multiple checkpoints
  • Optimizing checkpointing at barrier synchronization
  • Avg. performance overhead for 64 procs: 2%
  • Compared to 15% for global checkpointing
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

p g p g

4

slide-5
SLIDE 5

Background: In-Memory Checkpt with ReVive

P1 P2 P3 Register

[Pvrulovic-02] Execution

P1 P2 P3 Register Dump Caches CHK

Dirty Cache Displacement

Writebacks Writeback W W W W WB

Dirty Cache lines Checkpoint Application Stalls

Memory Log Logging

Stalls

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 5

slide-6
SLIDE 6

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]

Old Register restored P3 P2 P1 Fault CHK P3 P2 Caches P1 Cache Invalidated Memory Lines R d W W W W WB Reverted Log Memory

Global Broadcast protocol Local Coordinated Scalable protocol

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 6

slide-7
SLIDE 7

Coordinated Local Checkpointing Rules

P1 P2 P1 P2 P1 P2 wr x P1 P2 P1 P2 P1 P2 rd x Producer rollback Consumer rollback Producer chkpoint Consumer chkpoint chkpt chkpt rollback rollback chkpoint chkpoint

P checkpoints P’s producers checkpoint P rolls back P’s consumers rollback

  • Banatre et al. used Coordinated Local checkpointing for bus-

based machines [Banatre96] P rolls back P s consumers rollback

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

based machines [Banatre96]

7

slide-8
SLIDE 8

Rebound Fault Model

Log (in SW) Main Memory Chip Multiprocessor Log (in SW)

  • Any part of the chip can suffer transient or permanent faults.
  • A fault can occur even during checkpointing
  • Off-chip memory and logs suffer no fault on their own (e g NVM)

Off chip memory and logs suffer no fault on their own (e.g. NVM)

  • Fault detection outside our scope:
  • Fault detection latency has upper-bound of L cycles
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 8

slide-9
SLIDE 9

Rebound Architecture

Main Memory Chip Multiprocessor

P+L1 L2 Directory Cache

MyProducer MyConsumer Dep Register LW-ID

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 9

slide-10
SLIDE 10

Rebound Architecture

Main Memory Chip Multiprocessor

P+L1 L2 Directory Cache

MyProducer MyConsumer Dep Register

  • Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g

  • MyProducers : bitmap of proc. that produced data consumed by

the local proc.

  • MyConsumers : bitmap of proc that consumed data produced

MyConsumers : bitmap of proc. that consumed data produced by the local proc.

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 10

slide-11
SLIDE 11

Rebound Architecture

Main Memory Chip Multiprocessor

P+L1 L2 Directory Cache

MyProducer MyConsumer Dep Register

  • Dependence (Dep) registers in the L2 cache controller:

LW-ID

p ( p) g

  • MyProducers : bitmap of proc. that produced data consumed by

the local proc.

  • MyConsumers : bitmap of proc that consumed data produced

MyConsumers : bitmap of proc. that consumed data produced by the local proc.

  • Processor ID in each directory entry:

LW ID l t it t th li i th t h k i t i t l

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

  • LW-ID : last writer to the line in the current checkpoint interval.

11

slide-12
SLIDE 12

Recording Inter-Thread Dependences

P1 P2

Write

P1 writes

MyProducers MyConsumers MyProducers MyConsumers

D P1

Write

LW-ID

Log Memory

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

Assume MESI protocol

12

slide-13
SLIDE 13

Recording Inter-Thread Dependences

P1 P2

MyConsumers P2 P2 reads y MyProducers P1

MyProducers MyConsumers MyProducers MyConsumers

P2 P1 D P1 S

LW-ID

Write back Logging gg g Memory Log

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

Assume MESI protocol

13

slide-14
SLIDE 14

Recording Inter-Thread Dependences

P1 P2

P1 writes

P2 P1

MyProducers MyConsumers MyProducers MyConsumers

P1 S P1

LW-ID

D P1

Memory Log

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

Assume MESI protocol

14

slide-15
SLIDE 15

Recording Inter-Thread Dependences

P1 P2

Clear Dep registers

Clear LW ID

P1 checkpoints

P2 P1

MyProducers MyConsumers MyProducers MyConsumers

p g

P1 P1 S

W it b k Clear LW-ID

LW-ID should remain set till th li i

LW-ID

P1 D

Writebacks Logging

the line is checkpointed

Memory Log

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

Assume MESI protocol

15

slide-16
SLIDE 16

Distributed Checkpointing Protocol in SW

  • Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers – Built using MyProducers

P1

P1 P2 P3 P4 InteractionSet : P1

P1

chk initiate checkpoint checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 16

slide-17
SLIDE 17

Distributed Checkpointing Protocol in SW

  • Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers – Built using MyProducers

P1

P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1 P2 P3

Ck? Ck?

chk initiate checkpoint checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 17

slide-18
SLIDE 18

Distributed Checkpointing Protocol in SW

  • Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers – Built using MyProducers

P1

P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1 P2 P3

Ck? Ck?

chk initiate checkpoint

P4

Ck?

checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 18

slide-19
SLIDE 19

Distributed Checkpointing Protocol in SW

  • Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers – Built using MyProducers

P1

P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1 P2 P3

Ck? Ck?

chk initiate checkpoint

P4

Ck?

checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 19

slide-20
SLIDE 20

Distributed Checkpointing Protocol in SW

  • Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers – Built using MyProducers

P1

P1 P2 P3 P4 InteractionSet : P1, P2, P3

P1 P2 P3

Ck? Ck?

chk initiate checkpoint

P4

Ck?

  • Rollback handled similarly using MyConsumers

checkpoint

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 20

slide-21
SLIDE 21

Optimization1 : Delayed Writebacks

Interval I1 Time nterval I1 Stall WB dirty lines sync sync Checkpoint Stall sync WB dirty lines eckpoint In nterval I2 Stall C Interval I2 sync Ch In

  • Checkpointing overhead dominated by data writebacks
  • Delayed Writeback optimization
  • Processors synchronize and resume execution
  • Hardware automatically writes back dirty lines in background
  • Checkpoint only completed when all delayed data written back

Still d t d i t th d d d d l d d t

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

  • Still need to record inter-thread dependences on delayed data

21

slide-22
SLIDE 22

Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

  • Additional support:

Each processor has two sets of Dep. registers E h h li h d l d bit Each cache line has a delayed bit

  • Increased vulnerability

A rollback event forces both intervals to roll back

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 22

slide-23
SLIDE 23

Optimization2 : Multiple Checkpoints

  • Problem: Fault detection is not instantaneous

– Checkpoint is safe only after max fault-detection latency (L)

Dep registers 1 Ckpt 1

p y y ( )

ection ency

Dep registers 2 Rollback Ckpt 2 Fault

Dete Late

tf

  • Solution: Keep multiple checkpoints

– On fault, roll back interacting processors to safe checkpoints

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

  • No Domino Effect

23

slide-24
SLIDE 24

Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

  • Additional support:

Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency

  • Need to track communication across checkpoints

Need to track communication across checkpoints

  • Combination with Delayed Writebacks: one more Dep register set
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 24

slide-25
SLIDE 25

Optimization3 : Hiding Chkpt behind Global Barrier

  • Global barriers require that all processors communicate

Leads to global checkpoints – Leads to global checkpoints

  • Optimization:

p – Proactively trigger a global checkpoint at a global barrier – Hide checkpoint overhead behind barrier imbalance spins

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 25

slide-26
SLIDE 26

Evaluation Setup

  • Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim
  • Applications: SPLASH-2

some PARSEC Apache Applications: SPLASH 2 , some PARSEC, Apache

  • Simulated CMP architecture with up to 64 threads
  • Checkpoint interval : 5 – 8 ms
  • Modeled several environments:
  • Global: baseline global checkpointing
  • Rebound: Local checkpointing scheme with delayed writeback
  • Rebound: Local checkpointing scheme with delayed writeback.
  • Rebound_NoDWB: Rebound without the delayed writebacks.
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 26

slide-27
SLIDE 27
  • Avg. Interaction Set: Set of Producer Processors

64 38

  • Most apps: interaction set is a small set

Most apps: interaction set is a small set – Justifies coordinated local checkpointing – Averages brought up by global barriers

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 27

slide-28
SLIDE 28

Checkpoint Execution Overhead

30 40

nt

Global Rebound_NoDWB R b d

10 20

% Checkpoi Overhead

Rebound 2 15

Barnes Cholesky Fft Fmm Radix Lu-C Lu-NC Volrend Water- Sp Water- Nsq Radiosity Ocean Raytrace SP2-AVG

%

  • Rebound’s avg checkpoint execution overhead is 2%

– Compared to 15% for Global Compared to 15% for Global

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 28

slide-29
SLIDE 29

Checkpoint Execution Overhead

30 40

nt

Global Rebound_NoDWB R b d

10 20

% Checkpoi Overhead

Rebound

Barnes Cholesky Fft Fmm Radix Lu-C Lu-NC Volrend Water- Sp Water- Nsq Radiosity Ocean Raytrace SP2-AVG

%

  • Rebound’s avg checkpoint execution overhead is 2%

– Compared to 15% for Global Compared to 15% for Global

  • Delayed Writebacks complement local checkpointing
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 29

slide-30
SLIDE 30

Rebound Scalability

Constant problem size

  • Rebound is scalable in checkpoint overhead
  • Delayed Writebacks help scalability
  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 30

slide-31
SLIDE 31

Also in the Paper

  • Delayed write backs also useful in Global

Barrier optimi ation is effecti e b t not ni ersall applicable

  • Barrier optimization is effective but not universally applicable
  • Power increase due to hardware additions < 2%
  • Rebound leads to only 4% increase in coherence traffic

y

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing 31

slide-32
SLIDE 32

Conclusions

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

  • Leverages directory protocol
  • Boosts checkpointing efficiency:

p g y

  • Boosts checkpointing efficiency:
  • Delayed write-backs
  • Multiple checkpoints
  • Barrier optimization
  • Avg. execution overhead for 64 procs: 2%
  • Future work:
  • Apply Rebound to non-hardware coherent machines

S

  • R. Agarwal, P. Garg, J. Torrellas

Rebound: Scalable Checkpointing

  • Scalability to hierarchical directories

32

slide-33
SLIDE 33

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep Torrellas D f C S i Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu p