Rebound: Scalable Checkpointing for Coherent Shared Memor for - - PowerPoint PPT Presentation
Rebound: Scalable Checkpointing for Coherent Shared Memor for - - PowerPoint PPT Presentation
Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign
Checkpointing in Shared-Memory MPs
rollback Fault save chkpt save chkpt
- HW-based schemes for small CMPs use Global checkpointing
– All procs participate in system-wide checkpoints
P1 P2 P3 P4 checkpoint h k i t P1 P2 P3 P4
- Global checkpointing is not scalable
checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
– Synchronization, bursty movement of data, loss in rollback…
2
Alternative: Coordinated Local Checkpointing
- Idea: threads coordinate their checkpointing in groups
- Rationale:
– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Global Chkpt Local Chkpt Local Chkpt
+ Scalable: Checkpoint and rollback in processor groups C l it R d i t th d d d d i ll
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 3
– Complexity: Record inter-thread dependences dynamically.
Contributions
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
- Leverages directory protocol to track inter-thread deps.
p g y
- Opts to boost checkpointing efficiency:
- Delaying write-back of data to safe memory at checkpoints
- Supporting multiple checkpoints
- Optimizing checkpointing at barrier synchronization
- Avg. performance overhead for 64 procs: 2%
- Compared to 15% for global checkpointing
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
p g p g
4
Background: In-Memory Checkpt with ReVive
P1 P2 P3 Register
[Pvrulovic-02] Execution
P1 P2 P3 Register Dump Caches CHK
Dirty Cache Displacement
Writebacks Writeback W W W W WB
Dirty Cache lines Checkpoint Application Stalls
Memory Log Logging
Stalls
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 5
Background: In-Memory Checkpt with ReVive
[Pvrulovic-02]
Old Register restored P3 P2 P1 Fault CHK P3 P2 Caches P1 Cache Invalidated Memory Lines R d W W W W WB Reverted Log Memory
Global Broadcast protocol Local Coordinated Scalable protocol
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 6
Coordinated Local Checkpointing Rules
P1 P2 P1 P2 P1 P2 wr x P1 P2 P1 P2 P1 P2 rd x Producer rollback Consumer rollback Producer chkpoint Consumer chkpoint chkpt chkpt rollback rollback chkpoint chkpoint
P checkpoints P’s producers checkpoint P rolls back P’s consumers rollback
- Banatre et al. used Coordinated Local checkpointing for bus-
based machines [Banatre96] P rolls back P s consumers rollback
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
based machines [Banatre96]
7
Rebound Fault Model
Log (in SW) Main Memory Chip Multiprocessor Log (in SW)
- Any part of the chip can suffer transient or permanent faults.
- A fault can occur even during checkpointing
- Off-chip memory and logs suffer no fault on their own (e g NVM)
Off chip memory and logs suffer no fault on their own (e.g. NVM)
- Fault detection outside our scope:
- Fault detection latency has upper-bound of L cycles
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 8
Rebound Architecture
Main Memory Chip Multiprocessor
P+L1 L2 Directory Cache
MyProducer MyConsumer Dep Register LW-ID
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 9
Rebound Architecture
Main Memory Chip Multiprocessor
P+L1 L2 Directory Cache
MyProducer MyConsumer Dep Register
- Dependence (Dep) registers in the L2 cache controller:
LW-ID
p ( p) g
- MyProducers : bitmap of proc. that produced data consumed by
the local proc.
- MyConsumers : bitmap of proc that consumed data produced
MyConsumers : bitmap of proc. that consumed data produced by the local proc.
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 10
Rebound Architecture
Main Memory Chip Multiprocessor
P+L1 L2 Directory Cache
MyProducer MyConsumer Dep Register
- Dependence (Dep) registers in the L2 cache controller:
LW-ID
p ( p) g
- MyProducers : bitmap of proc. that produced data consumed by
the local proc.
- MyConsumers : bitmap of proc that consumed data produced
MyConsumers : bitmap of proc. that consumed data produced by the local proc.
- Processor ID in each directory entry:
LW ID l t it t th li i th t h k i t i t l
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
- LW-ID : last writer to the line in the current checkpoint interval.
11
Recording Inter-Thread Dependences
P1 P2
Write
P1 writes
MyProducers MyConsumers MyProducers MyConsumers
D P1
Write
LW-ID
Log Memory
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
Assume MESI protocol
12
Recording Inter-Thread Dependences
P1 P2
MyConsumers P2 P2 reads y MyProducers P1
MyProducers MyConsumers MyProducers MyConsumers
P2 P1 D P1 S
LW-ID
Write back Logging gg g Memory Log
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
Assume MESI protocol
13
Recording Inter-Thread Dependences
P1 P2
P1 writes
P2 P1
MyProducers MyConsumers MyProducers MyConsumers
P1 S P1
LW-ID
D P1
Memory Log
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
Assume MESI protocol
14
Recording Inter-Thread Dependences
P1 P2
Clear Dep registers
Clear LW ID
P1 checkpoints
P2 P1
MyProducers MyConsumers MyProducers MyConsumers
p g
P1 P1 S
W it b k Clear LW-ID
LW-ID should remain set till th li i
LW-ID
P1 D
Writebacks Logging
the line is checkpointed
Memory Log
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
Assume MESI protocol
15
Distributed Checkpointing Protocol in SW
- Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers – Built using MyProducers
P1
P1 P2 P3 P4 InteractionSet : P1
P1
chk initiate checkpoint checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 16
Distributed Checkpointing Protocol in SW
- Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers – Built using MyProducers
P1
P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1 P2 P3
Ck? Ck?
chk initiate checkpoint checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 17
Distributed Checkpointing Protocol in SW
- Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers – Built using MyProducers
P1
P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1 P2 P3
Ck? Ck?
chk initiate checkpoint
P4
Ck?
checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 18
Distributed Checkpointing Protocol in SW
- Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers – Built using MyProducers
P1
P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1 P2 P3
Ck? Ck?
chk initiate checkpoint
P4
Ck?
checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 19
Distributed Checkpointing Protocol in SW
- Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers – Built using MyProducers
P1
P1 P2 P3 P4 InteractionSet : P1, P2, P3
P1 P2 P3
Ck? Ck?
chk initiate checkpoint
P4
Ck?
- Rollback handled similarly using MyConsumers
checkpoint
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 20
Optimization1 : Delayed Writebacks
Interval I1 Time nterval I1 Stall WB dirty lines sync sync Checkpoint Stall sync WB dirty lines eckpoint In nterval I2 Stall C Interval I2 sync Ch In
- Checkpointing overhead dominated by data writebacks
- Delayed Writeback optimization
- Processors synchronize and resume execution
- Hardware automatically writes back dirty lines in background
- Checkpoint only completed when all delayed data written back
Still d t d i t th d d d d l d d t
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
- Still need to record inter-thread dependences on delayed data
21
Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead
- Additional support:
Each processor has two sets of Dep. registers E h h li h d l d bit Each cache line has a delayed bit
- Increased vulnerability
A rollback event forces both intervals to roll back
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 22
Optimization2 : Multiple Checkpoints
- Problem: Fault detection is not instantaneous
– Checkpoint is safe only after max fault-detection latency (L)
Dep registers 1 Ckpt 1
p y y ( )
ection ency
Dep registers 2 Rollback Ckpt 2 Fault
Dete Late
tf
- Solution: Keep multiple checkpoints
– On fault, roll back interacting processors to safe checkpoints
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
- No Domino Effect
23
Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection
- Additional support:
Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency
- Need to track communication across checkpoints
Need to track communication across checkpoints
- Combination with Delayed Writebacks: one more Dep register set
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 24
Optimization3 : Hiding Chkpt behind Global Barrier
- Global barriers require that all processors communicate
Leads to global checkpoints – Leads to global checkpoints
- Optimization:
p – Proactively trigger a global checkpoint at a global barrier – Hide checkpoint overhead behind barrier imbalance spins
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 25
Evaluation Setup
- Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim
- Applications: SPLASH-2
some PARSEC Apache Applications: SPLASH 2 , some PARSEC, Apache
- Simulated CMP architecture with up to 64 threads
- Checkpoint interval : 5 – 8 ms
- Modeled several environments:
- Global: baseline global checkpointing
- Rebound: Local checkpointing scheme with delayed writeback
- Rebound: Local checkpointing scheme with delayed writeback.
- Rebound_NoDWB: Rebound without the delayed writebacks.
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 26
- Avg. Interaction Set: Set of Producer Processors
64 38
- Most apps: interaction set is a small set
Most apps: interaction set is a small set – Justifies coordinated local checkpointing – Averages brought up by global barriers
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 27
Checkpoint Execution Overhead
30 40
nt
Global Rebound_NoDWB R b d
10 20
% Checkpoi Overhead
Rebound 2 15
Barnes Cholesky Fft Fmm Radix Lu-C Lu-NC Volrend Water- Sp Water- Nsq Radiosity Ocean Raytrace SP2-AVG
%
- Rebound’s avg checkpoint execution overhead is 2%
– Compared to 15% for Global Compared to 15% for Global
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 28
Checkpoint Execution Overhead
30 40
nt
Global Rebound_NoDWB R b d
10 20
% Checkpoi Overhead
Rebound
Barnes Cholesky Fft Fmm Radix Lu-C Lu-NC Volrend Water- Sp Water- Nsq Radiosity Ocean Raytrace SP2-AVG
%
- Rebound’s avg checkpoint execution overhead is 2%
– Compared to 15% for Global Compared to 15% for Global
- Delayed Writebacks complement local checkpointing
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 29
Rebound Scalability
Constant problem size
- Rebound is scalable in checkpoint overhead
- Delayed Writebacks help scalability
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 30
Also in the Paper
- Delayed write backs also useful in Global
Barrier optimi ation is effecti e b t not ni ersall applicable
- Barrier optimization is effective but not universally applicable
- Power increase due to hardware additions < 2%
- Rebound leads to only 4% increase in coherence traffic
y
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing 31
Conclusions
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
- Leverages directory protocol
- Boosts checkpointing efficiency:
p g y
- Boosts checkpointing efficiency:
- Delayed write-backs
- Multiple checkpoints
- Barrier optimization
- Avg. execution overhead for 64 procs: 2%
- Future work:
- Apply Rebound to non-hardware coherent machines
S
- R. Agarwal, P. Garg, J. Torrellas
Rebound: Scalable Checkpointing
- Scalability to hierarchical directories
32