Rebound: Scalable Checkpointing for Coherent Shared Memor for - PowerPoint PPT Presentation

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu p

Checkpointing in Shared-Memory MPs rollback Fault save save chkpt chkpt • HW-based schemes for small CMPs use Global checkpointing – All procs participate in system-wide checkpoints P1 P1 P2 P2 P3 P4 P3 P4 checkpoint checkpoint h k i t • Global checkpointing is not scalable – Synchronization, bursty movement of data, loss in rollback… R. Agarwal, P. Garg, J. Torrellas 2 Rebound: Scalable Checkpointing

Alternative: Coordinated Local Checkpointing • Idea: threads coordinate their checkpointing in groups • Rationale: – Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Global Local Chkpt Chkpt Chkpt + Scalable: Checkpoint and rollback in processor groups – Complexity: Record inter-thread dependences dynamically. C l it R d i t th d d d d i ll R. Agarwal, P. Garg, J. Torrellas 3 Rebound: Scalable Checkpointing

Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory p g y • Leverages directory protocol to track inter-thread deps. • Opts to boost checkpointing efficiency: • Delaying write-back of data to safe memory at checkpoints • Supporting multiple checkpoints • Optimizing checkpointing at barrier synchronization • Avg. performance overhead for 64 procs: 2% • Compared to 15% for global checkpointing p g p g R. Agarwal, P. Garg, J. Torrellas 4 Rebound: Scalable Checkpointing

Background: In-Memory Checkpt with ReVive [Pvrulovic-02] Execution Register Register P1 P1 P2 P2 P3 P3 Dump CHK Displacement Caches Dirty Cache Dirty Cache Writebacks lines W W W W WB Checkpoint Writeback Application Stalls Stalls Logging Log Memory R. Agarwal, P. Garg, J. Torrellas 5 Rebound: Scalable Checkpointing

Background: In-Memory Checkpt with ReVive [Pvrulovic-02] Old Register restored P1 P1 P2 P2 P3 P3 CHK Fault Caches Cache Invalidated W W W W WB Memory Lines Reverted R d Log Memory Global Local Coordinated Scalable protocol Broadcast protocol R. Agarwal, P. Garg, J. Torrellas 6 Rebound: Scalable Checkpointing

Coordinated Local Checkpointing Rules P1 P1 P1 P1 P2 P2 P2 P2 P1 P1 P2 P2 wr x rd x chkpt chkpt Consumer Producer Producer Consumer rollback rollback chkpoint chkpoint rollback rollback chkpoint chkpoint P checkpoints � P’s producers checkpoint P rolls back � P s consumers rollback � P’s consumers rollback P rolls back • Banatre et al. used Coordinated Local checkpointing for bus- based machines [Banatre96] based machines [Banatre96] R. Agarwal, P. Garg, J. Torrellas 7 Rebound: Scalable Checkpointing

Rebound Fault Model Chip Multiprocessor Main Memory Log (in SW) Log (in SW) • Any part of the chip can suffer transient or permanent faults. • A fault can occur even during checkpointing • Off-chip memory and logs suffer no fault on their own (e g NVM) Off chip memory and logs suffer no fault on their own (e.g. NVM) • Fault detection outside our scope: • Fault detection latency has upper-bound of L cycles R. Agarwal, P. Garg, J. Torrellas 8 Rebound: Scalable Checkpointing

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID R. Agarwal, P. Garg, J. Torrellas 9 Rebound: Scalable Checkpointing

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: p ( p) g • MyProducers : bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc that consumed data produced MyConsumers : bitmap of proc. that consumed data produced by the local proc. R. Agarwal, P. Garg, J. Torrellas 10 Rebound: Scalable Checkpointing

Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep MyConsumer L2 Register Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: p ( p) g • MyProducers : bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc that consumed data produced MyConsumers : bitmap of proc. that consumed data produced by the local proc. • Processor ID in each directory entry: • LW-ID : last writer to the line in the current checkpoint interval. LW ID l t it t th li i th t h k i t i t l R. Agarwal, P. Garg, J. Torrellas 11 Rebound: Scalable Checkpointing

Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes MyConsumers MyConsumers Write Write LW-ID P1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 12 Rebound: Scalable Checkpointing

Recording Inter-Thread Dependences MyConsumers � P2 P1 P2 y MyProducers MyProducers P1 P2 reads MyConsumers MyConsumers P2 MyProducers � P1 LW-ID P1 S D Write back Logging gg g Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 13 Rebound: Scalable Checkpointing

Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 P1 writes MyConsumers MyConsumers P2 LW-ID P1 S P1 P1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 14 Rebound: Scalable Checkpointing

Recording Inter-Thread Dependences P1 P2 Clear Dep registers p g MyProducers MyProducers P1 P1 checkpoints MyConsumers MyConsumers P2 Clear LW ID Clear LW-ID LW-ID LW-ID should remain set till P1 S Writebacks W it b k th li the line is i P1 D P1 checkpointed Logging Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas 15 Rebound: Scalable Checkpointing

Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1 P1 P2 P3 P4 P1 P1 chk initiate checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 16 Rebound: Scalable Checkpointing

Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 initiate checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 17 Rebound: Scalable Checkpointing

Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 18 Rebound: Scalable Checkpointing

Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint R. Agarwal, P. Garg, J. Torrellas 19 Rebound: Scalable Checkpointing

Distributed Checkpointing Protocol in SW • Interaction Set [P i ]: set of producer processors (transitively) for P i – Built using MyProducers – Built using MyProducers InteractionSet : P1, P2, P3 P1 P2 P3 P4 P1 P1 chk Ck? Ck? P2 P3 Ck ? initiate P4 checkpoint checkpoint • Rollback handled similarly using MyConsumers R. Agarwal, P. Garg, J. Torrellas 20 Rebound: Scalable Checkpointing

Optimization1 : Delayed Writebacks Time Interval nterval I 1 I 1 Stall Stall In Checkpoint sync sync eckpoint nterval Stall WB dirty lines WB dirty lines I 2 sync In C Ch Interval sync I 2 • Checkpointing overhead dominated by data writebacks • Delayed Writeback optimization • Processors synchronize and resume execution • Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back • Still need to record inter-thread dependences on delayed data Still d t d i t th d d d d l d d t R. Agarwal, P. Garg, J. Torrellas 21 Rebound: Scalable Checkpointing

Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. registers Each cache line has a delayed bit E h h li h d l d bit - Increased vulnerability A rollback event forces both intervals to roll back R. Agarwal, P. Garg, J. Torrellas 22 Rebound: Scalable Checkpointing

Optimization2 : Multiple Checkpoints • Problem: Fault detection is not instantaneous – Checkpoint is safe only after max fault-detection latency (L) p y y ( ) Ckpt 1 Dep registers 1 Rollback Ckpt 2 ection ency Dep registers 2 Late Dete t f Fault • Solution: Keep multiple checkpoints – On fault, roll back interacting processors to safe checkpoints • No Domino Effect R. Agarwal, P. Garg, J. Torrellas 23 Rebound: Scalable Checkpointing

Rebound: Scalable Checkpointing for Coherent Shared Memor for - PowerPoint PPT Presentation

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Applications with Little or No Rebound Digitalization and the Rebound Effect HS2019 Vanessa

Rebound Effects Digitalization and the Rebound Effect - Seminar HS2019 Martin Blapp Greenhouse

The Restaurant Rebound Welcome to the Restaurant Rebound, a bi-weekly report on the industrys

CPSC 410/ 611: Week 7 Vir t ual Memor y Reading: Silber shat z, Chapt er 9 Vir

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

The Walkable Urban Rebound San Antonio at the Tipping Point Alex Steinberger San Antonio

REBOUND EFFECT FOR PRIVATE TRANSPORT AND ENERGY SERVICES IN THE UK IAEE European Conference,

How to Improve Rebound Attacks Mar a Naya-Plasencia FHNW - Switzerland Outline 1 Hash

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Altice USA Q4 and Full Year 2019 Results February 12, 2020 Disclaimer FORWARD-LOOKING

ZEGAs Buy & Hedge January 2020 Disclosure Information presented does not involve the

How it works and what you need to know Clem son Energy Goals Reduce energy consumption 20%

2019 BFG Q1 Webinar The Rebound Q1 2018 Rebounding Market Performance Source:

Traps for Gauging Fumigation Effectiveness in Commercial Facilities James F. Campbell USDA ARS

Presentation By: Sean Haggett Research Goals Improve subsurface fracture character prediction

Bill Protection Introduction 1 Agenda 1.Review Decision Parameters & Principles 2.TOU Bill

Management Presentation Reliable power when and where you need it. Clean and simple. Safe Harbor

Rebound: Scalable Checkpointing for Coherent Shared Memor for - PowerPoint PPT Presentation

Rebound: Scalable Checkpointing for Coherent Shared Memor for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas D Department of Computer Science f C S i University of Illinois at Urbana-Champaign

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Applications with Little or No Rebound Digitalization and the Rebound Effect HS2019 Vanessa

Rebound Effects Digitalization and the Rebound Effect - Seminar HS2019 Martin Blapp Greenhouse

The Restaurant Rebound Welcome to the Restaurant Rebound, a bi-weekly report on the industrys

CPSC 410/ 611: Week 7 Vir t ual Memor y Reading: Silber shat z, Chapt er 9 Vir

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Coherent beam-beam effects X. Buffat Content Coherent vs. incoherent Self-consistent

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

The Walkable Urban Rebound San Antonio at the Tipping Point Alex Steinberger San Antonio

REBOUND EFFECT FOR PRIVATE TRANSPORT AND ENERGY SERVICES IN THE UK IAEE European Conference,

How to Improve Rebound Attacks Mar a Naya-Plasencia FHNW - Switzerland Outline 1 Hash

MI MI and Shared MI MI and Shared and Shared Decision Making and Shared Decision Making

Altice USA Q4 and Full Year 2019 Results February 12, 2020 Disclaimer FORWARD-LOOKING

ZEGAs Buy &amp; Hedge January 2020 Disclosure Information presented does not involve the

How it works and what you need to know Clem son Energy Goals Reduce energy consumption 20%

2019 BFG Q1 Webinar The Rebound Q1 2018 Rebounding Market Performance Source:

Traps for Gauging Fumigation Effectiveness in Commercial Facilities James F. Campbell USDA ARS

Presentation By: Sean Haggett Research Goals Improve subsurface fracture character prediction

Bill Protection Introduction 1 Agenda 1.Review Decision Parameters &amp; Principles 2.TOU Bill

Management Presentation Reliable power when and where you need it. Clean and simple. Safe Harbor

ZEGAs Buy & Hedge January 2020 Disclosure Information presented does not involve the

Bill Protection Introduction 1 Agenda 1.Review Decision Parameters & Principles 2.TOU Bill