 
              DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio † , Konstantinos Aisopos ‡§ Valeria Bertacco † , Li-Shiuan Peh § † University of Michigan ‡ Princeton University § Massachusetts Institute of Technology DAC 2011
Reliable Networks on Chip fault-tolerant routing Drain recon- recovery detection diagnosis figuration Detect if Diagnose Reconfigure Recover and fault has where fault network to resume normal occurred has occurred account for fault operation processor µp when nodes $ cache become disconnected, R data is lost! router 2
Previous Recovery Solutions • Checkpoint approaches ReVive [Prvulovic et. al’02] SafetyNet [Sorin et. al’02] checkpoint µP µP buffers $ $ MEM R R data stuck in high performance checkpoint buffer! overhead! • Drain takes a reactive approach , incurring performance overhead only when errors occur 3
Data Recovery with Drain • Recover data lost during reconfiguration – Emergency links provide alternate path – Transfers cache contents and architectural state power 2 wires gated processor µ µ µ P P P core . . . . . . . . . . . . . . . ... . . ... . local cache $ DRAIN emergency link $ $ ... . . ... . . . . . . . Router primary link Router Router memory . . . Mem . . . controller . . . 32-128 wires 4
Drain Example µp µp $ $ emergency link µp µp primary $ $ link M router 5
Drain Example µp µp Fault model: $ $ faults accumulate one at a time. link µp µp X failure $ $ M 6
Drain Example µp µp $ $ reconfigure interconnect µp µp $ $ M 7
Drain Example µp µp Fault model: $ $ initiate Drain recovery when a single additional fault causes a node to become isolated µp µp $ $ X M link failure 8
Drain Example µp µp $ $ node isolated! µp µp $ $ M 9
Drain Example µp µp $ $ drain connected nodes via primary links µp µp $ $ M 10
Drain Example µp µp $ $ µp µp $ $ drain disconnected node via emergency link M 11
Emergency Link Algorithm find next target find next target cache cache toward not found connected to subnet border main memory found found no no empty empty ? ? yes copy dirty copy registers cache lines to and state to done target cache target cache 12
Drain Example µp µp $ $ µp µp $ $ drain connected node again M 13
Drain Example µp µp $ $ resume normal operation OS can re-assign workload µp µp $ $ M 14
Drain Hardware additional cache logic existing cache logic primary link input data emergency link input serial to parallel way 0 ... way N DRAIN tag tag decoder set 0 data data set ... set DRAIN- tag tag data data uP set M enabled control logic tag =? =? local cache primary link output data emergency parallel to serial router set link output para <5,000 gates tag per node 15
Drain Performance as Links Fail 5M average time to flush data via emergency links average time to flush data via primary links increasing 4M drain time (cycles per incident) emergency link time 3M 2M 1M 0M 0 10 20 30 40 50 60 70 80 90 100 injected faults 16
Memory Latency Before and After 250 avg. memory latency (cycles) before recovery after recovery 200 150 100 50 0 17
Conclusions • DRAIN is a lightweight recovery mechanism for CMPs – 5,000 gates per node • Recoup cache data and architectural state from disconnected nodes • Performance overhead only during a recovery incident – ~3ms at 1GHz 18
Recommend
More recommend