DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in - - PowerPoint PPT Presentation

drain distributed recovery architecture for inaccessible
SMART_READER_LITE
LIVE PREVIEW

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in - - PowerPoint PPT Presentation

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips Andrew DeOrio , Konstantinos Aisopos Valeria Bertacco , Li-Shiuan Peh University of Michigan Princeton University Massachusetts


slide-1
SLIDE 1

DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in Multi-core Chips

Andrew DeOrio†, Konstantinos Aisopos‡§ Valeria Bertacco†, Li-Shiuan Peh§

DAC 2011

†University of Michigan ‡Princeton University §Massachusetts Institute of Technology

slide-2
SLIDE 2

Reliable Networks on Chip

2

R

µp

$

processor cache router Detect if fault has

  • ccurred

Diagnose where fault has occurred Recover and resume normal

  • peration

Reconfigure network to account for fault

Drain fault-tolerant routing

detection diagnosis recon- figuration recovery

when nodes become disconnected, data is lost!

slide-3
SLIDE 3

Previous Recovery Solutions

  • Checkpoint approaches
  • Drain takes a reactive approach, incurring

performance overhead only when errors occur

3

R

µP

$

checkpoint buffers

data stuck in checkpoint buffer!

R

µP

$

MEM high performance

  • verhead!

ReVive [Prvulovic et. al’02] SafetyNet [Sorin et. al’02]

slide-4
SLIDE 4

Data Recovery with Drain

  • Recover data lost during reconfiguration

– Emergency links provide alternate path – Transfers cache contents and architectural state

primary link

Mem

µ

P $ Router

µ

P $ Router

µ

P $ Router

. . . . . . . . . . . . . . . . . . ... ...

. . . . . . ... . . . . . . . . . . . . ...

processor core local cache memory controller

DRAIN emergency link

4

2 wires 32-128 wires power gated

slide-5
SLIDE 5

Drain Example

µp

$

M

µp

$

µp

$

µp

$

5

router emergency link primary link

slide-6
SLIDE 6

Drain Example

µp

$

M

µp

$

µp

$

µp

$

X

link failure

6

Fault model: faults accumulate

  • ne at a time.
slide-7
SLIDE 7

Drain Example

µp

$

M

µp

$

µp

$

µp

$

7

reconfigure interconnect

slide-8
SLIDE 8

Drain Example

µp

$

M

µp

$

µp

$

µp

$

X

link failure

8

Fault model: initiate Drain recovery when a single additional fault causes a node to become isolated

slide-9
SLIDE 9

Drain Example

µp

$

M

µp

$

µp

$

µp

$

9

node isolated!

slide-10
SLIDE 10

Drain Example

µp

$

M

µp

$

µp

$

µp

$

drain connected nodes via primary links

10

slide-11
SLIDE 11

Drain Example

µp

$

M

µp

$

µp

$

µp

$

drain disconnected node via emergency link

11

slide-12
SLIDE 12

Emergency Link Algorithm

12

find next target cache connected to main memory copy dirty cache lines to target cache copy registers and state to target cache empty ?

found not found no

find next target cache toward subnet border empty ?

found no yes done

slide-13
SLIDE 13

Drain Example

µp

$

M

µp

$

µp

$

µp

$

drain connected node again

13

slide-14
SLIDE 14

Drain Example

µp

$

M

µp

$

µp

$

resume normal

  • peration

14

µp

$

OS can re-assign workload

slide-15
SLIDE 15

Drain Hardware

15

set decoder set M set 0

...

way 0 ... way N

DRAIN- enabled control logic

uP router local cache

=? data

tag

data

tag

data

tag

=? tag

existing cache logic additional cache logic

data set tag set serial to parallel emergency link input

emergency link output

para parallel to serial data primary link output primary link input DRAIN data

tag

<5,000 gates per node

slide-16
SLIDE 16

Drain Performance as Links Fail

16

0M 1M 2M 3M 4M 5M 10 20 30 40 50 60 70 80 90 100

drain time (cycles per incident) injected faults average time to flush data via emergency links average time to flush data via primary links

increasing emergency link time

slide-17
SLIDE 17

Memory Latency Before and After

17

50 100 150 200 250

  • avg. memory latency (cycles)

before recovery after recovery

slide-18
SLIDE 18

Conclusions

  • DRAIN is a lightweight recovery mechanism

for CMPs

– 5,000 gates per node

  • Recoup cache data and architectural state

from disconnected nodes

  • Performance overhead only during a recovery

incident

– ~3ms at 1GHz

18