Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, - - PowerPoint PPT Presentation

vicis a reliable network for unreliable
SMART_READER_LITE
LIVE PREVIEW

Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, - - PowerPoint PPT Presentation

1 Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor 1/30 1 1


slide-1
SLIDE 1

1 1 1

1/30

Vicis: A Reliable Network for Unreliable Silicon

Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester

Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor

slide-2
SLIDE 2

2 2 2

2/30

  • Growing design complexity  modular architectures
  • Network-on-Chip (NoC)
  • Components (processors, memory, etc.) communicate via routers
  • Scalable bandwidth, inherent redundancy

Network-on-Chip

processor

router

slide-3
SLIDE 3

3 3 3

3/30

  • Technology-scaling increases likelihood of wear-out
  • Reasons include: oxide breakdown, electromigration, etc.

Transistor Wear-Out

The Bathtub Curve

Wearout Fault Rate Chip Time eliminated by burn-in eliminated by margining Useful Chip Lifetime

slide-4
SLIDE 4

4 4 4

4/30

  • Technology-scaling increases likelihood of wear-out
  • Reasons include: oxide breakdown, electromigration, etc.

Transistor Wear-Out

The Bathtub Curve

Wearout Fault Rate Chip Time Useful Chip Lifetime

Technology-scaling  shorter useful life

eliminated by burn-in eliminated by margining

slide-5
SLIDE 5

5 5 5

5/30

  • Technology-scaling increases likelihood of wear-out
  • Reasons include: oxide breakdown, electromigration, etc.

Transistor Wear-Out

The Bathtub Curve

Wearout Fault Rate Chip Time Useful Chip Lifetime Reclaimed by Vicis eliminated by burn-in eliminated by margining

slide-6
SLIDE 6

6 6 6

6/30

Vicis

System Response to Wear-Out

Errors Performance

No Protection

Detection Detect if fault has occurred Diagnosis Diagnose what fault has occurred Recovery Recover and resume normal operation

Checkpointing Invariants

Failure at first fault

Errors Performance

Triple Modular Redundancy

Failure after several faults

Errors Performance

Vicis

Graceful performance degradation

Reconfigure network to account for fault Reconfiguration

slide-7
SLIDE 7

7 7 7

7/30

Fault Tolerance Strategy

routing table crossbar input

  • utput

decoder

If faults are isolated, fault-free lanes may continue to operate

FIFO

slide-8
SLIDE 8

8 8 8

8/30

Outline

  • Architecture Overview
  • Diagnostic Approach
  • Experimental Results
  • Conclusion
slide-9
SLIDE 9

9 9 9

9/30

Network Assumptions

  • Wormhole routing
  • 2D mesh or torus
  • Static routing
  • No virtual channels
  • Hard fault injection

W N E local

dest i packet flits ...

router i S

dest dir N 1 S n-1 W i local routing table

...

head data tail

. . .

...

slide-10
SLIDE 10

10 10 10

10/30

Vicis Reliability Features

routing table crossbar input

  • utput

decoder

FIFO

slide-11
SLIDE 11

11 11 11

11/30

Vicis Reliability Features

routing table crossbar input

  • utput

decoder ECC

FIFO

slide-12
SLIDE 12

12 12 12

12/30

Vicis Reliability Features

routing table crossbar bypass bus input

  • utput

decoder controller ECC

FIFO

slide-13
SLIDE 13

13 13 13

13/30

routing table crossbar bypass bus input

  • utput

decoder controller ECC

Vicis Reliability Features

port swapper

FIFO

slide-14
SLIDE 14

14 14 14

14/30

routing table crossbar bypass bus input

  • utput

decoder controller ECC

Vicis Reliability Features

BIST Controller

FIFO

port swapper

slide-15
SLIDE 15

15 15 15

15/30

routing table crossbar bypass bus input

  • utput

decoder controller ECC BIST Controller

FIFO

port swapper

Vicis Reliability Features

config. table

slide-16
SLIDE 16

16 16 16

16/30

routing table crossbar bypass bus input

  • utput

decoder controller ECC BIST Controller

FIFO

port swapper

config. table

Vicis Reliability Features

Distributed Algorithm Engine

slide-17
SLIDE 17

17 17 17

17/30

Input Port Swapping

Input ports decoder port swapper

Local North West South East

  • Partial crossbar gives

multiple connection

  • ptions
  • Priority given to local

port in order to increase # of available IPs

FIFO

slide-18
SLIDE 18

18 18 18

18/30

Input Port Swapping

R R R

X

R R

X

R R R

X

R R

X

Two Functional Links Three Functional Links Input Port Swap Input Port Output Port

slide-19
SLIDE 19

19 19 19

19/30

Bypass Bus

crossbar bypass bus controller

  • Provides alternative

path around crossbar

  • Round-robin arbiter

inside controller

  • No penalty for single

user, additional users must stall until free

slide-20
SLIDE 20

20 20 20

20/30

Error Correction Codes

BIST swap BIST

ECC/DEC ECC/DEC Link ECC-to-ECC Path

XB

bbus

XB

bbus

  • Six available paths between two routers
  • Only one fault may be corrected by ECC
  • Fault information for five unit types must be considered:
  • Crossbar, bypass bus, network link, input port swapper, FIFOs
  • All configurations must be performed simultaneously
  • Configurations affect each another, network wide

FIFO

slide-21
SLIDE 21

21 21 21

21/30

Error Correction Codes

BIST swap BIST

ECC/DEC ECC/DEC Link ECC-to-ECC Path

XB

bbus

XB

bbus

Single Bit Fault FIFO

slide-22
SLIDE 22

22 22 22

22/30

Network Re-routing [DATE 2009]

  • Distributed routing algorithm
  • Can route around an arbitrary

number of faulty links

  • Requires no virtual channels
  • Implemented in fewer than 300 gates

Destination Routed ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

slide-23
SLIDE 23

23 23 23

23/30

Hard Fault Diagnosis

Unit 1 Unit 2 Built-in-Self-Test

1 Stuck-0

Unit 1 Unit 2 Built-in-Self-Test

missed faults!

slide-24
SLIDE 24

24 24 24

24/30

Hard Fault Diagnosis

Pattern Based Testing

  • Error unit, crossbar controller, routing table,

decode/ECC, output ports, FIFO control

Datapath Testing

  • FIFO datapath, input port swapper, links,

crossbar, bypass bus, configuration table

LFSR

=

expected signature

to configuration table pre-recorded test bit flip counter save for reconfiguration routine

unit under test unit under test

slide-25
SLIDE 25

25 25 25

25/30

Experimental Setup

  • 3x3 Torus
  • 32-bit data flits, 32 flit buffers
  • Implemented in Verilog
  • Synthesized, automatic place and route in 45nm
  • Reliability results
  • Implemented in C++
  • Performance results
  • Injected stuck-at faults on gate outputs
  • Weighting based on gate area
  • 10,000 packets per test, random uniform traffic
  • Parallel packet injection
slide-26
SLIDE 26

26 26 26

26/30

Results – Router Reliability

Key advantage: keep processors connected to network

slide-27
SLIDE 27

27 27 27

27/30

Results – Network Performance

20 40 60 80 100

Network Faults

1 2 3 4 5 6 7 8 9

Available Routers

Available Routers

~50% of routers work despite 100 faults ~1/2000 gates broken

slide-28
SLIDE 28

28 28 28

28/30

Results – Network Performance

20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Network Performance

Normalized Network Throughput

Network Faults

1 2 3 4 5 6 7 8 9

Available Routers

Available Routers

throughput decreases as routers reconfigure throughput increases as network shrinks

5th-95th percentile

slide-29
SLIDE 29

29 29 29

29/30

Results – Network Reliability

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

TMR Vicis

Likelihood of Correct Operation (%)

Network Faults

Key Advantage: Extended MTTF 42% area overhead 200+% area overhead Key Advantage: Low area

  • verhead

*BulletProof: A Defect-tolerant CMP Switch Architecture, Constantinides, et al. 2006

different TMR granularities *

slide-30
SLIDE 30

30 30 30

30/30

Conclusion

  • Vicis can tolerate a large number of faults
  • Vicis provides much greater reliability than NMR based

solutions

  • Vicis: constant-reliability, probabilistic performance
  • NMR: constant-performance, probabilistic reliability
  • Vicis has an overhead of 42% versus 100+% for NMR