1 1 1
1/30
Vicis: A Reliable Network for Unreliable Silicon
Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester
Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor
Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, - - PowerPoint PPT Presentation
1 Vicis: A Reliable Network for Unreliable Silicon Andrew DeOrio, David Fick, Jin Hu, Valeria Bertacco, David Blaauw, Dennis Sylvester Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor 1/30 1 1
1 1 1
1/30
Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor
2 2 2
2/30
processor
3 3 3
3/30
Wearout Fault Rate Chip Time eliminated by burn-in eliminated by margining Useful Chip Lifetime
4 4 4
4/30
Wearout Fault Rate Chip Time Useful Chip Lifetime
eliminated by burn-in eliminated by margining
5 5 5
5/30
Wearout Fault Rate Chip Time Useful Chip Lifetime Reclaimed by Vicis eliminated by burn-in eliminated by margining
6 6 6
6/30
Errors Performance
Detection Detect if fault has occurred Diagnosis Diagnose what fault has occurred Recovery Recover and resume normal operation
Failure at first fault
Errors Performance
Failure after several faults
Errors Performance
Graceful performance degradation
7 7 7
7/30
8 8 8
8/30
9 9 9
9/30
W N E local
dest i packet flits ...
router i S
dest dir N 1 S n-1 W i local routing table
...
head data tail
...
10 10 10
10/30
11 11 11
11/30
12 12 12
12/30
13 13 13
13/30
14 14 14
14/30
15 15 15
15/30
16 16 16
16/30
17 17 17
17/30
Local North West South East
18 18 18
18/30
19 19 19
19/30
20 20 20
20/30
BIST swap BIST
ECC/DEC ECC/DEC Link ECC-to-ECC Path
XB
bbus
XB
bbus
21 21 21
21/30
BIST swap BIST
ECC/DEC ECC/DEC Link ECC-to-ECC Path
XB
bbus
XB
bbus
22 22 22
22/30
Destination Routed ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
23 23 23
23/30
1 Stuck-0
missed faults!
24 24 24
24/30
LFSR
expected signature
to configuration table pre-recorded test bit flip counter save for reconfiguration routine
25 25 25
25/30
26 26 26
26/30
Key advantage: keep processors connected to network
27 27 27
27/30
20 40 60 80 100
1 2 3 4 5 6 7 8 9
Available Routers
~50% of routers work despite 100 faults ~1/2000 gates broken
28 28 28
28/30
20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Network Performance
1 2 3 4 5 6 7 8 9
Available Routers
throughput decreases as routers reconfigure throughput increases as network shrinks
5th-95th percentile
29 29 29
29/30
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
TMR Vicis
Key Advantage: Extended MTTF 42% area overhead 200+% area overhead Key Advantage: Low area
*BulletProof: A Defect-tolerant CMP Switch Architecture, Constantinides, et al. 2006
different TMR granularities *
30 30 30
30/30