Mechanism for Network-on-Chip Architectures University of Cyprus - - PowerPoint PPT Presentation

mechanism for network on chip architectures
SMART_READER_LITE
LIVE PREVIEW

Mechanism for Network-on-Chip Architectures University of Cyprus - - PowerPoint PPT Presentation

NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures University of Cyprus The Multicore Computer Architecture Laboratory (multiCAL) - Computer Architecture Research Group ( - CARCH) EuroCloud FP7


slide-1
SLIDE 1

1 NoCAlert (MICRO-2012) University of Cyprus

University of Cyprus

International Symposium on Microarchitecture, December 3 2012, Vancouver, Canada The Multicore Computer Architecture Laboratory (multiCAL) Ξ - Computer Architecture Research Group (Ξ - CARCH) EuroCloud FP7 Project

NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures

slide-2
SLIDE 2

2 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-3
SLIDE 3

3 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-4
SLIDE 4

4 NoCAlert (MICRO-2012) University of Cyprus

The Network-on-Chip (NoC) paradigm

Image courtesy of C. Daniloff

  • On-chip interconnection fabric (backbone) to connect all nodes
  • Modular design
  • Structured Interconnect Layout
  • Scalable and efficient
  • Packet-based communication
slide-5
SLIDE 5

5 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

Graph courtesy of www.crn.com

Intel 4004 4-bit

1971

Following Moore’s law the number of transistors/chip double approx. every 18-24 months  Designers turn into integrating more cores to take advantage of parallelism

slide-6
SLIDE 6

6 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000

slide-7
SLIDE 7

7 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007

Intel Core 2 Duo 2 Cores

slide-8
SLIDE 8

8 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007

Intel Core i7 (Nehalem) 4 Cores

2008

slide-9
SLIDE 9

9 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008

AMD Opteron 2400 6 Cores

2009

slide-10
SLIDE 10

10 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010

IBM POWER7 8 Cores

slide-11
SLIDE 11

11 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010

Intel Xeon Westmere-EX 10 Cores

2011

slide-12
SLIDE 12

12 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010 2011 2012

slide-13
SLIDE 13

13 NoCAlert (MICRO-2012) University of Cyprus

Core Number Increases

1971

2000 2007 2008 2009 2010 2011 2012

… Near Future

Intel Single-Chip Cloud Computer 48 Cores Intel Polaris Chip 80 Cores

slide-14
SLIDE 14

14 NoCAlert (MICRO-2012) University of Cyprus

It’s already happening!

Intel Polaris Chip

  • Router is becoming part of the

core design

  • NoCs are becoming necessary

Tilera TILE64 – 64 Cores

  • 2D mesh NoC comprising
  • 5 independent networks
  • ne for each of 5 message classes

1971

2000 2007 2008 2009 2010 2011 2012

… Near Future

slide-15
SLIDE 15

15 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-16
SLIDE 16

16 NoCAlert (MICRO-2012) University of Cyprus

Reliability in the nano era

  • Aggressive transistor downsizing

– Increasing hardware variability – Susceptibility to wear-out (accelerated aging effects)

  • Permanent faults

– Static (occurring at manufacture-time)

  • Process Variability (PV), Manufacturing imperfections

– Dynamic (occurring at run-time, prolonged stressing  component wear-out)

  • Electro-Migration (EM), Negative Bias Temperature Instability (NBTI), Oxide

breakdown, Stress-Induced Voiding (SIV), Hot Carrier Injection (HCI), etc.

  • Transient faults (or Soft Errors – Single-Event Upsets, SEU)

– Alpha particles (impurities in packaging/interconnect), Cosmic-ray-induced neutrons, Neutron-induced 10B fission (interconnect layer insulator)

– Traditionally associated with memories

  • Error Correcting Codes (ECC) widely used in DRAM modules
slide-17
SLIDE 17

17 NoCAlert (MICRO-2012) University of Cyprus

Ominous predictions regarding reliability

* S.R. Nassif, N. Mehta, and Yu Cao. A resilience roadmap. In Proc. of the Design, Automation and Test in Europe Conference (DATE), 2010.

  • This recent study from DATE-2010 signifies increases in failure probabilities by

tens of orders of magnitude at 12 nm, as opposed to 45 nm.

  • Each new technology generation decreases IC lifetime by half [ITRS 2011]

Probability of failure Impact of NBTI on failure probability trends Challenge: “Designing reliable systems from unreliable components” *

* S. Borkar, "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," in IEEE Micro, Nov/Dec 2005.

slide-18
SLIDE 18

18 NoCAlert (MICRO-2012) University of Cyprus

NoC (un)reliability implications

  • A single fault in the NoC can cause:

– Network disconnections – Deadlocks (Network and Protocol-level) – Lost packets – Degraded performance

 A single fault can paralyze the entire system (CMP)

  • Protecting the NoC is of paramount importance
slide-19
SLIDE 19

19 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-20
SLIDE 20

20 NoCAlert (MICRO-2012) University of Cyprus

NoCAlert: The Big Picture

  • NoCAlert:

– Lightweight distributed invariance checkers

– Checkers behave like hardware assertions – Checks for legality, not correctness – Network’s operation is never interrupted – Provides almost instantaneous fault detection

  • Assumption:

– Packet/flit contents are protected with ECC

  • NoCAlert protects against faults in the control logic
  • Interesting observation

– Erroneous but legal module outputs are always benign

slide-21
SLIDE 21

21 NoCAlert (MICRO-2012) University of Cyprus

NoCAlert’s Terminology

  • Invariance violation:

– The breaking of a fundamental functional rule within the context of a component’s operation – e.g., the routing computation unit outputs an illegal direction

  • Legality:

– Illegal is an output that is impossible to occur, based on the set

  • f functional correctness rules of a given component
  • Instantaneous fault detection:

– Detect a fault as soon as it manifests (same clock cycle)

– Easier to recover – Localized information could identify faulty location

slide-22
SLIDE 22

22 NoCAlert (MICRO-2012) University of Cyprus

Invariance Checking

  • System is continuously (on-line) examined for illegal
  • utputs

– An illegal output can be the result of some kind of fault

  • Emulates assertions used in software
  • Example: Assume a variable X cannot get the value 5

– assert(X!=5) – In hardware this would be achieved with a comparison unit that raises a flag

slide-23
SLIDE 23

23 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-24
SLIDE 24

24 NoCAlert (MICRO-2012) University of Cyprus

A generic (typical) NoC router micro-architecture

5x5 Crossbar

West Out

VA1 Arbitration VA2 Arbitration SA1 Arbitration SA2 Arbitration

SA2 arbiters control the XBAR connections Local Arbitration: Choose one specific

  • utput VC in

adjacent router Global Arbitration: Resolve global conflicts Local Arbitration: One winning VC in each port Global Arbitration: Resolve global conflicts

Routing Computation

Routing Computation: Next-hop direction

RC VC0 RC VC1 RC VC2 RC VC3

One flit capacity

RC VC0 RC VC1 RC VC2 RC VC3

One flit capacity

RC VC0 RC VC1 RC VC2 RC VC3

One flit capacity

RC VC0 RC VC1 RC VC2 RC VC3

One flit capacity

RC VC0 RC VC1 RC VC2 RC VC3

One flit slot

West In East In VC ID North In South In Processing Element In Input Port

East Out North Out South Out Processing Element Out

Router Pipeline

* Network packets are broken into multiple

  • flits. A flit is a flow control unit and it is the

smallest unit of flow control in the NoC.

slide-25
SLIDE 25

25 NoCAlert (MICRO-2012) University of Cyprus

Identifying invariances within the NoC Router

  • Identification of invariances relies on the modularity and hierarchy
  • f the NoC Router
  • The functional algorithm of each module is exhaustively inspected

using a bottom-up approach

– Identification of all the functional rules – Identification of all the functionally illegal outputs

  • End-to-end invariances at the network-level are identified

Network Level Router Level Input Port FIFO Buffers RC Unit VA and SA Arbiters Crossbar Switch

slide-26
SLIDE 26

26 NoCAlert (MICRO-2012) University of Cyprus

Invariance categorization

  • 32 invariances have been identified through detailed

exploration of the router’s microarchitecture

  • Identified invariances are categorized based on the

router module they are associated with

– Routing Computation unit (3) – Arbiters (10) – Crossbar (3) – Buffer State (12) – Port-Level (3) – End-to-End (network-level) (1)

slide-27
SLIDE 27

27 NoCAlert (MICRO-2012) University of Cyprus

Ensuring network correctness

  • Which conditions must a reliable network satisfy?
  • Four main conditions that ensure

functional correctness within the network*

– No packets are dropped – Delivery time is bounded – No data corruption occurs – No new packet is generated within the network

  • Additional requirement:

Intra-packet flit ordering

* D. Borrione et al. A generic model for formally verifying NoC communication architectures: A case study. In NOCS 2007

slide-28
SLIDE 28

28 NoCAlert (MICRO-2012) University of Cyprus

Routing Algorithm

  • Routing algorithms forbid some turns to avoid

deadlocks and livelocks in the network

  • E.g., Dimension-order XY routing

R R R R R R R R R R R R R R R R

S (0,0) D (2,3)

slide-29
SLIDE 29

29 NoCAlert (MICRO-2012) University of Cyprus

Invariance Example – Routing Algorithm

  • Invariance violation due to forbidden turn

according to the specification of the XY routing algorithm

R R R R R R R R R R R R R R R R

S (0,0) D (3,3)

Forbidden Turn

slide-30
SLIDE 30

30 NoCAlert (MICRO-2012) University of Cyprus

Invariance Example - Arbiters

1 5:1 Arbiter Requests Grants 5:1 Arbiter 1 1 1 1 1 Grants Requests

  • Grant is not allowed without a corresponding

request

  • Arbiter’s output must always be 1-hot
  • Can’t assign one resource (output port or VC)

to multiple contestants

1 1 5:1 Arbiter Requests Grants

  • If at least one request exists, the arbiter must

grant one of the contestants

slide-31
SLIDE 31

31 NoCAlert (MICRO-2012) University of Cyprus

Faults that do not cause invariance violations

R R R R R R R R R R R R R R R R

S (0,0) D (3,3)

Forbidden Turn Legal Turn

  • Invariance checking only detects illegal outputs
  • Does not necessarily detect incorrect outputs
slide-32
SLIDE 32

32 NoCAlert (MICRO-2012) University of Cyprus

Faults that do not cause invariance violations (Cont.)

  • Two elemental questions arising by this kind of faults:
  • 1. If such non-invariant upsets cause some other

functional violation later on in the network, will the fault be caught by subsequent NoCAlert checkers?

  • 2. If these non-invariant upsets are never caught by any

subsequent NoCAlert checker, do they end up affecting the overall network correctness?

slide-33
SLIDE 33

33 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-34
SLIDE 34

34 NoCAlert (MICRO-2012) University of Cyprus

Evaluation framework

  • Tools used:

– GARNET cycle-accurate NoC simulator

  • Used for extensive experimentation under fault presence

– Synopsys Design Compiler (for hardware synthesis)

  • Used to assess the hardware overhead of NoCAlert
  • Based on full Verilog HDL implementation of NoCAlert and

synthesis using 65 nm commercial standard-cell libraries

  • Compared against ForEVeR*, the current state-of-the-art

* R. Parikh and V. Bertacco, "Formally enhanced runtime verification to ensure noc functional correctness," In Proc. of the International Symposium on Microarchitecture (MICRO), 2011.

slide-35
SLIDE 35

35 NoCAlert (MICRO-2012) University of Cyprus

The ForEVeR mechanism *

  • Epoch-based on-line fault detection mechanism

– Achieved with the help of an additional lightweight checker network that is assumed to be 100% reliable – Contains run-time checks for arbitration stages and End-to-End coverage

  • Counter-based scheme that uses notification packets
  • Fault assessment occurs at the end of each epoch

– If counter values not reconciled, a recovery mechanism is triggered – In-flight data delivered to the intended destination via the checker network

  • BUT, how to choose epoch duration (sensitivity to traffic injection

rate, application-specific, etc.)  False positives even in a fault-free environment!

* R. Parikh and V. Bertacco, "Formally enhanced runtime verification to ensure noc functional correctness," In Proc. of the International Symposium on Microarchitecture (MICRO), 2011.

slide-36
SLIDE 36

36 NoCAlert (MICRO-2012) University of Cyprus

Fault-injection framework

  • Fault model: Single-bit, single-event transient faults
  • Faults were injected at the inputs and outputs of every control

module of a router – RC units, VA and SA arbiters, Crossbar, and Status Tables – One fault injected in each experiment

  • Total number of fault locations:

– 11,808 for 8x8 2D mesh network

Logic Module Checker Module Logic Module Checker Module (a) (b) Invariance Assertion Input Invariance Assertion Output Output Single-Bit Fault Single-Bit Fault Input

slide-37
SLIDE 37

37 NoCAlert (MICRO-2012) University of Cyprus

The Golden Reference report

  • For each experiment, generate a Golden Reference

(GR) report:

– A log of the entire network’s output (flit ejections), under a fault-free run. – “Oracle” knowledge of what should normally happen

  • “Contaminated” Logs are compared against the GR

– If all flits were delivered correctly (remember the four rules), and intra-packet order was maintained, the fault was benign (no system-level effect) – Note that the global order of packets is allowed to change

slide-38
SLIDE 38

38 NoCAlert (MICRO-2012) University of Cyprus

Network’s State Affects Fault Detection

  • The state of the network can influence fault detection

behavior:

– Faults in an empty network are less likely to be masked – Warmed-up networks might “hide” faults

  • Need for testing at different states

– 7 different traffic injection ratios (10-40% in 5% increments) – 3 different fault injection instances

  • Faults were injected at cycle 0 (empty network), cycle 32 K, and

cycle 64 K (warmed-up network)

– Resulting in a total of approx. 248 K simulations

slide-39
SLIDE 39

39 NoCAlert (MICRO-2012) University of Cyprus

Classification of NoCAlert’s detection outcomes

  • Four main fault detection categories:

– True positive: Detected non-benign fault – True negative: Non-detected benign fault – False positive: Detected benign fault

  • Can cause unnecessary fault recovery triggering

– False negative: Non-detected non-benign fault

  • Worst case
  • Ideally, this should be ZERO
  • Non-benign: Comparison against GR failed
  • Benign: Successful comparison against GR
slide-40
SLIDE 40

40 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-41
SLIDE 41

41 NoCAlert (MICRO-2012) University of Cyprus

Fault coverage breakdown

  • Observations:

– 0% false negatives for both schemes (no malicious faults escape) – False-positive percentages higher in a warmed-up network

  • More faults are masked

– NoCAlert behaves slightly worse than ForEVeR in terms of false positives (ForEVeR is an epoch-based mechanism  some faults vanish by end of epoch)

51.64 51.64 38.45 38.45 38.70 38.70 30.62 27.76 45.33 42.56 45.15 39.32 17.73 20.59 16.22 18.99 16.15 21.98

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

NoCAlert ForEVeR NoCAlert ForEVeR NoCAlert ForEVeR Cycle 0 Cycle 32000 Cycle 64000 True Positive False Positive True Negative

slide-42
SLIDE 42

42 NoCAlert (MICRO-2012) University of Cyprus

Fault detection latency

  • Observations:

– 97% of NoCAlert’s fault detections were instantaneous – Significant fault detection latency improvement

  • Up to 100x

– NoCAlert: on-line mechanism  instantaneous detection

10 20 30 40 50 60 70 80 90 100 1 8 64 512 4096

Faults Detected (%) Detection Delay (Cycles) NoCAlert ForEVeR

Cumulative fault detection delay distribution

slide-43
SLIDE 43

43 NoCAlert (MICRO-2012) University of Cyprus

Hardware overhead (synthesis using 65 nm libraries)

  • Area overhead: ranges from 1.38% to 4.42% (3%

average)

  • Power* overhead : 0.3% - 1.3% (0.7% average)
  • Critical path overhead: At most 3% (1% average)
  • DMR-CL: Use Double Modular Redundancy (DMR) protection for control logic
  • NoCAlert: Use NoCAlert protection for control logic

10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 2VCs 3VCs 4VCs 5VCs 6VCs 7VCs 8VCs Area Overhead Percentage Area (Normalised) Number of VCs per Port Baseline NoCAlert DMR-CL NoCAlert Area Overhead (%) DMR-CL Area Overhead (%)

* The power numbers were extracted from the Synopsys Design Compiler power report, with switching activity set to 50% for all nets. The operating voltage is 1 V.

slide-44
SLIDE 44

44 NoCAlert (MICRO-2012) University of Cyprus

Outline

  • Necessity of Networks-on-Chip (NoCs)
  • Reliability and NoCs
  • The NoCAlert Approach: Invariance Checking
  • Identifying invariances and examples
  • Evaluation
  • Results
  • Conclusion – Future Work
slide-45
SLIDE 45

45 NoCAlert (MICRO-2012) University of Cyprus

Conclusion

  • Comprehensive on-line and real-time fault detection

mechanism

  • Ensures 0% false negatives within the NoC
  • Based on the concept of invariance checking

– A collection of micro-checker modules dispersed throughout the router’s control logic modules – Real-time hardware assertions

  • Tremendous improvement in detection delay
  • Extremely lightweight nature of NoCAlert in terms of

area/power/timing overhead

slide-46
SLIDE 46

46 NoCAlert (MICRO-2012) University of Cyprus

Future Work

  • Formally prove NoCAlert’s full coverage
  • Fault localization

– Take advantage of the localized information provided by NoCAlert – Try to pinpoint the faulty location

  • Fault Recovery

– After fault localization – Techniques to bypass faulty modules

slide-47
SLIDE 47

47 NoCAlert (MICRO-2012) University of Cyprus

Thank you!

QUESTIONS?