[PDF] - AST2016 Dang slides Data November 2016 CITATIONS READS 0 10 4 PDF Document

SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311113761

AST2016 Dang slides

Data · November 2016

CITATIONS READS

10

4 authors, including: Some of the authors of this publication are also working on these related projects: HotCluster View project VENGME View project Khanh N. Dang Vietnam National University, Hanoi

35 PUBLICATIONS 114 CITATIONS

SEE PROFILE

Michael Meyer Waseda University

36 PUBLICATIONS 125 CITATIONS

SEE PROFILE

All content following this page was uploaded by Abderazek Ben Abdallah on 30 November 2016.

The user has requested enhancement of the downloaded file.

SLIDE 2

Reliability Assessment and Quantitative Evaluation of Soft-Error Resilient 3D Network-on-Chip Systems

Khanh N. Dang, Michael Meyer, Yuichi Okuyama, and Abderazek Ben Abdallah {d8162103, d8161104,okuyama,benab}@u-aizu.ac.jp Adaptive Systems Laboratory Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu, Fukushima, Japan

25th IEEE Asian Test Symposium (ATS'16), Nov. 21-24, 2016, Hiroshima, Japan

SLIDE 3

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion & future work

25th IEEE Asian Test Symposium (ATS'16)

2

SLIDE 4

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion & future work

25th IEEE Asian Test Symposium (ATS'16)

3

SLIDE 5

VLSI Design Challenges

For decades, the CMOS technology has been progressed to provide efficient solutions; however, VLSI design nowadays has several challenges:

Power Wall: Energy consumption is increased by ~60%

(high computing area) and ~40% (middle computing area) per year [Chang 2016].

Yield Wall: With the similar process control steps

(~420), yield of 5nm is predicted to be under 55% in compare to 28 nm (~78%) [Yield].

Packaging: Intel Chip’s pin number is expected to

increase by 25% every 2 years (tick-tock period)[Intel Proc].

25th IEEE Asian Test Symposium (ATS'16)

4

SLIDE 6

VLSI Design Challenges (cnt.)

Time-to-Market:
One quarter or one year late to market (2 year product life)

leads to over 33% or 90% of the revenue loss, respectively[TTM].

Reliability: Exposing to a variety of manufacturing, design, and
peration factors makes the future architectures more vulnerable

to different types of faults. [Henkel 2013].

10-15°C difference in operation temperature can lead to 2x

times difference of MTTF [Shafique 2014].

Soft error rate at 0.45 V is 30x times of 0.7 V [Shafique 2014].
⇒ Reliability assessment has been becoming an import part in

the design process.

25th IEEE Asian Test Symposium (ATS'16)

5

SLIDE 7

Network-on-Chip

Network-on-Chip (NoC) is the new paradigm to replace the

traditional Bus with benefits:

25th IEEE Asian Test Symposium (ATS'16)

6

R R R R R R R R R

PE PE PE PE PE PE PE PE PE

Router Processing Element

R R

PE PE

R R

PE PE

R

PE

R

PE

R

PE

R

PE

R

PE

R R

PE PE

R R

PE PE

R

PE

R

PE

R

PE

R

PE

R

PE

R R

PE PE

R R

PE PE

R

PE

R

PE

R

PE

R

PE

R

PE

Network Interface 2D Mesh Network-on-Chip 3D Mesh Network-on-Chip Wires

Low power
Scalability
Reusability
Parallelism

SLIDE 8

Reliability Challenges

25th IEEE Asian Test Symposium (ATS'16)

7

Fault Type Source Soft Errors Cross-talk Radiation particles Cosmic rays Thermal neutrons Hard Faults Manufacture defects Time dependent dielectric breakdown Thermal Stress Electro-migration Negative-Bias Temperature Instability

Single Event Transient by radiation particle Open wire defect

SLIDE 9

Reliability Challenges (cnt.)

25th IEEE Asian Test Symposium (ATS'16)

8

Fault Type Potential Effects Possible Solution Soft Errors

Flip-bit (gate/wire)
Data Corruption
Misrouting
Loss/duplicated packet
Packet latency
Locking state
Error Correction Code
Temporal Redundancy
Self-verification & roll-back

Hard Faults

Open (gate/wire)
Bridge (gate/wire)
Stuck at 0/1
Delay
Data corruption
Packet

loss/duplicate/misroute

Locking state
Spare module/gate for

replacements.

Faulty part isolation.
Fault-tolerant routing.

With the increasing of system vulnerability to faults and the critical effects on NoC systems, addressing NoC system reliability is needed.

SLIDE 10

Reliability Assessment

Reliability Assessment involves five phases:
System Definition
Preliminary Design
Detailed Design
Fabrication, Assembly, Integration and Test (FAIT)
Production/Support

25th IEEE Asian Test Symposium (ATS'16)

9

Reliability assessment is important for early design stages in order to prevent costly redesigns of the system.

Physical Analysis

Analyze in terms of

physical failures.

A full-chip assessment

can be obtained by combining separated parts.

Highest accuracy
Requires massive time

and computation resource Analytical Model

The design is analyzed

under analytical model

The design reliability is

estimated from the sub- modules or events.

Low complexity and

quick. System-Level Simulation

Faults are injected into

the system under specific distributions and rates.

Give an accurate

behavior under faults.

Result is trustable under

fair fault distributions and high amount of statistic values.

Analytical model is efficient for the three early stages.
By analyzing analytically, the critical part can be detected

and improved.

SLIDE 11

Paper Contributions

1. An efficient soft error resilient mechanism and

architecture (SER-3DR-NoC) for reliable 3D-NoC systems.

Use redundancy of pipeline stage execution to detect.
Use three execution results and majority voting to recover the

soft error.

2. An formulation of reliability assessment for fault-

tolerant system.

Base on Mean-Time-Between-Failure.
Modeling by Markov-state model.

25th IEEE Asian Test Symposium (ATS'16)

10

SLIDE 12

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion & future work

25th IEEE Asian Test Symposium (ATS'16)

11

SLIDE 13

Proposed System: SER-3DR-NoC.

Proposed System Architecture

25th IEEE Asian Test Symposium (ATS'16)

12

The proposed system (SER-3DR-NoC) is a 3D-Mesh based Network-on-Chip. The proposed system consists of SER-3DR router with 7 ports (6 directions and 1 local).

SER-3DR operates under 3 pipeline stages: BW: Buffer Writing, NPC/SA: Next Port Computing/Switch Allocation and CT: Crossbar Traversal. Incoming flit is stored in the input buffer. Later, the routing information is used to computing routing path and intra- router arbitration. Flits will be forwarded through the crossbar.

SLIDE 14

Soft Error Resilience Method

Approach:
Replicate the execution of the pipeline state.
Compare two consecutive results: different  fault occurred.
Correct by executing the third time and using a majority voting.
Target:
The routing (NPC) and arbitrating (SA) units role an import part in

side the network.

A soft error in NPC or SA can lead to misrouting, loss/duplicated

packet or even locking states.

NPC and SA are selected to be protected.

25th IEEE Asian Test Symposium (ATS'16)

13

SLIDE 15

Compute NPC Compute SA RNPC = NPC?

Roll-back and Re-compute NPC

SA = RSA? Compute CT Roll-back and Re-compute SA Cycle 2 Cycle 3 BW Compute RNPC Compute RSA Cycle 4

yes yes no no

stage

Original pipeline stage

stage

Redundant pipeline stage Cycle 1 NPC Majority Voting SA Majority Voting Compute CT

Finally, corrected routing flit is forwarded to crossbar

Soft Error Resilience Algorithm

25th IEEE Asian Test Symposium (ATS'16)

14

Incoming flit is stored in the buffer with Buffer Writing stage. Routing information is used for the first time of NPC/SA execution Execute a redundancy for each NPC/SA If they are similar, which means no soft error, flit is forwarded to crossbar Compare two consecutive results of NPC and SA If one of them (or both) is different, correct the error by third execution and majority voting.

SLIDE 16

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion

25th IEEE Asian Test Symposium (ATS'16)

15

SLIDE 17

Reliability Assessment Methodology

We proposed a reliability assessment method by

using Markov-state model.

The fault rate distribution is also proposed.
To evaluate the efficiency of a fault-tolerance, we

present a new parameter: Reliability Acceleration Factor.

To assess the soft error resilient mechanism, we

apply the method for it.

25th IEEE Asian Test Symposium (ATS'16)

16

SLIDE 18

Mean Time Between Failure

Mean Time Between Failure is the average value of time between two consequent failures.

t working fail fail working working Time Between Failures soft error occurs soft error occurs soft error is repaired soft error is repaired repaired soft error

Given a reliability function R, MTBF is as follows: 𝑁𝑈𝐶𝐺 = lim

𝑡→0 𝑆(𝑡)

* in Laplace domain

25th IEEE Asian Test Symposium (ATS'16)

17

SLIDE 19

Fault Rate Model

Fault rate of a system consisting of k modules: 𝜇𝑡𝑧𝑡𝑢𝑓𝑛 =

𝑗=1 𝑙

𝑔

𝑗 × 𝑃𝑆𝑗 × 𝐵𝑆𝑗 × 𝜇𝑣𝑜𝑗𝑢

The design is assumed to be under “steady-state” which has a constant fault rate.

25th IEEE Asian Test Symposium (ATS'16)

18

Parameter Description 𝑣𝑜𝑗𝑢 A select module as a reference for calculation the system’s fault rate. 𝑃𝑆𝑗 Operating time ratio of component 𝑗 to 𝑣𝑜𝑗𝑢 𝐵𝑆𝑗 Area cost ratio of component 𝑗 to 𝑣𝑜𝑗𝑢. 𝑔

𝑗

Changing rate caused by attaching the module 𝑗 to the system.

SLIDE 20

Markov State Model (1)

Each state 𝑇𝑗 of the Markov state model represents a

possible status of the system.

For example: 𝑇0 is initial and healthy state, 𝑇1 is faulty state.
The transition from 𝑇0 to 𝑇1 is given by a fault rate 𝜇 (𝜇 =

1/𝑁𝑈𝐶𝐺).

When a repairable system failed, the repairable system

can be recovered with a repair rate 𝜈.

25th IEEE Asian Test Symposium (ATS'16)

19

𝜇: fault rate 𝜇: fault rate μ: repaire rate healthy state faulty state healthy state faulty state

(a) non-repair system (b) repairable system

SLIDE 21

Markov State Model (2)

For a more complex system, its states can be separated into two sets: 𝐼 ≜ 𝑇𝑗 ∈ 𝑇 the system works correctly} 𝐺 ≜ 𝑇𝑗 ∈ 𝑇 the system fails} The reliability function is defined the probability of healthy states (H with n states). 𝑆 𝑡 = 𝑄 𝐼 𝑁𝑈𝐶𝐺 = lim

𝑡→0(𝑄 𝐼 𝑡

= lim

𝑡→0 𝑗=0 𝑜−1

𝑇𝑗(𝑡)

25th IEEE Asian Test Symposium (ATS'16)

20

An example Markov state model

Note: Solving Markov state model can be seen in back-up slides

SLIDE 22

Reliability Acceleration Factor

To obtained a numeric value representing the reliability of

system, a new parameter: Reliability Acceleration Factor is used. 𝑆𝐵𝐺 = 𝑁𝑈𝐶𝐺𝐺𝑈 𝑁𝑈𝐶𝐺𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚

𝑁𝑈𝐶𝐺𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚: Mean Time Between Failure of the original

system.

𝑁𝑈𝐶𝐺𝐺𝑈: Mean Time Between Failure of the fault-tolerant

system.

Because 𝑁𝑈𝐶𝐺 = 1/𝜇, 𝑆𝐵𝐺 = 1/𝑔 (f is fault reduction rate

in Fault Rate Model).

25th IEEE Asian Test Symposium (ATS'16)

21

SLIDE 23

Modeling Non-Fault-Tolerant System

A non-fault-tolerant system can be modeled as two states:

𝑇0 is the initial state.
𝑇1 the failure state.

MTBF of this system is given as follows: 𝑁𝑈𝐶𝐺 = 1 𝜇𝐸

25th IEEE Asian Test Symposium (ATS'16)

22

Markov State model Original Module

Module D

𝜇: fault rate

input

utput

SLIDE 24

Modeling Fault-Tolerant System

A non-fault-tolerant system can be

modeled as three states:

𝑇0 is the initial state.
𝑇1 represents the part of fault rate

which can be corrected by module C.

𝑇2 represents the part of fault rate

which cannot be corrected by module C.

𝑇𝐷−𝐺 is the state of module C fails.
𝑇1 has fault rate 𝜇𝐸1
𝑇2 has fault rate 𝜇𝐸2
𝜇𝐸 = 𝜇𝐸1 + 𝜇𝐸2

25th IEEE Asian Test Symposium (ATS'16)

23

Fault-Tolerant Module Module D Module C Fault-Tolerant Markov-State Model input

utput

SLIDE 25

Modeling Fault-Tolerant System (2)

From the Markov model, MTBF of the Fault-Tolerant system is given as follows: 𝑁𝑈𝐶𝐺𝐺𝑈 = 1 𝜇𝐸2 + 𝜇𝐷 In the fault rate model, the fault rate can be given as: 𝜇𝑡𝑧𝑡𝑢𝑓𝑛 =

𝑗=1 𝑙

𝑔

𝑗 × 𝑃𝑆𝑗 × 𝐵𝑆𝑗 × 𝜇𝑣𝑜𝑗𝑢

Therefore, the fault rate of module D is as follows:

𝜇𝐸2 = 𝑃𝑆𝐸2 × 𝐵𝑆𝐸2 × 𝜇𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚= 𝑔

𝐸 × 𝑃𝑆𝐸 × 𝐵𝑆𝐸 × 𝜇𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚

The fault rate of module C: 𝜇𝑑 = 𝑃𝑆𝑑 × 𝐵𝑆𝑑 × 𝜇𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚

25th IEEE Asian Test Symposium (ATS'16)

24

SLIDE 26

Modeling Fault-Tolerant System (3)

Finally, Reliability Acceleration Factor of the Fault-Tolerant system is as follows: 𝑆𝐵𝐺 = 1 𝑔

𝐸 × 𝑃𝑆𝐸 × 𝐵𝑆𝐸 + (𝑃𝑆𝐷 × 𝐵𝑆𝐷)

𝑔

𝐸 = 𝜇𝐸2/𝜇𝐸 is the reduction ratio of fault rate.

𝑃𝐷𝑌 is operation ratio of module X
𝐵𝐷𝑌 is area cost ratio of module X

25th IEEE Asian Test Symposium (ATS'16)

25

Fault-Tolerant Module Module D Module C

Original Module

Module D input

utput

input

utput

SLIDE 27

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion

25th IEEE Asian Test Symposium (ATS'16)

26

SLIDE 28

Evaluation Methodology

1. Analytical and Hardware Evaluation of the proposed soft error

resilient system:

Calculate the Reliability Acceleration Factor value.
Verify the capacity of error correction.
2. Quantitative evaluation of the proposed system
Area cost.
Power consumption.
Performance:
Three benchmarks: Matrix-multiplication, Uniform and

Transpose.

3. Comparison to some noticed soft error resilience methods
Reliability
Area overhead.
Latency

25th IEEE Asian Test Symposium (ATS'16)

27

SLIDE 29

Evaluation Configuration

25th IEEE Asian Test Symposium (ATS'16)

28

Architectures LAFT-OASIS

[Akram 2014]

TMR-OASIS

[Dang 2015]

SER-3DR-NoC Test-benches Uniform, Transpose and Matrix-Multiplication Flit size 33 Injection Rates 0%, 8.33%, 16.67%, 11.11%&6.67% and 33.33% Packet size 10 flits Routing Switching: Wormhole-like Flow-control: Stop-Go Routing: Look-ahead routing algorithm

SLIDE 30

Analytical Assessment for SER-3DR

25th IEEE Asian Test Symposium (ATS'16)

29

Original module (D) Correction module (C)

𝑆𝐵𝐺 =

1 𝑔𝐸×𝑃𝑆𝐸×𝐵𝑆𝐸 + 𝑃𝑆𝐷×𝐵𝑆𝐷

C: Soft Error Monitor modules D: NPC and SA arbiters modules With fD = 0 (100% protection),𝐵𝑆𝐷 ≈ 0.5446, and 𝑃𝑆𝐷 = 1, the proposed system RAF is: 𝑺𝑩𝑮 ≈ 1.84

SLIDE 31

Reliability Comparison

25th IEEE Asian Test Symposium (ATS'16)

30

Model TMR for OASIS Yu et al. [Yu 2013] Prodromou et al.

[Prodromou 2012]

SER-3DR Mechanism Majority Voting Monitor Monitor Monitor Area Overhead 204.33% 9% 3% 54.46% 𝐵𝑆𝐷 0.0433 0.09 0.03 0.5446 RAF ≈ 1.33 ≈ 11.11 ≈ 1 (only detection) ≈ 1.84 Delay +0 cycle + 0 cycle (no fault) + 1 cycle (recovery) +0 + 1 cycle (redundancy) + 2 cycle (recovery) Fault Coverage 100% hard faults and soft errors 7 faults 13 faults 100% soft errors Recovery Method Immediately (majority voting) Re-execution Unsupported Re-execution

Our proposed SER-3DR has a medium area overhead while TMR triples the area cost. In terms of fault coverage, our proposed techniques can deal with 100% of soft errors, while TMR can handle both hard faults and soft errors. Other monitor-based techniques handle a specific set

f faults (7 and 13).

With the assume that the monitor-based technique can handle 100% faults, it provides the best reliability (in RAF). SER-3DR provides a medium value and is better than TMR by 38.35%.

SLIDE 32

Performance Evaluation (1)

25th IEEE Asian Test Symposium (ATS'16)

31

Average delay of Transpose benchmarks; Network’s size :4x4x4.

TMR-OASIS provides an exactly similar performance to the original system even under faults injection. At a 0% of fault rate, SER-3DR slightly increase the average latency by 1.95% When faults are injected, only TMR-OASIS and SER-3DR keep working, baseline model fails. At a 33.33% of fault rate, SER- 3DR increases the average delay by 4.87%.

SLIDE 33

Performance Evaluation (2)

25th IEEE Asian Test Symposium (ATS'16)

32

Throughput evaluation: Uniform benchmarks, Network’s size : 4x4x4.

Similar to other benchmarks, TMR provides similar performance to the baseline model. At a 0% of fault rate, SER-3DR slightly decrease the throughput by 11.91% At a 33.33% of fault rate, SER- 3DR decrease the throughput by 20.07%.

SLIDE 34

Hardware Evaluation

TMR-OASIS costs 45.20% more area cost while the proposal

(SER-3DR) requires 14.98% of additional logic area (30.22% less).

The power consumption is slightly increased in our proposed

system: 5.90% (10.49% less than TMR-OASIS).

SER-3DR has the slowest maximum frequency: 655.74 MHz

due to additional logic unit in the critical paths.

Comparison between the proposed model, baseline model and TMR model.

25th IEEE Asian Test Symposium (ATS'16)

33

Design Max Freq. (𝑁𝐼𝑨) Power consumption (𝑛𝑋) Logic’s area (𝜈𝑛2) #TSVs LAFT OASIS 801.28 25.62 14,920 164 TMR-OASIS 763.36 30.31 21,664 164 SER-3DR 655.74 27.12 17,154 164

SLIDE 35

Content

Background
Soft Error Resilient 3D NoC System
Reliability Assessment Methodology
Evaluation Result
Conclusion & Future Works

25th IEEE Asian Test Symposium (ATS'16)

34

SLIDE 36

Conclusion

We proposed a method to improve the reliability
f 3D-NoC router against soft errors.
The proposed method is evaluated with

reasonable performance degradation while having the ability to deal with extremely high error rates (33%).

A reliability assessment method is proposed to

help designer evaluate the efficiency of the design.

In terms of reliability, the proposed method

improves MTBF by 1.84 times with a small latency increase of 18.16% (in average).

25th IEEE Asian Test Symposium (ATS'16)

35

View publication stats View publication stats