coping with soft errors in asynchronous burst mode
play

Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh - PowerPoint PPT Presentation

Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh Almukhaizim Feng Shi & Yiorgos Makris Computer Engineering Dept. Electrical Engineering Dept. Kuwait University, Kuwait Yale University, USA 4/11/2008 1 ASYNC08


  1. Coping with Soft Errors in Asynchronous Burst-Mode Machines Sobeeh Almukhaizim Feng Shi & Yiorgos Makris Computer Engineering Dept. Electrical Engineering Dept. Kuwait University, Kuwait Yale University, USA 4/11/2008 1 ASYNC’08

  2. Sources of Soft Errors “Solar Particles” Affect satellites; “Galactic Particles” may also penetrate Are high-energy particles that to Earth penetrate to Earth’s surface, through buildings and walls • High-energy particles collide with silicon atoms • Collision generates a voltage pulse at impact site 0 1 0 • Under certain conditions, it may produce a soft error 4/11/2008 2

  3. Frequency of Soft Errors Relative Soft Error Rate Increase 150 Soft Error Rate Trends [S. Borkar et al., Intel, DAC’04] 100 6 years from now 50 we are approximately 0 here 180 130 90 65 45 32 22 16 Chip Feature Size • Integrated circuits (synchronous & asynchronous) will require methods to tolerate / mitigate soft errors and ensure reliability 4/11/2008 3 ASYNC’08

  4. Soft Error Tolerance & Mitigation in ASYNC • Previous studies targeted Quasi Delay-Insensitive (QDI) circuits • SEU-tolerant QDI circuits ( W. Jang & A. Martin, ASYNC, 156-165, 2005 ): • SEU-tolerant QDI circuits ( W. Jang & A. Martin, ASYNC, 156-165, 2005 ): • • Gate-level fine-grain duplication and double-checking Gate-level fine-grain duplication and double-checking z • Fine granularity results in high overhead Transient error Transient error w o x is blocked is blocked • Soft error susceptibility estimation & mitigation in QDI Circuits y by C-elements by C-elements ( Y. Monnet, M. Renaudin, and R. Leveugle, Trans. on Computers, , 55(9): 1104-1115 (2006)) ): z 1 w 1 • Susceptibility (or sensitivity) is defined with respect to the number of C x 1 � 1 0 C element y 1 errors at the inputs of the C-element that are necessary to flip its state o 1 element 0 � 1 • Several soft error mitigation (or hardening) methods are presented z 2 w 2 0 x 2 C C y 2 element o 2 element 0 4/11/2008 4 ASYNC’08

  5. Asynchronous Burst-Mode Machines Inputs Outputs � 1 0 0 Asynchronous Controller � 0 � 1 1 0 � 1 1 0 • • • Interaction between the circuit and its environment Interaction between the circuit and its environment Interaction between the circuit and its environment happens in Bursts: happens in Bursts: happens in Bursts: • • • Input Burst: a set of bit changes in any order and at any time Input Burst: a set of bit changes in any order and at any time Input Burst: a set of bit changes in any order and at any time • • Outputs and state do not change during an input burst Outputs and state do not change during an input burst • Once the input burst is complete, the circuit responds with a hazard-free output burst • Particle strikes may cause logic errors or hazards 4/11/2008 5 ASYNC’08

  6. Coping with Soft Errors in ABMMs Methods to Cope with Soft Errors in ABMMs Tolerance Mitigation Techniques Techniques TMR Duplication -Based -Based 4/11/2008 6 ASYNC’08

  7. TMR-based Soft Error Tolerance in ABMMs Output Original C Inputs Circuit element Replica 1 Replica 2 0 � 1 0 � 1 C 0 � 1 element 0 � 1 State • C-element used as majority voter • Strikes at state-line C-elements not tolerated 4/11/2008 7 ASYNC’08

  8. Duplication-based Soft Error Tolerance • Observation: 2-input C-elements are sufficient to tolerate one failing module (i.e., only one replica is needed) Output Original C Inputs Circuit element Replica C element State • Strikes at state-line C-elements still not tolerated 4/11/2008 8 ASYNC’08

  9. Tolerating Errors on State-Line C-Elements • Proposed Solution: cross-coupled structure of C-elements Output Original C Inputs Circuit element Replica 0 0 C C State 1 0 1 element element 0 C C 1 State 2 1 element element 1 Transient error is blocked Transient error is blocked • All strikes at state-line C-elements are now tolerated 4/11/2008 9 ASYNC’08

  10. Example 1. Insert original circuit 2. Insert duplicate circuit 3. Insert state-line C-elements 4. Insert output C-elements 4/11/2008 10 ASYNC’08

  11. Experimental Results Duplication-based Soft Error Tolerance Circuit Name I/S/O Original Duplicate C-elements Total Overhead hp-ir 3/1/2 8 8 18 34 325.00% concur-mixer 3/2/3 16 16 33 65 306.25% tangram-mixer 3/1/2 10 10 18 38 280.00% rf-control 6/3/5 37 37 51 125 237.84% while_concur 4/2/3 24 24 33 81 237.50% barcode 13/4/17 172 172 99 443 157.56% p2 8/4/16 192 192 96 480 150.00% p1 13/4/14 238 238 90 566 137.82% Area overhead seems excessive for small circuits: cost inflated due to proportionately large number of C-elements over logic gates, and the rather expensive C-element implementation used 4/11/2008 11 ASYNC’08

  12. Coping with Soft Errors in ABMMs Methods to Cope Soft error with Soft Errors susceptibility in ABMMs estimation Tolerance Mitigation Techniques Techniques Sensitive Sensitive Sensitive complete partial gates logic cones logic cones 4/11/2008 12 ASYNC’08

  13. Soft Error Susceptibility Estimation • A hazard-aware asynchronous fault simulator is needed (SPIN-SIM: F. Shi and Y. Makris, ITC, 597-606 (2004)) • Fault simulate & construct a soft error susceptibility table ( sest) State & Input Potential SETs • Asymmetric soft error susceptibility of gates in different levels Burst Pair .. f p f 1 f 2 • Enables judicious selection and replication in a partial duplicate SIB 1 11000 00000 .. 00001 (K. Mohanram and N. A. Touba, ITC, 893-901 (2003)) SIB 2 01001 11001 .. 11001 . . .. .. .. .. SIB m 11001 00010 .. 00000 m s+k q ∑ ∑ E(sest[i,j]) q-1 j=s+1 i=1 , where s = ∑ k l susc(G q ) = m . k q l=1 n SER(ABMM) = ∑ sest(G q ) q=1 4/11/2008 13 ASYNC’08

  14. Duplication of Sensitive Gates • Using a duplication-based soft error tolerant ABMM: • Using a duplication-based soft error tolerant ABMM: • Using a duplication-based soft error tolerant ABMM: 1. 1. 1. Gates are remove from the first level of the duplicate in an increasing Gates are remove from the first level of the duplicate in an increasing Gates are remove from the first level of the duplicate in an increasing order of their soft error susceptibility order of their soft error susceptibility order of their soft error susceptibility 2. 2. Fan-outs are driven by the corresponding gate in the original ABMM Fan-outs are driven by the corresponding gate in the original ABMM 3. Area & soft error tolerance are updated accordingly Drives Cost: fan-outs 87% Gates of removed removed Tolerance gates : 68% 4/11/2008 14 ASYNC’08

  15. Duplication of Complete Sensitive Logic Cones • Output/State logic cones also have an asymmetric susceptibility: • Output/State logic cones also have an asymmetric susceptibility: • • Select a subset that meets an area target & whose replication Select a subset that meets an area target & whose replication maximizes the number of tolerated pairs of SIBs & SETs maximizes the number of tolerated pairs of SIBs & SETs • Modeled as an ILP: Cost: 1, if Y k .V T (sest[i, j]) = 0 m SIBs 60% Tol(Y k , i, j) = p SETs 0, if Y k .V T (sest[i, j]) > 0 r state/output lines Tolerance 1 ≤ k ≤ 2 r - 1 Y 1 & w not Y 0 & x m p : Maximize ∑ ∑ Tol(Y k , i, j), subject to: protected have 47% j=1 i=1 complete (i) C k < Cost protection (ii)X s ϵ {0, 1}, for 1 ≤ r ≤ s 4/11/2008 15 ASYNC’08

  16. Duplication of Partial Sensitive Logic Cones • Combines the previous two approaches: • Explores the asymmetric susceptibility of gates and output/state lines Drives Cost: fan-out 50% of removed Tolerance Y 0 & x Y 1 & w not gate : protected have 24% partial protection 4/11/2008 16 ASYNC’08

  17. Experimental Results 2-level ABMMs Circuit p1 Soft Error Protection (%) Area Overhead (%) • Achieved tolerance is commensurate with the area overhead • The partial logic cones mitigation method is consistently better 4/11/2008 17 ASYNC’08

  18. Experimental Results Multi-level ABMMs (new release by Columbia Univ.) Circuit p1 Soft Error Protection (%) Cost: 70% Tolerance : 84% Area Overhead (%) Multi-level implementation significantly improves the tradeoff between area overhead & achieved soft error tolerance 4/11/2008 18 ASYNC’08

  19. Summary • Soft error tolerance in ABMMs � Duplication-based solution that improves upon TMR � Cross-coupled C-element structure for state-line protection • Soft error mitigation in ABMMs � Enables exploration of the trade-off between the achieved soft error tolerance and the incurred area overhead � Driven by soft error susceptibility estimation via hazard-aware asynchronous fault simulator (SPIN-SIM) � Yields 3 progressively more powerful partial duplication options 4/11/2008 19 ASYNC’08

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend