parallel streaming computation on error prone processors
play

Parallel Streaming Computation on Error-Prone Processors Yavuz - PowerPoint PPT Presentation

Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik Hardware Errors on the Rise Soft Errors Due to Cosmic Rays Random Process Variation [Sierawski et al., 2011] [Khun et al., 2011] 25


  1. Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik

  2. Hardware Errors on the Rise Soft Errors Due to Cosmic Rays Random Process Variation [Sierawski et al., 2011] [Khun et al., 2011] 25 100000 Average Number of Dopant Atoms 20 10000 Upsets/B muons/Mb 15 1000 10 100 5 10 0 1 65 55 45 40 10000 1000 100 10 1 Technology Node (nm) Technology Node (nm) X

  3. Traditional Solutions Higher Latencies or Redundancy Voltage Margins Processor 1 Processor 2 PDF of Delay 0.01 Norm Number of Dies 0.008 0.006 Input Output 0.004 Replication Check 0.002 0 up to 100% 550 590 630 670 710 750 790 830 Delay (ps) Memory subsystem with ECC Reliable SECDED: 1-cycle latency, ~10k gates 4EC5ED: 14-cycle latency, ~100k gates High Power, Performance and Area Overhead X

  4. Architectures for Error-Prone Computing Reliable core & memory Unreliable core & memory . . ERSA [Leem et al., 2010] main thread: worker thread: . . - algorithmic control - do-all unit . . - worker thread error handling - restarted on error . . Reliable memory Unreliable memory Flikker [Liu et al., 2011] critical int x; int y; Processor / Memory Unreliable execution unit / tolerant register / memory EnerJ [Sampson et al., 2011] Instruction / Data Reliable execution unit / register / memory X * Reliable Unreliable

  5. To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to memory errors • Hangs due to control-flow errors X

  6. To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to StreamIt programming model memory errors • + memory segmentation Hangs due to control-flow errors Control-flow with scopes: Filter 2 • Known run-times of modular Filter 1 Filter 4 control-flow regions determine Filter 3 timeout limits • Coarse-grain sequencing of computation Regions with Memory: • R/W/X Only allowed accesses are permissions allowed, other dropped X

  7. To Minimal Reliable Hardware Error-tolerant application Error-prone processor Output: • Crashes due to memory errors • Hangs due to control-flow errors Error-tolerant application Error-prone processor Output: Graceful quality + coarse-grain control-flow, degradation with errors memory, I/O management *Extracting Useful Computation From Error-Prone Processors [Yetim et al, 2013] X

  8. Communication Errors For Parallel Streaming Applications Error-tolerant application Multiple processing Output: nodes with single- Unacceptable quality threaded protection This work • Communication errors – Unrecoverable corruption of the communication mechanism – Data misalignment among producer/consumer threads • CommGuard – Application-level communication information – Low overhead recovery from communication errors X

  9. Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X

  10. Communication Errors Transmission Failure Concurrent Software Queue • List of free pointers • List of data pointers Producer Consumer • Locks push pop • State shared by both ends • State retained throughout computation Corruption in lists, pointers and locks are permanent X

  11. Communication Errors Transmission Failure Producer Consumer Error-free Hardware Queue push pop • Data items are flowing • Image is not coherent X

  12. Communication Errors Misalignment I Producer(): Consumer(): Error-free Hardware Queue push R; pop R; push G; pop G; R B R B G R push B; pop B; Misalignment due to a control-flow error is permanent X

  13. Communication Errors Misalignment II Producer R R[64:127] R[128:191] R[0:63] G[64:127] G[192:255] G[0:63] Producer G Join P[64:127] P[0:63] B[64:127] B[128:191] B[0:63] Producer B Misalignment at join nodes are also permanent X

  14. Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X

  15. CommGuard Overview iteration iteration Producer Consumer iteration • Expecting item, markers received marker: PAD Iteration • Expecting marker, received item: DISCARD X

  16. CommGuard Overview join split Local Local iteration iteration counter counter For all incoming edges • If items missing: PAD • If items extra: DISCARD X

  17. CommGuard System Overview Unreliable Producer Unreliable Consumer New New Push Pop Stall iteration iteration Header Header Hardware Frame Inserter Frame Checker Item Item Queue Pad, Discard, Pad & Discard X

  18. Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X

  19. Experimental Methodology • Built on prior simulation Infrastructure by [Yetim et al, DATE 2013] – Virtutech Simics modeling 32-bit Intel x86 – Error injection capabilities – Protection modules for sequential streaming applications – Architecturally visible errors following distribution with given mean time between errors ( MTBE ) • Pick error injection cycle • Picks random register, pick random bit • Flip bit, repeat • Extensions for multi-core simulation – Monitor scheduling of selected threads – Pin threads to processor cores – Per-core error injection – Protection modules implemented for every core • Modeled frame checker and frame inserter • JPEG Decoder as a streaming application X

  20. Output at Different Error Rates • Output quality restored after misalignment through CommGuard • Graceful output degradation with increasing errors X

  21. Run-time Overhead Due to Stalls • Run-time increases due to stalls caused by misalignments • Only 2% even at high error-rates X

  22. Amount of Padding • Padding to resolve misalignments is observed even at low error rates X

  23. Outline • Motivation • Communication Errors in Parallel Streaming Applications • CommGuard System Overview • Experimental Methodology and Results • Conclusions X

  24. Conclusions • Communication in parallel applications add fragility – Error-prone communication subsystem – Data misalignments due to asynchronous threads • Explicit communication & control-flow can be used – Encapsulate coarse-grain data units – Use small checker circuitry to recover from communication errors • Low overhead solutions to sustain quality – Only ~150B of reliable state per core and less than 2% run- time overhead even at high error rate – 16dB can be sustained for errors as frequent as every 1ms X

  25. Parallel Streaming Computation on Error-Prone Processors Yavuz Yetim, Margaret Martonosi, Sharad Malik

  26. Backup Slides X

  27. Suitably Error Tolerant X

  28. Frame Checker FSM X

  29. Avoid Running Indefinitely Divide program to Regular execution Indefinite run due to errors regions with time limits Program Program Program Too long Too long, Loop 1 Loop 1 Scope 1 break Loop 2 Loop 2 Scope 2 X

  30. Disallowed Memory Accesses Regular execution Crash due to errors Suppress crashes Memory Memory Memory Crash X X W X W R/X R/X R/X Don’t crash, Bump PC W R/W R/W R/W X

  31. Overall Design MIS : Coarse-grained control flow constraints and recovery MFU : Coarse-grained constrains on memory accesses Streamed I/O : Manages bounded data streams X

  32. Communication Errors Single-threaded Toy producer-consumer Producer Consumer streaming application push 16 pop 64 Core 0 ... P P P P C Statically allocated 64-item buffer • Static location is preserved in reliable I-Cache throughout the computation • Every new [P] or [C] iteration recovers the pointer values • Communication never halts indefinitely X

  33. Shared State • The inserter and the checker need to keep state to operate • State below is shared by every inserter and checker belonging to a node Value Details (S)tatic or (D)ynamic How many times a node needs to fire Firing per frame S before the computation starts for the next frame Number of total frames the application Frame limit S needs to process How many frames have been processed Active frame D so far How many times the node has fired for the Active firing D active frame X

  34. Additional Frame Checker State State Details (E)rroneous (N)ormal Node is receiving items for the active Receiving items N frame Node has started new frame Expecting a header computationally hence the next item in the N queue should be a header The computation in the node is ahead of Discarding E the communication of the edge The communication of the edge is ahead of Padding E the computation in the node X

  35. CommGuard Placement . . . . . . . . Previous Filter Next FC FI X

  36. Output Quality For Varying MTBEs • Compare lossy compression to error-prone decompression • For raw image file I, encoded file E and decoded files F or P: Baseline: Error-free SNR Decompressed Image Error-free Raw Image Compressed Image Error-prone Decompressed Compression Image Ours: Error-prone SNR • This study was performed for MP3 and JPEG decoder benchmarks – Widely used – Full-runs – Each experimental setting: 10 times X

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend