Efficient Transient-Fault Tolerance for Multithreaded Processors - PowerPoint PPT Presentation

Efficient Transient-Fault Tolerance for Multithreaded Processors Using Dual-Thread Execution Yi Ma Huiyang Zhou Computer Science Department University of Central Florida

Introduction Introduction • Modern microprocessors are increasingly susceptible to transient faults. – Smaller transistors, higher density, lower supply voltage, etc. SER/chip for SRAM/latches/logic [Shivakumar et.al.] University of Central Florida 2

Introduction Introduction • A promising approach is redundant execution utilizing multithreaded processors. – AR-SMT, SRT, SRTR, etc. – Shortcomings • Performance degradation – Delayed instruction commitment – Resource contention • Increased energy consumption – Dynamic energy due to redundant execution – Static energy due to increased execution time • The contribution of this paper: – Dual-Thread Execution: achieves both performance enhancement and transient-fault tolerance for multithreaded processors. University of Central Florida 3

Outline Outline • Introduction • Dual-Thread Execution (DTE) – Overview – Architecture – Exploiting Fetch Policies • Experimental results • Related Work • Conclusion University of Central Florida 4

Dual- -Thread Execution Thread Execution Dual • DTE is built on a Simultaneous Multithreaded (SMT) processor – Two threads: the front thread and the back thread – Instructions are executed speculatively by the front thread and re-executed by the back thread. Result Queue Front thread Superscalar Core fetches in-order Back thread commits in-order • Resource sharing is critical to DTE’s overall performance. – Explore effective fetch policies for DTE. University of Central Florida 5

Architecture Architecture Result Queue next to fetch head tail Write Back Reg Read Dispatch I-Cache Execute Retire Fetch Issue Front thread Physical Back thread INV INV LSQ Registers file Run-ahead L1 Data Cache Cache • Front thread – Fetches instructions from the I-cache. – Executes instructions normally except for long-latency (L2 miss) loads. • Invalidates long-latency loads and their dependant instructions by setting the INV flag. (The INV flag is propagated.) – Writes store values into the run-ahead cache instead of the D-cache when retiring. – Forwards the retired instructions with their results to the result queue (FIFO) . University of Central Florida 6

Architecture (cont.) Architecture (cont.) Result Queue next to fetch head tail Write Back Reg Read Dispatch I-Cache Execute Retire Fetch Issue Front thread Physical Back thread INV INV LSQ Registers file Run-ahead L1 Data Cache Cache • Back thread – Fetches instructions from the result queue. • Instructions invalidated by the front thread are fetched twice to achieve full redundancy coverage. – Performs redundancy check. • Compares with the front thread results for valid instructions. • Compares with the redundant copy. University of Central Florida 7

Architecture (cont.) Architecture (cont.) • When a discrepancy is detected – Soft error – Misspeculation from the front thread • Rewind both threads to the currently committed states. – Squash everything in the back thread, the result queue and the front thread. – Invalidate the run-ahead cache. – Copy the back thread’s architectural states to the front thread. – Resume execution. University of Central Florida 8

How does DTE improve performance? How does DTE improve performance? • The front thread runs on virtually ideal L2 by invalidating long-latency cache-missing loads. • The cache misses in the front thread become very useful pre-fetches for the back thread. – It reduces cache misses and enables more computation overlapping in the back thread. • Front thread resolves all the branches that are independent on the invalidated instructions. – It provides back thread highly accurate control flow. University of Central Florida 9

How does DTE achieve transient- -fault tolerance? fault tolerance? How does DTE achieve transient • Every instruction is redundantly executed. • The redundant results are checked before committing to ECC protected architectural states. • Any discrepancy due to soft errors can be transparently repaired. University of Central Florida 10

Outline Outline • Introduction • Dual-Thread Execution (DTE) – Overview – Architecture – Fetch policies for DTE • Experimental results • Related Work • Conclusion University of Central Florida 11

Fetch Policies for DTE Fetch Policies for DTE • ROUND-ROBIN (RR) policy + Fairness - Fails to consider the resource requirement for each thread. • ICOUNT policy + Good for high ILP threads - Favors the front thread in DTE. • SLACK policy + Speeds up the trailing thread in SRT and SRTR. - Favors the front thread in DTE. University of Central Florida 12

Fetch Policies for DTE (cont.) Fetch Policies for DTE (cont.) • Back-First (BF) policy + Favors the back thread. - Limits the fast progress of the front thread. • Queue-Occupancy (QO) policy � When the occupancy is less than 50%, it favors the front thread, otherwise it favors the back thread. + Allocates resources effectively to both threads. University of Central Florida 13

Outline Outline • Introduction • Dual-Thread Execution (DTE) – Overview – Architecture – Fetch policies for DTE • Experimental results • Related Work • Conclusion University of Central Florida 14

Methodology Methodology • Processor settings – MIPS R10000 style superscalar processor supporting SMT – 8-way issue, 128-entry ROB, 128-entry issue queue, 128-entry LSQ – 32 kB 2-way L1 caches, 1024 kB 8-way L2 cache, L2 miss latency: 300 cycles – Branch predictor: 64k-entry G-share; 32k-entry BTB – Stride-based stream-buffer hardware prefetcher – 512-entry result queue, 4 kB 4-way run-ahead cache – Latency for copying architectural register values from back to front thread: 8 cycles • Benchmarks – Memory-intensive spec2000 benchmarks (>40% speedup with perfect L2) and two computation-intensive benchmarks, bzip2 and gap. University of Central Florida 15

Different Fetch Policies Different Fetch Policies 240% normalized execution time Round robin ICount 220% Slack 200% Back first 180% Queue Occupancy 160% 140% 120% 100% 80% 60% 40% equake bzip2 gap gcc swim average mcf parser twolf vpr ammp art Queue-Occupancy fetch policy works best for DTE. University of Central Florida 16

Performance Impact of DTE Performance Impact of DTE 180% SRTR DTE 160% normalized execution time 140% 120% 100% 80% 60% 40% 20% 0% mcf ammp twolf swim gcc parser vpr equake average bzip2 gap art On average, DTE achieves 15.5% speedup. University of Central Florida 17

Energy Efficiency of DTE Energy Efficiency of DTE 3.5 SRTR DTE 3 2.5 normalized EDP 2 1.5 1 0.5 0 mcf twolf ammp swim gcc parser vpr equake average bzip2 gap art On average, DTE reports much higher energy efficiency than SRTR (1.63 vs. 2.29). University of Central Florida 18

Related Work Related Work • SRT [Reinhadt and Mukherjee], SRTR [Vijaykumar et.al.] • AR-SMT [Rotenberg] – Similar high-lever architecture (delay buffer vs. result queue) – The A-stream executes the program non-speculatively. – The R-stream validates the results from the A-stream. • DIVA [Austin] – Uses a separate simple in-order checker to verify the out-of-order execution of the main thread. • Dual-Core Execution [Zhou] – DCE builds on two processor cores on a single chip. – The two cores work cooperatively to improve the performance of single- thread. – DTE is derived from DCE. University of Central Florida 19

Summary Summary • Dual-Thread Execution builds upon SMT processors. • The front thread and the back thread execute instruction stream collaboratively to provide efficient transient-fault tolerance. • Works best with the Queue-Occupancy fetch policy. • SMT-based design to achieve both high reliability and performance improvement. University of Central Florida 20

Thank you and Questions? Thank you and Questions? Computer Science Department University of Central Florida

Efficient Transient-Fault Tolerance for Multithreaded Processors - PowerPoint PPT Presentation

Efficient Transient-Fault Tolerance for Multithreaded Processors Using Dual-Thread Execution Yi Ma Huiyang Zhou Computer Science Department University of Central Florida Introduction Introduction Modern microprocessors are increasingly

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C:

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

21 August 2014 Solid operational performance despite currency headwinds Revenue at constant

USING CROSS CURRENCY SWAPS IN THE HUNGARIAN DEBT MANAGEMENT Zsolt Bang, Head of Treasury

An Employers Guide to 2014: Deciphering Obligations and Coverage Options under the Affordable

Morning Council members. Since we last met there have been new developments in Nevadas

Growth through disciplined execution Investor Presentation July 2019 Slide 1 I NVESTMENT H

Execution UW-STOUT MANUFACTURING OUTREACH CENTER A resource of the Discovery Center

Executing the Strategy Results presentation for FY 2018 Agenda 1. Key messages 2. Strategic

Managing Risk for Successful Capital Projects Thursday, August 10, 2017 Agenda 1. Pre-Planning