 
              FluidCheck: A Redundant Threading based Approach for Reliable Execution in Manycore Processors Rajshekar Kalayappan, Sm ruti R. Sarangi Dept of Computer Science and Engineering Indian Institute of Technology Delhi New Delhi, India.
S oft Errors • Temporary nature [ im g src : aviral.lab.asu.edu ] • Occurs due to particle strikes on the silicon • Source of particles : ▫ Solar ion flux ▫ Explosion of distant stars ▫ Impurities in the chip
S oft Errors • Rare event ▫ Particles need to strike at the right place, at the right angle, with the right amount of energy • Not rare enough to be ignored ▫ The critical charge required to flip a bit reduces with reducing feature size and operating voltage
S oft Errors • Solutions ▫ Device level radiation hardening � Two to four generations behind commercial counterparts [Courtland2015] ▫ System level hardening techniques required � Redundancy Compare Vote DMR TMR
Problem S tatement • To efficiently execute a set of applications on a chip multi-processor (homogeneous SMT- capable cores), while ensuring reliability in the face of soft errors
Related Work : DIVA [Austin1999] • Meant to provide reliability. Leader Checker • IP • Execution Assistance : Branch Prediction Hints • Operand Value Hints • • Result • Example <0x1234><op1=5><op2=2><res=7> • Cache line forwarding
Related Work L 2 L 1 Leader/ C 1 C 2 Checker L 4 L 3 C 3 C 4 SRT [Reinhardt20 0 0 ], CRT [Mukherjee20 0 2] AR-SMT [Rotenberg1999] • Improvement over SRT • Saves area • Circumvents hazards borne out of • Better throughput per core resource requirement similarity between a leader-checker pair • Better throughput per core
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf C perlbench C mcf L gromacs L cactusADM C gromacs C cactusADM SRT
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf C perlbench C mcf L gromacs L cactusADM C gromacs C cactusADM SRT • Throughput = 3.24 • Similarity in resource requirement • High throughput threads together
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf L perlbench L mcf C perlbench C mcf C mcf C perlbench L gromacs L cactusADM L gromacs L cactusADM C gromacs C cactusADM C cactusADM C gromacs CRT SRT • Throughput = 3.24 • Similarity in resource requirement • High throughput threads together
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf L perlbench L mcf C perlbench C mcf C mcf C perlbench L gromacs L cactusADM L gromacs L cactusADM C gromacs C cactusADM C cactusADM C gromacs CRT SRT • Throughput = 3.24 • Throughput = 3.55 • Similarity in resource • Similarity is broken requirement • Can we do better? • High throughput threads together
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf L perlbench L mcf L perlbench L mcf C perlbench C mcf C mcf C perlbench C mcf C gromacs C cactusADM L gromacs L cactusADM L gromacs L cactusADM L gromacs L cactusADM C gromacs C cactusADM C cactusADM C gromacs C perlbench CRT SRT • Throughput = 3.24 • Throughput = 3.55 • Throughput = 3.76 • Similarity in resource • Similarity is broken requirement • Can we do better? • High throughput threads together
Motivational Example Without any checking, throughput = 4.84 instructions per cycle L perlbench L mcf L perlbench L mcf L perlbench L mcf C perlbench C mcf C mcf C perlbench C mcf C gromacs C cactusADM L gromacs L cactusADM L gromacs L cactusADM L gromacs L cactusADM C gromacs C cactusADM C cactusADM C gromacs C perlbench CRT FluidCheck SRT • Throughput = 3.24 • Throughput = 3.55 • Throughput = 3.76 • Similarity in resource • Similarity is broken • Schedules based on the requirement • Can we do better? applications’ behavior • High throughput • FluidCheck is a superset threads together of schedules; SRT, CRT are instances within FluidCheck
S implified Illustration of FluidCheck’s Working Arbiter L1 L2 Core A Core B L3 L4 C2 C1 Core C Core D C3 C4
S implified Illustration of FluidCheck’s Working Arbiter L1 L2 Core A Core B HELP C1 unable to keep up L3 L4 C2 C1 Core C Core D C1 C3 C4
S implified Illustration of FluidCheck’s Working Checker Arbiter assignment request L1 L2 Core C Core A Core B L3 L4 C2 C1 Core C Core D C3 C4
S implified Illustration of FluidCheck’s Working Arbiter L1 L2 Core A Core B L3 L4 C2 Core C Core D C1 C3 C4
S implified Illustration of FluidCheck’s Working Periodic Arbiter reassignment L1 L2 Core A Core B L3 L4 C2 Core C Core D C1 C3 C4
S implified Illustration of FluidCheck’s Working Arbiter L1 L2 L4 Core A Core B L3 C1 C2 Core C Core D C3 C4
Challenges to achieving FluidCheck • Reactive phase-based scheduler • Efficient transfer of hints • Efficient forwarding of cache lines from the leader to the checker • Circumventing subtle livelock scenarios
Hardware Architecture
Overview of Redundant Execution
Ct Checker Pipeline L1 Memory Checkpointing L2 Ct Leader Pipeline L1
Memory Checkpointing Leader Checker Hint Ct Ct Pipeline Pipeline Store 11010101 1 L1 L1 L2
Ct Checker Pipeline L1 Memory Checkpointing L2 Ct Leader 1 11010101 Pipeline L1
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Ld/St 11010101 1 L1 L1 L2
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Ld/St Miss! 11010101 1 L1 L1 L2
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Ld/St Miss! 11010101 1 L1 L1 L2
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Ld/St Evict! 11010101 1 L1 L1 L2
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Ld/St Evict! 1 00001111 0 1101.. L1 Victim Cache L1 L2
Ct Checker Pipeline L1 Memory Checkpointing L2 Victim Cache Ct Leader Pipeline L1
Memory Checkpointing Leader Checker Ct Ct Pipeline Pipeline Store 11010101 L1 Victim Cache L1 L2
Memory Checkpointing SYNC Leader Checker Ct Ct Pipeline Pipeline 11001101 1 1 11010111 1 1101.. 1 11110101 1 1001.. L1 Victim Cache L1 L2
Memory Checkpointing SYNC Leader Checker Ct Ct Pipeline Pipeline 11001101 1 1 11010111 1 1101.. 1 11110101 1 1001.. L1 Victim Cache L1 L2
Memory Checkpointing SYNC Leader Checker Ct Ct Pipeline Pipeline 11001101 0 11010111 0 11110101 0 L1 Victim Cache L1 L2
Memory Checkpointing Rollback Leader Checker Ct Ct Pipeline Pipeline 11001101 1 1 11010111 1 1101.. 1 11110101 1 1001.. L1 Victim Cache L1 L2
Memory Checkpointing Rollback Leader Checker Ct Ct Pipeline Pipeline L1 Victim Cache L1 L2
Ct Forwarding Filters L2 Leader Pipeline L1
Ct Ld/St Forwarding Filters L2 Leader Pipeline L1
Ct Ld/St Forwarding Filters L2 Leader Pipeline L1 Hit!
Do Not Forward Ct Ld/St Forwarding Filters L2 Leader Pipeline L1 Hit!
Ct Forwarding Filters L2 Leader Pipeline L1 Miss!
Ct Forwarding Filters L2 RFB Leader Pipeline L1 Miss!
Ct Forwarding Filters L2 RFB Hit! Leader Pipeline L1 Miss!
Do Not Forward Ct Forwarding Filters L2 RFB Hit! Leader Pipeline L1 Miss!
Do Not Forward Ct Forwarding Filters L2 RFB Miss! Leader Pipeline L1 Miss!
Forwarding Filters Do Not Forward Leader Ct Pipeline Miss! Miss! 11010011 0 L1 RFB LFB L2
Forwarding Filters Do Not Forward Leader Ct Pipeline Miss! Miss! 11010011 0 L1 RFB LFB L2
1 Ct LFB 11010011 Forwarding Filters L2 RFB Miss! Leader Pipeline L1 Miss!
1 Ct Forward LFB 11010011 Forwarding Filters L2 RFB Miss! Leader Pipeline L1 Miss!
Arbiter Logic: I • Activity ▫ IPC ▫ WIPC(x) • Mapping a Single Thread ▫ Select the core with minimum activity that has free SMT slots ▫ If activity is IPC, scheme is termed m inIPC ▫ If activity is WIPC(x), scheme is termed m inWIPC_x
Arbiter Logic: II • Mapping a Set of Threads ▫ Scheduling Policies: � Pinned Leaders (SP-PL) � Unpinned Leaders (SP-UL) � Unpinned Leaders All Leaders First (SP-UALF) • SMT Fetch Policy ▫ Full Simultaneous Issue [Tullsen1995] ▫ If n threads on a core have activities A 1 , A 2 .. A n , then the i th thread gets fetch cycles (cycle block of size B considered) A × i B ∑ = 1 n A k k
Evaluation: S imulation Parameters • 16-core processor, 4-way SMT • Core configuration based on Intel Sandybridge and IBM Power7 Param eter Value Pipeline width 4 i-cache and d-cache 32 kB Shared L2 cache 12 MB NOC topology 2D torus Hint buffer 512 entry Victim Cache 32 entry RFB and LFB 64 entries each
Recommend
More recommend