FluidCheck: A Redundant Threading based Approach for Reliable - - PowerPoint PPT Presentation

fluidcheck a redundant threading based approach for
SMART_READER_LITE
LIVE PREVIEW

FluidCheck: A Redundant Threading based Approach for Reliable - - PowerPoint PPT Presentation

FluidCheck: A Redundant Threading based Approach for Reliable Execution in Manycore Processors Rajshekar Kalayappan, Sm ruti R. Sarangi Dept of Computer Science and Engineering Indian Institute of Technology Delhi New Delhi, India. S oft


slide-1
SLIDE 1

FluidCheck: A Redundant Threading based Approach for Reliable Execution in Manycore Processors

Rajshekar Kalayappan, Sm ruti R. Sarangi

Dept of Computer Science and Engineering Indian Institute of Technology Delhi New Delhi, India.

slide-2
SLIDE 2

S

  • ft Errors
  • Temporary nature
  • Occurs due to particle strikes on the silicon
  • Source of particles :

▫ Solar ion flux ▫ Explosion of distant stars ▫ Impurities in the chip

[ im g src : aviral.lab.asu.edu ]

slide-3
SLIDE 3

S

  • ft Errors
  • Rare event

▫ Particles need to strike at the right place, at the right angle, with the right amount of energy

  • Not rare enough to be ignored

▫ The critical charge required to flip a bit reduces with reducing feature size and operating voltage

slide-4
SLIDE 4

S

  • ft Errors
  • Solutions

▫ Device level radiation hardening

Two to four generations behind commercial counterparts [Courtland2015]

▫ System level hardening techniques required

Redundancy

Compare Vote

DMR TMR

slide-5
SLIDE 5

Problem S tatement

  • To efficiently execute a set of applications on a

chip multi-processor (homogeneous SMT- capable cores), while ensuring reliability in the face of soft errors

slide-6
SLIDE 6

Related Work : DIVA [Austin1999]

Leader Checker

  • Meant to provide reliability.
  • IP
  • Execution Assistance :
  • Branch Prediction Hints
  • Operand Value Hints
  • Result
  • Example

<0x1234><op1=5><op2=2><res=7>

  • Cache line forwarding
slide-7
SLIDE 7

Related Work

Leader/ Checker

SRT [Reinhardt20 0 0 ], AR-SMT [Rotenberg1999]

  • Saves area
  • Better throughput per core

L2 C1 L1 C2 L3 C4 L4 C3

CRT [Mukherjee20 0 2]

  • Improvement over SRT
  • Circumvents hazards borne out of

resource requirement similarity between a leader-checker pair

  • Better throughput per core
slide-8
SLIDE 8

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

Without any checking, throughput = 4.84 instructions per cycle SRT

slide-9
SLIDE 9

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

Without any checking, throughput = 4.84 instructions per cycle

  • Throughput = 3.24
  • Similarity in resource

requirement

  • High throughput

threads together SRT

slide-10
SLIDE 10

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

SRT Without any checking, throughput = 4.84 instructions per cycle

  • Throughput = 3.24
  • Similarity in resource

requirement

  • High throughput

threads together

Lperlbench Cmcf Lmcf Cperlbench Lgromacs CcactusADM LcactusADM Cgromacs

CRT

slide-11
SLIDE 11

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

SRT Without any checking, throughput = 4.84 instructions per cycle

  • Throughput = 3.24
  • Similarity in resource

requirement

  • High throughput

threads together

Lperlbench Cmcf Lmcf Cperlbench Lgromacs CcactusADM LcactusADM Cgromacs

CRT

  • Throughput = 3.55
  • Similarity is broken
  • Can we do better?
slide-12
SLIDE 12

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

SRT Without any checking, throughput = 4.84 instructions per cycle

  • Throughput = 3.24
  • Similarity in resource

requirement

  • High throughput

threads together

Lperlbench Cmcf Lmcf Cperlbench Lgromacs CcactusADM LcactusADM Cgromacs

CRT

  • Throughput = 3.55
  • Similarity is broken
  • Can we do better?

Lperlbench Cmcf Cgromacs Lmcf CcactusADM Lgromacs Cperlbench LcactusADM

  • Throughput = 3.76
slide-13
SLIDE 13

Motivational Example

Lperlbench Cperlbench Lmcf Cmcf Lgromacs Cgromacs LcactusADM CcactusADM

SRT Without any checking, throughput = 4.84 instructions per cycle

  • Throughput = 3.24
  • Similarity in resource

requirement

  • High throughput

threads together

Lperlbench Cmcf Lmcf Cperlbench Lgromacs CcactusADM LcactusADM Cgromacs

CRT

  • Throughput = 3.55
  • Similarity is broken
  • Can we do better?

Lperlbench Cmcf Cgromacs Lmcf CcactusADM Lgromacs Cperlbench LcactusADM FluidCheck

  • Throughput = 3.76
  • Schedules based on the

applications’ behavior

  • FluidCheck is a superset
  • f schedules; SRT, CRT

are instances within FluidCheck

slide-14
SLIDE 14

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C2 C1

slide-15
SLIDE 15

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C2 C1 C1 C1 unable to keep up HELP

slide-16
SLIDE 16

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C2 C1

Checker assignment request

Core C

slide-17
SLIDE 17

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C2 C1

slide-18
SLIDE 18

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C2 C1 Periodic reassignment

slide-19
SLIDE 19

S implified Illustration of FluidCheck’s Working

Arbiter Core A L1 L2 L3 L4 Core B Core C Core D C4 C3 C1 C2

slide-20
SLIDE 20

Challenges to achieving FluidCheck

  • Reactive phase-based scheduler
  • Efficient transfer of hints
  • Efficient forwarding of cache lines from the

leader to the checker

  • Circumventing subtle livelock scenarios
slide-21
SLIDE 21

Hardware Architecture

slide-22
SLIDE 22

Overview of Redundant Execution

slide-23
SLIDE 23

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1

slide-24
SLIDE 24

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 Hint Store 11010101 1

slide-25
SLIDE 25

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 11010101 1

slide-26
SLIDE 26

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 Ld/St 11010101 1

slide-27
SLIDE 27

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 Ld/St 11010101 1 Miss!

slide-28
SLIDE 28

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 Ld/St 11010101 1 Miss!

slide-29
SLIDE 29

Memory Checkpointing

Leader Ct Pipeline L1 L2 Checker Ct Pipeline L1 Ld/St 11010101 1 Evict!

slide-30
SLIDE 30

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 Ld/St 00001111 Evict! 1101.. 1

slide-31
SLIDE 31

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1

slide-32
SLIDE 32

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 Store 11010101

slide-33
SLIDE 33

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 1101.. 1 1001.. 1 11010111 1 11110101 1 11001101 1 SYNC

slide-34
SLIDE 34

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 1101.. 1 1001.. 1 11010111 1 11110101 1 11001101 1 SYNC

slide-35
SLIDE 35

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 11010111 11110101 11001101 SYNC

slide-36
SLIDE 36

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 1101.. 1 1001.. 1 11010111 1 11110101 1 11001101 1 Rollback

slide-37
SLIDE 37

Memory Checkpointing

Leader Ct Pipeline L1 Victim Cache L2 Checker Ct Pipeline L1 Rollback

slide-38
SLIDE 38

Forwarding Filters

Leader Ct Pipeline L1 L2

slide-39
SLIDE 39

Forwarding Filters

Leader Ct Pipeline L1 L2 Ld/St

slide-40
SLIDE 40

Forwarding Filters

Leader Ct Pipeline L1 L2 Ld/St Hit!

slide-41
SLIDE 41

Forwarding Filters

Leader Ct Pipeline L1 L2 Ld/St Hit! Do Not Forward

slide-42
SLIDE 42

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss!

slide-43
SLIDE 43

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! RFB

slide-44
SLIDE 44

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! RFB Hit!

slide-45
SLIDE 45

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! Do Not Forward RFB Hit!

slide-46
SLIDE 46

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! Do Not Forward RFB Miss!

slide-47
SLIDE 47

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! Do Not Forward RFB Miss! LFB 11010011

slide-48
SLIDE 48

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! Do Not Forward RFB Miss! LFB 11010011

slide-49
SLIDE 49

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! RFB Miss! LFB 11010011 1

slide-50
SLIDE 50

Forwarding Filters

Leader Ct Pipeline L1 L2 Miss! Forward RFB Miss! LFB 11010011 1

slide-51
SLIDE 51

Arbiter Logic: I

  • Activity

▫ IPC ▫ WIPC(x)

  • Mapping a Single Thread

▫ Select the core with minimum activity that has free SMT slots ▫ If activity is IPC, scheme is termed m inIPC ▫ If activity is WIPC(x), scheme is termed m inWIPC_x

slide-52
SLIDE 52

Arbiter Logic: II

  • Mapping a Set of Threads

▫ Scheduling Policies: Pinned Leaders (SP-PL) Unpinned Leaders (SP-UL) Unpinned Leaders All Leaders First (SP-UALF)

  • SMT Fetch Policy

▫ Full Simultaneous Issue [Tullsen1995] ▫ If n threads on a core have activities A1, A2 .. An, then the ith thread gets fetch cycles (cycle block of size B considered)

B A A

n k k i

×

∑ =1

slide-53
SLIDE 53

Evaluation: S imulation Parameters

  • 16-core processor, 4-way SMT
  • Core configuration based on Intel Sandybridge

and IBM Power7

Param eter Value Pipeline width 4 i-cache and d-cache 32 kB Shared L2 cache 12 MB NOC topology 2D torus Hint buffer 512 entry Victim Cache 32 entry RFB and LFB 64 entries each

slide-54
SLIDE 54

Evalation Methodology

  • Tools

▫ Tejas Architectural Simulator ▫ McPAT and Orion2 models

  • Workloads

▫ “low”: 16 applications (16 + 16 threads)

▫ “m edium ”: 24 applications (24 + 24 threads)

▫ “high”: 32 applications (32 + 32 threads) ▫ In each case 100 random combinations of SPEC CPU2006 benchmarks were considered

  • Comparison Metric

1

| |

∏ ∈

W W b

b execute unreliably to taken cycles b execute reliably to taken cycles

slide-55
SLIDE 55

Evaluation: Results

47% 37% 27%

slide-56
SLIDE 56

FluidCheck’s Mapping Ability

slide-57
SLIDE 57

Performance of Forwarding Filters

slide-58
SLIDE 58

Comparison with Generic S cheduling S chemes

  • DCCS [Settle2004] • IPCS [Parekh 2000] • RIRS [ElMoursy2006]
  • TCA [Acosta2009] • L1 BW-aware [Feliu2013]
slide-59
SLIDE 59

Conclusions

  • Efficient system-level solutions to handle soft

errors are critically sought

  • The protection of modern multi-core,

multithreading capable processors presents interesting challenges

  • Our solution FluidCheck achieves reliability with

a mere 27% reduction in performance on average, while seminal works such as SRT (47%) and CRT(37%) present much higher slowdowns

slide-60
SLIDE 60

Extra slides

slide-61
SLIDE 61

DIVA : Checker Operation

Fetch

  • Check IP
  • Fetch From IP
  • <0x1234>

Decode

  • <R1=R2+R3>

Execute

  • Using the
  • perand value

hints

  • <5+2>

Writeback

  • Check

communication

  • R2 == 5 ?
  • R3 == 2 ?
  • Check

computation

  • 7 == res ?
  • Write 7 to R1

Commit

  • Complete store
slide-62
SLIDE 62

DIVA : Execution Assistance

  • The DIVA checker

▫ Faces no data hazards

Operand value hints are passed from leader

▫ Faces no control hazards

The stream of packets from the leader are in correct dynamic order (if no soft error struck the prediction

  • r branching logic)

If a soft error occurred (rare event), it is detected when the branch condition is evaluated at the checker

slide-63
SLIDE 63

DIV A : Consequence of Execution Assistance

  • What gains can be achieved through execution

assistance?

▫ Checker can be made simpler ▫ Checker can be made slower ▫ Checker can be made to do more work

slide-64
SLIDE 64

Resolving Livelock Issues

  • Suppose a checker thread faces a decode stall since

the ROB was full

  • Suppose some other leader thread on the same core

is occupying the head of the ROB and is facing a long latency miss

  • The checker thread is forced to migrate
  • Possibility of multiple forced migrations in quick

succession – detrimental to performance

  • Solution – Reservation. If a resource is greater than

95% full, it will not accept any more leader entries