CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING? Bj - - PowerPoint PPT Presentation

can we put concurrency back into redundant multithreading
SMART_READER_LITE
LIVE PREVIEW

CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING? Bj - - PowerPoint PPT Presentation

CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING? Bj orn D obel and Hermann H artig (TU Dresden) New Delhi, 14.10.2014 Motivation: Transient Hardware Faults Radiation-induced soft errors Mainly an issue in


slide-1
SLIDE 1

CAN WE PUT CONCURRENCY BACK INTO REDUNDANT MULTITHREADING?

Bj ¨

  • rn D ¨
  • bel and Hermann H¨

artig (TU Dresden)

New Delhi, 14.10.2014

slide-2
SLIDE 2

Motivation: Transient Hardware Faults

  • Radiation-induced soft errors

– Mainly an issue in avionics+space

  • DRAM errors in large data centers

– Google Study: > 2% failing DRAM DIMMs per year1 – ECC insufficient2

  • Decreasing transistor sizes → higher rate of errors in CPU

functional units3

1 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study, SIGMETRICS 2009 2 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice, ASPLOS 2012 3 Dixit, Wood: The Impact of New Technology on Soft Error Rates, IRPS 2011 RomainMT slide 1 of 17

slide-3
SLIDE 3

ASTEROID Operating System

Replicated Driver Unreplicated Application Replicated Application L4 Runtime Environment Romain L4/Fiasco.OC microkernel

RomainMT slide 2 of 17

slide-4
SLIDE 4

Romain: Structure

Master

Details: D ¨

  • bel, H¨

artig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012 RomainMT slide 3 of 17

slide-5
SLIDE 5

Romain: Structure

Replica Replica Replica Master

Details: D ¨

  • bel, H¨

artig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012 RomainMT slide 3 of 17

slide-6
SLIDE 6

Romain: Structure

Replica Replica Replica Master =

Details: D ¨

  • bel, H¨

artig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012 RomainMT slide 3 of 17

slide-7
SLIDE 7

Romain: Structure

Replica Replica Replica Master System Call Proxy Resource Manager =

Details: D ¨

  • bel, H¨

artig, Engel: Operating System Support for Redundant Multithreading, EMSOFT 2012 RomainMT slide 3 of 17

slide-8
SLIDE 8

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4

RomainMT slide 4 of 17

slide-9
SLIDE 9

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3

RomainMT slide 4 of 17

slide-10
SLIDE 10

How About Multithreading?

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3 C1 C2 C3 C4 C1 C2 C3 C4

RomainMT slide 4 of 17

slide-11
SLIDE 11

Problem: Nondeterminism

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 B1 B2 B3 C1 C2 C3 C3 C1 C2 C3 C4

RomainMT slide 5 of 17

slide-12
SLIDE 12

Problem: Nondeterminism

A1 A2 A3 A4 A1 A2 A3 A4 B1 B2 B3 C1 C2 C3 C3 C1 C2 C3 B1 B2 B3 B4

RomainMT slide 5 of 17

slide-13
SLIDE 13

Nondeterminism Example

i n t x = 1; pthread_mutex_t m = PTHREAD_MUTEX_DEFAULT ; void *thread_A( void *data) { pthread_mutex_lock (&m); x = x + 1; pthread_mutex_unlock (&m); return NULL; } void *thread_B( void *data) { pthread_mutex_lock (&m); x = x * 2; pthread_mutex_unlock (&m); return NULL; }

RomainMT slide 6 of 17

slide-14
SLIDE 14

Nondeterminism Example

i n t x = 1; pthread_mutex_t m = PTHREAD_MUTEX_DEFAULT ; void *thread_A( void *data) { pthread_mutex_lock (&m); x = x + 1; pthread_mutex_unlock (&m); return NULL; } void *thread_B( void *data) { pthread_mutex_lock (&m); x = x * 2; pthread_mutex_unlock (&m); return NULL; }

  • race-free (locks)
  • (A;B) → x = 4
  • (B;A) → x = 3

RomainMT slide 6 of 17

slide-15
SLIDE 15

Solution: Deterministic Multithreading

  • Related work: debugging multithreaded programs
  • Compiler solutions:4 no support for binary-only software

4 Bergan et al.: Core-Det: A Compiler and Runtime System for Deterministic Multithreaded Execution, ASPLOS 2010 5 Aviram et al.: Efficient System-enforced Deterministic Parallelism, OSDI 2010 6 Mushtaq et al.: Efficient Software Based Fault Tolerance Approach on Multicore Platforms, DATE 2013 7 Olszewski et al: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 RomainMT slide 7 of 17

slide-16
SLIDE 16

Solution: Deterministic Multithreading

  • Related work: debugging multithreaded programs
  • Compiler solutions:4 no support for binary-only software
  • Workspace-Consistent Memory:5 Requires per-replica and

per-thread memory copies

4 Bergan et al.: Core-Det: A Compiler and Runtime System for Deterministic Multithreaded Execution, ASPLOS 2010 5 Aviram et al.: Efficient System-enforced Deterministic Parallelism, OSDI 2010 6 Mushtaq et al.: Efficient Software Based Fault Tolerance Approach on Multicore Platforms, DATE 2013 7 Olszewski et al: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 RomainMT slide 7 of 17

slide-17
SLIDE 17

Solution: Deterministic Multithreading

  • Related work: debugging multithreaded programs
  • Compiler solutions:4 no support for binary-only software
  • Workspace-Consistent Memory:5 Requires per-replica and

per-thread memory copies

  • Lock-Based Determinism

– Reliance on ECC-protected memory6 – Our work reuses ideas from Kendo.7

4 Bergan et al.: Core-Det: A Compiler and Runtime System for Deterministic Multithreaded Execution, ASPLOS 2010 5 Aviram et al.: Efficient System-enforced Deterministic Parallelism, OSDI 2010 6 Mushtaq et al.: Efficient Software Based Fault Tolerance Approach on Multicore Platforms, DATE 2013 7 Olszewski et al: Kendo: Efficient Deterministic Multithreading in Software, ASPLOS 2009 RomainMT slide 7 of 17

slide-18
SLIDE 18

Enforced Determinism

  • Adapt libpthread: place INT3 into four functions

– pthread mutex lock – pthread mutex unlock – pthread lock – pthread unlock

  • Lock operations reflected to RomainMT master
  • Master enforces lock ordering

RomainMT slide 8 of 17

slide-19
SLIDE 19

Enforced Determinism: Microbenchmark

Native Single DMR TMR 10 20 30 40 50 60 70 80 90 Execution time in seconds

0.286 s 121x 197x 309x

2 Threads: i n t x = 0; pthread_mutex_t m = PTHREAD_MUTEX_DEFAULT ; void *thread( void *data) { for ( i n t i = 0; i < 5000000; ++i) { pthread_mutex_lock (&m); x = x + 1; pthread_mutex_unlock (&m); } return NULL; }

RomainMT slide 9 of 17

slide-20
SLIDE 20

Optimization Opportunities?

1 2 3 4 5 Socket 0 6 7 8 9 10 11 Socket 1

RomainMT slide 10 of 17

slide-21
SLIDE 21

Optimization Opportunities?

1 2 3 4 5 Socket 0 6 7 8 9 10 11 Socket 1

W1,1 W2,1 W3,1 Mgr1 Mgr2 Mgr3 W1,2 W2,2 W3,2 RomainMT slide 10 of 17

slide-22
SLIDE 22

Optimization Opportunities?

1 2 3 4 5 Socket 0 6 7 8 9 10 11 Socket 1

W1,1 W1,2 W1,3 Mgr1 Mgr2 Mgr3 W1,2 W2,2 W3,2 RomainMT slide 10 of 17

slide-23
SLIDE 23

Optimized Enforced Determinism

Single DMR TMR 10 20 30 40 50 60 70 80 90 Execution time in seconds Unoptimized CPU Placement Fast Synchronization

194x 138x 212x 192x RomainMT slide 11 of 17

slide-24
SLIDE 24

Cooperative Determinism

  • Replication-aware

libpthread

  • Replicas agree on

acquisition order w/o master invocation

  • Trade-off: libpthread

becomes single point

  • f failure

Replica pthr. rep LIP Replica pthr. rep LIP Replica pthr. rep LIP Lock Info Page ROMAIN Master CPU 0 CPU 1 CPU 2

RomainMT slide 12 of 17

slide-25
SLIDE 25

Cooperation: Lock Acquisition

spinlock (mtx.spinlock) lock rep(mtx) Owner free? Owner self? spinunlock (mtx.spinlock) No No Yield CPU Store Owner ID Store Owner Epoch Epoch matches? spinunlock (mtx.spinlock) Yes Yes return Yes No RomainMT slide 13 of 17

slide-26
SLIDE 26

Cooperative Determinism: Microbenchmark

Single DMR TMR 2 4 6 8 10 12 14 16 18 20 Execution time in seconds Unoptimized CPU Placement Fast Synchronization

2.17x 13.5x 11.7x 67 .9x 31.1x 30.3x RomainMT slide 14 of 17

slide-27
SLIDE 27

Overhead: SPLASH2, 2 workers

Radiosity Barnes FMM Raytrace Water Volrend Ocean FFT LU Radix GEOMEAN 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Runtime normalized vs. native

Single Replica Two Replicas Three Replicas RomainMT slide 15 of 17

slide-28
SLIDE 28

Overhead: SPLASH2, 4 workers

Radiosity Barnes FMM Raytrace Water Volrend Ocean FFT LU Radix GEOMEAN 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 3.93 2.94 2.02 2.02

Runtime normalized vs. native

Single Replica Two Replicas Three Replicas RomainMT slide 16 of 17

slide-29
SLIDE 29

Overhead: SPLASH2, 4 workers

Radiosity Barnes FMM Raytrace Water Volrend Ocean FFT LU Radix GEOMEAN 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 3.93 2.94 2.02 2.02

Runtime normalized vs. native

Single Replica Two Replicas Three Replicas

Sources of overhead:

  • System call

interception

  • Frequent memory

allocation

  • Cache effects
  • Lock density

RomainMT slide 16 of 17

slide-30
SLIDE 30

Summary

  • Redundant Multithreading as an OS Service
  • Binary-only applications
  • Multithreaded Replication using lock-based determinism

– Enforced Determinism – Cooperative Determinism – Lock Density limits performance

  • Application overhead for TMR: 24% (2 workers),

65% (4 workers)

http://spp1500.itec.kit.edu

RomainMT slide 17 of 17