Order Is A Lie Are you sure you know how your code runs ? Order in - - PowerPoint PPT Presentation

order is a lie
SMART_READER_LITE
LIVE PREVIEW

Order Is A Lie Are you sure you know how your code runs ? Order in - - PowerPoint PPT Presentation

Order Is A Lie Are you sure you know how your code runs ? Order in code is not respected by Compilers Processors ( out-of-order execution) SMP Cache Management Understanding execution order in a multithreaded context is out of reach


slide-1
SLIDE 1

Order Is A Lie

Are you sure you know how your code runs ?

slide-2
SLIDE 2

Order in code is not respected by

  • Compilers
  • Processors (out-of-order execution)
  • SMP Cache Management

Understanding execution order in a multithreaded context is out of reach of a human mind.

slide-3
SLIDE 3

Compilers and Order ?

slide-4
SLIDE 4

Order and Side Effects

int next() { static int x = 0; return x++; } void g() { int x = 0, y, tab[32]; // can be equivalent to: // tab[0] = 1 // tab[1] = 0; // ... tab[x++] = x++; // x = 2 - 1 or 1 - 1 ? y = x + --x; // x = 0 - 1 or 1 - 0 ? x = next() - next(); }

slide-5
SLIDE 5

Out Of Order ? OoO

slide-6
SLIDE 6

OoO

Do you know what a pipeline is ? Out-of-order is the next step.

slide-7
SLIDE 7

1990: first microprocessor IBM Power 1 Not a new a idea 1964/1966: first out-of-order machine CDC6600 & IBM 360/91

OoO

slide-8
SLIDE 8

Pipeline …

slide-9
SLIDE 9

Pipeline … with OoO

slide-10
SLIDE 10

OoO

slide-11
SLIDE 11

OoO

int f(int *a) { int x = 1, y; y = *a; x += 41; // Don't need previous statement *a = x; // Require 2 previous statements return y; }

slide-12
SLIDE 12

And The Cache ?

slide-13
SLIDE 13

Cache

multiple processors + slow memory = a lot of hardware caches !

slide-14
SLIDE 14

Cache Coherency

M modified line is owned by 1 core E exclusive S shared line is shared I invalid line is E or M elsewhere

slide-15
SLIDE 15

Cache Coherency

M E S I M

✘ ✘ ✘ ✔

E

✘ ✘ ✘ ✔

S

✘ ✘ ✔ ✔

I

✔ ✔ ✔ ✔

slide-16
SLIDE 16

Cache Coherency

slide-17
SLIDE 17

Cache Coherency

  • Line invalidation is expensive
  • To improve perf, procs use:

○ Store Buffer ○ Invalidate Queue

  • We need barrier !
slide-18
SLIDE 18

So what can we do ?

slide-19
SLIDE 19

Theoretical View

Determinism can be defined through the

  • bservation of memory states history.
slide-20
SLIDE 20

Theoretical View

A program is deterministic if we don't observe different states history through (all possible) executions.

slide-21
SLIDE 21

Linearizability

An history is atomic if:

  • its invocations and responses can be

reordered to yield a sequential history.

  • that sequential history is correct according

to the sequential definition of the object.

  • if a response preceded an invocation in the
  • riginal history, it must still precede it in the

sequent reordering

slide-22
SLIDE 22

Dealing With Memory

I/O Automaton can be used to describe properties and behavior independently of concrete hardware implementation.

slide-23
SLIDE 23

Dealing With Memory

Front-End Object R Process Object A

RESPOND RESPOND INVOKE INVOKE

slide-24
SLIDE 24

Main Results

  • Wait-free operations are possible
  • The only meaningful primitives are:

○ Compare-and-Swap (CAS) ○ Load-Link/Store-Conditional (ll/sc)

  • Order is not required for determinism !
slide-25
SLIDE 25

Compare And Swap

bool CAS(int *loc, int cmp, int newval) { if (*loc == cmp) { *loc= newval; return true; } return false; }

slide-26
SLIDE 26

ll/sc

  • Load from memory and link to the cell
  • Store in the cell if no write was made
  • More powerful than CAS
  • More RISC oriented
  • Many implementations are weak
slide-27
SLIDE 27

ll/sc v.s. CAS

  • Hardware ll/sc is often broken
  • Most broken ll/sc can simulate CAS
  • Most algorithms are described using CAS
slide-28
SLIDE 28

Memory Barriers

  • Release: force all write operations to be

finished before the barrier

  • Acquire: prevent all read operations to

begin before the barrier

  • Full: acquire and release at the same time

Barriers will also flush Store Buffers and Invalidate Queues.

slide-29
SLIDE 29

Memory Barriers

void worker0(char *msg, char *shr, int *ok) { for (char *cur = msg; *cur; ++cur, ++shr) *shr = *cur; // need a release barrier *ok = 1; } void worker1(char *shr, int *ok) { if (*ok) // need an acquire barrier printf("%s\n", shr); }

slide-30
SLIDE 30

Non Blocking

slide-31
SLIDE 31

Non Blocking ?

  • It's all about progression
  • We don't want locks
  • We want minimal system interactions
  • We want to scale upon heavy contention
slide-32
SLIDE 32

Linearization Point

  • Usual mistake: atomic means one instruction
  • For observers, an operation is atomic if there's

a point marking the change Linearization Point

No Visible Change Updated Operation

slide-33
SLIDE 33

Lock-free

As long as one thread is active, the whole system makes progress. A lock-free algorithm should leave shared data in correct states between linearization points.

slide-34
SLIDE 34
  • Rely only on CAS
  • Usual schema is:
  • a. Prepare
  • b. Acquire entry data points
  • c. Prepare update
  • d. Update (CAS) if entry are valid or go to b
  • d is the linearization point

Lock-free

slide-35
SLIDE 35

Lock-free

Existing Algorithms (mostly in Java) for:

  • Stack
  • Queue
  • Linked list
  • Skip-list
slide-36
SLIDE 36

Lock-free Queue is a classic (PODC96) Implemented for years in Java Not in C++ due to lack of memory-model.

  • 1. Acquire tail (push) or head (pop)
  • 2. Prepare for update
  • 3. When queue is in a temporary state

(incomplete pop) finished the job and retry

  • 4. In all cases, if acquired pointers have

changed, retry, otherwise do the update.

Lock-free Queue

slide-37
SLIDE 37

Lock-free and Memory

In most lock-free algorithms, threads can hold pointers that can be deleted by other threads.

slide-38
SLIDE 38

Lock-free and Memory

  • First attempt: use a recycler

○ avoid early free ○ do not protect from ABA issues

  • Use a garbage-collector ?

○ solves early free and ABA issues ○ are GCs wait/lock free ? …

slide-39
SLIDE 39

ABA problem

A B B A B Read pointer A Entry is now B Read pointer A

slide-40
SLIDE 40

Lock-free and Memory

Two main solutions:

  • Double-word based solutions

○ using pair pointer/counter ○ Only x86-64 provides 128b CAS

  • Hazard Pointers

○ Simple ○ wait-free ○ not hardware dependant

slide-41
SLIDE 41

Lock-free Performances

  • Academics: better perf than lock-based algos
  • Java: implementation agrees
  • C++ ? None officials, mine has strange results.
  • Pure bench speed-up are not clear
  • Hybrid algorithms (TBB) can do better with

limited number of threads.

slide-42
SLIDE 42

Wait-free

In a given set of processes, each process can perform its action in a finite (bounded) number of steps.

slide-43
SLIDE 43

Wait-free

  • Far more difficult than lock-free
  • Implementation are far more expensive
  • Can't use failure/retry loop
  • Most implementation use helping system:
  • 1. Make a forward step for another thread
  • 2. Start its own action step by step
  • All pending operations have progression !
slide-44
SLIDE 44

Wait-free

Recently (2011) a new approach appears:

  • Mix lock-free algo with helping mechanism:
  • 1. Try to help every N calls
  • 2. Bounded failure/retry loop (lockfree)
  • 3. Fail ? Move to helping mechanism
  • Provide similar perf as lock-free algos.
slide-45
SLIDE 45

RCU by Example

Logically after insert Logically before insert

slide-46
SLIDE 46

RCU by Example

slide-47
SLIDE 47

Conclusion

slide-48
SLIDE 48

?