Linux kernel synchroniza2on Don Porter CSE 506 1 CSE 506: - - PowerPoint PPT Presentation

linux kernel synchroniza2on
SMART_READER_LITE
LIVE PREVIEW

Linux kernel synchroniza2on Don Porter CSE 506 1 CSE 506: - - PowerPoint PPT Presentation

CSE 506: Opera.ng Systems Linux kernel synchroniza2on Don Porter CSE 506 1 CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads Formats Allocators User Todays Lecture System Calls Synchroniza2on in Kernel the kernel


slide-1
SLIDE 1

CSE 506: Opera.ng Systems

Linux kernel synchroniza2on

Don Porter CSE 506

1

slide-2
SLIDE 2

CSE 506: Opera.ng Systems

Logical Diagram

Memory Management CPU Scheduler User Kernel Hardware Binary Formats Consistency System Calls Interrupts Disk Net RCU File System Device Drivers Networking Sync Memory Allocators Threads Today’s Lecture Synchroniza2on in the kernel

2

slide-3
SLIDE 3

CSE 506: Opera.ng Systems

Warm-up

  • What is synchroniza2on?

– Code on mul2ple CPUs coordinate their opera2ons

  • Examples:

– Locking provides mutual exclusion while changing a pointer-based data structure – Threads might wait at a barrier for comple2on of a phase

  • f computa2on

– Coordina2ng which CPU handles an interrupt

3

slide-4
SLIDE 4

CSE 506: Opera.ng Systems

Why Linux synchroniza2on?

  • A modern OS kernel is one of the most complicated

parallel programs you can study

– Other than perhaps a database

  • Includes most common synchroniza2on paXerns

– And a few interes2ng, uncommon ones

4

slide-5
SLIDE 5

CSE 506: Opera.ng Systems

Historical perspec2ve

  • Why did OSes have to worry so much about

synchroniza2on back when most computers have

  • nly one CPU?

5

slide-6
SLIDE 6

CSE 506: Opera.ng Systems

The old days: They didn’t worry!

  • Early/simple OSes (like JOS, pre-lab4): No need for

synchroniza2on

– All kernel requests wait un2l comple2on – even disk requests – Heavily restrict when interrupts can be delivered (all traps use an interrupt gate) – No possibility for two CPUs to touch same data

6

slide-7
SLIDE 7

CSE 506: Opera.ng Systems

Slightly more recently

  • Op2mize kernel performance by blocking inside the

kernel

  • Example: Rather than wait on expensive disk I/O,

block and schedule another process un2l it completes

– Cost: A bit of implementa2on complexity

  • Need a lock to protect against concurrent update to pages/inodes/
  • etc. involved in the I/O
  • Could be accomplished with rela2vely coarse locks
  • Like the Big Kernel Lock (BKL)

– Benefit: BeXer CPU u2litza2on

7

slide-8
SLIDE 8

CSE 506: Opera.ng Systems

A slippery slope

  • We can enable interrupts during system calls

– More complexity, lower latency

  • We can block in more places that make sense

– BeXer CPU usage, more complexity

  • Concurrency was an op2miza2on for really fancy

OSes, un2l…

8

slide-9
SLIDE 9

CSE 506: Opera.ng Systems

The forcing func2on

  • Mul2-processing

– CPUs aren’t gegng faster, just smaller – So you can put more cores on a chip

  • The only way soiware (including kernels) will get

faster is to do more things at the same 2me

9

slide-10
SLIDE 10

CSE 506: Opera.ng Systems

Performance Scalability

  • How much more work can this soiware complete in

a unit of 2me if I give it another CPU?

– Same: No scalability---extra CPU is wasted – 1 -> 2 CPUs doubles the work: Perfect scalability

  • Most soiware isn’t scalable
  • Most scalable soiware isn’t perfectly scalable

10

slide-11
SLIDE 11

CSE 506: Opera.ng Systems

Performance Scalability

2 4 6 8 10 12 1 2 3 4 Execu.on Time (s) CPUs Perfect Scalability Not Scalable Somewhat scalable

Ideal: Time halves with 2x CPUS

11

slide-12
SLIDE 12

CSE 506: Opera.ng Systems

Performance Scalability (more visually intui2ve)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 2 3 4 Performance 1 / Execu.on Time (s) CPUs Perfect Scalability Not Scalable Somewhat scalable

Slope =1 == perfect scaling

12

slide-13
SLIDE 13

CSE 506: Opera.ng Systems

Performance Scalability (A 3rd visual)

5 10 15 20 25 30 35 1 2 3 4 Execu.on Time (s) * CPUs CPUs Perfect Scalability Not Scalable Somewhat scalable

Slope = 0 == perfect scaling

13

slide-14
SLIDE 14

CSE 506: Opera.ng Systems

Coarse vs. Fine-grained locking

  • Coarse: A single lock for everything

– Idea: Before I touch any shared data, grab the lock – Problem: completely unrelated opera2ons wait on each

  • ther
  • Adding CPUs doesn’t improve performance

14

slide-15
SLIDE 15

CSE 506: Opera.ng Systems

Fine-grained locking

  • Fine-grained locking: Many “liXle” locks for

individual data structures

– Goal: Unrelated ac2vi2es hold different locks

  • Hence, adding CPUs improves performance

– Cost: complexity of coordina2ng locks

15

slide-16
SLIDE 16

CSE 506: Opera.ng Systems

Current Reality

Performance Complexity

Fine-Grained Locking Course-Grained Locking

ò Unsavory trade-off between complexity and performance scalability

16

slide-17
SLIDE 17

CSE 506: Opera.ng Systems

How do locks work?

  • Two key ingredients:

– A hardware-provided atomic instruc2on

  • Determines who wins under conten2on

– A wai2ng strategy for the loser(s)

17

slide-18
SLIDE 18

CSE 506: Opera.ng Systems

Atomic instruc2ons

  • A “normal” instruc2on can span many CPU cycles

– Example: ‘a = b + c’ requires 2 loads and a store – These loads and stores can interleave with other CPUs’ memory accesses

  • An atomic instruc2on guarantees that the en2re
  • pera2on is not interleaved with any other CPU

– x86: Certain instruc2ons can have a ‘lock’ prefix – Intui2on: This CPU ‘locks’ all of memory – Expensive! Not ever used automa2cally by a compiler; must be explicitly used by the programmer

18

slide-19
SLIDE 19

CSE 506: Opera.ng Systems

Atomic instruc2on examples

  • Atomic increment/decrement ( x++ or x--)

– Used for reference coun2ng – Some variants also return the value x was set to by this instruc2on (useful if another CPU immediately changes the value)

  • Compare and swap

– if (x == y) x = z; – Used for many lock-free data structures

19

slide-20
SLIDE 20

CSE 506: Opera.ng Systems

Atomic instruc2ons + locks

  • Most lock implementa2ons have some sort of

counter

  • Say ini2alized to 1
  • To acquire the lock, use an atomic decrement

– If you set the value to 0, you win! Go ahead – If you get < 0, you lose. Wait L – Atomic decrement ensures that only one CPU will decrement the value to zero

  • To release, set the value back to 1

20

slide-21
SLIDE 21

CSE 506: Opera.ng Systems

Wai2ng strategies

  • Spinning: Just poll the atomic counter in a busy loop;

when it becomes 1, try the atomic decrement again

  • Blocking: Create a kernel wait queue and go to sleep,

yielding the CPU to more useful work

– Winner is responsible to wake up losers (in addi2on to segng lock variable to 1) – Create a kernel wait queue – the same thing used to wait

  • n I/O
  • Note: Moving to a wait queue takes you out of the scheduler’s run

queue

21

slide-22
SLIDE 22

CSE 506: Opera.ng Systems

Which strategy to use?

  • Main considera2on: Expected 2me wai2ng for the

lock vs. 2me to do 2 context switches

– If the lock will be held a long 2me (like while wai2ng for disk I/O), blocking makes sense – If the lock is only held momentarily, spinning makes sense

  • Other, subtle considera2ons we will discuss later

22

slide-23
SLIDE 23

CSE 506: Opera.ng Systems

Linux lock types

  • Blocking: mutex, semaphore
  • Non-blocking: spinlocks, seqlocks, comple2ons

23

slide-24
SLIDE 24

CSE 506: Opera.ng Systems

Linux spinlock (simplified)

1: lock; decb slp->slock jns 3f 2: pause cmpb $0,slp->slock jle 2b jmp 1b 3:

// Locked decrement of lock var // Jump if not set (result is zero) to 3 // Low power instruc2on, wakes on // coherence event // Read the lock value, compare to zero // If less than or equal (to zero), goto 2 // Else jump to 1 and try again // We win the lock

24

slide-25
SLIDE 25

CSE 506: Opera.ng Systems

Rough C equivalent

while (0 != atomic_dec(&lock->counter)) { do { // Pause the CPU un2l some coherence // traffic (a prerequisite for the counter // changing) saving power } while (lock->counter <= 0); }

25

slide-26
SLIDE 26

CSE 506: Opera.ng Systems

Why 2 loops?

  • Func2onally, the outer loop is sufficient
  • Problem: AXempts to write this variable invalidate it

in all other caches

– If many CPUs are wai2ng on this lock, the cache line will bounce between CPUs that are polling its value

  • This is VERY expensive and slows down EVERYTHING on the

system

– The inner loop read-shares this cache line, allowing all polling in parallel

  • This paXern called a Test&Test&Set lock (vs.

Test&Set)

26

slide-27
SLIDE 27

CSE 506: Opera.ng Systems

Test & Set Lock

CPU 0 Cache Memory Bus 0x1000 RAM CPU 1 Cache atomic_dec

Cache Line “ping-pongs” back and forth

while (!atomic_dec(&lock->counter)) 0x1000 CPU 2 // Has lock atomic_dec Write Back+Evict Cache Line

27

slide-28
SLIDE 28

CSE 506: Opera.ng Systems

Test & Test & Set Lock

CPU 0 Cache Memory Bus 0x1000 RAM CPU 1 Cache read

Line shared in read mode un2l unlocked

while (lock->counter <= 0)) 0x1000 CPU 2 // Has lock read Unlock by wri2ng 1

28

slide-29
SLIDE 29

CSE 506: Opera.ng Systems

Why 2 loops?

  • Func2onally, the outer loop is sufficient
  • Problem: AXempts to write this variable invalidate it

in all other caches

– If many CPUs are wai2ng on this lock, the cache line will bounce between CPUs that are polling its value

  • This is VERY expensive and slows down EVERYTHING on the

system

– The inner loop read-shares this cache line, allowing all polling in parallel

  • This paXern called a Test&Test&Set lock (vs.

Test&Set)

29

slide-30
SLIDE 30

CSE 506: Opera.ng Systems

Reader/writer locks

  • Simple op2miza2on: If I am just reading, we can let
  • ther readers access the data at the same 2me

– Just no writers

  • Writers require mutual exclusion

30

slide-31
SLIDE 31

CSE 506: Opera.ng Systems

Linux RW-Spinlocks

  • Low 24 bits count ac2ve readers

– Unlocked: 0x01000000 – To read lock: atomic_dec_unless(count, 0)

  • 1 reader: 0x:00ffffff
  • 2 readers: 0x00fffffe
  • Etc.
  • Readers limited to 2^24. That is a lot of CPUs!
  • 25th bit for writer

– Write lock – CAS 0x01000000 -> 0

  • Readers will fail to acquire the lock un2l we add 0x1000000

31

slide-32
SLIDE 32

CSE 506: Opera.ng Systems

Subtle issue

  • What if we have a constant stream of readers and a

wai2ng writer?

– The writer will starve

  • We may want to priori2ze writers over readers

– For instance, when readers are polling for the write – How to do this?

32

slide-33
SLIDE 33

CSE 506: Opera.ng Systems

Seqlocks

  • Explicitly favor writers, poten2ally starve readers
  • Idea:

– An explicit write lock (one writer at a 2me) – Plus a version number – each writer increments at beginning and end of cri2cal sec2on

  • Readers: Check version number, read data, check

again

– If version changed, try again in a loop – If version hasn’t changed and is even, neither has data

33

slide-34
SLIDE 34

CSE 506: Opera.ng Systems

Seqlock Example

70 % Time for CSE 506 30 % Time for All Else Version Lock Invariant: Must add up to 100%

34

slide-35
SLIDE 35

CSE 506: Opera.ng Systems

Version Lock

Seqlock Example

70 % Time for CSE 506 30 % Time for All Else

Reader: do { v = version; a = cse506; b = other; } while (v % 2 == 1 || 
 v != version); Writer: lock(); version++;

  • ther = 20;

cse506 = 80; version++; unlock();

1 2 80 20

What if reader executed now?

35

slide-36
SLIDE 36

CSE 506: Opera.ng Systems

Seqlocks

  • Explicitly favor writers, poten2ally starve readers
  • Idea:

– An explicit write lock (one writer at a 2me) – Plus a version number – each writer increments at beginning and end of cri2cal sec2on

  • Readers: Check version number, read data, check

again

– If version changed, try again in a loop – If version hasn’t changed and is even, neither has data

36

slide-37
SLIDE 37

CSE 506: Opera.ng Systems

Composing locks

  • Suppose I need to touch two data structures (A and

B) in the kernel, protected by two locks.

  • What could go wrong?

– Deadlock! – Thread 0: lock(a); lock(b) – Thread 1: lock(b); lock(a)

  • How to solve?

– Lock ordering

37

slide-38
SLIDE 38

CSE 506: Opera.ng Systems

Lock Ordering

  • A program code conven2on
  • Developers get together, have lunch, plan the order
  • f locks
  • In general, nothing at compile 2me or run-2me

prevents you from viola2ng this conven2on

– Research topics on making this beXer:

  • Finding locking bugs
  • Automa2cally locking things properly
  • Transac2onal memory

38

slide-39
SLIDE 39

CSE 506: Opera.ng Systems

How to order?

  • What if I lock each entry in a linked list. What is a

sensible ordering?

– Lock each item in list order – What if the list changes order? – Uh-oh! This is a hard problem

  • Lock-ordering usually reflects sta2c assump2ons

about the structure of the data

– When you can’t make these assump2ons, ordering gets hard

39

slide-40
SLIDE 40

CSE 506: Opera.ng Systems

Linux solu2on

  • In general, locks for dynamic data structures are
  • rdered by kernel virtual address

– I.e., grab locks in increasing virtual address order

  • A few places where traversal path is used instead

40

slide-41
SLIDE 41

CSE 506: Opera.ng Systems

Lock ordering in prac2ce

From Linux: fs/dcache.c

void d_prune_aliases(struct inode *inode) { struct dentry *dentry; struct hlist_node *p; restart: spin_lock(&inode->i_lock); hlist_for_each_entry(dentry, p, &inode->i_dentry, d_alias) { spin_lock(&dentry->d_lock); if (!dentry->d_count) { __dget_dlock(dentry); __d_drop(dentry); spin_unlock(&dentry->d_lock); spin_unlock(&inode->i_lock); dput(dentry); goto restart; } spin_unlock(&dentry->d_lock); } spin_unlock(&inode->i_lock); }

Care taken to lock inode before each alias Inode lock protects list; Must restart loop aier modifica2on

41

slide-42
SLIDE 42

CSE 506: Opera.ng Systems

mm/filemap.c lock ordering

/* * Lock ordering: * ->i_mmap_lock (vmtruncate) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * ->i_mutex * ->i_mmap_lock (truncate->unmap_mapping_range) * ->mmap_sem * ->i_mmap_lock * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * ->mmap_sem * ->lock_page (access_process_vm) * ->mmap_sem * ->i_mutex (msync) * ->i_mutex * ->i_alloc_sem (various) * ->inode_lock * ->sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * ->i_mmap_lock * ->anon_vma.lock (vma_adjust) * ->anon_vma.lock * ->page_table_lock or pte_lock (anon_vma_prepare and various) * ->page_table_lock or pte_lock * ->swap_lock (try_to_unmap_one) * ->private_lock (try_to_unmap_one) * ->tree_lock (try_to_unmap_one) * ->zone.lru_lock (follow_page->mark_page_accessed) * ->zone.lru_lock (check_pte_range->isolate_lru_page) * ->private_lock (page_remove_rmap->set_page_dirty) * ->tree_lock (page_remove_rmap->set_page_dirty) * ->inode_lock (page_remove_rmap->set_page_dirty) * ->inode_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->__set_page_dirty_buffers) * ->task->proc_lock * ->dcache_lock (proc_pid_lookup) */

42

slide-43
SLIDE 43

CSE 506: Opera.ng Systems

Semaphore

  • A counter of allowed concurrent processes

– A mutex is the special case of 1 at a 2me

  • Plus a wait queue
  • Implemented similarly to a spinlock, except spin loop

replaced with placing oneself on a wait queue

43

slide-44
SLIDE 44

CSE 506: Opera.ng Systems

Ordering blocking and spin locks

  • If you are mixing blocking locks with spinlocks, be

sure to acquire all blocking locks first and release blocking locks last

– Releasing a semaphore/mutex schedules the next waiter

  • On the same CPU!

– If we hold a spinlock, the waiter may also try to grab this lock – The waiter may block trying to get our spinlock and never yield the CPU – We never get scheduled again, we never release the lock

44

slide-45
SLIDE 45

CSE 506: Opera.ng Systems

Summary

  • Understand how to implement a spinlock/

semaphore/rw-spinlock

  • Understand trade-offs between:

– Spinlocks vs. blocking lock – Fine vs. coarse locking – Favoring readers vs. writers

  • Lock ordering issues

45