Scaling Synchronization Primitives Ph.D. Defense of Dissertation - - PowerPoint PPT Presentation

scaling synchronization primitives
SMART_READER_LITE
LIVE PREVIEW

Scaling Synchronization Primitives Ph.D. Defense of Dissertation - - PowerPoint PPT Presentation

Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10


slide-1
SLIDE 1

Scaling Synchronization Primitives

Ph.D. Defense of Dissertation Sanidhya Kashyap

slide-2
SLIDE 2

Rise of the multicore machines

1 10 100 1000 10000 1970 1980 1990 2000 2010 2020

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten

Frequency (MHz) Number of hardware threads Applications inherently scaled with increasing frequency

Data-intensive applications performance Hardware limitation: frequency stagnated

2

Machines have multiples of processors (multi-socket)

CPU trends

slide-3
SLIDE 3

Multicore machines → “The free lunch is over”

Operating systems Cloud services Data processing systems Databases

Today’s de facto standard: Concurrent applications that scale with increasing cores

Synchronization primitives

Basic building block for designing applications

4

slide-4
SLIDE 4

6

Synchronization primitives

Provide some form of consistency required by applications Determine the ordering/scheduling of concurrent events

slide-5
SLIDE 5

0 K 20 K 40 K 60 K 80 K 100 K 1 2 4 24 48 72 96 120 144 168 192

# threads

Embarrassingly parallel application performance

Typical application performance on a manycore machine

7

Messages / second

Higher is better

slide-6
SLIDE 6

0 K 200 K 400 K 600 K 800 K 1000 K 1200 K 1 2 4 24 48 72 96 120 144 168 192

Messages / second # threads

Embarrassingly parallel application performance

Typical application performance on a manycore machine

8

Synchronization required at several places

slide-7
SLIDE 7

Future hardware will exacerbate scalability

9

Challenge: Maintain application scalability

SSD

100-1000s of CPUs

Single core Many-core

10x ~ 1000x

slide-8
SLIDE 8

How can we minimize the overhead

  • f synchronization primitives for large

multicore machines?

10

Efficiently schedule events by leveraging HW/SW

slide-9
SLIDE 9

13

Thesis contributions

  • Discrepancy between lock design and use
  • Approach: Decouple lock design from lock policy via

shuffling mechanism

Decouple lock policy from design

HW / SW policy design

SOSP’19

  • Double scheduling in a virtualized environment
  • Introduce various types of preemption problems
  • Approach: Expose semantic information across layers

VM double scheduling

Hypervisor

VM

OS

VM

ATC’18

  • Timestamping is costly on large multicore machines
  • Cache contention due to atomic instructions
  • Approach: Use per-core invariant hardware clock

Hardware timestamping

Eurosys’18

slide-10
SLIDE 10

14

Thesis contributions

  • Discrepancy between lock design and use
  • Approach: Decouple lock design from lock policy via

shuffling mechanism

  • Double scheduling in a virtualized environment
  • Introduce various types of preemption problems
  • Approach: Expose semantic information across layers

VM double scheduling

Hypervisor

VM

OS

VM

ATC’18

  • Timestamping is costly on large multicore machines
  • Cache contention due to atomic instructions
  • Approach: Use per-core invariant hardware clock

Hardware timestamping

Eurosys’18

Decouple lock policy from design

HW / SW policy design

SOSP’19

slide-11
SLIDE 11

15

Example: Email service

slide-12
SLIDE 12

0 K 20 K 40 K 60 K 80 K 100 K 1 2 4 24 48 72 96 120 144 168 192

Embarrassingly parallel application performance

...

file_create(…) { spin_lock(superblock); … spin_unlock(superblock); } rename_file(…) { write_lock(global_rename_lock); … write_unlock(global_rename_lock); }

# threads

Degrading performance due to inefficient locks

Messages / second

16

File system

process_create(…) { mm_lock(process->lock); … mm_unlock(process->lock); } processs_schedule(…) { spin_lock(run_queue->lock); … spin_unlock(run_queue->lock); }

Scheduler

Process intensive and stresses memory subsystem, file system and scheduler

slide-13
SLIDE 13
  • Provide mutual exclusion among tasks
  • Guard shared resource
  • Mutex, readers-writer lock, spinlock

17

Synchronization primitive: Locks

Lock protects the access to the data structure Threads want to modify the data structure Threads wait for their turn by either spinning or sleeping

slide-14
SLIDE 14

18

Locks: MOST WIDELY used primitive

20 40 60 80 100 120 140 160 180 200 2002 2019

# lock API() calls (x1000)

More locks are in use to improve OS scalability

4X Linux kernel

slide-15
SLIDE 15

19

Locks are used in a complicated manner

A system call can acquire up to 12 locks (average of 4)

slide-16
SLIDE 16

22

Issue with current lock designs

Backoff lock

1989 1991

Ticket lock MCS lock HBO lock

2003

HCLH lock

2006

FC lock

2011

RCL Cohort lock

2012

Malthusian lock

2014

HMCS lock

2015

CST lock

2017

Design specific locks for hardware and software requirements

Used in practice Best performance

slide-17
SLIDE 17

23

Issue with current lock designs

Generic Focus on simple lock, more applicability Forgo hardware characteristic Worsening throughput with more cores Hardware specific design High throughput for high thread count Use extra memory Suitable for pre-allocated data Locks in practice: Locks in research: Design specific locks for hardware and software requirements

HW/SW policies are statically tied together

slide-18
SLIDE 18

Scalable and practical locking algorithms

Incorporating HW/SW policies dynamically

24

slide-19
SLIDE 19

Two trends driving locks’ innovation

Evolving hardware

Applications requirement

25

slide-20
SLIDE 20

In high thread count In single thread In oversubscription Minimize lock contentions No penalty when not contended Avoid bookkeeping overhead

1) High throughput

Memory footprint Scales to millions of locks

2) Minimal lock size

Two dimensions of lock design / goals

26

slide-21
SLIDE 21

In high thread count In single thread In oversubscription Minimize lock contentions No penalty when not contended Avoid bookkeeping overhead

1) High throughput

Memory footprint Scales to millions of locks (e.g., file inode)

2) Minimal lock size

Application: Multi-threaded to utilize cores to improve performance Lock: Minimize lock contention while maintaining high throughput

Two dimensions of lock design / goals

27

slide-22
SLIDE 22

In high thread count In single thread In oversubscription Minimize lock contentions No penalty when not contended Avoid bookkeeping overhead

1) High throughput

Memory footprint Scales to millions of locks

2) Minimal lock size

Application: Use threads utilize cores to improve performance Locks: Minimize lock contention while maintaining high throughput Application: Single thread to do an operation; fine-grained locking Lock: Minimal or almost no lock/unlock overhead

Two dimensions of lock design / goals

28

slide-23
SLIDE 23

In high thread count In single thread In oversubscription Minimize lock contentions No penalty when not contended Avoid bookkeeping overhead

1) High throughput

Memory footprint Scales to millions of locks

2) Minimal lock size

Application: More threads than cores; common scenario; eg. I/O wait Lock: Minimize scheduler overhead while waking or parking threads

Two dimensions of lock design / goals

29

slide-24
SLIDE 24

In high thread count In single thread In oversubscription Minimize lock contentions No penalty when not contended Avoid bookkeeping overhead

1) High throughput

Application: Locks are embedded in data structures; eg. file inodes Lock: Can stress memory allocator or data structure alignment

Memory footprint Scales to millions of locks

2) Minimal lock size

Two dimensions of lock design / goals

30

slide-25
SLIDE 25

Operations / second # threads > 1 socket Stock Oversubscribed

Benchmark: Each thread creates a file, a serial operation, in a shared directory1

Setup: 192-core/8-socket machine

Locks performance: Throughput

1 socket

  • Throughput collapses after one socket

Due to non-uniform memory access (NUMA)

31

1. Understanding Manycore Scalability of File Systems [ATC’16]

slide-26
SLIDE 26

Operations / second # threads Stock

  • Throughput collapses after one socket

Due to non-uniform memory access (NUMA)

  • NUMA also affects oversubscription

Benchmark: Each thread creates a file, a serial operation, in a shared directory1

Setup: 192-core/8-socket machine

Locks performance: Throughput

Prevent throughput collapse after one socket

34

Oversubscribed 1 socket > 1 socket

1. Understanding Manycore Scalability of File Systems [ATC’16]

slide-27
SLIDE 27
  • Goal: high throughput at high thread count
  • Making locks NUMA-aware:

Use extra memory to improve throughput

Two level locks: per-socket and the global

  • Avoid NUMA overhead

→ Pass global lock within the same socket

Socket-2 Socket-1

Global lock Socket lock

Existing research efforts: Hierarchical locks

35

slide-28
SLIDE 28
  • Problems:

Require extra memory allocation

Do not care about single thread throughput

  • Example: CST2

Allocates socket structure on first access

Handles oversubscription (# threads > # CPUS)

2. Scalable NUMA-aware Blocking Synchronization Primitives [ATC’17]

Existing research efforts: Hierarchical locks

Socket-2 Socket-1

Global lock Socket lock

36

slide-29
SLIDE 29

Operations / second # threads

  • Maintains throughput:

Beyond one socket (high thread count) In oversubscribed case (384 threads)

  • Poor single thread throughput

Multiple atomic instructions

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Setup: 192-core/8-socket machine

Stock CST

Locks performance: Throughput

37

Oversubscribed 1 socket > 1 socket

slide-30
SLIDE 30

Operations / second # threads

  • Maintains throughput:

Stock CST

Beyond one socket (high thread count) In oversubscribed case (384 threads)

  • Poor single thread throughput

Multiple atomic instructions

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Setup: 192-core/8-socket machine

Locks performance: Throughput

Non-contended case → single thread matters

38

Oversubscribed 1 socket > 1 socket

slide-31
SLIDE 31
  • CST has large memory footprint

Locks’ memory footprint 140 # threads 18

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Allocate socket structure and global lock Worst case: ~1 GB footprint out of 32 GB application’s memory

Stock CST

Locks performance: Memory footprint

39

slide-32
SLIDE 32
  • CST has large memory footprint

Stock CST Locks’ memory footprint 140 # threads 18

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Allocate socket structure and global lock Worst case: ~1 GB footprint out of 32 GB application’s memory

820 Hierarchical lock

→ All per-socket locks are pre-allocated for the hierarchical lock

Locks performance: Memory footprint

Lock’s memory footprint affects its adoption

40

slide-33
SLIDE 33

1) NUMA-aware lock without extra memory 2) High throughput in both low/high thread count Two goals in our new lock design

41

slide-34
SLIDE 34

Observations:

→ Hierarchical locks avoid NUMA by passing the lock within a socket → Queue-based locks already maintain a list of waiters

Key idea: Sort waiters on the fly

42

slide-35
SLIDE 35

43

t1

A waiting queue

Socket ID (e.g, socket 0)

shuffler:

Socket ID tail

waiter’s qnode:

Head

Sort waiters on the fly using socket ID

slide-36
SLIDE 36

44

Another waiter is in a different socket

shuffler:

Socket ID tail

waiter’s qnode: t1 t2

Socket 3 Socket 0

Sort waiters on the fly using socket ID

slide-37
SLIDE 37

45

More waiters join

shuffler:

Socket ID tail

waiter’s qnode: t1 t2 t3 t4

Socket 3 Socket 0 Socket 3 Socket 0

Sort waiters on the fly using socket ID

slide-38
SLIDE 38

46

Shuffler (t1) sorts based on socket ID

shuffler:

Socket ID tail

waiter’s qnode: t1 t2 t3 t4

Sort waiters on the fly using socket ID

slide-39
SLIDE 39

A waiter (shuffler ) reorders the queue of waiters

  • A waiter, otherwise spinning (i.e,. wasting), amortises the cost of lock ops

1)

By reordering (e.g., lock orders)

2)

By modifying waiters’ states (e.g., waking-up/sleeping)

→ Shuffler computes NUMA-ness on the fly without using any additional memory

t1 t2 t3 t4

Shuffling: Design methodology

47

slide-40
SLIDE 40

A shuffler can modify the queue or a waiter’s state with a defined function/policy!

Blocking lock: wake up a nearby sleeping waiter RWlock: Group writers together

Incorporate shuffling in lock design

Shuffling is generic!

48

slide-41
SLIDE 41

SHFLLOCKS

Minimal footprint locks that handle any thread contention

49

slide-42
SLIDE 42

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

  • Decouples the lock holder and waiters

Lock holder holds the TAS lock

Waiters join the queue

Unlock the TAS lock (reset the TAS word to 0)

unlock():

Try acquiring the TAS lock first; join the queue on failure

lock():

SHFLLOCKS

50

slide-43
SLIDE 43

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

TAS maintains single thread performance

  • Waiters use shuffling to improve application throughput

NUMA-awareness, efficient wake up strategy

Utilizing Idle/CPU wasting waiters

Shuffling is off the critical path most of the time

  • Maintain long-term fairness:

Bound the number of shuffling rounds

SHFLLOCKS

51

slide-44
SLIDE 44

52

SHFLLOCKS: Family of lock algorithms

NUMA-aware spinlock NUMA-aware blocking lock NUMA-aware writer preferred readers-writer lock

slide-45
SLIDE 45

t0 (socket 0): lock()

unlocked locked t0 shuffler:

Socket ID tail

waiter’s qnode:

Multiple threads join the queue t0: unlock() Shuffling in progress

NUMA-aware SHFLLOCK in action

53

slide-46
SLIDE 46

t0 (socket 0): lock()

t0 t0

Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

54

slide-47
SLIDE 47

t0 t1 t2 t3 t4

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

55

slide-48
SLIDE 48

t0 t1 t2 t3 t4

t1 starts the shuffling process

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

56

slide-49
SLIDE 49

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

t1 groups t3

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

57

slide-50
SLIDE 50

t0 t1 t3 t2 t4

t3 now becomes the shuffler

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

58

slide-51
SLIDE 51

t0 t1 t3 t2 t4

t3 now becomes the shuffler

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

59

slide-52
SLIDE 52

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

60

slide-53
SLIDE 53

t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

61

slide-54
SLIDE 54

t1 t3 t2 t4

t1 acquires the lock via CAS

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

62

slide-55
SLIDE 55

t1 t1 t3 t2 t4

t1 notifies t3 as a new queue head

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

63

slide-56
SLIDE 56

t1 t1 t3 t2 t4 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue t0: unlock() Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

NUMA-aware SHFLLOCK in action

64

slide-57
SLIDE 57

t1 t1 t3 t2 t4 unlocked locked shuffler:

Socket ID tail

waiter’s qnode:

Shuffling invariants for correctness

65

→ Only one waiter is a shuffler at a time → Shuffling starts from the head of the queue → Shuffler can pass the shuffling role to any of its successor → After passing the shuffling role, the waiter only spins

slide-58
SLIDE 58

66

SHFLLOCKS: Family of lock algorithms

NUMA-aware spinlock NUMA-aware blocking lock NUMA-aware writer preferred readers-writer lock

slide-59
SLIDE 59

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

  • Extension of NUMA-aware spinlock
  • Handles core oversubscription
  • Shuffler wakes up sleeping waiter besides reordering
  • No extra data structure required for parking

NUMA-aware blocking SHFLLOCK

67

slide-60
SLIDE 60
  • Spinning and parked waiters are in one list
  • Prior locks maintain a separate parking list
  • Shuffler ensures:
  • NUMA-awareness by reordering queue
  • Shuffled waiters are always spinning

68

Single parking list

★ Lock size remains intact ★ Shuffler ensures NUMA-awareness in both under- and over-

subscribed case

slide-61
SLIDE 61
  • Always pass the lock to a spinning waiter
  • Shuffler ensures by waking up shuffled waiters
  • Steal the global TAS lock
  • Waiters only park if more than one tasks are running on a CPU

(system load)

69

Minimal scheduler intervention

★ Scheduler is mostly off the critical path ★ Guarantees forward progress of the system

slide-62
SLIDE 62

70

t0 t1 t2 t3 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-63
SLIDE 63

71

unlocked locked t0 t1 t2 t3 t4 shuffler:

Socket ID tail

waiter’s qnode:

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep

Z Z Scheduled out spinning

Shuffling in progress

ZZ ZZ ZZ

NUMA-aware blocking SHFLLOCK in action

slide-64
SLIDE 64

72

t0 t1 t2 t3 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ ZZ

t1 starts the shuffling process

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-65
SLIDE 65

73

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ ZZ

t1 groups t3

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-66
SLIDE 66

74

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ ZZ

t1 wakes up t3

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-67
SLIDE 67

75

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ

t1 wakes up t3

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-68
SLIDE 68

76

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ

t3 now becomes the shuffler

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-69
SLIDE 69

77

t0 t1 t3 t2 t4

t0 (socket 0): lock() Multiple threads join the queue Threads go to sleep Shuffling in progress

ZZ ZZ

t3 now becomes the shuffler

unlocked locked shuffler:

Socket ID tail

waiter’s qnode: Z Z Scheduled out spinning

NUMA-aware blocking SHFLLOCK in action

slide-70
SLIDE 70

78

Memory footprint Lock API Handling threads contention Low Medium High Lock size Context NUMA aware Locks Over subscribe mutex NO 40 B Bad Malthusian NO 24 B O(N*T) Modified Good* CST Yes 32 + 512 B Good SHFLLOCK Yes 12 B Good

  • Extension of non-blocking locks
  • T → Number of threads; N → number of locks
  • Lock API modified → lock/unlock(L, ctx)

Current blocking locks

slide-71
SLIDE 71

79

SHFLLOCKS: Family of lock algorithms

NUMA-aware spinlock NUMA-aware blocking lock NUMA-aware writer preferred readers-writer lock

slide-72
SLIDE 72

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

  • Extension of blocking lock
  • Centralized counter encodes readers and writer
  • Waiting readers and writer enqueued in one waiting list

Readers-writer SHFLLOCK

80

Count (4/8B)

(readers indicator)

slide-73
SLIDE 73
  • SHFLLOCK performance:

Does shuffling maintains application’s throughput?

What is the overall memory footprint?

Setup: 192-core/8-socket machine

Analysis of SHFLLOCK

81

slide-74
SLIDE 74

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Operations / second # threads Stock CST

  • SHFLLOCKS maintain performance:

SHFLLOCK

Locks performance: Throughput

82

Oversubscribed 1 socket > 1 socket

slide-75
SLIDE 75

Benchmark: Each thread creates a file, a serial operation, in a shared directory

  • SHFLLOCKS maintain performance:
  • Beyond one socket

NUMA-aware shuffling

Locks performance: Throughput

Stock CST SHFLLOCK Operations / second # threads

83

Oversubscribed 1 socket > 1 socket

slide-76
SLIDE 76

Benchmark: Each thread creates a file, a serial operation, in a shared directory

  • SHFLLOCKS maintain performance:
  • Beyond one socket

NUMA-aware shuffling

  • Core oversubscription

NUMA-aware + wakeup shuffling

Locks performance: Throughput

Stock CST SHFLLOCK Operations / second # threads

84

Oversubscribed 1 socket > 1 socket

slide-77
SLIDE 77

Operations / second # threads

  • Single thread

TAS acquire and release

Locks performance: Throughput

Stock CST SHFLLOCK

Benchmark: Each thread creates a file, a serial operation, in a shared directory

  • SHFLLOCKS maintain performance:
  • Beyond one socket

NUMA-aware shuffling

  • Core oversubscription

NUMA-aware + wakeup shuffling

85

Oversubscribed 1 socket > 1 socket

slide-78
SLIDE 78

Benchmark: Each thread creates a file, a serial operation, in a shared directory

Stock CST 140 # threads SHFLLOCK 11 18

  • SHFLLOCK has least memory footprint

Reason: No extra auxiliary data structure

Stock: parking list structure + extra lock

CST: per-socket structure

Locks’ memory footprint

Locks performance: Memory footprint

86

slide-79
SLIDE 79

Operations / second # threads

RWLocks performance: Throughput

Stock CST SHFLLOCK

Benchmark: Each thread enumerates files in a shared directory: read-only workload

  • SHFLLOCK is better than Stock:

Few atomic instructions on the critical path

87

Oversubscribed 1 socket > 1 socket

slide-80
SLIDE 80

Operations / second # threads

RWLocks performance: Throughput

Stock CST SHFLLOCK

Benchmark: Each thread enumerates files in a shared directory: read-only workload

  • SHFLLOCK is better than Stock:

Few atomic instructions on the critical path

88

Oversubscribed 1 socket > 1 socket

  • CST is better than SHFLLOCK:

CST uses per-socket counters than a centralized counter → Minimizes coherence traffic

slide-81
SLIDE 81

0 K 20 K 40 K 60 K 80 K 100 K 1 2 4 24 48 72 96 120 144 168 192 0 GB 5 GB 10 GB 15 GB 20 GB 25 GB 192 Stock SHFLLOCK

Lock’s memory

CST

Memory footprint

SHFLLOCK improves Exim’s performance

Messages / second # threads

Throughput

# threads

89

Process intensive and stresses memory subsystem, file system and scheduler

slide-82
SLIDE 82
  • Lock holder splits the queue:

NUMA-awareness: Compact NUMA-aware lock (CNA)

Blocking lock: Malthusian lock

  • Shuffling can support other policies:

Non-inclusive cache (Skylake architecture)

Multi-level NUMA hierarchy (SGI machines)

Priority inheritance or boosting

Discussion

90

slide-83
SLIDE 83
  • Current lock designs:

Do not maintain best throughput with varying threads

Have high memory footprint

  • Shuffling: Dynamically reorder the list or modify waiter’s state

NUMA-awareness, waking up waiters

  • SHFLLOCKS: Shuffling-based family of lock algorithms

Best throughput with no extra memory overhead

Utilize wasted CPU waiters to amortize lock operations

91

SHFLLOCK: Conclusion

slide-84
SLIDE 84

Taesoo Kim

Jaeho Kim Byoungyoun Lee Xiaohe Cheng Seulbae Kim Meng Xu Junyeon Yoon Wen Xu Virendra Marathe Dave Dice

Acknowledgement

Changwoo Min

Rohan Kadekodi Yihe Huang Yizhou Shan Ajit Mathew Madhava Krishnan Woonhak Kang Steffen Maass Margo Seltzer Alex Kogan

Irina Calciu

Pavel Emelyanov Tushar Krishna Vijay Chidambaram Kangnyeon Kim Jayashree Mohan Mohan Kumar Se Kwon Lee Steve Byan Jean Pierre Lozi

92

slide-85
SLIDE 85

Thank you!

  • Designing scalable synchronization mechanisms is critical
  • This thesis:

Lock algorithms that decouple lock design from hardware and software policy

A constant ordering primitive that scales to 100s-1000s of CPUs

Adding semantic information to task schedulers to minimize double scheduling

Minimizing scheduling overhead of concurrent events that leverage both hardware and software efficiently

Conclusion

93