ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - - PowerPoint PPT Presentation

shfmlocks scalable and practjcal locking for manycore
SMART_READER_LITE
LIVE PREVIEW

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - - PowerPoint PPT Presentation

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/ File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on


slide-1
SLIDE 1

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems

Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/

slide-2
SLIDE 2

File system becomes a botuleneck on manycore systems

2

0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS

  • 3. Never scales

Embarrassingly parallel application!

  • 1. Saturated
  • 2. Collapsed
slide-3
SLIDE 3

Even in slower storage medium fjle system becomes a botuleneck

0k 2k 4k 6k 8k 10k 12k btrfs ext4 F2FS XFS messages/sec Exim email server at 80 cores RAMDISK SSD HDD

slide-4
SLIDE 4

FxMark: File systems are not scalable in manycore systems

50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBL 50 100 150 200 250 10 20 30 40 50 60 70 80 M ops/sec #core DRBM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DRBH 20 40 60 80 100 120 140 160 10 20 30 40 50 60 70 80 M ops/sec #core DWOL 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 M ops/sec #core DWOM 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 M ops/sec #core DWAL 0.5 1 1.5 2 2.5 3 3.5 4 10 20 30 40 50 60 70 80 M ops/sec #core DWTL 20 40 60 80 100 120 140 10 20 30 40 50 60 70 80 M ops/sec #core DWSL 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 M ops/sec #core MRPL 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 M ops/sec #core MRPM 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 20 30 40 50 60 70 80 M ops/sec #core MRPH 50 100 150 200 250 300 350 400 450 500 10 20 30 40 50 60 70 80 M ops/sec #core MRDL 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 M ops/sec #core MRDM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWCL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core MWCM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core MWUL 0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 70 80 M ops/sec #core MWUM 0.5 1 1.5 2 2.5 10 20 30 40 50 60 70 80 M ops/sec #core MWRL 0.05 0.1 0.15 0.2 0.25 0.3 0.35 10 20 30 40 50 60 70 80 M ops/sec #core MWRM 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 10 20 30 40 50 60 70 80 M ops/sec #core DRBM:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOL:O_DIRECT 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 10 20 30 40 50 60 70 80 M ops/sec #core DWOM:O_DIRECT 0k 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k 10 20 30 40 50 60 70 80 messages/sec #core Exim 100 200 300 400 500 600 700 10 20 30 40 50 60 70 80
  • ps/sec
#core RocksDB 2 4 6 8 10 12 14 16 18 10 20 30 40 50 60 70 80 GB/sec #core DBENCH

Legend btrfs ext4 ext4NJ F2FS tmpfs XFS

Create fjles on a shared directory Locks are critjcal in performance and scalability

slide-5
SLIDE 5

Future hardware further exacerbates the problem

5

0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS

  • 3. Never scales

Embarrassingly parallel application!

  • 1. Saturated
  • 2. Collapsed
slide-6
SLIDE 6

Why this happens? : Memory access is NOT scalable

  • 1. Read operations are

scalable

Read private Read shared

slide-7
SLIDE 7

Why this happens? : Memory access is NOT scalable

  • 1. Read operations are

scalable

Read private Read shared Write private Write shared

  • 2. Write operations are

NOT scalable

slide-8
SLIDE 8

Why this happens? : Memory access is NOT scalable

  • 1. Read operations are

scalable

Read private Read shared Write private Write shared

  • 2. Write operations are

NOT scalable

  • 3. Write operations

interfere read operations

Shared lock variable (fmag) Shared data protected by the lock

slide-9
SLIDE 9

Why this happens? : Cache coherence is not scalable

  • Cache coherent traffjc dominates!!!
  • Writjng a cache line in a popular MESI protocol:

Writer’s cache: Shared → Exclusive

All readers’ cache line: Shared → Invalidate

Should minimize contended cache lines and core-to-core communication traffic

LLC Memory LLC Memory Socket-1 Socket-2

slide-10
SLIDE 10

Linux kernel lock adoptjon / modifjcatjon

Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)

NUMA- aware locks

Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block (3.16) → Rwsem TTAS + spin + block (3.16) → Spinlock ticket (2.6) → Mutex TTAS + block (2.6) → Rwsem TTAS + block → Spinlock qspinlock (4.4) → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →

Lock's research efgorts

1990s 2011 2014 2016

Lock’s research efgorts and their use Adoptjng new locks is necessary but it is not easy

slide-11
SLIDE 11

Two dimensions of lock design/goals

11

In high thread count In single thread In oversubscriptjon Minimize lock contentjons No penalty when not contended Avoid bookkeeping overheads

1) High throughput

Memory footprint Scales to millions of locks (e.g., fjle inode)

2) Minimal lock size

slide-12
SLIDE 12

Locks performance: Throughput

12

Operations / second # threads

1 socket > 1 socket

Stock

Oversubscribed

  • Performance crashes afuer 1 socket.

Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory.

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

LLC Memory LLC Memory Socket-1 Socket-2

slide-13
SLIDE 13

Locks performance: Throughput

13

Operations / second # threads

1 socket > 1 socket

Stock

Oversubscribed

  • Performance crashes afuer 1 socket.

Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory.

  • NUMA also afgects oversubscriptjon.

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Prevent throughput crash afuer one socket

slide-14
SLIDE 14

Existjng research efgorts

  • Making locks NUMA-aware:

○ Two level locks: per-socket and global ○ Generally hierarchical

  • Problems:

○ Require extra memory allocatjon ○ Do not care about single thread throughput

  • Example: CST1

14

Socket-2 Socket-1

Global lock Socket lock

  • 1. Scalable NUMA-aware Blocking Synchronizatjon Primitjves. ATC 2017.
slide-15
SLIDE 15

Locks performance: Throughput

15

Operations / second # threads

  • Maintains throughput:

Stock CST

Oversubscribed > 1 socket 1 socket

Beyond one socket (high thread count). In oversubscribed case (384 threads).

  • Poor single thread throughput.

Multjple atomic instructjons.

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Setup: 8-socket 192-core machine

Single thread matuers in non-contended cases

slide-16
SLIDE 16

Locks performance: Memory footprint

16

  • CST has large memory footprint.

Locks’ memory footprint 140 # threads 18

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.

Stock CST

slide-17
SLIDE 17

Locks performance: Memory footprint

17

  • CST has large memory footprint.

Stock CST Locks’ memory footprint 140 # threads 18

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.

820 Hierarchical lock

slide-18
SLIDE 18

Locks performance: Memory footprint

18

  • CST has large memory footprint.

Stock CST Locks’ memory footprint 140 # threads 18

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.

820 Hierarchical lock

Lock’s memory footprint afgect its adoptjon

slide-19
SLIDE 19

19

Two goals in our new lock

1) NUMA-aware lock with no memory overhead 2) High throughput in both low/high thread count

slide-20
SLIDE 20

Key idea: Sort waiters on the fmy

20

Observatjons:

Hierarchical locks avoid NUMA by passing the lock within a socket Queue-based locks already maintain a set of waiters

slide-21
SLIDE 21

Shuffming: Design methodology

21

t1

Representjng a waitjng queue

Socket id (e.g, socket 0)

shuffmer:

Socket ID tail

waiter’s qnode:

slide-22
SLIDE 22

Shuffming: Design methodology

22

Another waiter is in a difgerent socket

shuffmer:

Socket ID tail

waiter’s qnode: t1 t2

slide-23
SLIDE 23

Shuffming: Design methodology

23

More waiters join

shuffmer:

Socket ID tail

waiter’s qnode: t1 t2 t3 t4

slide-24
SLIDE 24

Shuffming: Design methodology

24

Shuffmer (t1) sorts based on socket ID

shuffmer:

Socket ID tail

waiter’s qnode: t1 t2 t3 t4

slide-25
SLIDE 25

Shuffming: Design methodology

25

A waiter (shuffmer ) reorders the queue of waiters

  • A waiter, otherwise spinning (i.e,. wastjng), amortjses the cost of lock ops

1) By reordering (e.g., lock orders) 2) By modifying waiters’ states (e.g., waking-up/sleeping)

→ Shuffmer computes NUMA-ness on the fmy without using memory unlike others

t1 t2 t3 t4

slide-26
SLIDE 26

26

A shuffmer can modify the queue or a waiter’s state with a defjned functjon/policy! Shuffming is generic!

Blocking lock: wake up a nearby sleeping waiter RWlock: Group writers together

Incorporate shuffming in lock design

slide-27
SLIDE 27

SHFLLOCKS

27

Minimal footprint locks that handle any thread contentjon

slide-28
SLIDE 28

SHFLLOCKS

28

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

  • Decouples the lock holder and waiters

○ Lock holder holds the TAS lock ○ Waiters join the queue

Unlock the TAS lock (reset the TAS word to 0)

unlock():

Try acquiring the TAS lock fjrst; join the queue on failure

lock():

slide-29
SLIDE 29

SHFLLOCKS

29

TAS (4B)

(test-and-set lock)

Queue tail (8B)

(waiters list)

TAS maintains single thread performance

  • Waiters use shuffming to improve applicatjon throughput

○ NUMA-awareness, effjcient wake up strategy ○ Utjlizing Idle/CPU wastjng waiters

  • Maintain long-term fairness:

○ Bound the number of shuffming rounds ★ Shuffming is ofg the critjcal path most of the tjme

slide-30
SLIDE 30

NUMA-aware SHFLLOCK in actjon

30

t0 (socket 1): lock()

unlocked locked t0 t0 shuffmer:

Socket ID tail

waiter’s qnode:

slide-31
SLIDE 31

NUMA-aware SHFLLOCK in actjon

31

Multjple threads join the queue

unlocked locked t0 t1 t2 t3 t4 shuffmer:

Socket ID tail

waiter’s qnode:

slide-32
SLIDE 32

NUMA-aware SHFLLOCK in actjon

32

Shuffming in progress

unlocked locked t0 t1 t2 t3 t4

t1 starts the shuffming process

shuffmer:

Socket ID tail

waiter’s qnode:

slide-33
SLIDE 33

NUMA-aware SHFLLOCK in actjon

33

Shuffming in progress

unlocked locked t0 t1 t3 t2 t4

t3 now becomes the shuffmer

shuffmer:

Socket ID tail

waiter’s qnode:

slide-34
SLIDE 34

NUMA-aware SHFLLOCK in actjon

34

t0: unlock()

unlocked locked t0 t1 t3 t2 t4

t1 acquires the lock via CAS

shuffmer:

Socket ID tail

waiter’s qnode:

slide-35
SLIDE 35

NUMA-aware SHFLLOCK in actjon

35

t0: unlock()

unlocked locked t1 t1 t3 t2 t4

t1 notjfjes t3 as a new queue head

shuffmer:

Socket ID tail

waiter’s qnode:

slide-36
SLIDE 36

NUMA-aware SHFLLOCK in actjon

36

t0: unlock()

unlocked locked t1 t1 t3 t2 t4 t1 t3 t2 t4 shuffmer:

Socket ID tail

waiter’s qnode:

slide-37
SLIDE 37

Other SHFLLOCKS: Blocking SHFLLOCK

37

  • NUMA-aware blocking lock.
  • Wake up shuffmed waiters based on the socket ID.

○ Avoids the wakeup latency from the critjcal path.

  • Lock is always passed to a spinning waiter.

○ Lock stealing: avoid lock-waiter preemptjon problem. ○ Shuffmed waiters are already spinning.

  • Guarantees forward progress of the system.
slide-38
SLIDE 38

Blocking SHFLLOCK in actjon

38

unlocked locked

shuffmer

t0 t1 t3 t2 t4 ZZ ZZ ZZ scheduled

  • ut

t0 t1 t3 t2 t4 ZZ ZZ ZZ

t1 wakes up t3 afuer moving it.

slide-39
SLIDE 39

Implementatjon

  • Kernel space:

○ Replaced all mutex and rwsem ○ Modifjed slowpath of the qspinlock

  • User space:

○ Added to the Litl library

  • Please see our paper:

○ Readers-writer lock: Centralized rw-indicator + SHFLLOCK

39

slide-40
SLIDE 40

Evaluatjon

  • SHFLLOCK performance:

○ Does shuffming maintains applicatjon’s throughput? ○ What is the overall memory footprint? Setup: Eight socket 192-core/8-socket machine

40

slide-41
SLIDE 41

Locks performance: Throughput

41

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Operations / second # threads Stock CST

Oversubscribed > 1 socket 1 socket

SHFLLOCK

  • SHFLLOCKS maintain performance:
slide-42
SLIDE 42

Locks performance: Throughput

42

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Operations / second # threads Stock CST

Oversubscribed > 1 socket 1 socket

SHFLLOCK

  • SHFLLOCKS maintain performance:
  • Beyond one socket

○ NUMA-aware shuffming

slide-43
SLIDE 43

Locks performance: Throughput

43

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Operations / second # threads Stock CST

Oversubscribed > 1 socket 1 socket

SHFLLOCK

  • SHFLLOCKS maintain performance:
  • Beyond one socket

○ NUMA-aware shuffming

  • Core oversubscriptjon

○ NUMA-aware + wakeup shuffming

slide-44
SLIDE 44

Locks performance: Throughput

44

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Operations / second # threads Stock CST

Oversubscribed > 1 socket 1 socket

SHFLLOCK

  • SHFLLOCKS maintain performance:
  • Beyond one socket

○ NUMA-aware shuffming

  • Core oversubscriptjon

○ NUMA-aware + wakeup shuffming

  • Single thread

○ TAS acquire and release

slide-45
SLIDE 45

Locks performance: Memory footprint

45

(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)

Stock CST 140 # threads SHFLLOCK 11 18

  • SHFLLOCKS has least memory footprint

Reason: No extra auxiliary data structure ➢ Stock: parking list structure + extra lock ➢ CST: per-socket structure

Locks’ memory footprint

slide-46
SLIDE 46

Case study: Exim mail server

46

Stock SHFLLOCK

It is fork intensive and stresses memory subsystem, fjle system and scheduler Improves throughput by up to 1.5x Decreases memory footprint up to 93%

Lock’s memory Messages / second # threads # threads CST

Throughput Memory footprint

slide-47
SLIDE 47

Discussion

  • Another way to enforce these policies dynamically:

○ Lock holder splits the queue to provide:

  • E.g., NUMA-awareness: Compact NUMA-aware lock (CNA).
  • E.g., blocking lock: Malthusian lock.
  • Shuffming can support other policies:

○ Non-inclusive cache (Skylake architecture). ○ Multj-level NUMA hierarchy (SGI machines).

47

slide-48
SLIDE 48

Conclusion

  • Locks are critjcal for fjle system and applicatjon performance
  • Current lock designs:

○ Do not maintain best throughput with varying threads ○ Have high memory footprint

  • Shuffming: Dynamically enforce policies

○ NUMA, blocking, etc

  • SHFLLOCKS: Shuffming-based family of lock algorithms

○ NUMA-aware minimal memory footprint locks

48