ShfmLocks: Scalable and Practjcal Locking for Manycore Systems
Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/
ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - - PowerPoint PPT Presentation
ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/ File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on
Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/
2
0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS
Embarrassingly parallel application!
0k 2k 4k 6k 8k 10k 12k btrfs ext4 F2FS XFS messages/sec Exim email server at 80 cores RAMDISK SSD HDD
Legend btrfs ext4 ext4NJ F2FS tmpfs XFS
5
0k 2k 4k 6k 8k 10k 12k 14k 10 20 30 40 50 60 70 80 messages/sec #core Exim mail server on RAMDISK btrfs ext4 F2FS XFS
Embarrassingly parallel application!
scalable
Read private Read shared
scalable
Read private Read shared Write private Write shared
NOT scalable
scalable
Read private Read shared Write private Write shared
NOT scalable
interfere read operations
–
Writer’s cache: Shared → Exclusive
–
All readers’ cache line: Shared → Invalidate
LLC Memory LLC Memory Socket-1 Socket-2
Linux kernel lock adoptjon / modifjcatjon
Dekker's algorithm (1962) Semaphore (1965) Lamport's bakery algorithm (1974) Backofg lock (1989) Ticket lock (1991) MCS lock (1991) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) HBO lock (2003)
NUMA- aware locks
Spinlock TTAS → Semaphore TTAS + block → Rwsem TTAS + block → Spinlock ticket → Mutex TTAS + spin + block (3.16) → Rwsem TTAS + spin + block (3.16) → Spinlock ticket (2.6) → Mutex TTAS + block (2.6) → Rwsem TTAS + block → Spinlock qspinlock (4.4) → Mutex TTAS + spin + block → Rwsem TTAS + spin + block →
Lock's research efgorts
1990s 2011 2014 2016
11
In high thread count In single thread In oversubscriptjon Minimize lock contentjons No penalty when not contended Avoid bookkeeping overheads
Memory footprint Scales to millions of locks (e.g., fjle inode)
12
Operations / second # threads
1 socket > 1 socket
Stock
Oversubscribed
Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory.
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
LLC Memory LLC Memory Socket-1 Socket-2
13
Operations / second # threads
1 socket > 1 socket
Stock
Oversubscribed
Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory.
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
○ Two level locks: per-socket and global ○ Generally hierarchical
○ Require extra memory allocatjon ○ Do not care about single thread throughput
14
Socket-2 Socket-1
Global lock Socket lock
15
Operations / second # threads
Stock CST
Oversubscribed > 1 socket 1 socket
Beyond one socket (high thread count). In oversubscribed case (384 threads).
Multjple atomic instructjons.
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Setup: 8-socket 192-core machine
16
Locks’ memory footprint 140 # threads 18
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.
Stock CST
17
Stock CST Locks’ memory footprint 140 # threads 18
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.
820 Hierarchical lock
18
Stock CST Locks’ memory footprint 140 # threads 18
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Allocate socket structure and global lock. Worst case: ~1 GB footprint out of 32 GB applicatjon’s memory.
820 Hierarchical lock
19
20
Hierarchical locks avoid NUMA by passing the lock within a socket Queue-based locks already maintain a set of waiters
21
t1
Representjng a waitjng queue
Socket id (e.g, socket 0)
shuffmer:
Socket ID tail
waiter’s qnode:
22
Another waiter is in a difgerent socket
shuffmer:
Socket ID tail
waiter’s qnode: t1 t2
23
More waiters join
shuffmer:
Socket ID tail
waiter’s qnode: t1 t2 t3 t4
24
Shuffmer (t1) sorts based on socket ID
shuffmer:
Socket ID tail
waiter’s qnode: t1 t2 t3 t4
25
A waiter (shuffmer ) reorders the queue of waiters
1) By reordering (e.g., lock orders) 2) By modifying waiters’ states (e.g., waking-up/sleeping)
t1 t2 t3 t4
26
Blocking lock: wake up a nearby sleeping waiter RWlock: Group writers together
27
28
TAS (4B)
(test-and-set lock)
Queue tail (8B)
(waiters list)
○ Lock holder holds the TAS lock ○ Waiters join the queue
Unlock the TAS lock (reset the TAS word to 0)
unlock():
Try acquiring the TAS lock fjrst; join the queue on failure
lock():
29
TAS (4B)
(test-and-set lock)
Queue tail (8B)
(waiters list)
TAS maintains single thread performance
○ NUMA-awareness, effjcient wake up strategy ○ Utjlizing Idle/CPU wastjng waiters
○ Bound the number of shuffming rounds ★ Shuffming is ofg the critjcal path most of the tjme
30
t0 (socket 1): lock()
unlocked locked t0 t0 shuffmer:
Socket ID tail
waiter’s qnode:
31
Multjple threads join the queue
unlocked locked t0 t1 t2 t3 t4 shuffmer:
Socket ID tail
waiter’s qnode:
32
Shuffming in progress
unlocked locked t0 t1 t2 t3 t4
t1 starts the shuffming process
shuffmer:
Socket ID tail
waiter’s qnode:
33
Shuffming in progress
unlocked locked t0 t1 t3 t2 t4
t3 now becomes the shuffmer
shuffmer:
Socket ID tail
waiter’s qnode:
34
t0: unlock()
unlocked locked t0 t1 t3 t2 t4
t1 acquires the lock via CAS
shuffmer:
Socket ID tail
waiter’s qnode:
35
t0: unlock()
unlocked locked t1 t1 t3 t2 t4
t1 notjfjes t3 as a new queue head
shuffmer:
Socket ID tail
waiter’s qnode:
36
t0: unlock()
unlocked locked t1 t1 t3 t2 t4 t1 t3 t2 t4 shuffmer:
Socket ID tail
waiter’s qnode:
37
○ Avoids the wakeup latency from the critjcal path.
○ Lock stealing: avoid lock-waiter preemptjon problem. ○ Shuffmed waiters are already spinning.
38
unlocked locked
shuffmer
t0 t1 t3 t2 t4 ZZ ZZ ZZ scheduled
t0 t1 t3 t2 t4 ZZ ZZ ZZ
t1 wakes up t3 afuer moving it.
○ Replaced all mutex and rwsem ○ Modifjed slowpath of the qspinlock
○ Added to the Litl library
○ Readers-writer lock: Centralized rw-indicator + SHFLLOCK
39
○ Does shuffming maintains applicatjon’s throughput? ○ What is the overall memory footprint? Setup: Eight socket 192-core/8-socket machine
40
41
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Operations / second # threads Stock CST
Oversubscribed > 1 socket 1 socket
SHFLLOCK
42
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Operations / second # threads Stock CST
Oversubscribed > 1 socket 1 socket
SHFLLOCK
○ NUMA-aware shuffming
43
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Operations / second # threads Stock CST
Oversubscribed > 1 socket 1 socket
SHFLLOCK
○ NUMA-aware shuffming
○ NUMA-aware + wakeup shuffming
44
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Operations / second # threads Stock CST
Oversubscribed > 1 socket 1 socket
SHFLLOCK
○ NUMA-aware shuffming
○ NUMA-aware + wakeup shuffming
○ TAS acquire and release
45
(e.g., each thread creates a fjle, a serial operatjon, in a shared directory)
Stock CST 140 # threads SHFLLOCK 11 18
Reason: No extra auxiliary data structure ➢ Stock: parking list structure + extra lock ➢ CST: per-socket structure
Locks’ memory footprint
46
Stock SHFLLOCK
It is fork intensive and stresses memory subsystem, fjle system and scheduler Improves throughput by up to 1.5x Decreases memory footprint up to 93%
Lock’s memory Messages / second # threads # threads CST
Throughput Memory footprint
○ Lock holder splits the queue to provide:
○ Non-inclusive cache (Skylake architecture). ○ Multj-level NUMA hierarchy (SGI machines).
47
○ Do not maintain best throughput with varying threads ○ Have high memory footprint
○ NUMA, blocking, etc
○ NUMA-aware minimal memory footprint locks
48