ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - PowerPoint PPT Presentation

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/

File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on RAMDISK application! btrfs F2FS 14k ext4 XFS 12k messages/sec 10k 1. Saturated 8k 6k 4k 2k 2. Collapsed 3. Never scales 0k 0 10 20 30 40 50 60 70 80 #core 2

Even in slower storage medium fjle system becomes a botuleneck Exim email server at 80 cores 12k RAMDISK SSD 10k HDD 8k messages/sec 6k 4k 2k 0k btrfs ext4 F2FS XFS

FxMark: File systems are not scalable in manycore systems DRBL DRBM DRBH DWOL DWOM DWAL DWTL 250 250 10 160 2 10 4 9 1.8 9 140 3.5 200 200 8 1.6 8 120 3 7 1.4 7 Create fjles on a shared directory 100 2.5 M ops/sec 150 M ops/sec 150 M ops/sec 6 M ops/sec M ops/sec 1.2 M ops/sec 6 M ops/sec 5 80 1 5 2 100 100 4 0.8 4 60 1.5 3 0.6 3 40 1 50 50 2 0.4 2 20 0.5 1 0.2 1 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core DWSL MRPL MRPM MRPH MRDL MRDM MWCL 140 80 9 5 500 8 2.5 8 4.5 450 70 7 120 4 400 2 7 60 6 100 3.5 350 6 50 5 3 300 1.5 M ops/sec Locks are critjcal in performance and scalability M ops/sec M ops/sec M ops/sec M ops/sec M ops/sec M ops/sec 80 5 40 2.5 250 4 4 60 2 200 1 30 3 3 1.5 150 40 20 2 2 1 100 0.5 20 10 1 1 0.5 50 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core MWRL DRBM:O_DIRECT MWCM MWUL MWUM MWRM DRBL:O_DIRECT 2.5 0.5 0.45 0.5 0.7 0.35 0.5 0.45 0.45 0.45 0.4 0.6 0.3 2 0.4 0.4 0.4 0.35 0.5 0.25 0.35 0.35 0.35 0.3 1.5 M ops/sec 0.3 M ops/sec M ops/sec 0.3 M ops/sec M ops/sec M ops/sec M ops/sec 0.3 0.4 0.2 0.25 0.25 0.25 0.25 0.2 0.3 0.15 0.2 0.2 1 0.2 0.15 0.15 0.15 0.15 0.2 0.1 0.1 0.1 0.1 0.5 0.1 0.1 0.05 0.05 0.05 0.05 0.05 0 0 0 0 0 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core #core #core Legend DWOL:O_DIRECT DWOM:O_DIRECT Exim RocksDB DBENCH 0.45 0.45 100k 700 18 0.4 0.4 90k 16 600 0.35 0.35 80k 14 btrfs 70k 500 0.3 0.3 12 ext4 messages/sec M ops/sec M ops/sec 60k 0.25 0.25 400 ops/sec 10 ext4NJ GB/sec 50k 0.2 0.2 8 F2FS 300 40k 0.15 0.15 tmpfs 6 30k 200 XFS 0.1 0.1 4 20k 100 0.05 0.05 10k 2 0 0 0k 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 #core #core #core #core #core

Future hardware further exacerbates the problem Embarrassingly parallel Exim mail server on RAMDISK application! btrfs F2FS 14k ext4 XFS 12k messages/sec 10k 1. Saturated 8k 6k 4k 2k 2. Collapsed 3. Never scales 0k 0 10 20 30 40 50 60 70 80 #core 5

Why this happens? : Memory access is NOT scalable 1. Read operations are scalable Read private Read shared

Why this happens? : Memory access is NOT scalable 1. Read operations are 2. Write operations are scalable NOT scalable Read private Read shared Write private Write shared

Why this happens? : Memory access is NOT scalable 1. Read operations are 2. Write operations are 3. Write operations scalable NOT scalable interfere read operations Read private Shared lock variable (fmag) Read shared Write private Write shared Shared data protected by the lock

Why this happens? : Cache coherence is not scalable ● Cache coherent traffjc dominates!!! ● Writjng a cache line in a popular MESI protocol: – Writer’s cache: Shared → Exclusive – All readers’ cache line: Shared → Invalidate Socket-1 Socket-2 Should minimize contended Memory Memory cache lines and core-to-core LLC LLC communication traffic

Lock’s research efgorts and their use Lock's research efgorts Linux kernel lock adoptjon / modifjcatjon Dekker's algorithm (1962) Semaphore (1965) Spinlock TTAS → 1990s Lamport's bakery algorithm (1974) Semaphore TTAS + block Adoptjng new locks is necessary but it is not easy → Backofg lock (1989) Rwsem TTAS + block → Ticket lock (1991) Spinlock ticket (2.6) → MCS lock (1991) 2011 Mutex TTAS + block (2.6) → HBO lock (2003) Rwsem TTAS + block → Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Spinlock ticket → 2014 NUMA- Mutex TTAS + spin + block (3.16) → Remote Core locking (2012) aware Rwsem TTAS + spin + block (3.16) → Cohort lock (2012) locks RW cohort lock (2013) Spinlock qspinlock (4.4) → 2016 Malthusian lock (2014) Mutex TTAS + spin + block → Rwsem TTAS + spin + block HMCS lock (2015) → AHMCS lock(2016)

Two dimensions of lock design/goals 1) High throughput In high thread count Minimize lock contentjons In single thread No penalty when not contended In oversubscriptjon Avoid bookkeeping overheads 2) Minimal lock size Scales to millions of locks Memory footprint (e.g., fjle inode) 11

Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed ● Performance crashes afuer 1 socket. Operations / second Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory. Socket-1 Socket-2 Memory Memory # threads Stock LLC LLC 12

Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed ● Performance crashes afuer 1 socket. Operations / second Due to non-uniform memory access (NUMA). Accessing local socket memory is faster than the remote socket memory. ● NUMA also afgects oversubscriptjon. # threads Stock Prevent throughput crash afuer one socket 13

Existjng research efgorts ● Making locks NUMA-aware: ○ Two level locks: per-socket and global Global lock ○ Generally hierarchical Socket lock ● Problems: ○ Require extra memory allocatjon Socket-1 Socket-2 ○ Do not care about single thread throughput ● Example: CST 1 14 1. Scalable NUMA-aware Blocking Synchronizatjon Primitjves. ATC 2017.

Locks performance: Throughput (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 1 socket > 1 socket Oversubscribed Operations / second ● Maintains throughput: Beyond one socket (high thread count). In oversubscribed case (384 threads). ● Poor single thread throughput. Multjple atomic instructjons. # threads Stock CST Single thread matuers in non-contended cases Setup: 8-socket 192-core machine 15

Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 140 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 18 # threads Stock CST 16

Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 820 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 140 18 # threads Stock CST Hierarchical lock 17

Locks performance: Memory footprint (e.g., each thread creates a fjle, a serial operatjon, in a shared directory) 820 Locks’ memory footprint ● CST has large memory footprint. Allocate socket structure and global lock. Worst case : ~1 GB footprint out of 32 GB applicatjon’s memory. 140 18 # threads Lock’s memory footprint afgect its adoptjon Stock CST Hierarchical lock 18

Two goals in our new lock 1) NUMA-aware lock with no memory overhead 2) High throughput in both low/high thread count 19

Key idea: Sort waiters on the fmy Observatjons: Hierarchical locks avoid NUMA by passing the lock within a socket Queue-based locks already maintain a set of waiters 20

Shuffming : Design methodology Representjng a waitjng queue Socket id (e.g, socket 0) t1 Socket ID shuffmer: waiter’s qnode: tail 21

Shuffming : Design methodology Another waiter is in a difgerent socket t1 t2 Socket ID shuffmer: waiter’s qnode: tail 22

Shuffming : Design methodology More waiters join t1 t2 t3 t4 Socket ID shuffmer: waiter’s qnode: tail 23

Shuffming : Design methodology Shuffmer (t1) sorts based on socket ID t1 t2 t3 t4 Socket ID shuffmer: waiter’s qnode: tail 24

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - PowerPoint PPT Presentation

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/ File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

CS533 Concepts of Operating Systems Linux Kernel Locking Techniques Intro to kernel locking

Orthogonal key-value locking Goetz Graefe, Hideaki Kimura Hewlett-Packard Laboratories Palo Alto,

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Practjcal, Powerful Presentatjon Skills JHB, CPT and DBN R 6, 675 Ex VAT 2 Days E - Learning

algebra on manycore nodes Michael A. Heroux Scalable Algorithms Department Sandia National

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Transactional Locking II Nir Shavit, Dave Dice and Ori Shalev Scalable Synchronization Group

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

blocking synchronization Yang Xu Outline Disadvantages of locking Hardware support for

LOCKING CS 2550 / Spring 2006 Principles of Database Systems under multiple granularities 11

CS 764: Topics in Database Management Systems Lecture 9: B-tree Locking Xiangyao Yu 10/5/2020 1

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Owen S. Hofmann, Xuan Wang, Emmett Witchel, Donald E. Porter 1 Fine-grained locking -

March 22, 2018 Hukum Singh Meena IAS, Joint Secretary Department of Land Resources Ministry of

Level 1 Telephone (08) 8223 8000 157 Grenfell Street International +618 8223 8000 Adelaide

CO NATIONAL IONAL GU GUARD ARD STATE TE TA PROGR OGRAM AM OVER ERVI VIEW EW Program

BEYOND THE CYCLE TO INVEST FOR THE NEXT GENERATION Tanguy DESROUSSEAUX Head of Public Sector

International Care Pathways Conference Debating a primary goal of care pathways in reducing the

AIT Complex Facility Design Team Mary Ellen McCrillis HQDA, G-3/5/7 (Proponent) Dolat G. Desai

Army Stormwater Office of the Assistant Chief of Staff for Installation Management Bill Sproul,

June 2018 PAGE 0 Disclaimer This presentation is for informational purposes only and does not

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems - PowerPoint PPT Presentation

ShfmLocks: Scalable and Practjcal Locking for Manycore Systems Changwoo Min COSMOSS Lab / ECE / Virginia Tech https://cosmoss-vt.github.io/ File system becomes a botuleneck on manycore systems Embarrassingly parallel Exim mail server on

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

CS533 Concepts of Operating Systems Linux Kernel Locking Techniques Intro to kernel locking

Orthogonal key-value locking Goetz Graefe, Hideaki Kimura Hewlett-Packard Laboratories Palo Alto,

MORC A MANYCORE ORIENTED COMPRESSED CACHE TRI M. NGUYEN, DAVID WENTZLAFF 12/7/2015 1

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Practjcal, Powerful Presentatjon Skills JHB, CPT and DBN R 6, 675 Ex VAT 2 Days E - Learning

algebra on manycore nodes Michael A. Heroux Scalable Algorithms Department Sandia National

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Transactional Locking II Nir Shavit, Dave Dice and Ori Shalev Scalable Synchronization Group

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

blocking synchronization Yang Xu Outline Disadvantages of locking Hardware support for

LOCKING CS 2550 / Spring 2006 Principles of Database Systems under multiple granularities 11

CS 764: Topics in Database Management Systems Lecture 9: B-tree Locking Xiangyao Yu 10/5/2020 1

A marriage of rely/guarantee &amp; separation logic Viktor V afeiadis MPI - SWS Coarse - grain

Owen S. Hofmann, Xuan Wang, Emmett Witchel, Donald E. Porter 1 Fine-grained locking -

March 22, 2018 Hukum Singh Meena IAS, Joint Secretary Department of Land Resources Ministry of

Level 1 Telephone (08) 8223 8000 157 Grenfell Street International +618 8223 8000 Adelaide

CO NATIONAL IONAL GU GUARD ARD STATE TE TA PROGR OGRAM AM OVER ERVI VIEW EW Program

BEYOND THE CYCLE TO INVEST FOR THE NEXT GENERATION Tanguy DESROUSSEAUX Head of Public Sector

International Care Pathways Conference Debating a primary goal of care pathways in reducing the

AIT Complex Facility Design Team Mary Ellen McCrillis HQDA, G-3/5/7 (Proponent) Dolat G. Desai

Army Stormwater Office of the Assistant Chief of Staff for Installation Management Bill Sproul,

June 2018 PAGE 0 Disclaimer This presentation is for informational purposes only and does not

A marriage of rely/guarantee & separation logic Viktor V afeiadis MPI - SWS Coarse - grain