Scaling Synchronization Primitives Ph.D. Defense of Dissertation - PowerPoint PPT Presentation

Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap

Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10 Number of hardware threads 1 1970 1980 1990 2000 2010 2020 Applications inherently scaled with increasing frequency Machines have multiples of processors (multi-socket) 2 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten

Multicore machines → “The free lunch is over” Today’s de facto standard: Concurrent applications that scale with increasing cores Operating systems Cloud services Data processing systems Databases Synchronization primitives Basic building block for designing applications 4

Synchronization primitives Provide some form of consistency required by applications Determine the ordering/scheduling of concurrent events 6

Embarrassingly parallel application performance Typical application performance on a manycore machine Higher is better 100 K 80 K Messages / second 60 K 40 K 20 K 0 K 1 2 4 24 48 72 96 120 144 168 192 # threads 7

Embarrassingly parallel application performance Typical application performance on a manycore machine 1200 K 1000 K Messages / second 800 K 600 K 400 K 200 K 0 K 1 2 4 24 48 72 96 120 144 168 192 Synchronization required at several places # threads 8

Future hardware will exacerbate scalability 100-1000s of CPUs Many-core Single core 10x ~ 1000x SSD Challenge: Maintain application scalability 9

How can we minimize the overhead of synchronization primitives for large multicore machines? Efficiently schedule events by leveraging HW/SW 10

Thesis contributions • Timestamping is costly on large multicore machines Hardware timestamping • Cache contention due to atomic instructions Eurosys’18 • Approach : Use per-core invariant hardware clock • Double scheduling in a virtualized environment VM VM VM double Hypervisor • Introduce various types of preemption problems scheduling OS • Approach : Expose semantic information across layers ATC’18 • Discrepancy between lock design and use Decouple lock design policy from • Approach : Decouple lock design from lock policy via HW / SW design policy shuffling mechanism SOSP’19 13

Thesis contributions • Timestamping is costly on large multicore machines Hardware timestamping • Cache contention due to atomic instructions Eurosys’18 • Approach : Use per-core invariant hardware clock • Double scheduling in a virtualized environment VM VM VM double Hypervisor • Introduce various types of preemption problems scheduling OS • Approach : Expose semantic information across layers ATC’18 • Discrepancy between lock design and use Decouple lock design policy from • Approach : Decouple lock design from lock policy via HW / SW design policy shuffling mechanism SOSP’19 14

Example: Email service 15

Embarrassingly parallel application performance Process intensive and stresses memory subsystem, file system and scheduler ... 100 K 80 K Messages / second File system file_create (…) { 60 K spin_lock(superblock); … Scheduler process_create (…) { 40 K spin_unlock(superblock); mm_lock(process->lock); } Degrading performance … mm_unlock(process->lock); 20 K due to inefficient locks rename_file (…) { } write_lock(global_rename_lock); … 0 K processs_schedule (…) { write_unlock(global_rename_lock); 1 2 4 24 48 72 96 120 144 168 192 spin_lock(run_queue->lock); } … spin_unlock(run_queue->lock); # threads } 16

Synchronization primitive: Locks • Provide mutual exclusion among tasks • Guard shared resource • Mutex, readers-writer lock, spinlock Threads want to modify the data structure Threads wait for their turn by either spinning or sleeping Lock protects the access to the data structure 17

Locks: MOST WIDELY used primitive 200 # lock API() calls (x1000) 180 160 140 4X 120 100 80 60 40 20 2002 2019 Linux kernel More locks are in use to improve OS scalability 18

Locks are used in a complicated manner A system call can acquire up to 12 locks (average of 4) 19

Issue with current lock designs Design specific locks for hardware and software requirements Ticket lock HBO lock FC lock Cohort lock HMCS lock 1989 2006 2012 2014 2017 1991 2015 2003 2011 Backoff lock MCS lock HCLH lock CST lock RCL Malthusian lock Used in practice Best performance 22

Issue with current lock designs Design specific locks for hardware and software requirements Locks in practice: Generic Focus on simple lock, more applicability Forgo hardware characteristic Worsening throughput with more cores Locks in research: Hardware specific design High throughput for high thread count Use extra memory Suitable for pre-allocated data HW/SW policies are statically tied together 23

Incorporating HW/SW policies dynamically Scalable and practical locking algorithms 24

Two trends driving locks’ innovation Evolving hardware Applications requirement 25

Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead 2) Minimal lock size Scales to millions of locks Memory footprint 26

Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended Application : Multi-threaded to utilize cores to improve performance In oversubscription Avoid bookkeeping overhead Lock : Minimize lock contention while maintaining high throughput 2) Minimal lock size Scales to millions of locks Memory footprint (e.g., file inode) 27

Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead Application : Single thread to do an operation; fine-grained locking Application: Use threads utilize cores to improve performance Locks: Minimize lock contention while maintaining high throughput Lock : Minimal or almost no lock/unlock overhead 2) Minimal lock size Memory footprint Scales to millions of locks 28

Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended In oversubscription Avoid bookkeeping overhead Application : More threads than cores; common scenario; eg. I/O wait 2) Minimal lock size Lock : Minimize scheduler overhead while waking or parking threads Scales to millions of locks Memory footprint 29

Two dimensions of lock design / goals 1) High throughput In high thread count Minimize lock contentions In single thread No penalty when not contended Application : Locks are embedded in data structures; eg. file inodes In oversubscription Avoid bookkeeping overhead Lock : Can stress memory allocator or data structure alignment 2) Minimal lock size Scales to millions of locks Memory footprint 30

Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory 1 1 socket > 1 socket Oversubscribed Throughput collapses after one socket ● Operations / second Due to non-uniform memory access (NUMA) # threads Stock Setup: 192-core/8-socket machine 1. Understanding Manycore Scalability of File Systems [ ATC’16] 31

Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory 1 1 socket > 1 socket Oversubscribed Throughput collapses after one socket ● Operations / second Due to non-uniform memory access (NUMA) NUMA also affects oversubscription ● # threads Stock Prevent throughput collapse after one socket Setup: 192-core/8-socket machine 1. Understanding Manycore Scalability of File Systems [ ATC’16] 34

Existing research efforts: Hierarchical locks Goal: high throughput at high thread count ● Making locks NUMA-aware: ● Global lock Use extra memory to improve throughput ○ Socket lock Two level locks: per-socket and the global ○ Socket-1 Socket-2 Avoid NUMA overhead ● → Pass global lock within the same socket 35

Existing research efforts: Hierarchical locks Problems: ● Require extra memory allocation ○ Do not care about single thread throughput ○ Global lock Example: CST 2 ● Socket lock Allocates socket structure on first access ○ Socket-1 Socket-2 Handles oversubscription (# threads > # CPUS) ○ 2. Scalable NUMA-aware Blocking Synchronization Primitives [ ATC’17] 36

Locks performance: Throughput Benchmark: Each thread creates a file, a serial operation, in a shared directory Maintains throughput: 1 socket > 1 socket Oversubscribed ● Beyond one socket (high thread count) Operations / second In oversubscribed case (384 threads) Poor single thread throughput ● Multiple atomic instructions # threads Stock CST Setup: 192-core/8-socket machine 37

Scaling Synchronization Primitives Ph.D. Defense of Dissertation - PowerPoint PPT Presentation

Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10

Last time Need for synchronization primitives 7: Synchronization Locks and building locks

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

RenderMan Primitives RenderMan Primitives CSCD 472? Slide 1 4/5/10 Primitive Attributes

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Other Synchronization Primitives Prof. Patrick G. Bridges 1 University of New Mexico Infinitely

Last class: Synchronization Problems and Primitives Today: Synchonization Solutions

Multiprocessor Synchronization Multiprocessor Systems Memory Consistency

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Implementing new Topology Mapping Primitives Guillermo Baltra Prior Work Primitives for

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

Lock Inference for Java Khilan Gudka Imperial College London Supervised by Professor Susan

17 Locking Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

Information-Centric Networking: Overview, Current State and Key Challenges Prof. George Pavlou

Seeing through Website Privacy Policies Rishiraj Saha Roy Max Planck Institute for Informatics

ECE 3574: Applied Software Design Thread Synchronization Today we are going to look at how to

Lecture 8: Threads Lisa (Ling) Liu What is a thread? A thread is an independent execution path,

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

Synchronization 2: Locks (part 2), Mutexes 1 load/store reordering recall: out-of-order

Scaling Synchronization Primitives Ph.D. Defense of Dissertation - PowerPoint PPT Presentation

Scaling Synchronization Primitives Ph.D. Defense of Dissertation Sanidhya Kashyap Rise of the multicore machines Hardware limitation: frequency stagnated CPU trends 10000 Frequency (MHz) 1000 100 Data-intensive applications performance 10

Last time Need for synchronization primitives 7: Synchronization Locks and building locks

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

RenderMan Primitives RenderMan Primitives CSCD 472? Slide 1 4/5/10 Primitive Attributes

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Other Synchronization Primitives Prof. Patrick G. Bridges 1 University of New Mexico Infinitely

Last class: Synchronization Problems and Primitives Today: Synchonization Solutions

Multiprocessor Synchronization Multiprocessor Systems Memory Consistency

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Implementing new Topology Mapping Primitives Guillermo Baltra Prior Work Primitives for

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

Lock Inference for Java Khilan Gudka Imperial College London Supervised by Professor Susan

17 Locking Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

Information-Centric Networking: Overview, Current State and Key Challenges Prof. George Pavlou

Seeing through Website Privacy Policies Rishiraj Saha Roy Max Planck Institute for Informatics

ECE 3574: Applied Software Design Thread Synchronization Today we are going to look at how to

Lecture 8: Threads Lisa (Ling) Liu What is a thread? A thread is an independent execution path,

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

Synchronization 2: Locks (part 2), Mutexes 1 load/store reordering recall: out-of-order

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo