Shuffling: A Lock Contention Aware Thread Scheduling Technique - PowerPoint PPT Presentation

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri

Multicores are Ubiquitous  Deliver computing power via parallelism  Potential for delivering high performance for multithreaded applications Mobile phones Oracle SPARC M7-8 2

Complexity of Achieving High Performance Operating System Policies Application Characteristics  Thread Scheduling  Degree of Parallelism  Memory Management  Lock Contention  Memory Requirements Architecture  Cache Hierarchy  Cross-chip Interconnect Protocols 3

Modern Operating Systems Improve System Utilization and Provide Fairness  Thread Scheduling: Time Share → Fairness  Memory Allocation: Next → Data Locality Do not consider relationships between threads of a multithreaded application Application characteristics should be considered 4

OS Load Balancing vs Lock Contention • OS load balancing is oblivious of lock contention • Performance of multithreaded program with high lock contention is sensitive to the distribution of threads across sockets • Inappropriate distribution of threads → increases frequency of lock transfers • Increases lock acquisition latencies • Increases LLC misses in the critical path 5

Outline  Introduction  Motivation  Shuffling Framework  Experimental Results 6

Lock Contention Study Lock contention is an important performance limiting factor 23 programs (pthreads) - SPEC JBB2005 - PARSEC - SPEC OMP2001 - SPLASH 2x Run with 64 threads 64-core machine Four 16-core Sockets (AMD Opteron) 7

Lock Contention on Performance Lock time: the percentage of elapsed time a process spends on waiting for lock operations in user space 8

Lock Transfers Overhead of Lock Transfer: Acquire Lock  T_low → Lock transfers between Execute Critical Section threads located on the same Socket Release Lock  T_high → Lock transfers between threads located on different Sockets e.g.: bodytrack (BT) with 64 threads Lock Solaris Transfer T_low 31% T_high 69% 9

High Frequency of LLC misses & Its Cause BT with 64 threads  Lock arrival times spread across a wide interval  The likelihood of lock acquired by a thread on a different socket is very high Lock arrival times of threads per socket at the entry of a lock within a 100 ms time interval 10

Outline  Introduction  Motivation  Shuffling Framework  Experimental Results 1 1

Thread Shuffling [ PACT 2014 ] Minimize variation in lock arrival times of threads Schedule threads whose lock arrival times are clustered in a small time interval Once a thread releases the lock it is highly likely that another thread on the same Socket will successfully acquire the lock 12

Thread Shuffling (algorithm) Input: N → Number of Threads; S → Number of Sockets repeat 1. Monitor Threads – sample lock times of N threads if lock times exceed threshold then 2. Form Thread Groups – sort threads according to lock times and divide them into S groups 3. Perform Shuffling – shuffle threads to establish newly computed groups until (application terminates) 13

Shuffling Interval Impacts Lock transfers between sockets  LLC misses 500 ms as a shuffling interval BT: LLC miss rate vs Shuffling interval 14

Shuffling Overhead Negligible Frequency of monitoring and shuffling Overhead is negligible ( < 1% of system time) 15

Lock Transfers: Solaris vs Shuffling BT Shuffling Shuffling Solaris Lock Solaris Transfer LLC miss rate 1.9 3.3 T_low 46% 31% Lock time 86% 72% T_high 54% 69% 16

Thread Lock Arrival-time Ranges 17

Lock contention & LLC miss rate Reduces Lock contention & LLC misses 18

Evaluating Thread Shuffling (cont.) Up to 54% Memcached: 17% Avg. 13% TATP: 28% Relative to Solaris DINO: only considers LLC misses PSets: binding a pool of threads to a pool of cores 19

Conclusions Problem: OS thread scheduling is oblivious to lock contention and fails to maximize performance of multithreaded applications on multicore multiprocessor systems Idea: Minimize variation in lock arrival times of threads Advantages:  Improves performance on average 13% (max of 54%)  No need to modify application source code 20

Shuffling: A Lock Contention Aware Thread Scheduling Technique - PowerPoint PPT Presentation

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are Ubiquitous Deliver computing power via parallelism Potential for delivering high performance for multithreaded applications Mobile phones

On the Shuffling Algorithm for the Aztec Nordenstam eno@kth.se Diamond Background Shuffling

Locks Do Not Compose! Example Code Thread 1 Thread 2 class Account { transfer(A, B, 10);

Chapter 2 Process, Thread and Process, Thread and Chapter 2 Scheduling Scheduling

13 IN THIS CHAPTER Benefits of Thread Pooling 308 Considerations and Costs of Thread

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

Chapter 2 Process, thread, and Process, thread, and Chapter 2 scheduling scheduling

Chapter 2 Process, Thread and Chapter 2 Process, Thread and Scheduling Scheduling

LIRA: Adaptive Contention-Aware Thread Placement for Parallel Runtime Systems Alexander Collins*,

1 Reader/Writer Lock: Second Try Reader/Writer Lock: Second Try Guidelines for Condition

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Contention-Related Crash Failures Anas Durand LIP6, Sorbonne Universit, Paris April 1st,

Multithreading Horstmann ch.9 Multithreading Threads Thread states Thread

To thread or not to thread? Why PETSc favors MPI-only Plenary Discussion PETSc User Meeting 2016

Tiling Shuffling Phenomenon Tri Lai University of Nebraska Lincoln Lincoln, NE 68588 Dimers

Shuffling Cards via One-sided Transpositions Stephen Connor Joint work with Oliver Matheau-Raven

Permutations, Card Shuffling, and Representation Theory Franco Saliola, Universit du Qubec

UNDERSTANDING TRANSACTIONAL MEMORY PERFORMANCE Donald E. Porter and Emmett Witchel The

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

QtWidgets and QtQuick.Controls - A Comparison Qt Developer Days Europe 2014 Presented by Kevin

XML and Content Management Lecture 3: Modelling XML Documents: XML Schema Maciej Ogrodniczuk,

COOPERATION INSTEAD OF CONTENTION! THE NEBULOUS CONCEPT OF WIRELESS LINK. Network

Tanima Dey Wei Wang, Jack W. Davidson, Mary L. Soffa e a g, Jac a dso , a y So a Department

Analytical Performance Modeling of Hierarchical Interconnect Fabrics Nikita Nikitin, Javier de

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public