Shuffling: A Lock Contention Aware Thread Scheduling Technique - - PowerPoint PPT Presentation

shuffling a lock contention aware thread scheduling
SMART_READER_LITE
LIVE PREVIEW

Shuffling: A Lock Contention Aware Thread Scheduling Technique - - PowerPoint PPT Presentation

Shuffling: A Lock Contention Aware Thread Scheduling Technique Kishore Pusukuri Multicores are Ubiquitous Deliver computing power via parallelism Potential for delivering high performance for multithreaded applications Mobile phones


slide-1
SLIDE 1

Shuffling: A Lock Contention Aware Thread Scheduling Technique

Kishore Pusukuri

slide-2
SLIDE 2

2

Multicores are Ubiquitous

 Deliver computing power via parallelism

 Potential for delivering high performance

for multithreaded applications

Mobile phones Oracle SPARC M7-8

slide-3
SLIDE 3

3

Complexity of Achieving High Performance

Application Characteristics

Degree of Parallelism Lock Contention Memory Requirements

Operating System Policies

Thread Scheduling Memory Management

Architecture

Cache Hierarchy Cross-chip Interconnect Protocols

slide-4
SLIDE 4

4

Modern Operating Systems

 Thread Scheduling: Time Share → Fairness  Memory Allocation: Next → Data Locality

Improve System Utilization and Provide Fairness Do not consider relationships between threads

  • f a multithreaded application

Application characteristics should be considered

slide-5
SLIDE 5

5

OS Load Balancing vs Lock Contention

  • OS load balancing is oblivious of lock contention
  • Performance of multithreaded program with high

lock contention is sensitive to the distribution of threads across sockets

  • Inappropriate distribution of threads → increases

frequency of lock transfers

  • Increases lock acquisition latencies
  • Increases LLC misses in the critical path
slide-6
SLIDE 6

6

Outline

Introduction Motivation Shuffling Framework Experimental Results

slide-7
SLIDE 7

7

Lock Contention Study

Run with 64 threads 64-core machine Four 16-core Sockets (AMD Opteron) 23 programs (pthreads)

  • SPEC JBB2005
  • PARSEC
  • SPEC OMP2001
  • SPLASH 2x

Lock contention is an important performance limiting factor

slide-8
SLIDE 8

8

Lock Contention on Performance

Lock time: the percentage of elapsed time a process spends on waiting for lock operations in user space

slide-9
SLIDE 9

9

Lock Transfers

Overhead of Lock Transfer:

 T_low → Lock transfers between

threads located on the same Socket

 T_high → Lock transfers between

threads located on different Sockets

Lock Transfer Solaris T_low

31%

T_high

69% e.g.: bodytrack (BT) with 64 threads

Acquire Lock Execute Critical Section Release Lock

slide-10
SLIDE 10

10

High Frequency of LLC misses & Its Cause

Lock arrival times of threads per socket at the entry of a lock within a 100 ms time interval

BT with 64 threads

 Lock arrival times spread

across a wide interval

 The likelihood of lock

acquired by a thread on a different socket is very high

slide-11
SLIDE 11

1 1

Outline

Introduction Motivation Shuffling Framework Experimental Results

slide-12
SLIDE 12

12

Thread Shuffling [ PACT 2014 ]

Schedule threads whose lock arrival times are clustered in a small time interval Once a thread releases the lock it is highly likely that another thread on the same Socket will successfully acquire the lock Minimize variation in lock arrival times of threads

slide-13
SLIDE 13

13

Thread Shuffling (algorithm)

repeat

  • 1. Monitor Threads – sample lock times of N threads

if lock times exceed threshold then

  • 2. Form Thread Groups – sort threads according to

lock times and divide them into S groups

  • 3. Perform Shuffling – shuffle threads to

establish newly computed groups until (application terminates) Input: N → Number of Threads; S → Number of Sockets

slide-14
SLIDE 14

14

Shuffling Interval

Impacts Lock transfers between sockets  LLC misses

BT: LLC miss rate vs Shuffling interval

500 ms as a shuffling interval

slide-15
SLIDE 15

15

Shuffling Overhead Negligible

Overhead is negligible ( < 1% of system time)

Frequency of monitoring and shuffling

slide-16
SLIDE 16

16

Lock Transfers: Solaris vs Shuffling

Lock Transfer

Shuffling

Solaris T_low 46% 31% T_high 54% 69% Shuffling Solaris LLC miss rate 1.9 3.3 Lock time

72%

86%

BT

slide-17
SLIDE 17

17

Thread Lock Arrival-time Ranges

slide-18
SLIDE 18

18

Lock contention & LLC miss rate

Reduces Lock contention & LLC misses

slide-19
SLIDE 19

19 DINO: only considers LLC misses PSets: binding a pool of threads to a pool of cores

Evaluating Thread Shuffling (cont.)

Up to 54%

  • Avg. 13%

Relative to Solaris

Memcached: 17% TATP: 28%

slide-20
SLIDE 20

20

Conclusions

Problem:

OS thread scheduling is oblivious to lock contention and fails to maximize performance of multithreaded applications on multicore multiprocessor systems

Idea: Minimize variation in lock arrival times of threads Advantages:

 Improves performance on average 13% (max of 54%)  No need to modify application source code