[PPT] - Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks PowerPoint Presentation

SLIDE 1

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks

Yuvraj Patel, Leon Yang*, Leo Arulraj+, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift University of Wisconsin-Madison

* - Now at Facebook, + - Now at Cohesity

SLIDE 2

Competitive environment

2

App 1 Bins/Lib

Container Engine Operating System Physical Infrastructure

App 2 Bins/Lib App 1 Bins/Lib

Hypervisor Physical Infrastructure

App 2 Bins/Lib Guest OS Guest OS

Example use-cases of modern data centers

Containers VM1 VM2

Clients

Every container/VM/user expects

their desired share of resources

Schedulers play an important role

to fulfill the expectations

CPU schedulers important for CPU

allocation

Majority of the systems are

concurrent systems protected by locks

C2 C1

SLIDE 3

The problem – Scheduler Subversion

Accessing locks can lead to new problem - “Scheduler subversion”
Locks determine CPU allocation instead of the scheduler

3

2 Processes – P0 & P1
Default priority
P0 holds the lock

twice as long as P1

Ticket lock-

acquisition fairness

Linux CFS Scheduler

Expected

SLIDE 4

The problem – Scheduler Subversion

Accessing locks can lead to new problem - “Scheduler subversion”
Locks determine CPU allocation instead of the scheduler

4

2 Processes – P0 & P1
Default priority
P0 holds the lock

twice as long as P1

Ticket lock-

acquisition fairness

Linux CFS Scheduler

Expected Observed CPU allocation aligns with lock usage

SLIDE 5

The solution – Scheduler-Cooperative Locks

Scheduler-Cooperative Locks (SCL) guarantee lock usage fairness by

aligning with scheduling goals

Three important design components to build SCLs
Track lock usage
Penalize dominant users
Provide dedicated window of opportunity to every user
Implementation - Two user-space locks and one kernel lock
Evaluation
Correctness - Allocate lock usage according to the scheduling goals even in extreme

cases

Performance - Efficient and scalable
Useful – Apply SCLs to real-world systems – UpScaleDB, KyotoCabinet, Linux kernel

5

SLIDE 6

Introduction
The Problem – Scheduler Subversion
The Solution – Scheduler-Cooperative Locks
Evaluation
Conclusion

6

SLIDE 7

UpScaleDB – embedded key-value database

Lock & CPU dominance

7

Global mutex lock
Workload
8 threads pinned on 4 CPU
4 threads insert ops
4 threads find ops
Default thread priority
Equal CPU allocation
Run for 120 seconds

SLIDE 8

Lock & CPU dominance

UpScaleDB – embedded key-value database

8

Global mutex lock
Workload
8 threads pinned on 4 CPU
4 threads insert ops
4 threads find ops
Default thread priority
Equal CPU allocation
Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

SLIDE 9

Lock & CPU dominance

UpScaleDB – embedded key-value database

9

Global mutex lock
Workload
8 threads pinned on 4 CPU
4 threads insert ops
4 threads find ops
Default thread priority
Equal CPU allocation
Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

SLIDE 10

Lock & CPU dominance

UpScaleDB – embedded key-value database

10

Global mutex lock
Workload
8 threads pinned on 4 CPU
4 threads insert ops
4 threads find ops
Default thread priority
Equal CPU allocation
Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

Nearly six times more CPU allocated to insert threads than find threads

SLIDE 11

Lock & CPU dominance

UpScaleDB – embedded key-value database

11

Global mutex lock
Workload
8 threads pinned on 4 CPU
4 threads insert ops
4 threads find ops
Default thread priority
Equal CPU allocation
Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

Nearly six times more CPU allocated to insert threads than find threads Insert threads dominate lock usage

SLIDE 12

Causes of scheduler subversion

Two reasons

12

SLIDE 13

Reason #1 - Different critical section lengths

Threads spend varied amount of time in

critical section

Thread dwelling longer in critical section

becomes dominant user of CPU

13

11 22 33 44

Put/Get Insert/Find

Ratio

LevelDB UpScaleDB Ratio of median critical section times for various systems

SLIDE 14

Reason #2 - Majority locked run time

Time spent in critical section is high -> contention
Lock algorithm determines which threads scheduled
Common case in many applications and OS 1,2,3,4

14

1. Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems. ACM Trans. Comput. Syst.,36(1), March 2019 2. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. USENIX ATC 2012 3. Understanding Manycore Scalability of File Systems, USENIX ATC 2016 4. Non-scalable locks are dangerous. Linux Symposium, 2012

SLIDE 15

Introduction
The Problem – Scheduler Subversion
The Solution – Scheduler-Cooperative Locks
Evaluation
Conclusion

15

SLIDE 16

Scheduler-Cooperative Locks (SCLs)

Lock opportunity
Amount of time thread holds lock or could acquire lock when free
Important metric to measure lock usage fairness
Philosophy
Prevent dominant users from acquiring lock
Ensure equal “lock opportunity” to every user
Design locks that aligns with scheduling goals
Three important design components

16

SLIDE 17

#1 - Track lock usage

Track time spent in critical section

17

SLIDE 18

#1 - Track lock usage

Track time spent in critical section

18

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

SLIDE 19

#1 - Track lock usage

Track time spent in critical section
Tracking helps to identify dominant

users

19

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

SLIDE 20

#1 - Track lock usage

Track time spent in critical section
Tracking helps to identify dominant

users

Tracking flexible
Any schedulable entity such as

threads, processes, containers

Type of work – readers or writers

20

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

SLIDE 21

#2 – Penalize users

Penalize dominant users

21

SLIDE 22

#2 – Penalize users

Penalize dominant users
Penalty calculated while releasing lock
Penalty applied while acquiring lock
Prevent user from acquiring lock

22

scl_lock() { if (penalty) { sleep-until-penalty-time } ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs calculate penalty, penalty-time ….. }

SLIDE 23

#2 – Penalize users

Penalize dominant users
Penalty calculated while releasing lock
Penalty applied while acquiring lock
Prevent user from acquiring lock
Penalty based on scheduling goals

23

scl_lock() { if (penalty) { sleep-until-penalty-time } ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs calculate penalty, penalty-time ….. }

SLIDE 24

#3 – Dedicated window of opportunity

24

Lock slice – dedicated window of
pportunity to every user

SLIDE 25

#3 – Dedicated window of opportunity

25

Lock slice – dedicated window of
pportunity to every user

P0 P1

SLIDE 26

#3 – Dedicated window of opportunity

26

Lock slice – dedicated window of
pportunity to every user

P0 P1 Lock slice (2ms) Time

Slice owner is lock owner

SLIDE 27

#3 – Dedicated window of opportunity

27

Lock slice – dedicated window of
pportunity to every user
Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Time

Slice owner is lock owner Lock acquisition is fast-pathed improving throughput

SLIDE 28

#3 – Dedicated window of opportunity

28

Lock slice – dedicated window of
pportunity to every user
Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Lock slice (2ms) Time

Slice ownership transferred to P1

SLIDE 29

#3 – Dedicated window of opportunity

29

Lock slice – dedicated window of
pportunity to every user
Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Lock slice (2ms) Time

Size of individual critical section can vary

SLIDE 30

#3 – Dedicated window of opportunity

30

Lock slice – dedicated window of
pportunity to every user
Owner can acquire lock multiple

times within a slice without penalty

Slice ownership alternates between

users

P0 P1 Lock slice (2ms) Lock slice (2ms) Lock slice (2ms) Time

Wait-times depends

n lock slice size

SLIDE 31

#3 – Dedicated window of opportunity

31

Lock slice – dedicated window of
pportunity to every user
Owner can acquire lock multiple

times within a slice without penalty

Slice ownership alternates between

users

P0 P1 Lock slice (2ms) Lock slice (2ms) Lock slice (2ms) Time

Lock slice

Fixed-sized virtual critical section
Transferred to next owner based
n scheduling policy

SLIDE 32

SCLs Implementation

Three different implementations
u-SCL – User-space mutex replacement
RW-SCL – Reader-Writer Scheduler-Cooperative Lock
k-SCL – Kernel version of u-SCL
New and existing optimization techniques
u-SCL
Spin-and-park – To minimize CPU time spent while waiting
Next-thread prefetch – Next owner ready before slice ownership handoff
RW-SCL
Per NUMA node counters
More details in paper

32

SLIDE 33

Introduction
The Problem – Scheduler Subversion
The Solution – Scheduler-Cooperative Locks
Evaluation
Conclusion

33

SLIDE 34

Evaluation

Same UpScaleDB experiment

34

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time

TPUT - 22.2K Mutex TPUT - 11.7K

SLIDE 35

Evaluation

Same UpScaleDB experiment

35

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

TPUT - 22.2K TPUT - 695K Mutex u-SCL TPUT - 11.7K

SLIDE 36

Evaluation

Same UpScaleDB experiment

36

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

TPUT - 22.2K TPUT - 35K TPUT - 695K Mutex u-SCL TPUT - 11.7K

SLIDE 37

Evaluation

Same UpScaleDB experiment

37

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

Max Lock Hold Time TPUT - 22.2K TPUT - 35K TPUT - 695K Mutex u-SCL TPUT - 11.7K

SLIDE 38

Results summary

Lock usage fairness – Allocate CPU proportionally even in extreme

cases

Lock overhead - Efficient and scales well up to 32 CPU
Lock slice sizes vs. Performance
Large slice size – Higher throughput
Small slice size – Low Latency
Demonstrate real-world utility of SCLs
Port RW-SCL to KyotoCabinet
Replace global file-system rename lock with k-SCL in Linux kernel

38

SLIDE 39

Introduction
The Problem – Scheduler Subversion
The Solution – Scheduler-Cooperative Locks
Evaluation
Conclusion

39

SLIDE 40

Conclusion

Lock usage determines CPU allocation subverting scheduling goals
Introduce Scheduler-Cooperative Locks (SCL) to address the problem
Evaluation shows the performance characteristics and versatility of

SCLs

Future work – Build SCLs that support other scheduling goals

40

Source - https://research.cs.wisc.edu/adsl/Software/

SLIDE 41

Thank you ☺

41