Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks - - PowerPoint PPT Presentation

avoiding scheduler subversion usin ing scheduler
SMART_READER_LITE
LIVE PREVIEW

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks - - PowerPoint PPT Presentation

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks Yuvraj Patel, Leon Yang * , Leo Arulraj + , Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift University of Wisconsin-Madison * - Now at Facebook, + - Now at


slide-1
SLIDE 1

Avoiding Scheduler Subversion usin ing Scheduler-Cooperative Locks

Yuvraj Patel, Leon Yang*, Leo Arulraj+, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Michael M. Swift University of Wisconsin-Madison

* - Now at Facebook, + - Now at Cohesity

slide-2
SLIDE 2

Competitive environment

2

App 1 Bins/Lib

Container Engine Operating System Physical Infrastructure

App 2 Bins/Lib App 1 Bins/Lib

Hypervisor Physical Infrastructure

App 2 Bins/Lib Guest OS Guest OS

Example use-cases of modern data centers

Containers VM1 VM2

Clients

  • Every container/VM/user expects

their desired share of resources

  • Schedulers play an important role

to fulfill the expectations

  • CPU schedulers important for CPU

allocation

  • Majority of the systems are

concurrent systems protected by locks

C2 C1

slide-3
SLIDE 3

The problem – Scheduler Subversion

  • Accessing locks can lead to new problem - “Scheduler subversion”
  • Locks determine CPU allocation instead of the scheduler

3

  • 2 Processes – P0 & P1
  • Default priority
  • P0 holds the lock

twice as long as P1

  • Ticket lock-

acquisition fairness

  • Linux CFS Scheduler

Expected

slide-4
SLIDE 4

The problem – Scheduler Subversion

  • Accessing locks can lead to new problem - “Scheduler subversion”
  • Locks determine CPU allocation instead of the scheduler

4

  • 2 Processes – P0 & P1
  • Default priority
  • P0 holds the lock

twice as long as P1

  • Ticket lock-

acquisition fairness

  • Linux CFS Scheduler

Expected Observed CPU allocation aligns with lock usage

slide-5
SLIDE 5

The solution – Scheduler-Cooperative Locks

  • Scheduler-Cooperative Locks (SCL) guarantee lock usage fairness by

aligning with scheduling goals

  • Three important design components to build SCLs
  • Track lock usage
  • Penalize dominant users
  • Provide dedicated window of opportunity to every user
  • Implementation - Two user-space locks and one kernel lock
  • Evaluation
  • Correctness - Allocate lock usage according to the scheduling goals even in extreme

cases

  • Performance - Efficient and scalable
  • Useful – Apply SCLs to real-world systems – UpScaleDB, KyotoCabinet, Linux kernel

5

slide-6
SLIDE 6
  • Introduction
  • The Problem – Scheduler Subversion
  • The Solution – Scheduler-Cooperative Locks
  • Evaluation
  • Conclusion

6

slide-7
SLIDE 7
  • UpScaleDB – embedded key-value database

Lock & CPU dominance

7

  • Global mutex lock
  • Workload
  • 8 threads pinned on 4 CPU
  • 4 threads insert ops
  • 4 threads find ops
  • Default thread priority
  • Equal CPU allocation
  • Run for 120 seconds
slide-8
SLIDE 8

Lock & CPU dominance

  • UpScaleDB – embedded key-value database

8

  • Global mutex lock
  • Workload
  • 8 threads pinned on 4 CPU
  • 4 threads insert ops
  • 4 threads find ops
  • Default thread priority
  • Equal CPU allocation
  • Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

slide-9
SLIDE 9

Lock & CPU dominance

  • UpScaleDB – embedded key-value database

9

  • Global mutex lock
  • Workload
  • 8 threads pinned on 4 CPU
  • 4 threads insert ops
  • 4 threads find ops
  • Default thread priority
  • Equal CPU allocation
  • Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

slide-10
SLIDE 10

Lock & CPU dominance

  • UpScaleDB – embedded key-value database

10

  • Global mutex lock
  • Workload
  • 8 threads pinned on 4 CPU
  • 4 threads insert ops
  • 4 threads find ops
  • Default thread priority
  • Equal CPU allocation
  • Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

Nearly six times more CPU allocated to insert threads than find threads

slide-11
SLIDE 11

Lock & CPU dominance

  • UpScaleDB – embedded key-value database

11

  • Global mutex lock
  • Workload
  • 8 threads pinned on 4 CPU
  • 4 threads insert ops
  • 4 threads find ops
  • Default thread priority
  • Equal CPU allocation
  • Run for 120 seconds

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Lock Hold Time Wait + Other

Nearly six times more CPU allocated to insert threads than find threads Insert threads dominate lock usage

slide-12
SLIDE 12

Causes of scheduler subversion

  • Two reasons

12

slide-13
SLIDE 13

Reason #1 - Different critical section lengths

  • Threads spend varied amount of time in

critical section

  • Thread dwelling longer in critical section

becomes dominant user of CPU

13

11 22 33 44

Put/Get Insert/Find

Ratio

LevelDB UpScaleDB Ratio of median critical section times for various systems

slide-14
SLIDE 14

Reason #2 - Majority locked run time

  • Time spent in critical section is high -> contention
  • Lock algorithm determines which threads scheduled
  • Common case in many applications and OS 1,2,3,4

14

1. Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems. ACM Trans. Comput. Syst.,36(1), March 2019 2. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. USENIX ATC 2012 3. Understanding Manycore Scalability of File Systems, USENIX ATC 2016 4. Non-scalable locks are dangerous. Linux Symposium, 2012

slide-15
SLIDE 15
  • Introduction
  • The Problem – Scheduler Subversion
  • The Solution – Scheduler-Cooperative Locks
  • Evaluation
  • Conclusion

15

slide-16
SLIDE 16

Scheduler-Cooperative Locks (SCLs)

  • Lock opportunity
  • Amount of time thread holds lock or could acquire lock when free
  • Important metric to measure lock usage fairness
  • Philosophy
  • Prevent dominant users from acquiring lock
  • Ensure equal “lock opportunity” to every user
  • Design locks that aligns with scheduling goals
  • Three important design components

16

slide-17
SLIDE 17

#1 - Track lock usage

  • Track time spent in critical section

17

slide-18
SLIDE 18

#1 - Track lock usage

  • Track time spent in critical section

18

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

slide-19
SLIDE 19

#1 - Track lock usage

  • Track time spent in critical section
  • Tracking helps to identify dominant

users

19

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

slide-20
SLIDE 20

#1 - Track lock usage

  • Track time spent in critical section
  • Tracking helps to identify dominant

users

  • Tracking flexible
  • Any schedulable entity such as

threads, processes, containers

  • Type of work – readers or writers

20

scl_lock() { ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs ….. }

slide-21
SLIDE 21

#2 – Penalize users

  • Penalize dominant users

21

slide-22
SLIDE 22

#2 – Penalize users

  • Penalize dominant users
  • Penalty calculated while releasing lock
  • Penalty applied while acquiring lock
  • Prevent user from acquiring lock

22

scl_lock() { if (penalty) { sleep-until-penalty-time } ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs calculate penalty, penalty-time ….. }

slide-23
SLIDE 23

#2 – Penalize users

  • Penalize dominant users
  • Penalty calculated while releasing lock
  • Penalty applied while acquiring lock
  • Prevent user from acquiring lock
  • Penalty based on scheduling goals

23

scl_lock() { if (penalty) { sleep-until-penalty-time } ….. lock.start_cs = now() } scl_unlock() { ….. end_cs = now() cs_time = end_cs – lock.start_cs calculate penalty, penalty-time ….. }

slide-24
SLIDE 24

#3 – Dedicated window of opportunity

24

  • Lock slice – dedicated window of
  • pportunity to every user
slide-25
SLIDE 25

#3 – Dedicated window of opportunity

25

  • Lock slice – dedicated window of
  • pportunity to every user

P0 P1

slide-26
SLIDE 26

#3 – Dedicated window of opportunity

26

  • Lock slice – dedicated window of
  • pportunity to every user

P0 P1 Lock slice (2ms) Time

Slice owner is lock owner

slide-27
SLIDE 27

#3 – Dedicated window of opportunity

27

  • Lock slice – dedicated window of
  • pportunity to every user
  • Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Time

Slice owner is lock owner Lock acquisition is fast-pathed improving throughput

slide-28
SLIDE 28

#3 – Dedicated window of opportunity

28

  • Lock slice – dedicated window of
  • pportunity to every user
  • Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Lock slice (2ms) Time

Slice ownership transferred to P1

slide-29
SLIDE 29

#3 – Dedicated window of opportunity

29

  • Lock slice – dedicated window of
  • pportunity to every user
  • Owner can acquire lock multiple

times within a slice without penalty

P0 P1 Lock slice (2ms) Lock slice (2ms) Time

Size of individual critical section can vary

slide-30
SLIDE 30

#3 – Dedicated window of opportunity

30

  • Lock slice – dedicated window of
  • pportunity to every user
  • Owner can acquire lock multiple

times within a slice without penalty

  • Slice ownership alternates between

users

P0 P1 Lock slice (2ms) Lock slice (2ms) Lock slice (2ms) Time

Wait-times depends

  • n lock slice size
slide-31
SLIDE 31

#3 – Dedicated window of opportunity

31

  • Lock slice – dedicated window of
  • pportunity to every user
  • Owner can acquire lock multiple

times within a slice without penalty

  • Slice ownership alternates between

users

P0 P1 Lock slice (2ms) Lock slice (2ms) Lock slice (2ms) Time

Lock slice

  • Fixed-sized virtual critical section
  • Transferred to next owner based
  • n scheduling policy
slide-32
SLIDE 32

SCLs Implementation

  • Three different implementations
  • u-SCL – User-space mutex replacement
  • RW-SCL – Reader-Writer Scheduler-Cooperative Lock
  • k-SCL – Kernel version of u-SCL
  • New and existing optimization techniques
  • u-SCL
  • Spin-and-park – To minimize CPU time spent while waiting
  • Next-thread prefetch – Next owner ready before slice ownership handoff
  • RW-SCL
  • Per NUMA node counters
  • More details in paper

32

slide-33
SLIDE 33
  • Introduction
  • The Problem – Scheduler Subversion
  • The Solution – Scheduler-Cooperative Locks
  • Evaluation
  • Conclusion

33

slide-34
SLIDE 34

Evaluation

  • Same UpScaleDB experiment

34

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time

TPUT - 22.2K Mutex TPUT - 11.7K

slide-35
SLIDE 35

Evaluation

  • Same UpScaleDB experiment

35

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

TPUT - 22.2K TPUT - 695K Mutex u-SCL TPUT - 11.7K

slide-36
SLIDE 36

Evaluation

  • Same UpScaleDB experiment

36

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

TPUT - 22.2K TPUT - 35K TPUT - 695K Mutex u-SCL TPUT - 11.7K

slide-37
SLIDE 37

Evaluation

  • Same UpScaleDB experiment

37

Workload – 8 threads (4 insert threads + 4 find threads) pinned on 4 CPU, equal CPU allocation

5 10 15 20 25 30 F1 F2 F3 F4 I1 I2 I3 I4 CPU Time (Seconds) Thread Wait + Other Lock Hold Time F1 F2 F3 F4 I1 I2 I3 I4 Thread

Max Lock Hold Time TPUT - 22.2K TPUT - 35K TPUT - 695K Mutex u-SCL TPUT - 11.7K

slide-38
SLIDE 38

Results summary

  • Lock usage fairness – Allocate CPU proportionally even in extreme

cases

  • Lock overhead - Efficient and scales well up to 32 CPU
  • Lock slice sizes vs. Performance
  • Large slice size – Higher throughput
  • Small slice size – Low Latency
  • Demonstrate real-world utility of SCLs
  • Port RW-SCL to KyotoCabinet
  • Replace global file-system rename lock with k-SCL in Linux kernel

38

slide-39
SLIDE 39
  • Introduction
  • The Problem – Scheduler Subversion
  • The Solution – Scheduler-Cooperative Locks
  • Evaluation
  • Conclusion

39

slide-40
SLIDE 40

Conclusion

  • Lock usage determines CPU allocation subverting scheduling goals
  • Introduce Scheduler-Cooperative Locks (SCL) to address the problem
  • Evaluation shows the performance characteristics and versatility of

SCLs

  • Future work – Build SCLs that support other scheduling goals

40

Source - https://research.cs.wisc.edu/adsl/Software/

slide-41
SLIDE 41

Thank you ☺

41