Scheduling Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram - - PowerPoint PPT Presentation

scheduling
SMART_READER_LITE
LIVE PREVIEW

Scheduling Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram - - PowerPoint PPT Presentation

CSE 506: Opera.ng Systems Scheduling Don Porter 1 CSE 506: Opera.ng Systems Logical Diagram Binary Memory Threads Formats Allocators User Todays Lecture System Calls Switching to CPU Kernel scheduling RCU File System Networking


slide-1
SLIDE 1

CSE 506: Opera.ng Systems

Scheduling

Don Porter

1

slide-2
SLIDE 2

CSE 506: Opera.ng Systems

Logical Diagram

Memory Management CPU Scheduler User Kernel Hardware Binary Formats Consistency System Calls Interrupts Disk Net RCU File System Device Drivers Networking Sync Memory Allocators Threads Today’s Lecture Switching to CPU scheduling

2

slide-3
SLIDE 3

CSE 506: Opera.ng Systems

Lecture goals

  • Understand low-level building blocks of a scheduler
  • Understand compeLng policy goals
  • Understand the O(1) scheduler

– CFS next lecture

  • Familiarity with standard Unix scheduling APIs

3

slide-4
SLIDE 4

CSE 506: Opera.ng Systems

Undergrad review

  • What is cooperaLve mulLtasking?

– Processes voluntarily yield CPU when they are done

  • What is preempLve mulLtasking?

– OS only lets tasks run for a limited Lme, then forcibly context switches the CPU

  • Pros/cons?

– CooperaLve gives more control; so much that one task can hog the CPU forever – PreempLve gives OS more control, more overheads/ complexity

4

slide-5
SLIDE 5

CSE 506: Opera.ng Systems

Where can we preempt a process?

  • In other words, what are the logical points at which

the OS can regain control of the CPU?

  • System calls

– Before – During (more next Lme on this) – AXer

  • Interrupts

– Timer interrupt – ensures maximum Lme slice

5

slide-6
SLIDE 6

CSE 506: Opera.ng Systems

(Linux) Terminology

  • mm_struct – represents an address space in kernel
  • task – represents a thread in the kernel

– A task points to 0 or 1 mm_structs

  • Kernel threads just “borrow” previous task’s mm, as they only

execute in kernel address space

– Many tasks can point to the same mm_struct

  • MulL-threading
  • Quantum – CPU Lmeslice

6

slide-7
SLIDE 7

CSE 506: Opera.ng Systems

Outline

  • Policy goals
  • Low-level mechanisms
  • O(1) Scheduler
  • CPU topologies
  • Scheduling interfaces

7

slide-8
SLIDE 8

CSE 506: Opera.ng Systems

Policy goals

  • Fairness – everything gets a fair share of the CPU
  • Real-Lme deadlines

– CPU Lme before a deadline more valuable than Lme aXer

  • Latency vs. Throughput: Timeslice length maeers!

– GUI programs should feel responsive – CPU-bound jobs want long Lmeslices, beeer throughput

  • User prioriLes

– Virus scanning is nice, but I don’t want it slowing things down

8

slide-9
SLIDE 9

CSE 506: Opera.ng Systems

No perfect soluLon

  • OpLmizing mulLple variables
  • Like memory allocaLon, this is best-effort

– Some workloads prefer some scheduling strategies

  • Nonetheless, some soluLons are generally beeer

than others

9

slide-10
SLIDE 10

CSE 506: Opera.ng Systems

Context switching

  • What is it?

– Swap out the address space and running thread

  • Address space:

– Need to change page tables – Update cr3 register on x86 – Simplified by convenLon that kernel is at same address range in all processes – What would be hard about mapping kernel in different places?

10

slide-11
SLIDE 11

CSE 506: Opera.ng Systems

Other context switching tasks

  • Swap out other register state

– Segments, debugging registers, MMX, etc.

  • If descheduling a process for the last Lme, reclaim its

memory

  • Switch thread stacks

11

slide-12
SLIDE 12

CSE 506: Opera.ng Systems

Switching threads

  • Programming abstracLon:

/* Do some work */ schedule(); /* Something else runs */ /* Do more work */

12

slide-13
SLIDE 13

CSE 506: Opera.ng Systems

How to switch stacks?

  • Store register state on the stack in a well-defined

format

  • Carefully update stack registers to new stack

– Tricky: can’t use stack-based storage for this step!

13

slide-14
SLIDE 14

CSE 506: Opera.ng Systems

Example

Thread 1 (prev) Thread 2 (next)

/* eax is next->thread_info.esp */ /* push general-purpose regs*/ push ebp mov esp, eax pop ebp /* pop other regs */

ebp esp eax regs ebp regs ebp

14

slide-15
SLIDE 15

CSE 506: Opera.ng Systems

Weird code to write

  • Inside schedule(), you end up with code like:

switch_to(me, next, &last); /* possibly clean up last */

  • Where does last come from?

– Output of switch_to – Wrieen on my stack by previous thread (not me)!

15

slide-16
SLIDE 16

CSE 506: Opera.ng Systems

How to code this?

  • Pick a register (say ebx); before context switch, this is

a pointer to last’s locaLon on the stack

  • Pick a second register (say eax) to stores the pointer

to the currently running task (me)

  • Make sure to push ebx aXer eax
  • AXer switching stacks:

– pop ebx /* eax sLll points to old task*/ – mov (ebx), eax /* store eax at the locaLon ebx points to */ – pop eax /* Update eax to new task */

16

slide-17
SLIDE 17

CSE 506: Opera.ng Systems

Outline

  • Policy goals
  • Low-level mechanisms
  • O(1) Scheduler
  • CPU topologies
  • Scheduling interfaces

17

slide-18
SLIDE 18

CSE 506: Opera.ng Systems

Strawman scheduler

  • Organize all processes as a simple list
  • In schedule():

– Pick first one on list to run next – Put suspended task at the end of the list

  • Problem?

– Only allows round-robin scheduling – Can’t prioriLze tasks

18

slide-19
SLIDE 19

CSE 506: Opera.ng Systems

Even straw-ier man

  • Naïve approach to prioriLes:

– Scan the enLre list on each run – Or periodically reshuffle the list

  • Problems:

– Forking – where does child go? – What about if you only use part of your quantum?

  • E.g., blocking I/O

19

slide-20
SLIDE 20

CSE 506: Opera.ng Systems

O(1) scheduler

  • Goal: decide who to run next, independent of

number of processes in system

– SLll maintain ability to prioriLze tasks, handle parLally unused quanta, etc

20

slide-21
SLIDE 21

CSE 506: Opera.ng Systems

O(1) Bookkeeping

  • runqueue: a list of runnable processes

– Blocked processes are not on any runqueue – A runqueue belongs to a specific CPU – Each task is on exactly one runqueue

  • Task only scheduled on runqueue’s CPU unless migrated
  • 2 *40 * #CPUs runqueues

– 40 dynamic priority levels (more later) – 2 sets of runqueues – one acLve and one expired

21

slide-22
SLIDE 22

CSE 506: Opera.ng Systems

O(1) Data Structures

AcLve Expired 139 138 137 100 101

. . .

139 138 137 100 101

. . .

22

slide-23
SLIDE 23

CSE 506: Opera.ng Systems

O(1) IntuiLon

  • Take the first task off the lowest-numbered

runqueue on acLve set

– Confusingly: a lower priority value means higher priority

  • When done, put it on appropriate runqueue on

expired set

  • Once acLve is completely empty, swap which set of

runqueues is acLve and expired

  • Constant Lme, since fixed number of queues to

check; only take first item from non-empty queue

23

slide-24
SLIDE 24

CSE 506: Opera.ng Systems

O(1) Example

AcLve Expired 139 138 137 100 101

. . .

139 138 137 100 101

. . .

Pick first, highest priority task to run Move to expired queue when quantum expires

24

slide-25
SLIDE 25

CSE 506: Opera.ng Systems

What now?

AcLve Expired 139 138 137 100 101

. . .

139 138 137 100 101

. . .

25

slide-26
SLIDE 26

CSE 506: Opera.ng Systems

Blocked Tasks

  • What if a program blocks on I/O, say for the disk?

– It sLll has part of its quantum leX – Not runnable, so don’t waste Lme puung it on the acLve

  • r expired runqueues
  • We need a “wait queue” associated with each

blockable event

– Disk, lock, pipe, network socket, etc.

26

slide-27
SLIDE 27

CSE 506: Opera.ng Systems

Blocking Example

AcLve Expired 139 138 137 100 101

. . .

139 138 137 100 101

. . .

Disk

Block on disk! Process goes on disk wait queue

27

slide-28
SLIDE 28

CSE 506: Opera.ng Systems

Blocked Tasks, cont.

  • A blocked task is moved to a wait queue unLl the

expected event happens

– No longer on any ac.ve or expired queue!

  • Disk example:

– AXer I/O completes, interrupt handler moves task back to acLve runqueue

28

slide-29
SLIDE 29

CSE 506: Opera.ng Systems

Time slice tracking

  • If a process blocks and then becomes runnable, how

do we know how much Lme it had leX?

  • Each task tracks Lcks leX in ‘Lme_slice’ field

– On each clock Lck: current->time_slice-- – If Lme slice goes to zero, move to expired queue

  • Refill Lme slice
  • Schedule someone else

– An unblocked task can use balance of Lme slice – Forking halves Lme slice with child

29

slide-30
SLIDE 30

CSE 506: Opera.ng Systems

More on prioriLes

  • 100 = highest priority
  • 139 = lowest priority
  • 120 = base priority

– “nice” value: user-specified adjustment to base priority – Selfish (not nice) = -20 (I want to go first) – Really nice = +19 (I will go last)

30

slide-31
SLIDE 31

CSE 506: Opera.ng Systems

Base Lme slice

  • “Higher” priority tasks get longer Lme slices

– And run first

time = (140 − prio)*20ms prio < 120 (140 − prio)*5ms prio ≥ 120 # $ % & %

31

slide-32
SLIDE 32

CSE 506: Opera.ng Systems

Goal: Responsive UIs

  • Most GUI programs are I/O bound on the user

– Unlikely to use enLre Lme slice

  • Users get annoyed when they type a key and it takes

a long Lme to appear

  • Idea: give UI programs a priority boost

– Go to front of line, run briefly, block on I/O again

  • Which ones are the UI programs?

32

slide-33
SLIDE 33

CSE 506: Opera.ng Systems

Idea: Infer from sleep Lme

  • By definiLon, I/O bound applicaLons spend most of

their Lme waiLng on I/O

  • We can monitor I/O wait Lme and infer which

programs are GUI (and disk intensive)

  • Give these applicaLons a priority boost
  • Note that this behavior can be dynamic

– Ex: GUI configures DVD ripping, then it is CPU-bound – Scheduling should match program phases

33

slide-34
SLIDE 34

CSE 506: Opera.ng Systems

Dynamic priority

dynamic priority = max ( 100, min ( sta.c priority − bonus + 5, 139 ) )

  • Bonus is calculated based on sleep Lme
  • Dynamic priority determines a tasks’ runqueue
  • This is a heurisLc to balance compeLng goals of CPU

throughput and latency in dealing with infrequent I/ O

– May not be opLmal

34

slide-35
SLIDE 35

CSE 506: Opera.ng Systems

Dynamic Priority in O(1) Scheduler

  • Important: The runqueue a process goes in is

determined by the dynamic priority, not the staLc priority

– Dynamic priority is mostly determined by Lme spent waiLng, to boost UI responsiveness

  • Nice values influence sta.c priority (directly)

– StaLc priority is a starLng point for dynamic priority – No maeer how “nice” you are (or aren’t), you can’t boost your “bonus” without blocking on a wait queue!

35

slide-36
SLIDE 36

CSE 506: Opera.ng Systems

Rebalancing tasks

  • As described, once a task ends up in one CPU’s

runqueue, it stays on that CPU forever

36

slide-37
SLIDE 37

CSE 506: Opera.ng Systems

Rebalancing

CPU 0 CPU 1

. . . . . .

CPU 1 Needs More Work!

37

slide-38
SLIDE 38

CSE 506: Opera.ng Systems

Rebalancing tasks

  • As described, once a task ends up in one CPU’s

runqueue, it stays on that CPU forever

  • What if all the processes on CPU 0 exit, and all of the

processes on CPU 1 fork more children?

  • We need to periodically rebalance
  • Balance overheads against benefits

– Figuring out where to move tasks isn’t free

38

slide-39
SLIDE 39

CSE 506: Opera.ng Systems

Idea: Idle CPUs rebalance

  • If a CPU is out of runnable tasks, it should take load

from busy CPUs

– Busy CPUs shouldn’t lose Lme finding idle CPUs to take their work if possible

  • There may not be any idle CPUs

– Overhead to figure out whether other idle CPUs exist – Just have busy CPUs rebalance much less frequently

39

slide-40
SLIDE 40

CSE 506: Opera.ng Systems

Average load

  • How do we measure how busy a CPU is?
  • Average number of runnable tasks over Lme
  • Available in /proc/loadavg

40

slide-41
SLIDE 41

CSE 506: Opera.ng Systems

Rebalancing strategy

  • Read the loadavg of each CPU
  • Find the one with the highest loadavg
  • (Hand waving) Figure out how many tasks we could

take

– If worth it, lock the CPU’s runqueues and take them – If not, try again later

41

slide-42
SLIDE 42

CSE 506: Opera.ng Systems

Why not rebalance?

  • IntuiLon: If things run slower on another CPU
  • Why might this happen?

– NUMA (Non-Uniform Memory Access) – Hyper-threading – MulL-core cache behavior

  • Vs: Symmetric MulL-Processor (SMP) – performance
  • n all CPUs is basically the same

42

slide-43
SLIDE 43

CSE 506: Opera.ng Systems

SMP

  • All CPUs similar, equally “close” to memory

CPU0 CPU1 CPU2 CPU3

Memory

43

slide-44
SLIDE 44

CSE 506: Opera.ng Systems

NUMA

  • Want to keep execuLon near memory; higher migraLon

costs

CPU0 CPU1 CPU2 CPU3

Memory Memory

Node Node

44

slide-45
SLIDE 45

CSE 506: Opera.ng Systems

Scheduling Domains

  • General abstracLon for CPU topology
  • “Tree” of CPUs

– Each leaf node contains a group of “close” CPUs

  • When an idle CPU rebalances, it starts at leaf node

and works up to the root

– Most rebalancing within the leaf – Higher threshold to rebalance across a parent

45

slide-46
SLIDE 46

CSE 506: Opera.ng Systems

SMP Scheduling Domain

CPU0 CPU1 CPU2 CPU3

Flat, all CPUS equivalent!

46

slide-47
SLIDE 47

CSE 506: Opera.ng Systems

NUMA Scheduling Domains

CPU0 CPU1 CPU2 CPU3

CPU0 starts rebalancing here first Higher threshold to move to sibling/ parent

47

slide-48
SLIDE 48

CSE 506: Opera.ng Systems

Hyper-threading

  • Precursor to mulL-core

– A few more transistors than Intel knew what to do with, but not enough to build a second core on a chip yet

  • Duplicate architectural state (registers, etc), but not

execuLon resources (ALU, floaLng point, etc)

  • OS view: 2 logical CPUs
  • CPU: pipeline bubble in one “CPU” can be filled with
  • peraLons from another; yielding higher uLlizaLon

48

slide-49
SLIDE 49

CSE 506: Opera.ng Systems

Hyper-threaded scheduling

  • Imagine 2 hyper-threaded CPUs

– 4 Logical CPUs – But only 2 CPUs-worth of power

  • Suppose I have 2 tasks

– They will do much beeer on 2 different physical CPUs than sharing one physical CPU

  • They will also contend for space in the cache

– Less of a problem for threads in same program. Why?

49

slide-50
SLIDE 50

CSE 506: Opera.ng Systems

NUMA + Hyperthreading Domains

CPU0 CPU1 NUMA DOMAIN 1 NUMA DOMAIN 1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7

Logical CPU Physical CPU is a sched domain

50

slide-51
SLIDE 51

CSE 506: Opera.ng Systems

MulL-core

  • More levels of caches
  • MigraLon among CPUs sharing a cache preferable

– Why? – More likely to keep data in cache

  • Scheduling domains based on shared caches

– E.g., cores on same chip are in one domain

51

slide-52
SLIDE 52

CSE 506: Opera.ng Systems

Outline

  • Policy goals
  • Low-level mechanisms
  • O(1) Scheduler
  • CPU topologies
  • Scheduling interfaces

52

slide-53
SLIDE 53

CSE 506: Opera.ng Systems

Seung prioriLes

  • setpriority(which, who, niceval) and getpriority()

– Which: process, process group, or user id – PID, PGID, or UID – Niceval: -20 to +19 (recall earlier)

  • nice(niceval)

– Historical interface (backwards compaLble) – Equivalent to:

  • setpriority(PRIO_PROCESS, getpid(), niceval)

53

slide-54
SLIDE 54

CSE 506: Opera.ng Systems

Scheduler Affinity

  • sched_setaffinity and sched_getaffinity
  • Can specify a bitmap of CPUs on which this can be

scheduled

– Beeer not be 0!

  • Useful for benchmarking: ensure each thread on a

dedicated CPU

54

slide-55
SLIDE 55

CSE 506: Opera.ng Systems

yield

  • Moves a runnable task to the expired runqueue

– Unless real-Lme (more later), then just move to the end of the acLve runqueue

  • Several other real-Lme related APIs

55

slide-56
SLIDE 56

CSE 506: Opera.ng Systems

Summary

  • Understand compeLng scheduling goals
  • Understand how context switching implemented
  • Understand O(1) scheduler + rebalancing
  • Understand various CPU topologies and scheduling

domains

  • Scheduling system calls

56