[PPT] - CSE 513 I ntroduction to Operating Systems Class 9 - Distributed PowerPoint Presentation

SLIDE 1

1

CSE 513 I ntroduction to Operating Systems Class 9 - Distributed and Multiprocessor Operating Systems

J onat han Walpole Dept . of Comp. Sci. and Eng. Oregon Healt h and Science Universit y

SLIDE 2

2

Why use parallel or distributed systems?

Speed - reduce time to answer Scale - increase size of problem Reliability - increase resilience to errors Communication - span geographical distance

SLIDE 3

3

Overview

Multiprocessor systems Multi- computer systems Distributed systems

SLIDE 4

Multiprocessor, multi- computer and distributed architectures

shar ed memor y mult ipr ocessor message passing mult i-comput er (clust er ) wide ar ea dist r ibut ed syst em

SLIDE 5

Multiprocessor Systems

SLIDE 6

6

Multiprocessor systems

Def inition:

A comput er syst em in which t wo or mor e CPUs

shar e f ull access t o a common RAM

Hardware implements shared memory among

CPUs

Architecture determines whether access times

to dif f erent memory regions are the same

UMA - unif or m memor y access NUMA - non-unif or m memor y access

SLIDE 7

7

Bus- based UMA and NUMA architectures

Bus becomes t he bot t leneck as number of CPUs increases

SLIDE 8

8

Crossbar switch- based UMA architecture

I nt erconnect cost increases as square of number of CPUs

SLIDE 9

9

Multiprocessors with 2x2 switches

SLIDE 10

10

Omega switching network f rom 2x2 switches

I nt erconnect suf f ers cont ent ion, but cost s less

SLIDE 11

11

NUMA multiprocessors

Single address space visible to all CPUs
Access to remote memory via commands
LOAD
STORE
Access to remote memory slower than to local

memory

Compilers and OS need to be caref ul about

data placement

SLIDE 12

12

Directory- based NUMA multiprocessors

(a) 256- node directory based multiprocessor (b) Fields of 32- bit memory address (c) Directory at node 36

SLIDE 13

13

Operating systems f or multiprocessors

OS structuring approaches

Pr ivat e OS per CPU Mast er -slave ar chit ect ur e Symmet r ic mult ipr ocessing ar chit ect ur e

New problems

mult ipr ocessor synchr onizat ion mult ipr ocessor scheduling

SLIDE 14

14

The private OS approach

I mplications of private OS approach

shared I / O devices st at ic memory allocat ion no dat a sharing no parallel applicat ions

SLIDE 15

15

The master- slave approach

OS only runs on master CPU

Single kernel lock prot ect s OS dat a st ruct ures Slaves t rap syst em calls and place process on scheduling

queue f or mast er

Parallel applications supported

Memory shared among all CP

Us

Single CPU f or all OS calls becomes a bottleneck

SLIDE 16

16

Symmetric multiprocessing (SMP)

OS runs on all CPUs

Mult iple CP

Us can be execut ing t he OS simult aneously

Access t o OS dat a st ruct ures requires synchronizat ion Fine grain crit ical sect ions lead t o more locks and more

parallelism … and more pot ent ial f or deadlock

SLIDE 17

17

Multiprocessor synchronization

Why is it dif f erent compared to single

processor synchronization?

Disabling int er r upt s does not pr event memor y

accesses since it only af f ect s “t his” CPU

Mult iple copies of t he same dat a exist in caches of

dif f er ent CPUs

atomic lock instructions do CPU- CPU communication

Spinning t o wait f or a lock is not always a bad idea

SLIDE 18

18

Synchronization problems in SMPs

TSL instruction is non- trivial on SMPs

SLIDE 19

19

Avoiding cache thrashing during spinning

Multiple locks used to avoid cache thrashing

SLIDE 20

20

Spinning versus switching

I n some cases CPU “must” wait

scheduling cr it ical sect ion may be held

I n other cases spinning may be more ef f icient

than blocking

spinning wast es CPU cycles swit ching uses up CPU cycles also if cr it ical sect ions ar e shor t spinning may be bet t er

t han blocking

st at ic analysis of cr it ical sect ion dur at ion can

det er mine whet her t o spin or block

dynamic analysis can impr ove per f or mance

SLIDE 21

21

Multiprocessor scheduling

Two dimensional scheduling decision

t ime (which pr ocess t o r un next ) space (which pr ocessor t o r un it on)

Time sharing approach

single scheduling queue shar ed acr oss all CPUs

Space sharing approach

par t it ion machine int o sub-clust er s

SLIDE 22

22

Time sharing

Single data structure used f or scheduling Problem - scheduling f requency inf luences

inter- thread communication time

SLIDE 23

23

I nterplay between scheduling and I PC

Problem with communication between two threads

bot h belong t o process A bot h running out of phase

SLIDE 24

24

Space sharing

Groups of cooperating threads can communicate at

the same time

f ast int er-t hread communicat ion t ime

SLIDE 25

25

Gang scheduling

Problem with pure space sharing

Some par t it ions ar e idle while ot her s ar e over loaded

Can we combine time sharing and space sharing

and avoid introducing scheduling delay into I PC?

Solution: Gang Scheduling

Gr oups of r elat ed t hr eads scheduled as a unit (gang) All member s of gang r un simult aneously on dif f erent

t imeshar ed CPUs

All gang member s st ar t and end t ime slices t oget her

SLIDE 26

26

Gang scheduling

SLIDE 27

Multi- computer Systems

SLIDE 28

28

Multi- computers

Also known as

clust er comput ers clust ers of workst at ions (COWs)

Def inition:Tightly- coupled CPUs that do not

share memory

SLIDE 29

29

Multi- computer interconnection topologies

(a) single swit ch (b) r ing (c) grid (d) double t orus (e) cube (f ) hypercube

SLIDE 30

30

Store & f orward packet switching

SLIDE 31

31

Network interf aces in a multi- computer

Network co- processors may of f - load

communication processing f rom the main CPU

SLIDE 32

32

OS issues f or multi- computers

Message passing perf ormance Programming model

synchr onous vs asynchor nous message passing dist r ibut ed vir t ual memor y

Load balancing and coordinated scheduling

SLIDE 33

33

Optimizing message passing perf ormance

Parallel application perf ormance is dominated by

communication costs

int er r upt handling, cont ext swit ching, message

copying …

Solution - get the OS out of the loop

map int er f ace boar d t o all pr ocesses t hat need it act ive messages - give int er r upt handler addr ess of

user -buf f er

sacr if ice pr ot ect ion f or per f or mance?

SLIDE 34

34

CPU / network card coordination

How to maximize independence between CPU and

network card while sending/ receiving messages?

Use send & r eceive r ings and bit -maps

ne always set s bit s, one always clear s bit s

SLIDE 35

35

Blocking vs non- blocking send calls

Minimum services

provided

send and receive

commands

These can be blocking

(synchronous) or non- blocking (asynchronous) calls

(a) Blocking send call (b) Non-blocking send call

SLIDE 36

36

Blocking vs non- blocking calls

Advantages of non- blocking calls

abilit y t o over lap comput at ion and communicat ion

impr oves per f or mance

Advantages of blocking calls

simpler pr ogr amming model

SLIDE 37

37

Remote procedure call (RPC)

Goal

suppor t execut ion of r emot e pr ocedur es make r emot e pr ocedur e execut ion indist inguishable

f r om local pr ocedur e execut ion

allow dist r ibut ed pr ogr amming wit hout changing t he

pr ogr amming model

SLIDE 38

38

Remote procedure call (RPC)

Steps in making a remote procedure call

client and ser ver st ubs ar e pr oxies

SLIDE 39

39

RPC implementation issues

Cannot pass pointers

call by r ef er ence becomes copy-r est or e (at best )

Weakly typed languages

Client st ub cannot det er mine size of r ef er ence

par amet er s

Not always possible t o det er mine par amet er t ypes

Cannot use global variables

may get moved (r eplicat ed) t o r emot e machine

Basic problem - local procedure call relies on

shared memory

SLIDE 40

40

Distributed shared memory (DSM)

Goal

use sof t war e t o cr eat e t he illusion of shar ed

memor y on t op of message passing har dwar e

lever age vir t ual memor y har dwar e t o page f ault on

non-r esident pages

ser vice page f ault s f r om r emot e memor ies inst ead

f f r om local disk

SLIDE 41

41

Distributed shared memory (DSM)

DSM at the hardware, OS or middleware layer

SLIDE 42

42

Page replication in DSM systems

Replication

(a) Pages distributed on 4 machines (b) CPU 0 reads page 10 (c) CPU 1 reads page 10

SLIDE 43

43

Consistency and f alse sharing in DSM

SLIDE 44

44

Strong memory consistency

P1 P2 P3 P4

W1 W2 W3 W4 R2 R1

Total order enf orces sequential consistency

int uit ively simple f or programmers, but very cost ly t o

implement

not even implement ed in non-dist ribut ed machines!

SLIDE 45

45

Scheduling in multi- computer systems

Each computer has its own OS

local scheduling applies

Which computer should we allocate a task to

initially?

Decision can be based on load (load balancing) load balancing can be st at ic or dynamic

SLIDE 46

46

Graph- theoretic load balancing approach

Process

Two ways of allocating 9 processes to 3 nodes
Total network traf f ic is sum of arcs cut by node

boundaries

The second partitioning is better

SLIDE 47

47

Sender- initiated load balancing

Overloaded nodes (senders) of f - load work to underloaded

nodes (receivers)

SLIDE 48

48

Receiver- initiated load balancing

Underloaded nodes (receivers) request work f rom overloaded

nodes (senders)

SLIDE 49

Distributed Systems

SLIDE 50

50

Distributed systems

Def inition: Loosely- coupled CPUs that do not

share memory

wher e is t he boundar y bet ween t ight ly-coupled and

loosely-coupled syst ems?

Other dif f erences

single vs mult iple administ r at ive domains geogr aphic dist r ibut ion homogeneit y vs het er ogeneit y of har dwar e and

sof t war e

SLIDE 51

51

Comparing multiprocessors, multi- computers and distributed systems

SLIDE 52

52

Ethernet as an interconnect

Computer

Bus- based vs switched Ethernet

SLIDE 53

53

The I nternet as an interconnect

SLIDE 54

54

OS issues f or distributed systems

Common interf aces above heterogeneous

systems

Communicat ion pr ot ocols Dist r ibut ed syst em middlewar e

Choosing suitable abstractions f or distributed

system interf aces

dist r ibut ed document -based syst ems dist r ibut ed f ile syst ems dist r ibut ed obj ect syst ems

SLIDE 55

55

Network service and protocol types

SLIDE 56

56

Protocol interaction and layering

SLIDE 57

57

Homogeneity via middleware

SLIDE 58

58

Distributed system middleware models

Document- based systems File- based systems Object- based systems

SLIDE 59

59

Document- based middleware - WWW

SLIDE 60

60

Document- based middleware

How the browser gets a page

Asks DNS f or I P address DNS replies with I P address Browser makes connection Sends request f or specif ied page Server sends f ile TCP connection released Browser displays text Browser f etches, displays images

SLIDE 61

61

File- based middleware

Design issues

Naming and name r esolut ion Ar chit ect ur e and int er f aces Caching st r at egies and cache consist ency File shar ing semant ics Disconnect ed oper at ion and f ault t oler ance

SLIDE 62

62

Naming

(b) Clients with the same view of name space (c) Clients with dif f erent views of name space

SLIDE 63

63

Naming and transparency issues

Can clients distinguish between local and remote f iles?
Location transparency

f ile name does not reveal t he f ile' s physical st orage

locat ion.

Location independence

t he f ile name does not need t o be changed when t he

f ile' s physical st orage locat ion changes.

SLIDE 64

64

Global vs local name spaces

Global name space

f ile names are globally unique any f ile can be named f rom any node

Local name spaces

remot e f iles must be insert ed in t he local name space f ile names are only meaningf ul wit hin t he calling node but how do you ref er t o remot e f iles in order t o insert

t hem?

globally unique f ile handles can be used to map remote

f iles to local names

SLIDE 65

65

Building a name space with super- root

Super- root / machine name approach

concat enat e t he host name t o t he names of f iles st ored on

t hat host

syst em-wide uniqueness guarant eed simple t o locat ed a f ile not locat ion t ransparent or locat ion independent

SLIDE 66

66

Building a name space using mounting

Mounting remote f ile systems

export ed remot e direct ory is import ed and mount ed ont o

local direct ory

accesses require a globally unique f ile handle f or t he remot e

direct ory

nce mount ed, f ile names are locat ion-t ransparent
location can be captured via naming conventions

are t hey locat ion independent ?

location of f ile vs location of client?
f iles have dif f erent names f rom dif f erent places

SLIDE 67

67

Local name spaces with mounting

Mounting (part of ) a remote f ile system in NFS.

SLIDE 68

68

Nested mounting on multiple servers

SLIDE 69

69

NSF name space

Server exports a directory
mountd: provides a unique f ile handle f or the exported

NFS f ile handles

V- node contains
ref erence t o a f ile handle f or mount ed remot e f iles
ref erence t o an i-node f or local f iles
File handle uniquely names a remote directory
f ile syst em ident if ier: unique number f or each f ile syst em (in UNI X

super block)

i-node and i-node generat ion number

v-node i-node File handle File System identifier i-node i-node generation number

SLIDE 71

71

Mounting on- demand

Need to decide where and when to mount remote

directories

Where? - Can be based on conventions to standardize

local name spaces (ie., / home/ username f or user home directories)

When? - boot time, login time, access time, …

?

What to mount when?

How long does it t ake t o mount everyt hing? Do we know what everyt hing is? Can we do mount ing on-demand?

An automounter is a client- side process that handles on-

demand mounting

it int ercept s request s and act s like a local NFS server

SLIDE 72

72

Distributed f ile system architectures

Server side

how do servers export f iles how do servers handle request s f rom client s?

Client side

how do applicat ions access a remot e f ile in t he same way

as a local f ile?

Communication layer

how do client s and servers communicat e?

SLIDE 73

73

Local access architectures

Local access approach

move f ile t o client local access on client ret urn f ile t o server dat a shipping

approach

SLIDE 74

74

Remote access architectures

Remote access

leave f ile on server send read/ writ e operat ions

t o server

ret urn result s t o client f unct ion shipping approach

SLIDE 75

75

File- level interf ace

Accesses can be supported at either the f ile

granularity or block granularity

File- level client- server interf ace

local access model wit h whole f ile movement and

caching

r emot e access model client -ser ver int er f ace at

syst em call level

client per f or ms r emot e open, r ead, wr it e, close calls

SLIDE 76

76

Block- level interf ace

Block- level client- server interf ace

client -ser ver int er f ace at f ile syst em or disk block

level

ser ver of f er s vir t ual disk int er f ace client f ile accesses gener at e block access r equest s

t o ser ver

block-level caching of part s of f iles on client

SLIDE 77

77

NFS architecture

The basic NFS architecture f or UNI X systems.

SLIDE 78

78

NFS server side

Mountd

server export s direct ory via mount d mount d provides t he init ial f ile handle f or t he export ed

direct ory

client issues nfs_mount request via RP

C t o mount d

mount d checks if t he pat hname is a direct ory and if t he

direct ory is export ed t o t he client

nf sd: services NFS RPC calls, gets the data f rom its

local f ile system, and replies to the RPC

Usually list ening at port 2049

Both mountd and nf sd use RPC

SLIDE 79

79

Communication layer: NFS RPC Calls

NFS / RPC uses XDR and TCP/ I P
f handle: 64- byte opaque data (in NFS v3)

what ’s in t he f ile handle?

status, fattr fhandle, offset, count, data write status, fhandle, fattr dirfh, name, fattr create status, fattr, data fhandle, offset, count read status, fhandle, fattr dirfh, name lookup Results Input args Proc.

SLIDE 80

80

NFS f ile handles

V- node contains
ref erence t o a f ile handle f or mount ed remot e f iles
ref erence t o an i-node f or local f iles
File handle uniquely names a remote directory
f ile syst em ident if ier: unique number f or each f ile syst em (in UNI X

super block)

i-node and i-node generat ion number

v-node i-node File handle File System identifier i-node i-node generation number

SLIDE 81

81

NFS client side

Accessing remote f iles in the same way as

accessing local f iles requires kernel support

Vnode int er f ace

read(fd,..) struct file

Mode Vnode

ffset

V_data

fs_op struct vnode

{int (open)(); int (close)(); int (read)(); int (write)(); int (*lookup)(); … } process file table

SLIDE 82

82

Caching vs pure remote service

Network traf f ic?

–

caching reduces remot e accesses ⇒ reduces net work t raf f ic

–

caching generat es f ewer, larger, dat a t ransf ers

Server load?

–

caching reduces remot e accesses ⇒ r educes ser ver load

Server disk throughput?

–

pt imized bet t er f or large request s t han random disk blocks
Data integrity?

–

cache-consist ency problem due t o f requent writ es

Operating system complexity?

–

simpler f or remot e service.

SLIDE 83

83

Four places to cache f iles

Server’s disk: slow perf ormance Server’s memory

cache management , how much t o cache, r eplacement

st r at egy

st ill slow due t o net wor k delay

Client’s disk

access speed vs ser ver memor y? lar ge f iles can be cached suppor t s disconnect ed oper at ion

Client’s memory

f ast est access can be used by diskless wor kst at ions compet es wit h t he VM syst em f or physical memor y

space

SLIDE 84

84

Cache consistency

Ref lect ing changes t o local cache t o mast er copy Ref lect ing changes t o mast er copy t o local caches

update/invalidate Copy 1 Copy 2 Master copy write

SLIDE 85

85

Common update algorithms f or client caching

Write- through: all writes are carried out immediately
Reliable: lit t le inf ormat ion is lost in t he event of a client crash
Slow: cache not usef ul f or writ es
Delayed- write: writes do not immediately propagate to server
bat ching writ es amort izes overhead
wait f or blocks t o f ill
if dat a is writ t en and t hen delet ed immediat ely, dat a need not

be writ t en at all (20-30 % of new dat a is delet ed wit h 30 secs)

Write- on- close: delay writing until the f ile is closed at the

client

semant ically meaningf ul delayed-writ e policy
if f ile is open f or short durat ion, works f ine
if f ile is open f or long, suscept ible t o losing dat a in t he event of

client crash

SLIDE 86

86

Cache coherence

How to keep locally cached data up to date / consistent?
Client- initiated approach

check validit y on every access: t oo much overhead f irst access t o a f ile (e.g., f ile open) every f ixed t ime int erval

Server- initiated approach

server records, f or each client , t he (part s of ) f iles it

caches

server responds t o updat es by propagat ion or invalidat ion

Disallow caching during concurrent- write or read/ write

sharing

allow mult iple client s t o cache f ile f or read only access f lush all client caches when t he f ile is opened f or writ ing

SLIDE 87

87

NFS – server caching

Reads

use t he local f ile syst em cache pref et ching in UNI X using read-ahead

Writes

writ e-t hrough (synchronously, no cache) commit on close (st andard behaviour in v4)

SLIDE 88

88

NFS – client caching (reads)

Clients are responsible f or validating cache entries

(stateless server)

Validation by checking last modif ication time

t ime st amps issues by server aut omat ic validat ion on open (wit h server??)

A cache entry is considered valid if one of the f ollowing

are true:

cache ent ry is less t han t seconds old (3-30 s f or f iles,

30-60 s f or direct ories)

modif ied t ime at server is t he same as modif ied t ime on

client

SLIDE 89

89

NFS – client caching (writes)

Delayed writes

modif ied f iles are marked dirt y and f lushed t o server on

close (or sync)

Bio- daemons (block input- output)

read-ahead request s are done asynchronously writ e request s are submit t ed when a block is f illed

SLIDE 90

90

File sharing semantics

Semantics of File sharing

(a) single processor gives sequent ial consist ency (b) dist ribut ed syst em may ret urn obsolet e value

SLIDE 91

91

Consistency semantics f or f ile sharing

What value do reads see af ter writes?
UNI X semantics
value read is t he value st ored by last writ e
writ es t o an open f ile are visible immediat ely t o ot hers wit h t he

f ile open

easy t o implement wit h one server and no cache
Session semantics
writ es t o an open f ile are not visible immediat ely t o ot hers wit h

t he f ile opened already

changes become visible on close t o sessions st art ed lat er
I mmutable- Shared- Files semantics - simple to implement
A sharable f ile cannot be modif ied
File names cannot be reused and it s cont ent s may not be

alt ered

Transactions
All changes have all-or-not hing propert y
W1,R1,R2,W2 not allowed where P1 = W1;W2 and P2 = R1;R2

SLIDE 92

92

NFS – f ile sharing semantics

Not UNI X semantics!
Unspecif ied in NFS standard
Not clear because of timing dependencies
Consistency issues can arise

Example: J ack and J ill have a f ile cached. J ack opens t he

f ile and modif ies it , t hen he closes t he f ile. J ill t hen

pens t he f ile (bef ore t seconds have elapsed) and

modif ies it as well. Then she closes t he f ile. Are bot h J ack’s and J ill’s modif icat ions present in t he f ile? What if J ack closes t he f ile af t er J ill opens it ?

Locking part of v4 (byte range, leasing)