a dist ribut ed syst em 18 dist ribut ed syst ems
play

A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: - PDF document

A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: 7/ 3/ 2004 1:49:01 PM -1 -2 Loosely Coupled Dist ribut ed Tight ly Coupled Dist ribut ed- Syst ems Syst ems Users are aware of mult iplicit y of Users not aware


  1. A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: 7/ 3/ 2004 1:49:01 PM -1 -2 Loosely Coupled Dist ribut ed Tight ly Coupled Dist ribut ed- Syst ems Syst ems � Users are aware of mult iplicit y of � Users not aware of mult iplicit y of machines. Access t o resources of various machines. Access t o remot e resources machines is done explicit ly by: similar t o access t o local resources � Remot e logging int o t he appr opr iat e r emot e � Examples machine. � Dat a Migr at ion – t r ansf er dat a by t r ansf er r ing � Tr ansf er r ing dat a f r om r emot e machines t o ent ir e f ile, or t r ansf er r ing only t hose por t ions local machines, via t he File Transf er Prot ocol of t he f ile necessary f or t he immediat e t ask. (FTP ) mechanism. � Comput at ion Migr at ion – t r ansf er t he comput at ion, r at her t han t he dat a, acr oss t he syst em. -3 -4 Dist ribut ed-Operat ing Syst ems Why Dist ribut ed Syst ems? (Cont .) � Communicat ion � Pr ocess Migr at ion – execut e an ent ir e pr ocess, � Dealt wit h t his when we t alked about net wor ks or part s of it , at dif f erent sit es. • Load balancing – dist ribut e processes across net work t o even t he workload. � Resour ce shar ing • Comput at ion speedup – subprocesses can run concurrent ly on dif f erent sit es. � Comput at ional speedup • Hardware pref erence – process execut ion may require specialized processor. • Sof t ware pref erence – required sof t ware may be available at only a part icular sit e. � Reliabilit y • Dat a access – run process remot ely, rat her t han t ransf er all dat a locally. -5 -6 1

  2. OS Support f or resource Resource Sharing sharing � Dist ribut ed Syst ems of f er access t o � Resource Management ? specialized resources of many syst ems � Dist r ibut ed OS can manage diver se r esour ces of nodes in syst em � Example: � Make r esour ces visible on all nodes • Some nodes may have special dat abases • Like VM, can provide f unct ional illusion bur rarely hide • Some nodes may have access t o special hardware t he perf ormance cost devices (e.g. t ape drives, print ers, et c.) � Scheduling? � DS of f ers benef it s of locat ing processing � Dist r ibut ed OS could schedule pr ocesses t o r un near dat a or sharing special devices near t he needed r esour ces � I f need t o access dat a in a lar ge dat abase may be easier t o ship code t her e and r esult s back t han t o r equest dat a be shipped t o code -7 -8 Design I ssues Why Dist ribut ed Syst ems? � Resour ce shar ing � Transparency – t he dist r ibut ed syst em should appear as a convent ional, cent r alized syst em t o t he user. � Comput at ional speedup � Fault tolerance – t he dist r ibut ed syst em should cont inue t o f unct ion in t he f ace of f ailur e. � Reliabilit y � Scalability – as demands incr ease, t he syst em should easily accept t he addit ion of new r esour ces t o accommodat e t he incr eased demand. � Clusters vs Client / Server � Clust ers: a collect ion of semi- aut onomous machines t hat act s as a single syst em. -9 -10 Comput at ion Speedup Breaking up t he problems � Some t asks t oo lar ge f or even t he f ast est single � To harness comput at ional speedup must comput er f irst break up t he big problem int o many � Real t ime weat her/ climat e modeling, human genome smaller problems proj ect , f luid t urbulence modeling, ocean circulat ion � More art t han science? modeling, et c. � ht t p:/ / www.nersc.gov/ research/ GC/ gcnersc.ht ml � Somet imes br eak up by f unct ion � What t o do? • P ipeline? � Leave t he problem unsolved? • Job queue? � Engineer a bigger/ f ast er comput er? � Somet imes br eak up by dat a � Harness resources of many smaller (commodit y?) • Each node responsible f or port ion of dat a set ? machines in a dist ribut ed syst em? -11 -12 2

  3. Decomposit ion Examples Decomposit ion Examples (con’t) � Decrypt ing a message � Bar nes Hut – calculat ing ef f ect of bodies in space on each ot her � Easily parallelizable , give each node a set of keys t o t ry � Could divide space int o NxN regions? � J ob queue – when t r ied all your keys go back � Some regions have many more bodies f or more? � I nst ead divide up so have r oughly � Modeling ocean circulat ion same number of bodies � Give each node a por t ion of t he ocean t o model � Wit hin a r egion, bodies have lot s (N squar e f t r egion?) of ef f ect on each ot her (close t oget her ) � Model f lows wit hin r egion locally � Communicat e wit h nodes managing neighbor ing � Abst r act ot her r egions as a single r egions t o model f lows int o ot her r egions body t o minimize communicat ion -13 -14 Linear Speedup Super -linear Speedup � Linear speedup is of t en t he goal. � Somet imes can act ually do bet t er t han linear speedup! � Allocat e N nodes t o t he j ob goes N t imes as f ast � Especially if divide up a big dat a set so t hat t he � Once you’ve broken up t he problem int o N piece needed at each node f it s int o main memor y on t hat machine pieces, can you expect it t o go N t imes as f ast ? � Savings f r om avoiding disk I / O can out weigh t he communicat ion/ synchr onizat ion cost s � Are t he pieces equal? � When split up a pr oblem, t ension bet ween � I s t her e a piece of t he wor k t hat cannot be br oken up (inher ent ly sequent ial?) duplicat ing pr ocessing at all nodes f or r eliabilit y � Synchr onizat ion and communicat ion over head and simplicit y and allowing nodes t o specialize bet ween pieces? -15 -16 OS Support f or P arallel J obs OS Support f or P arallel J obs (con’t) � Gr oup Communicat ion? � P rocess Management ? � OS could provide f acilit ies f or pieces of a single j ob t o � OS could manage all pieces of a par allel j ob as communicat e easily one unit � Locat ion independent addressing? � Allow all pieces t o be cr eat ed, managed, � Shared memory? dest r oyed at a single command line � Dist ribut ed f ile syst em? � For k (pr ocess,machine)? � Synchr onizat ion? � Scheduling? � Support f or mut ually exclusive access t o dat a across mult iple machines � Pr ogr ammer could specif y wher e pieces should � Can’t rely on HW at omic operat ions any more r un and or OS could decide � Deadlock management ? • P rocess Migrat ion? Load Balancing? � We’ll t alk about clock synchronizat ion and t wo - phase � Tr y t o schedule piece t oget her so can commit lat er communicat e ef f ect ively -17 -18 3

  4. Why Dist ribut ed Syst ems? Reliabilit y � Resour ce shar ing � Dist r ibut ed syst em of f er s pot ent ial f or incr eased reliabilit y � I f one part of syst em f ails, rest could t ake over � Comput at ional speedup � Redundancy, f ail- over � !BUT! Of t en r ealit y is t hat dist r ibut ed syst ems of f er less r eliabilit y � Reliabilit y � “A dist ribut ed syst em is one in which some machine I ’ve never heard of f ails and I can’t do work!” � Hard t o get rid of all hidden dependencies � No clean f ailure model • Nodes don’t j ust f ail t hey can cont inue in a br oken st at e • Par t it ion net wor k = many many nodes f ail at once! (Det er mine who you can st ill t alk t o; Ar e you cut of f or ar e t hey?) • Net work goes down and up and down again! -19 -20 Robust ness Failure Det ect ion � Det ect ing har dwar e f ailur e is dif f icult . � To det ect a link f ailur e, a handshaking pr ot ocol can be used. � Det ect and recover f rom sit e f ailure, � Assume Sit e A and Sit e B have est ablished a link. f unct ion t ransf er, reint egrat e f ailed sit e At f ixed int er vals, each sit e will exchange an I - am-up message indicat ing t hat t hey ar e up and r unning. � Failur e det ect ion � I f Sit e A does not receive a message wit hin t he f ixed int er val, it assumes eit her (a) t he ot her sit e � Reconf igur at ion is not up or (b) t he message was lost . � Sit e A can now send an Ar e-you-up? message t o Sit e B. � I f Sit e A does not r eceive a r eply, it can r epeat t he message or t r y an alt er nat e r out e t o Sit e B. -21 -22 Failure Det ect ion (cont ) Reconf igurat ion � When Sit e A det er mines a f ailur e has occur r ed, it � I f Sit e A does not ult imat ely r eceive a r eply f r om must r econf igur e t he syst em: Sit e B, it concludes some t ype of f ailure has occurred. 1. I f t he link f rom A t o B has f ailed, t his must be br oadcast t o ever y sit e in t he syst em. � Types of f ailures: - Sit e B is down - The direct link bet ween A and B is down 2. I f a sit e has f ailed, ever y ot her sit e must also - The alt er nat e link f r om A t o B is down be not if ied indicat ing t hat t he ser vices of f er ed by t he f ailed sit e ar e no longer available. - The message has been lost � When t he link or t he sit e becomes available again, � However , Sit e A cannot det er mine exact ly why t he t his inf or mat ion must again be br oadcast t o all f ailur e has occur r ed. ot her sit es. � B may be assuming A is down at t he same t ime � Can eit her assume it can make decisions alone? -23 -24 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend