[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 - - PowerPoint PPT Presentation

537 distributed systems
SMART_READER_LITE
LIVE PREVIEW

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 - - PowerPoint PPT Presentation

[537] Distributed Systems Chapters 42 Tyler Harter 11/19/14 File-System Case Studies Local - FFS : Fast File System - LFS : Log-Structured File System Network - NFS : Network File System - AFS : Andrew File System File-System Case


slide-1
SLIDE 1

[537] Distributed Systems

Chapters 42 Tyler Harter 11/19/14

slide-2
SLIDE 2

File-System Case Studies

Local

  • FFS: Fast File System
  • LFS: Log-Structured File System
  • Network
  • NFS: Network File System
  • AFS: Andrew File System
slide-3
SLIDE 3

File-System Case Studies

Local

  • FFS: Fast File System
  • LFS: Log-Structured File System
  • Network
  • Intro: communication basics [today]
  • NFS: Network File System
  • AFS: Andrew File System
slide-4
SLIDE 4

Review

slide-5
SLIDE 5

Atomicity

Say we want to do several things.

  • Atomicity means we don’t get interrupted when

partially done (or at least that we can make it appear that way to the user).

  • Concurrency: we’re worried about other threads

Persistence: we’re worried about crashes

slide-6
SLIDE 6

Atomic Update

Say we want to update a file foo.txt. If we crash, we want one of the following:

  • all old data
  • all new data
  • Strategy: write new data to foo.tmp, and only after

that’s complete, replace foo.txt by switching names.

slide-7
SLIDE 7

Bad Protocol

copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt

slide-8
SLIDE 8

Bad Protocol

foo.txt Old Data

(on disk)

slide-9
SLIDE 9

Bad Protocol

copy foo.txt to foo.tmp (with changes)

foo.txt Old Data

(on disk)

foo.tmp New Data

(in RAM)

slide-10
SLIDE 10

Bad Protocol

copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt

foo.txt New Data

(in RAM)

Old Data

(on disk)

slide-11
SLIDE 11

Bad Protocol

copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt

foo.txt New Data

(in RAM) (on disk)

slide-12
SLIDE 12

Good Protocol

copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt

slide-13
SLIDE 13

Good Protocol

foo.txt Old Data

(on disk)

slide-14
SLIDE 14

Good Protocol

copy foo.txt to foo.tmp (with changes)

foo.txt Old Data

(on disk)

foo.tmp New Data

(in RAM)

slide-15
SLIDE 15

Good Protocol

copy foo.txt to foo.tmp (with changes) fsync foo.tmp

foo.txt Old Data

(on disk)

foo.tmp New Data

(on disk)

slide-16
SLIDE 16

Good Protocol

copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt

foo.txt Old Data

(on disk)

New Data

(on disk)

slide-17
SLIDE 17

Good Protocol

copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt

foo.txt New Data

(on disk) (on disk)

slide-18
SLIDE 18

Local FS Comparison

FFS+Journal:

  • must write data twice (writes expensive)
  • can put data exactly where we like (reads cheaper)
  • LFS:
  • all writes sequential (writes cheaper)
  • reads may be very random (reads expensive)
slide-19
SLIDE 19

Local FS Comparison

In what ways is FFS more complex?

  • In what ways is LFS more complex?
  • Compare group descriptor to segment summary.
  • LFS: why don’t we need to update root inode upon

updating any file?

slide-20
SLIDE 20

Distributed Systems

slide-21
SLIDE 21

OSTEP Definition

Def: more than 1 machine

  • Examples:
  • client/server: web server and web client
  • cluster: page rank computation
  • Other courses:

CS 640: Networking CS 739: Distributed Systems

slide-22
SLIDE 22

Why Go Distributed?

More compute power

  • More storage capacity
  • Fault tolerance
  • Data sharing
slide-23
SLIDE 23

New Challenges

System failure: need to worry about partial failure.

  • Communication failure: links unreliable
slide-24
SLIDE 24

Communication

All communication is inherently unreliable.

  • Need to worry about:
  • bit errors
  • packet loss
  • node/link failure
slide-25
SLIDE 25

Why are network sockets less reliable than pipes?

slide-26
SLIDE 26

Writer Process

Pipe

Reader Process user kernel

slide-27
SLIDE 27

Writer Process

Pipe

Reader Process user kernel

slide-28
SLIDE 28

Writer Process

Pipe

Reader Process user kernel

slide-29
SLIDE 29

Writer Process

Pipe

Reader Process user kernel

slide-30
SLIDE 30

Writer Process

Pipe

Reader Process user kernel

slide-31
SLIDE 31

Writer Process

Pipe

Reader Process user kernel

slide-32
SLIDE 32

Writer Process

Pipe

Reader Process user kernel

slide-33
SLIDE 33

Writer Process

Pipe

Reader Process user kernel

slide-34
SLIDE 34

Writer Process

Pipe

Reader Process user kernel

slide-35
SLIDE 35

Writer Process

Pipe

Reader Process user kernel

slide-36
SLIDE 36

Writer Process

Pipe

Reader Process user kernel

slide-37
SLIDE 37

Writer Process

Pipe

Reader Process user kernel

write waits for space

slide-38
SLIDE 38

Writer Process

Pipe

Reader Process user kernel

write waits for space

slide-39
SLIDE 39

Writer Process

Pipe

Reader Process user kernel

write waits for space

slide-40
SLIDE 40

Writer Process

Pipe

Reader Process user kernel

write waits for space

slide-41
SLIDE 41

Writer Process

Pipe

Reader Process user kernel

slide-42
SLIDE 42

Writer Process

Network Socket

user kernel Machine A Reader Process user kernel Machine B Router

slide-43
SLIDE 43

Writer Process

Network Socket

user kernel Machine A Reader Process user kernel Machine B Router what if router’s buffer is full?

slide-44
SLIDE 44

Writer Process

Network Socket

user kernel Machine A Reader Process user kernel Machine B Router what if B’s buffer is full?

slide-45
SLIDE 45

Writer Process

Network Socket

user kernel Machine A

?

From A’s view, network and B are largely a black box.

slide-46
SLIDE 46

Overview

Raw messages

  • Reliable messages
  • OS abstractions
  • virtual memory
  • global file system
  • Programming-languages abstractions
  • remote procedure call
slide-47
SLIDE 47

Raw Messages: UDP

API:

  • reads and writes over socket file descriptors
  • messages sent from/to ports to target a process on machine
  • Provide minimal reliability features:
  • messages may be lost
  • messages may be reordered
  • messages may be duplicated
  • only protection: checksums
slide-48
SLIDE 48

Raw Messages: UDP

Advantages

  • lightweight
  • some applications make better reliability decisions

themselves (e.g., video conferencing programs)

  • Disadvantages
  • more difficult to write application correctly
slide-49
SLIDE 49

Overview

Raw messages

  • Reliable messages
  • OS abstractions
  • virtual memory
  • global file system
  • Programming-languages abstractions
  • remote procedure call
slide-50
SLIDE 50

Strategy

Using software, build reliable, logical connections

  • ver unreliable connections.
  • Strategies:
  • acknowledgment
slide-51
SLIDE 51

ACK

Sender

[send message]

  • [recv ack]

Receiver

  • [recv message]

[send ack]

Sender knows message was received.

slide-52
SLIDE 52

ACK

Sender

[send message]

  • Receiver

Sender misses ACK… What to do?

slide-53
SLIDE 53

Strategy

Using software, build reliable, logical connections

  • ver unreliable connections.
  • Strategies:
  • acknowledgment
slide-54
SLIDE 54

Strategy

Using software, build reliable, logical connections

  • ver unreliable connections.
  • Strategies:
  • acknowledgment
  • timeout
slide-55
SLIDE 55

Timeout

Sender

[send message]

  • Receiver
slide-56
SLIDE 56

Timeout

Sender

[send message] [start timer]

  • Receiver
slide-57
SLIDE 57

Timeout

Sender

[send message] [start timer]

  • … waiting for ack …

Receiver

slide-58
SLIDE 58

Timeout

Sender

[send message] [start timer]

  • … waiting for ack …
  • [timer goes off]

Receiver

slide-59
SLIDE 59

Timeout

Sender

[send message] [start timer]

  • … waiting for ack …
  • [timer goes off]

[send message]

  • [recv ack]

Receiver

  • [recv message]

[send ack]

slide-60
SLIDE 60

Timeout: Issue 1

How long to wait?

slide-61
SLIDE 61

Timeout: Issue 1

How long to wait?

  • Too long: system feels unresponsive
  • Too short: messages needlessly re-sent
  • Messages may have been dropped due to
  • verloaded server. Aggressive clients worsen this.
slide-62
SLIDE 62

Timeout: Issue 1

How long to wait?

  • One strategy: be adaptive.
  • Adjust time based on how long acks usually take.
  • For each missing ack, wait longer between retries.
slide-63
SLIDE 63

Timeout: Issue 2

What does a lost ack really mean?

slide-64
SLIDE 64

Sender

[send message]

  • [timout]

Receiver

  • Sender

[send message]

  • [timout]

Receiver

  • [recv message]

[send ack]

Case 1 Case 2 How can sender tell between these two cases?

slide-65
SLIDE 65

Timeout: Issue 2

What does a lost ack really mean?

  • ACK: message received exactly once
  • No ACK: message received at most once
slide-66
SLIDE 66

Timeout: Issue 2

What does a lost ack really mean?

  • ACK: message received exactly once
  • No ACK: message received at most once
  • What if message is command to increment counter?
slide-67
SLIDE 67

Proposed Solution

Sender could send an AckAck so receiver knows whether to retry sending an Ack.

  • Sound good?
slide-68
SLIDE 68

Aside: Two Generals’ Problem

general 1 general 2 enemy

slide-69
SLIDE 69

Aside: Two Generals’ Problem

general 1 general 2 enemy

Suppose a generals agree after N messages. Did the arrival of the N’th message change anybodies decision?

slide-70
SLIDE 70

Aside: Two Generals’ Problem

general 1 general 2 enemy

Suppose a generals agree after N messages. Did the arrival of the N’th message change anybodies decision?

  • if yes: then what if the N’th message had been lost?
  • if no: then why bother sending N messages?
slide-71
SLIDE 71

Timeout: Issue 2

What does a lost ack really mean?

  • ACK: message received exactly once
  • No ACK: message received at most once
  • What if message is command to increment counter?
slide-72
SLIDE 72

Strategy

Using software, build reliable, logical connections

  • ver unreliable connections.
  • Strategies:
  • acknowledgment
  • timeout
slide-73
SLIDE 73

Strategy

Using software, build reliable, logical connections

  • ver unreliable connections.
  • Strategies:
  • acknowledgment
  • timeout
  • remember sent messages
slide-74
SLIDE 74

Receiver Remembers Messages

Sender

[send message]

  • [timout]

[send message]

  • [recv ack]

Receiver

  • [recv message]

[send ack]

  • [ignore message]

[send ack]

slide-75
SLIDE 75

Receiver Remembers Messages

Sender

[send message]

  • [timout]

[send message]

  • [recv ack]

Receiver

  • [recv message]

[send ack]

  • [ignore message]

[send ack]

how do we know to ignore?

slide-76
SLIDE 76

Solutions

Solution 1: remember every message ever sent.

slide-77
SLIDE 77

Solutions

Solution 1: remember every message ever sent.

  • Solution 2: sequence numbers
  • give each message a seq number
  • receiver knows all messages before an N have

been seen

  • receiver remembers messages sent after N
slide-78
SLIDE 78

TCP

Most popular protocol based on seq nums.

  • Also buffers messages so they arrive in order.
  • Timeouts are adaptive.
slide-79
SLIDE 79

Overview

Raw messages

  • Reliable messages
  • OS abstractions
  • virtual memory
  • global file system
  • Programming-languages abstractions
  • remote procedure call
slide-80
SLIDE 80

Virtual Memory

Inspiration: threads share memory

  • Idea: processes on different machines share mem
slide-81
SLIDE 81

Virtual Memory

Inspiration: threads share memory

  • Idea: processes on different machines share mem
  • Strategy:
  • a bit like swapping we saw before
  • instead of swap to disk, swap to other machine
  • sometimes multiple copies may be in memory
  • n different machines
slide-82
SLIDE 82

PFN valid present

  • 5

6 7 8 … …

PFN valid present

  • 21

22 23 24 … … Process on Machine A Process on Machine B

slide-83
SLIDE 83

PFN valid present

  • 1
  • 1
  • 1

Process on Machine A 5 6 7 8 … …

PFN valid present

  • 1
  • 1
  • 1

Process on Machine B 21 22 23 24 … … map 3-page region into both memories.

slide-84
SLIDE 84

PFN valid present

  • 5

1 1 7 1 1 8 1 1

Process on Machine A

X Y Z

5 6 7 8 … …

PFN valid present

  • 1
  • 1
  • 1

Process on Machine B 21 22 23 24 … … A writes X,Y,Z

slide-85
SLIDE 85

PFN valid present

  • 5

1 1 7 1 1 8 1 1

Process on Machine A

X Y Z

5 6 7 8 … …

PFN valid present

  • 23

1 1

  • 1
  • 1

Process on Machine B

X

21 22 23 24 … … B reads 1st page

slide-86
SLIDE 86

PFN valid present

  • 5

1 1 7 1 1 8 1 1

Process on Machine A

X Y Z

5 6 7 8 … …

PFN valid present

  • 23

1 1 22 1 1

  • 1

Process on Machine B

Y X

21 22 23 24 … … B reads 2st page

slide-87
SLIDE 87

PFN valid present

  • 1

7 1 1 8 1 1

Process on Machine A

Y Z

5 6 7 8 … …

PFN valid present

  • 23

1 1 22 1 1

  • 1

Process on Machine B

Y X’

21 22 23 24 … … B writes X’ to 1st page

slide-88
SLIDE 88

PFN valid present

  • 6

1 1 7 1 1 8 1 1

Process on Machine A

X’ Y Z

5 6 7 8 … …

PFN valid present

  • 23

1 1 22 1 1

  • 1

Process on Machine B

Y X’

21 22 23 24 … … A reads 1st page

slide-89
SLIDE 89

Virtual Memory Problems

What if a machine crashes?

  • mapping disappears in other machines
  • how to handle?
  • Performance?
  • when to prefetch?
  • loads/stores expected to be fast
  • DSM (distributed shared memory) not used today.
slide-90
SLIDE 90

Global File System

Advantages

  • file access is already expected to be slow
  • use common API
  • no need to modify applications (sorta true,

flocks over NFS don’t work)

  • Disadvantages
  • doesn’t always make sense, e.g., for video app
slide-91
SLIDE 91

Overview

Raw messages

  • Reliable messages
  • OS abstractions
  • virtual memory
  • global file system
  • Programming-languages abstractions
  • remote procedure call
slide-92
SLIDE 92

RPC

Remote Procedure Call.

  • What could be easier than calling a function?
  • Strategy: create wrappers so calling a function on

another machine feels just like calling a local function.

  • This abstraction is very common in industry.
slide-93
SLIDE 93

RPC

int main(…) {

  • }

Machine A

int foo(char *msg) { … }

Machine B

slide-94
SLIDE 94

RPC

int main(…) { int x = foo(); }

Machine A

int foo(char *msg) { … }

Machine B Want main() on A to call foo() on B.

slide-95
SLIDE 95

RPC

int main(…) { int x = foo(); }

Machine A

int foo(char *msg) { … }

Machine B Want main() on A to call foo() on B.

slide-96
SLIDE 96

RPC

int main(…) { int x = foo(); }

  • int foo(char *msg) {

send msg to B recv msg from B }

Machine A

int foo(char *msg) { … }

Machine B Want main() on A to call foo() on B.

slide-97
SLIDE 97

RPC

int main(…) { int x = foo(); }

  • int foo(char *msg) {

send msg to B recv msg from B }

Machine A

int foo(char *msg) { … }

  • void foo_listener() {

while(1) { recv, call foo } }

Machine B Want main() on A to call foo() on B.

slide-98
SLIDE 98

RPC

int main(…) { int x = foo(); }

  • int foo(char *msg) {

send msg to B recv msg from B }

Machine A

int foo(char *msg) { … }

  • void foo_listener() {

while(1) { recv, call foo } }

Machine B Actual calls.

slide-99
SLIDE 99

RPC

int main(…) { int x = foo(); }

  • int foo(char *msg) {

send msg to B recv msg from B }

Machine A

int foo(char *msg) { … }

  • void foo_listener() {

while(1) { recv, call foo } }

Machine B What it feels like for programmer.

slide-100
SLIDE 100

RPC

int main(…) { int x = foo(); }

  • int foo(char *msg) {

send msg to B recv msg from B }

Machine A

int foo(char *msg) { … }

  • void foo_listener() {

while(1) { recv, call foo } }

Machine B Wrappers.

client wrapper server wrapper

slide-101
SLIDE 101

RPC Tools

RPC packages help with this with two components.

  • (1) Stub generation
  • create wrappers automatically
  • (2) Runtime library
  • thread pool
  • socket listeners call functions on server
slide-102
SLIDE 102

RPC Tools

RPC packages help with this with two components.

  • (1) Stub generation
  • create wrappers automatically
  • (2) Runtime library
  • thread pool
  • socket listeners call functions on server
slide-103
SLIDE 103

Stub Generation

Many tools will automatically generate wrappers:

  • rpcgen
  • thrift
  • protobufs
  • Programmer fills in generated stubs.
slide-104
SLIDE 104

Wrapper Generation

Wrappers must do conversions:

  • client arguments to message
  • message to server arguments
  • server return to message
  • message to client return
  • Need uniform endianness (wrappers do this).
  • Conversion is called marshaling/unmarshaling,
  • r serializing/deserializing.
slide-105
SLIDE 105

Wrapper Generation: Pointers

Why are pointers problematic?

slide-106
SLIDE 106

Wrapper Generation: Pointers

Why are pointers problematic?

  • The addr passed from the client will not be valid
  • n the server.
  • Solutions?
slide-107
SLIDE 107

Wrapper Generation: Pointers

Why are pointers problematic?

  • The addr passed from the client will not be valid
  • n the server.
  • Solutions?
  • smart RPC package: follow pointers
  • distribute generic data structs with RPC package
slide-108
SLIDE 108

RPC Tools

RPC packages help with this with two components.

  • (1) Stub generation
  • create wrappers automatically
  • (2) Runtime library
  • thread pool
  • socket listeners call functions on server
slide-109
SLIDE 109

RPC Tools

RPC packages help with this with two components.

  • (1) Stub generation
  • create wrappers automatically
  • (2) Runtime library
  • thread pool
  • socket listeners call functions on server
slide-110
SLIDE 110

Runtime Library

Design decisions:

  • How to serve calls?
  • usually with a thread pool
  • What underlying protocol to use?
  • usually UDP
slide-111
SLIDE 111

Sender

[call] [tcp send]

  • [recv]

[ack]

Receiver

  • [recv]

[ack] [exec call] …

  • [return]

[tcp send]

  • RPC over TCP
slide-112
SLIDE 112

Sender

[call] [tcp send]

  • [recv]

[ack]

Receiver

  • [recv]

[ack] [exec call] …

  • [return]

[tcp send]

  • RPC over TCP

Why wasteful?

slide-113
SLIDE 113

RPC over UDP

Strategy: use function return as implicit ACK.

  • Piggybacking technique.
  • What if function takes a long time?
  • then send a separate ACK
slide-114
SLIDE 114

Conclusion

Many communication abstraction possible:

  • Raw messages (UDP)

Reliable messages (TCP) Virtual memory (OS) Global file system (OS) Function calls (RPC)

slide-115
SLIDE 115

Announcements

Thursday discussion

  • review midterm 2.
  • Office hours
  • today at 1pm, in office