SLIDE 1 [537] Distributed Systems
Chapters 42 Tyler Harter 11/19/14
SLIDE 2 File-System Case Studies
Local
- FFS: Fast File System
- LFS: Log-Structured File System
- Network
- NFS: Network File System
- AFS: Andrew File System
SLIDE 3 File-System Case Studies
Local
- FFS: Fast File System
- LFS: Log-Structured File System
- Network
- Intro: communication basics [today]
- NFS: Network File System
- AFS: Andrew File System
SLIDE 4
Review
SLIDE 5 Atomicity
Say we want to do several things.
- Atomicity means we don’t get interrupted when
partially done (or at least that we can make it appear that way to the user).
- Concurrency: we’re worried about other threads
Persistence: we’re worried about crashes
SLIDE 6 Atomic Update
Say we want to update a file foo.txt. If we crash, we want one of the following:
- all old data
- all new data
- Strategy: write new data to foo.tmp, and only after
that’s complete, replace foo.txt by switching names.
SLIDE 7
Bad Protocol
copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt
SLIDE 8 Bad Protocol
foo.txt Old Data
(on disk)
SLIDE 9 Bad Protocol
copy foo.txt to foo.tmp (with changes)
foo.txt Old Data
(on disk)
foo.tmp New Data
(in RAM)
SLIDE 10 Bad Protocol
copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt
foo.txt New Data
(in RAM)
Old Data
(on disk)
SLIDE 11 Bad Protocol
copy foo.txt to foo.tmp (with changes) rename foo.tmp to foo.txt
foo.txt New Data
(in RAM) (on disk)
SLIDE 12
Good Protocol
copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt
SLIDE 13 Good Protocol
foo.txt Old Data
(on disk)
SLIDE 14 Good Protocol
copy foo.txt to foo.tmp (with changes)
foo.txt Old Data
(on disk)
foo.tmp New Data
(in RAM)
SLIDE 15 Good Protocol
copy foo.txt to foo.tmp (with changes) fsync foo.tmp
foo.txt Old Data
(on disk)
foo.tmp New Data
(on disk)
SLIDE 16 Good Protocol
copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt
foo.txt Old Data
(on disk)
New Data
(on disk)
SLIDE 17 Good Protocol
copy foo.txt to foo.tmp (with changes) fsync foo.tmp rename foo.tmp to foo.txt
foo.txt New Data
(on disk) (on disk)
SLIDE 18 Local FS Comparison
FFS+Journal:
- must write data twice (writes expensive)
- can put data exactly where we like (reads cheaper)
- LFS:
- all writes sequential (writes cheaper)
- reads may be very random (reads expensive)
SLIDE 19 Local FS Comparison
In what ways is FFS more complex?
- In what ways is LFS more complex?
- Compare group descriptor to segment summary.
- LFS: why don’t we need to update root inode upon
updating any file?
SLIDE 20
Distributed Systems
SLIDE 21 OSTEP Definition
Def: more than 1 machine
- Examples:
- client/server: web server and web client
- cluster: page rank computation
- Other courses:
CS 640: Networking CS 739: Distributed Systems
SLIDE 22 Why Go Distributed?
More compute power
- More storage capacity
- Fault tolerance
- Data sharing
SLIDE 23 New Challenges
System failure: need to worry about partial failure.
- Communication failure: links unreliable
SLIDE 24 Communication
All communication is inherently unreliable.
- Need to worry about:
- bit errors
- packet loss
- node/link failure
SLIDE 25
Why are network sockets less reliable than pipes?
SLIDE 26
Writer Process
Pipe
Reader Process user kernel
SLIDE 27
Writer Process
Pipe
Reader Process user kernel
SLIDE 28
Writer Process
Pipe
Reader Process user kernel
SLIDE 29
Writer Process
Pipe
Reader Process user kernel
SLIDE 30
Writer Process
Pipe
Reader Process user kernel
SLIDE 31
Writer Process
Pipe
Reader Process user kernel
SLIDE 32
Writer Process
Pipe
Reader Process user kernel
SLIDE 33
Writer Process
Pipe
Reader Process user kernel
SLIDE 34
Writer Process
Pipe
Reader Process user kernel
SLIDE 35
Writer Process
Pipe
Reader Process user kernel
SLIDE 36
Writer Process
Pipe
Reader Process user kernel
SLIDE 37 Writer Process
Pipe
Reader Process user kernel
write waits for space
SLIDE 38 Writer Process
Pipe
Reader Process user kernel
write waits for space
SLIDE 39 Writer Process
Pipe
Reader Process user kernel
write waits for space
SLIDE 40 Writer Process
Pipe
Reader Process user kernel
write waits for space
SLIDE 41
Writer Process
Pipe
Reader Process user kernel
SLIDE 42
Writer Process
Network Socket
user kernel Machine A Reader Process user kernel Machine B Router
SLIDE 43
Writer Process
Network Socket
user kernel Machine A Reader Process user kernel Machine B Router what if router’s buffer is full?
SLIDE 44
Writer Process
Network Socket
user kernel Machine A Reader Process user kernel Machine B Router what if B’s buffer is full?
SLIDE 45 Writer Process
Network Socket
user kernel Machine A
?
From A’s view, network and B are largely a black box.
SLIDE 46 Overview
Raw messages
- Reliable messages
- OS abstractions
- virtual memory
- global file system
- Programming-languages abstractions
- remote procedure call
SLIDE 47 Raw Messages: UDP
API:
- reads and writes over socket file descriptors
- messages sent from/to ports to target a process on machine
- Provide minimal reliability features:
- messages may be lost
- messages may be reordered
- messages may be duplicated
- only protection: checksums
SLIDE 48 Raw Messages: UDP
Advantages
- lightweight
- some applications make better reliability decisions
themselves (e.g., video conferencing programs)
- Disadvantages
- more difficult to write application correctly
SLIDE 49 Overview
Raw messages
- Reliable messages
- OS abstractions
- virtual memory
- global file system
- Programming-languages abstractions
- remote procedure call
SLIDE 50 Strategy
Using software, build reliable, logical connections
- ver unreliable connections.
- Strategies:
- acknowledgment
SLIDE 51 ACK
Sender
[send message]
Receiver
[send ack]
Sender knows message was received.
SLIDE 52 ACK
Sender
[send message]
Sender misses ACK… What to do?
SLIDE 53 Strategy
Using software, build reliable, logical connections
- ver unreliable connections.
- Strategies:
- acknowledgment
SLIDE 54 Strategy
Using software, build reliable, logical connections
- ver unreliable connections.
- Strategies:
- acknowledgment
- timeout
SLIDE 55 Timeout
Sender
[send message]
SLIDE 56 Timeout
Sender
[send message] [start timer]
SLIDE 57 Timeout
Sender
[send message] [start timer]
Receiver
SLIDE 58 Timeout
Sender
[send message] [start timer]
- … waiting for ack …
- [timer goes off]
Receiver
SLIDE 59 Timeout
Sender
[send message] [start timer]
- … waiting for ack …
- [timer goes off]
[send message]
Receiver
[send ack]
SLIDE 60
Timeout: Issue 1
How long to wait?
SLIDE 61 Timeout: Issue 1
How long to wait?
- Too long: system feels unresponsive
- Too short: messages needlessly re-sent
- Messages may have been dropped due to
- verloaded server. Aggressive clients worsen this.
SLIDE 62 Timeout: Issue 1
How long to wait?
- One strategy: be adaptive.
- Adjust time based on how long acks usually take.
- For each missing ack, wait longer between retries.
SLIDE 63
Timeout: Issue 2
What does a lost ack really mean?
SLIDE 64 Sender
[send message]
Receiver
[send message]
Receiver
[send ack]
Case 1 Case 2 How can sender tell between these two cases?
SLIDE 65 Timeout: Issue 2
What does a lost ack really mean?
- ACK: message received exactly once
- No ACK: message received at most once
SLIDE 66 Timeout: Issue 2
What does a lost ack really mean?
- ACK: message received exactly once
- No ACK: message received at most once
- What if message is command to increment counter?
SLIDE 67 Proposed Solution
Sender could send an AckAck so receiver knows whether to retry sending an Ack.
SLIDE 68
Aside: Two Generals’ Problem
general 1 general 2 enemy
SLIDE 69 Aside: Two Generals’ Problem
general 1 general 2 enemy
Suppose a generals agree after N messages. Did the arrival of the N’th message change anybodies decision?
SLIDE 70 Aside: Two Generals’ Problem
general 1 general 2 enemy
Suppose a generals agree after N messages. Did the arrival of the N’th message change anybodies decision?
- if yes: then what if the N’th message had been lost?
- if no: then why bother sending N messages?
SLIDE 71 Timeout: Issue 2
What does a lost ack really mean?
- ACK: message received exactly once
- No ACK: message received at most once
- What if message is command to increment counter?
SLIDE 72 Strategy
Using software, build reliable, logical connections
- ver unreliable connections.
- Strategies:
- acknowledgment
- timeout
SLIDE 73 Strategy
Using software, build reliable, logical connections
- ver unreliable connections.
- Strategies:
- acknowledgment
- timeout
- remember sent messages
SLIDE 74 Receiver Remembers Messages
Sender
[send message]
[send message]
Receiver
[send ack]
[send ack]
SLIDE 75 Receiver Remembers Messages
Sender
[send message]
[send message]
Receiver
[send ack]
[send ack]
how do we know to ignore?
SLIDE 76
Solutions
Solution 1: remember every message ever sent.
SLIDE 77 Solutions
Solution 1: remember every message ever sent.
- Solution 2: sequence numbers
- give each message a seq number
- receiver knows all messages before an N have
been seen
- receiver remembers messages sent after N
SLIDE 78 TCP
Most popular protocol based on seq nums.
- Also buffers messages so they arrive in order.
- Timeouts are adaptive.
SLIDE 79 Overview
Raw messages
- Reliable messages
- OS abstractions
- virtual memory
- global file system
- Programming-languages abstractions
- remote procedure call
SLIDE 80 Virtual Memory
Inspiration: threads share memory
- Idea: processes on different machines share mem
SLIDE 81 Virtual Memory
Inspiration: threads share memory
- Idea: processes on different machines share mem
- Strategy:
- a bit like swapping we saw before
- instead of swap to disk, swap to other machine
- sometimes multiple copies may be in memory
- n different machines
SLIDE 82 PFN valid present
6 7 8 … …
PFN valid present
22 23 24 … … Process on Machine A Process on Machine B
SLIDE 83 PFN valid present
Process on Machine A 5 6 7 8 … …
PFN valid present
Process on Machine B 21 22 23 24 … … map 3-page region into both memories.
SLIDE 84 PFN valid present
1 1 7 1 1 8 1 1
Process on Machine A
X Y Z
5 6 7 8 … …
PFN valid present
Process on Machine B 21 22 23 24 … … A writes X,Y,Z
SLIDE 85 PFN valid present
1 1 7 1 1 8 1 1
Process on Machine A
X Y Z
5 6 7 8 … …
PFN valid present
1 1
Process on Machine B
X
21 22 23 24 … … B reads 1st page
SLIDE 86 PFN valid present
1 1 7 1 1 8 1 1
Process on Machine A
X Y Z
5 6 7 8 … …
PFN valid present
1 1 22 1 1
Process on Machine B
Y X
21 22 23 24 … … B reads 2st page
SLIDE 87 PFN valid present
7 1 1 8 1 1
Process on Machine A
Y Z
5 6 7 8 … …
PFN valid present
1 1 22 1 1
Process on Machine B
Y X’
21 22 23 24 … … B writes X’ to 1st page
SLIDE 88 PFN valid present
1 1 7 1 1 8 1 1
Process on Machine A
X’ Y Z
5 6 7 8 … …
PFN valid present
1 1 22 1 1
Process on Machine B
Y X’
21 22 23 24 … … A reads 1st page
SLIDE 89 Virtual Memory Problems
What if a machine crashes?
- mapping disappears in other machines
- how to handle?
- Performance?
- when to prefetch?
- loads/stores expected to be fast
- DSM (distributed shared memory) not used today.
SLIDE 90 Global File System
Advantages
- file access is already expected to be slow
- use common API
- no need to modify applications (sorta true,
flocks over NFS don’t work)
- Disadvantages
- doesn’t always make sense, e.g., for video app
SLIDE 91 Overview
Raw messages
- Reliable messages
- OS abstractions
- virtual memory
- global file system
- Programming-languages abstractions
- remote procedure call
SLIDE 92 RPC
Remote Procedure Call.
- What could be easier than calling a function?
- Strategy: create wrappers so calling a function on
another machine feels just like calling a local function.
- This abstraction is very common in industry.
SLIDE 93 RPC
int main(…) {
Machine A
int foo(char *msg) { … }
Machine B
SLIDE 94 RPC
int main(…) { int x = foo(); }
Machine A
int foo(char *msg) { … }
Machine B Want main() on A to call foo() on B.
SLIDE 95 RPC
int main(…) { int x = foo(); }
Machine A
int foo(char *msg) { … }
Machine B Want main() on A to call foo() on B.
SLIDE 96 RPC
int main(…) { int x = foo(); }
send msg to B recv msg from B }
Machine A
int foo(char *msg) { … }
Machine B Want main() on A to call foo() on B.
SLIDE 97 RPC
int main(…) { int x = foo(); }
send msg to B recv msg from B }
Machine A
int foo(char *msg) { … }
while(1) { recv, call foo } }
Machine B Want main() on A to call foo() on B.
SLIDE 98 RPC
int main(…) { int x = foo(); }
send msg to B recv msg from B }
Machine A
int foo(char *msg) { … }
while(1) { recv, call foo } }
Machine B Actual calls.
SLIDE 99 RPC
int main(…) { int x = foo(); }
send msg to B recv msg from B }
Machine A
int foo(char *msg) { … }
while(1) { recv, call foo } }
Machine B What it feels like for programmer.
SLIDE 100 RPC
int main(…) { int x = foo(); }
send msg to B recv msg from B }
Machine A
int foo(char *msg) { … }
while(1) { recv, call foo } }
Machine B Wrappers.
client wrapper server wrapper
SLIDE 101 RPC Tools
RPC packages help with this with two components.
- (1) Stub generation
- create wrappers automatically
- (2) Runtime library
- thread pool
- socket listeners call functions on server
SLIDE 102 RPC Tools
RPC packages help with this with two components.
- (1) Stub generation
- create wrappers automatically
- (2) Runtime library
- thread pool
- socket listeners call functions on server
SLIDE 103 Stub Generation
Many tools will automatically generate wrappers:
- rpcgen
- thrift
- protobufs
- Programmer fills in generated stubs.
SLIDE 104 Wrapper Generation
Wrappers must do conversions:
- client arguments to message
- message to server arguments
- server return to message
- message to client return
- Need uniform endianness (wrappers do this).
- Conversion is called marshaling/unmarshaling,
- r serializing/deserializing.
SLIDE 105
Wrapper Generation: Pointers
Why are pointers problematic?
SLIDE 106 Wrapper Generation: Pointers
Why are pointers problematic?
- The addr passed from the client will not be valid
- n the server.
- Solutions?
SLIDE 107 Wrapper Generation: Pointers
Why are pointers problematic?
- The addr passed from the client will not be valid
- n the server.
- Solutions?
- smart RPC package: follow pointers
- distribute generic data structs with RPC package
SLIDE 108 RPC Tools
RPC packages help with this with two components.
- (1) Stub generation
- create wrappers automatically
- (2) Runtime library
- thread pool
- socket listeners call functions on server
SLIDE 109 RPC Tools
RPC packages help with this with two components.
- (1) Stub generation
- create wrappers automatically
- (2) Runtime library
- thread pool
- socket listeners call functions on server
SLIDE 110 Runtime Library
Design decisions:
- How to serve calls?
- usually with a thread pool
- What underlying protocol to use?
- usually UDP
SLIDE 111 Sender
[call] [tcp send]
[ack]
Receiver
[ack] [exec call] …
[tcp send]
SLIDE 112 Sender
[call] [tcp send]
[ack]
Receiver
[ack] [exec call] …
[tcp send]
Why wasteful?
SLIDE 113 RPC over UDP
Strategy: use function return as implicit ACK.
- Piggybacking technique.
- What if function takes a long time?
- then send a separate ACK
SLIDE 114 Conclusion
Many communication abstraction possible:
Reliable messages (TCP) Virtual memory (OS) Global file system (OS) Function calls (RPC)
SLIDE 115 Announcements
Thursday discussion
- review midterm 2.
- Office hours
- today at 1pm, in office