Distributed Systems CS425/ECE428 Logistics Related Undergraduates - - PowerPoint PPT Presentation

distributed systems
SMART_READER_LITE
LIVE PREVIEW

Distributed Systems CS425/ECE428 Logistics Related Undergraduates - - PowerPoint PPT Presentation

Distributed Systems CS425/ECE428 Logistics Related Undergraduates switching from T3 to T4 Please email Heather Mihaly and Elsa Gunter (hmihal2@illinois.edu, egunter@illinois.edu) with the request and your UIN. Todays agenda


slide-1
SLIDE 1

Distributed Systems

CS425/ECE428

slide-2
SLIDE 2

Logistics Related

  • Undergraduates switching from T3 to T4
  • Please email Heather Mihaly and Elsa Gunter

(hmihal2@illinois.edu, egunter@illinois.edu) with the request and your UIN.

slide-3
SLIDE 3

Today’s agenda

  • System Model
  • Chapter 2.4 (except 2.4.3), parts of Chapter 2.3
  • Failure Detection
  • Chapter 15.1
slide-4
SLIDE 4

What is a distributed system?

Independent components that are connected by a network and communicate by passing messages to achieve a common goal, appearing as a single coherent system.

process thread, node, ....

slide-5
SLIDE 5

Relationship between processes

  • Two main categories:
  • Client-server
  • Peer-to-peer
slide-6
SLIDE 6

Relationship between processes

  • Client-server

Client Server

Request Response

Clear difference in roles.

slide-7
SLIDE 7

Relationship between processes

  • Client-server

Client P

  • 1. Request
  • 4. Response

Server

  • 2. Request
  • 3. Response
slide-8
SLIDE 8

Relationship between processes

  • Peer-to-peer

Peer Peer Peer Similar roles. Run the same program/algorithm.

slide-9
SLIDE 9

Relationship between processes

Client Server Client Server Server

...…

peer-to-peer

slide-10
SLIDE 10

Relationship between processes

  • Two broad categories:
  • Client-server
  • Peer-to-peer
slide-11
SLIDE 11

Distributed algorithm

  • Algorithm on a single process
  • Sequence of steps taken to perform a computation.
  • Steps are strictly sequential.
  • Distributed algorithm
  • Steps taken by each of the processes in the system (including

transmission of messages).

  • Different processes may execute their steps concurrently.
slide-12
SLIDE 12

Key aspects of a distributed system

  • Processes must communicate with one another to

coordinate actions. Communication time is variable.

  • Different processes (on different computers) have different

clocks!

  • Processes and communication channels may fail.
slide-13
SLIDE 13

Key aspects of a distributed system

  • Processes must communicate with one another to

coordinate actions. Communication time is variable.

  • Different processes (on different computers) have different

clocks!

  • Processes and communication channels may fail.
slide-14
SLIDE 14

How processes communicate

  • Directly using network sockets.
  • Abstractions such as remote procedure calls,

publish-subscribe systems, or distributed share memory.

  • Differ with respect to how the message, the sender
  • r the receiver is specified.
slide-15
SLIDE 15

How processes communicate

p q

m communication channel

slide-16
SLIDE 16

Communication channel properties

p q

m

  • Latency (L): Delay between the start of m’s transmission at p and the

beginning of its receipt at q.

  • Time taken for a bit to propagate through network links.
  • Queuing that happens at intermediate hops.
  • Delay in getting to the network.
  • Overheads in the operating systems in sending and receiving

messages.

  • …..

L communication channel

slide-17
SLIDE 17

Communication channel properties

p q

m

  • Latency (L): Delay between the start of m’s transmission at p and the

beginning of its receipt at q.

  • Bandwidth (B): Total amount of information that can be transmitted
  • ver the channel per unit time.
  • Per-channel bandwidth reduces as multiple channels share common

network links. size(m)/B

slide-18
SLIDE 18

Communication channel properties

p q

m

  • Total time taken to pass a message is governed by latency

and bandwidth of the channel.

  • Both latency and available bandwidth may vary over time.
slide-19
SLIDE 19

Key aspects of a distributed system

  • Processes must communicate with one another to

coordinate actions. Communication time is variable.

  • Different processes (on different computers) have different

clocks!

  • Processes and communication channels may fail.
slide-20
SLIDE 20

Differing clocks

  • Each computer in a distributed system has its own

internal clock.

  • Local clock of different processes show different time

values.

  • Clocks drift from perfect times at different rates.
slide-21
SLIDE 21

Key aspects of a distributed system

  • Processes must communicate with one another to

coordinate actions. Communication time is variable.

  • Different processes (on different computers) have different

clocks!

  • Processes and communication channels may fail.
slide-22
SLIDE 22

Two ways to model

  • Synchronous distributed systems:
  • Known upper and lower bounds on time taken by each step in a

process.

  • Known bounds on message passing delays.
  • Known bounds on clock drift rates.
  • Asynchronous distributed systems:
  • No bounds on process execution speeds.
  • No bounds on message passing delays.
  • No bounds on clock drift rates.
slide-23
SLIDE 23

Synchronous and Asynchronous

  • Most real-world systems are asynchronous.
  • Bounds can be estimated, but hard to guarantee.
  • Assuming system is synchronous can still be useful.
  • Possible to build a synchronous system.
slide-24
SLIDE 24

Key aspects of a distributed system

  • Processes must communicate with one another to

coordinate actions. Communication time is variable.

  • Different processes (on different computers) have different

clocks!

  • Processes and communication channels may fail.
slide-25
SLIDE 25

Types of failure

  • Omission: when a process or a channel fails to perform

actions that it is supposed to do.

  • Process may crash.
slide-26
SLIDE 26

How to detect a crashed process?

p q

Periodic ping ack

p q

Periodic heartbeats

slide-27
SLIDE 27

How to detect a crashed process?

p q

Periodic ping ack ∆1 time elapsed after sending ping, and no ack. If synchronous, ∆1 = 2(max network delay) If asynchronous, ∆1 = k(max observed round trip time)

slide-28
SLIDE 28

How to detect a crashed process?

p q

Periodic ping ack Pings are sent every T seconds. ∆1 time elapsed after sending ping, and no ack, report crash. If synchronous, ∆1 = 2(max network delay) If asynchronous, ∆1 = k(max observed round trip time)

slide-29
SLIDE 29

How to detect a crashed process?

(T + ∆2) time elapsed since last heartbeat.

p q

Periodic heartbeats

t t + min t + T t + T + max

slide-30
SLIDE 30

How to detect a crashed process?

(T + ∆2) time elapsed since last heartbeat, report crash. If synchronous, ∆2 = max network delay – min network delay If asynchronous, ∆2 = k(observed delay)

p q

Periodic heartbeats

slide-31
SLIDE 31

Correctness of failure detection

  • Completeness
  • Every failed process is eventually detected.
  • Accuracy
  • Every detected failure corresponds to a crashed process

(no mistakes).

slide-32
SLIDE 32

Correctness of failure detection

  • Characterized by completeness and accuracy.
  • Synchronous system
  • Failure detection via ping-ack and heartbeat is both

complete and accurate.

  • Asynchronous system
  • Our strategy for ping-ack and heartbeat is complete.
  • Impossible to achieve both completeness and accuracy.
  • Can we have an accurate but incomplete algorithm?
  • Never report failure.
slide-33
SLIDE 33

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1
  • Heartbeat: ∆ + T + ∆2
slide-34
SLIDE 34

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2
slide-35
SLIDE 35

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
slide-36
SLIDE 36

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)

Try deriving these before next class!

slide-37
SLIDE 37

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
  • Bandwidth usage:
  • Ping-ack: 2 messages every T units
  • Heartbeat: 1 message every T unit.
slide-38
SLIDE 38

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
  • Bandwidth usage:
  • Ping-ack: 2 messages every T units
  • Heartbeat: 1 message every T unit.
slide-39
SLIDE 39

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
  • Bandwidth usage:
  • Ping-ack: 2 messages every T units
  • Heartbeat: 1 message every T units.
slide-40
SLIDE 40

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
  • Bandwidth usage:
  • Ping-ack: 2 messages every T units
  • Heartbeat: 1 message every T units.

Decreasing T decreases failure detection time, but increases bandwidth usage.

slide-41
SLIDE 41

Metrics for failure detection

  • Worst case failure detection time
  • Ping-ack: T + ∆1- ∆ (where ∆ is time taken for last ping from p to reach q)
  • Heartbeat: ∆ + T + ∆2 (where ∆ is time taken for last message from q to reach p)
  • Bandwidth usage:
  • Ping-ack: 2 messages every T units
  • Heartbeat: 1 message every T units.

Increasing ∆1 or ∆2 increases accuracy but also increases failure detection time.

slide-42
SLIDE 42

Types of failure

  • Omission: when a process or a channel fails to perform

actions that it is supposed to do.

  • Process may crash.
  • Fail-stop: if other processes can certainly detect the crash.
  • Communication omission: a message sent by process was

not received by another.

slide-43
SLIDE 43

Communication Omission

  • Channel Omission: omitted by channel
  • Send omission: process completes ‘send’ operation, but

message does not reach its outgoing message buffer.

  • Receive omission: message reaches the incoming

message buffer, but not received by the process.

process p process q Communication chann el

send

Outgoing message buffer Incoming message buffer

receive m

Outgoing message buffer Incoming message buffer Communication Channel

slide-44
SLIDE 44

Two Generals Problem

When to attack?

How do the two general coordinate their time for attack?

slide-45
SLIDE 45

Two Generals Problem

When to attack?

X

What if their messengers may get shot on the way?

slide-46
SLIDE 46

Types of failure

  • Omission: when a process or a channel fails to perform

actions that it is supposed to do.

  • Process may crash.
  • Fail-stop: if other processes can detect that the process

has crashed.

  • Communication omission: a message sent by process was

not received by another. Message drops (or omissions) can be mitigated by network protocols.

slide-47
SLIDE 47

Types of failure

  • Omission: when a process or a channel fails to perform

actions that it is supposed to do, e.g. process crash and message drops.

  • Arbitrary (Byzantine) Failures: any type of error, e.g. a

process executing incorrectly, sending a wrong message, etc.

  • Timing Failures: Timing guarantees are not met.
  • Applicable only in synchronous systems.
slide-48
SLIDE 48

Summary

  • Relationship between processes
  • Client-server and peer-to-peer
  • Sources of uncertainty
  • Communication time, clock drift rates
  • Synchronous vs asynchronous models.
  • Types of failures: omission, arbitrary, timing
  • Detecting failed a process.