Internet Server Clusters Internet Server Clusters Jeff Chase Duke - - PowerPoint PPT Presentation

internet server clusters internet server clusters
SMART_READER_LITE
LIVE PREVIEW

Internet Server Clusters Internet Server Clusters Jeff Chase Duke - - PowerPoint PPT Presentation

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of Computer Science CPS 212: Distributed Information Systems Using Clusters for Scalable Services Using Clusters for Scalable Services Clusters are a


slide-1
SLIDE 1

Internet Server Clusters Internet Server Clusters

Jeff Chase Duke University, Department of Computer Science CPS 212: Distributed Information Systems

slide-2
SLIDE 2

Using Clusters for Scalable Services Using Clusters for Scalable Services

Clusters are a common vehicle for improving scalability and availability at a single service site in the network.

Are network services the “Killer App” for clusters?

  • incremental scalability

just wheel in another box...

  • excellent price/performance

high-end PCs are commodities: high-volume, low margins

  • fault-tolerance

“simply a matter of software”

  • high-speed cluster interconnects are on the market

SANs + Gigabit Ethernet... cluster nodes can coordinate to serve requests w/ low latency

  • “shared nothing”
slide-3
SLIDE 3

[Fox/Brewer]: SNS, TACC, and All That [Fox/Brewer]: SNS, TACC, and All That

[Fox/Brewer97] proposes a cluster-based reusable software infrastructure for scalable network services (“SNS”), such as:

  • TranSend: scalable, active proxy middleware for the Web

think of it as a dial-up ISP in a box, in use at Berkeley distills/transforms pages based on user request profiles

  • Inktomi/HotBot search engine

core technology for Inktomi Inc., today with $15B market cap. “bringing parallel computing technology to the Internet”

Potential services are based on Transformation, Aggregation, Caching, and Customization (TACC), built above SNS.

slide-4
SLIDE 4

TACC TACC

Vision: deliver “the content you want” by viewing HTML content as a dynamic, mutable medium.

  • 1. Transform Internet content according to:
  • network and client needs/limitations

e.g., on-the-fly compression/distillation [ASPLOS96], packaging Web pages for PalmPilots, encryption, etc.

  • directed by user profile database
  • 2. Aggregate content from different back-end services or resources.
  • 3. Cache content to reduce cost/latency of delivery.
  • 4. Customize (see Transform)
slide-5
SLIDE 5

TranSend TranSend Structure Structure

$ $ $

Front Ends Profiles Control Panel

html gif jpg To Internet

SAN (high speed) Utility (10baseT) Coordination bus

$

Cache partition

...

Datatype-specific distiller

[adapted from Armando Fox (through http://ninja.cs.berkeley.edu/pubs)]

slide-6
SLIDE 6

SNS/TACC Philosophy SNS/TACC Philosophy

  • 1. Specify services by plugging generic programs into the TACC

framework, and compose them as needed.

sort of like CGI with pipes run by long-lived worker processes that serve request queues allows multiple languages, etc.

  • 2. Worker processes in the TACC framework are loosely

coordinated, independent, and stateless.

ACID vs. BASE serve independent requests from multiple users narrow view of a “service”: one-shot readonly requests, and stale data is OK

  • 3. Handle bursts with designated overflow pool of machines.
slide-7
SLIDE 7

TACC Examples TACC Examples

HotBot search engine

  • Query crawler’s DB
  • Cache recent searches
  • Customize UI/presentation

TranSend transformation proxy

  • On-the-fly lossy compression of inline images

(GIF, JPG, etc.)

  • Cache original & transformed
  • User specifies aggressiveness, “refinement”

UI, etc.

C T T $ $ A A T T $ $ C DB DB html html

[Fox]

slide-8
SLIDE 8

(Worker) Ignorance Is Bliss (Worker) Ignorance Is Bliss

What workers don’t need to know

  • Data sources/sinks
  • User customization (key/value pairs)
  • Access to cache
  • Communication with other workers by name

Common case: stateless workers C, Perl, Java supported

  • Recompilation often unnecessary
  • Useful tasks possible in <10 lines of (buggy) Perl

[Fox]

slide-9
SLIDE 9

Questions Questions

  • 1. What are the research contributions of the paper?

system architecture decouples SNS concerns from content TACC programming model composes stateless worker modules validation using two real services, with measurements How is this different from clusters for parallel computing?

  • 2. How is this different from clusters for parallel computing?
  • 3. What are the barriers to scale in SNS/TACC?
  • 4. How are requests distributed to caches, FEs, workers?
  • 5. What can we learn from the quantitative results?
  • 6. What about services that allow client requests to update shared

data?

e.g., message boards, calendars, mail,

slide-10
SLIDE 10

SNS/TACC Functional Issues SNS/TACC Functional Issues

  • 1. What about fault-tolerance?
  • Service restrictions allow simple, low-cost mechanisms.

Primary/backup process replication is not necessary with BASE model and stateless workers.

  • Uses a process-peer approach to restart failed processes.

Processes monitor each other’s health and restart if necessary. Workers and manager find each other with “beacons” on well- known ports.

  • 2. Load balancing?
  • Manager gathers load info and distributes to front-ends.
  • How are incoming requests distributed to front-ends?
slide-11
SLIDE 11

Porcupine: A Highly Available Cluster Porcupine: A Highly Available Cluster-

  • based Mail Service

based Mail Service

Yasushi Saito Brian Bershad Hank Levy

University of Washington Department of Computer Science and Engineering, Seattle, WA http://porcupine.cs.washington.edu/ [Saito]

slide-12
SLIDE 12

Why Email? Why Email?

Mail is important

Real demand

Mail is hard

Write intensive Low locality

Mail is easy

Well-defined API Large parallelism Weak consistency

[Saito]

How much of Porcupine is reusable to other services? Can we use the SNS/TACC framework for this?

slide-13
SLIDE 13

Goals Goals

Use commodity hardware to build a large, scalable mail service Three facets of scalability ... Performance: Linear increase with cluster size Manageability: React to changes automatically Availability: Survive failures gracefully

[Saito]

slide-14
SLIDE 14

Conventional Mail Solution Conventional Mail Solution

Static partitioning

Performance problems:

No dynamic load balancing

Manageability problems:

Manual data partition decision

Availability problems:

Limited fault tolerance

SMTP/IMAP/POP

Bob’s mbox Ann’s mbox Joe’s mbox Suzy’s mbox

NFS servers

[Saito]

slide-15
SLIDE 15

Key Techniques and Relationships Key Techniques and Relationships

Functional Homogeneity

“any node can perform any task” Automatic Reconfiguration Load Balancing Replication Manageability Performance Availability Framework Techniques Goals

[Saito]

slide-16
SLIDE 16

Porcupine Architecture Porcupine Architecture

Node A

...

Node B Node Z

...

SMTP server POP server IMAP server Mail map

Mailbox storage User profile

Replication Manager Membership Manager RPC Load Balancer User map

[Saito]

slide-17
SLIDE 17

Porcupine Operations Porcupine Operations

ÿþýüûþüý

A B

...

A

  • üþ
  • ý
  • þ

ü

  • ÿ
  • üû
  • ý

ü ü ý þ üý ý ûü þü ÿ

  • ü

ü ý þ

  • þ
  • þ
  • ý

ûü

  • B

C

Protocol handling User lookup Load Balancing Message store ...

C

[Saito]

slide-18
SLIDE 18

Basic Data Structures Basic Data Structures

“bob” BCACABAC

bob: {A,C} ann: {B}

BCACABAC

suzy: {A,C} joe: {B}

BCACABAC

Apply hash function

User map Mail map /user info Mailbox storage

A B C

Bob’s MSGs Suzy’s MSGs Bob’s MSGs Joe’s MSGs Ann’s MSGs Suzy’s MSGs [Saito]

fragment list mailbox fragments

slide-19
SLIDE 19

Porcupine Advantages Porcupine Advantages

Advantages:

Optimal resource utilization Automatic reconfiguration and task re-distribution upon node failure/recovery Fine-grain load balancing

Results:

Better Availability Better Manageability Better Performance

[Saito]

slide-20
SLIDE 20

Availability Availability

Goals:

Maintain function after failures React quickly to changes regardless of cluster size Graceful performance degradation / improvement

Strategy: Two complementary mechanisms

Hard state: email messages, user profile

ÿ Optimistic fine-grain replication

Soft state: user map, mail map

ÿ Reconstruction after membership change

[Saito]

slide-21
SLIDE 21

Soft Soft-

  • state Reconstruction

state Reconstruction

B C A B A B A C bob: {A,C} joe: {C} B C A B A B A C B A A B A B A B bob: {A,C} joe: {C} B A A B A B A B A C A C A C A C bob: {A,C} joe: {C} A C A C A C A C suzy: {A,B} ann: {B}

  • 1. Membership protocol

Usermap recomputation

  • 2. Distributed

disk scan

suzy: ann:

Timeline

A B

ann: {B} B C A B A B A C suzy: {A,B}

C

ann: {B} B C A B A B A C suzy: {A,B} ann: {B} B C A B A B A C suzy: {A,B}

[Saito]

suzy ann

slide-22
SLIDE 22

How does Porcupine React to How does Porcupine React to Configuration Changes? Configuration Changes?

300 400 500 600 700 100 200 300 400 500 600 700 800

Time(seconds)

Messages /second No failure One node failure Three node failures Six node failures Nodes fail New membership determined Nodes recover New membership determined

[Saito]

slide-23
SLIDE 23

Hard Hard-

  • state Replication

state Replication

Goals:

Keep serving hard state after failures Handle unusual failure modes

Strategy: Exploit Internet semantics

Optimistic, eventually consistent replication Per-message, per-user-profile replication Efficient during normal operation Small window of inconsistency

[Saito]

How will Porcupine behave in a partition failure?

slide-24
SLIDE 24

More on Porcupine Replication More on Porcupine Replication

To add/delete/modify a message:

  • Find and update any replica of the mailbox fragment.

Do whatever it takes: make a new fragment if necessary...pick a new replica if chosen replica does not respond.

  • Replica asynchronously transmits updates to other fragment replicas.

continuous reconciling of replica states

  • Log/force pending update state, and target nodes to receive update.
  • n recovery, continue transmitting updates where you left off
  • Order updates by loosely synchronized physical clocks.

Clock skew should be less than the inter-arrival gap for a sequence

  • f order-dependent requests...use nodeID to break ties.
  • How many node failures can Porcupine survive? What happens if

nodes fail “forever”?

slide-25
SLIDE 25

How Efficient is Replication? How Efficient is Replication?

100 200 300 400 500 600 700 800 5 10 15 20 25 30

Cluster size Messages/second

Porcupine no replication Porcupine with replication=2

68m/day 24m/day [Saito]

slide-26
SLIDE 26

How Efficient is Replication? How Efficient is Replication?

100 200 300 400 500 600 700 800 5 10 15 20 25 30

Cluster size Messages/second

Porcupine no replication Porcupine with replication=2 Porcupine with replication=2, NVRAM

68m/day 24m/day 33m/day [Saito]

slide-27
SLIDE 27

Load balancing: Deciding where to store messages Load balancing: Deciding where to store messages

Goals:

Handle skewed workload well Support hardware heterogeneity No voodoo parameter tuning

Strategy: Spread-based load balancing

Spread: soft limit on # of nodes per mailbox Large spread ÿ better load balance Small spread ÿ better affinity Load balanced within spread Use # of pending I/O requests as the load measure

[Saito]

slide-28
SLIDE 28

Questions Questions

  • How to select the front-end node to handle the request? Does it

matter which one we choose?

  • Don’t we already know how to build big mail servers? (e.g.,

Earthlink, Christenson USITS97) Why do we need Porcupine?

  • What properties of the mail “data model” allow this approach, with

weaker consistency guarantees than a database?

  • How does the system leverage/exploit the weaker semantics?
  • Can the architecture accommodate new features, e.g., Pachyderm-

like storage/indexing of large mail collections?

  • Could I run Porcupine on the same cluster with other applications?
  • Could this have been built on Microsoft’s MSCS? How much

application effort would have been saved?

slide-29
SLIDE 29

Clusters: A Broader View Clusters: A Broader View

MSCS (“Wolfpack”) is designed as basic infrastructure for commercial applications on clusters.

  • “A cluster service is a package of fault-tolerance primitives.”
  • Service handles startup, resource migration, failover, restart.
  • But: apps may need to be “cluster-aware”.

Apps must participate in recovery of their internal state. Use facilities for logging, checkpointing, replication, etc.

  • Service and node OS supports uniform naming and virtual

environments. Preserve continuity of access to migrated resources. Preserve continuity of the environment for migrated resources.

slide-30
SLIDE 30

Wolfpack: Resources Wolfpack: Resources

  • The components of a cluster are nodes and resources.

Shared nothing: each resource is owned by exactly one node.

  • Resources may be physical or logical.

Disks, servers, databases, mailbox fragments, IP addresses,...

  • Resources have types, attributes, and expected behavior.
  • (Logical) resources are aggregated in resource groups.

Each resource is assigned to at most one group.

  • Some resources/groups depend on other resources/groups.

Admin-installed registry lists resources and dependency tree.

  • Resources can fail.

cluster service/resource managers detect failures.

slide-31
SLIDE 31

Fault Fault-

  • Tolerant Systems: The Big Picture

Tolerant Systems: The Big Picture

messaging system file/storage system database

mail service cluster service application service application service

redundant hardware parity ECC replication RAID parity checksum ack/retransmission replication logging checkpointing voting replication logging checkpointing voting Note: dependencies redundancy at any/each/every level what failure semantics to the level above?

slide-32
SLIDE 32

Wolfpack: Resource Placement and Migration Wolfpack: Resource Placement and Migration

The cluster service detects component failures and responds by restarting resources or migrating resource groups.

  • Restart resource in place if possible...
  • ...else find another appropriate node and migrate/restart.

Ideally, migration/restart/failover is transparent.

  • Logical resources (processes) execute in virtual environments.

uniform name space for files, registry, OS objects (NT mods)

  • Node physical clocks are loosely synchronized, with clock drift less

than minimal time for recovery/migration/restart. guarantees migrated resource sees monotonically increasing clocks

  • Route resource requests to the node hosting the resource.
  • Is the failure visible to other resources that depend on the resource?
slide-33
SLIDE 33

Membership 101 Membership 101

Cluster nodes must agree on the set of cluster members (the view).

  • distribute resource ownership effectively

shift resources on node failures or additions

  • eliminate dangerous/expensive interactions with faulty nodes
  • “keep everyone in the loop” on updates and events

e.g., multicast groups and group communication

The literature on group membership is tangled up with the problem

  • f ordered multicast (e.g., “CATOCS”).
  • What are the ordering guarantees for message delivery, especially

with respect to membership changes?

  • Ordered group communication is controversial, but everyone needs

a solution for the separate but related membership problem.

slide-34
SLIDE 34

Failure Detectors Failure Detectors

First problem: how to detect that a member has failed?

  • pings, timeouts, beacons, heartbeats
  • recovery notifications

“I was gone for awhile, but now I’m back.”

Is the failure detector accurate? Is the failure detector live? In an asynchronous system, it is possible for a failure detector to be accurate or live, but not both.

  • As it turns out, it is impossible for an asynchronous system to agree
  • n anything with accuracy and liveness!
  • But this is academic...
slide-35
SLIDE 35

Failure Detectors in Real Systems Failure Detectors in Real Systems

Common solution:

  • Use a failure detector that is live but not accurate.

Assume bounded processing delays and delivery times. Timeout with multiple retries detects failure accurately with high probability. If a “failed” site turns out to be alive, then kill it (fencing).

  • Use a recovery detector that is accurate but not live.

“I’m back....hey, did anyone hear me?”

What do we assume about communication failures?

How much pinging is enough? 1-to-N, N-to-N, ring?

What about network partitions?

slide-36
SLIDE 36

Membership Service Membership Service

Second problem: How to propagate knowledge of failure/recovery events to other nodes?

  • Surviving nodes should agree on the new view (regrouping).
  • Convergence should be rapid.
  • The regrouping protocol should itself be tolerant of message drops,

message reorderings, and failures. liveness and accuracy again

  • The regrouping protocol should be scalable.
  • The protocol should handle network partitions.
  • Behavior of the messaging system (e.g., group multicast) across

membership changes must be well-specified and understood.

slide-37
SLIDE 37

Example: Wombat Example: Wombat

  • Wombat is a new membership protocol, an outgrowth of Porcupine.

Gretta Bartels, University of Washington, Duke ‘98

  • Wombat is empirically more efficient/scalable than competing

algorithms such as Three Round.

  • But: Wombat makes no guarantees about the relative ordering of

membership events and messages. Adherents of group communication would not accept it as a “real” membership protocol.

  • Wombat’s assumptions have not been formally defined, and its

properties have not been proven. If you can’t prove that it works, you can’t believe that it works.

  • Disclaimer: Wombat is a promising work in progress.
slide-38
SLIDE 38

Wombat Basics Wombat Basics

ping ping

leader minions

Nodes are ranked by unique IDs. Node IDs are permanent. Node i pings predecessor(i). The highest-ranked node is the leader. All other nodes are minions. The leader periodically broadcasts its view to all known minions. physical broadcast Minions adopt the leader’s view. determine pred from leader’s view

slide-39
SLIDE 39

Node Arrival/Recovery in Wombat Node Arrival/Recovery in Wombat

If node i joins the cluster:

  • 1. i waits for the leader’s next beacon.
  • 2. i detects that the leader’s view does

not include i.

  • 3. i notifies the leader.
  • 4. The leader updates its view.
  • 5. The leader broadcasts its new view.
  • 6. Minions adopt the leader’s view.

“I’m here too.” i

slide-40
SLIDE 40

Node Failure in Wombat Node Failure in Wombat

If a node fails:

  • 1. Its successor notifies the leader.
  • 2. The leader updates its view.
  • 3. The leader broadcasts its view.
  • 4. Minions adopt the leader’s new view.
  • 5. Life goes on.

X

“Node i has failed.” i

slide-41
SLIDE 41

Leader Failure in Wombat Leader Failure in Wombat

If the leader fails:

  • 1. Successor detects the failure.
  • 2. Successor knows that the failed node

was the leader.

  • 3. Successor broadcasts as leader.
  • 4. Minions adopt the new leader’s view.
  • 5. Life goes on.

X

“I am in control.”

slide-42
SLIDE 42

Multiple Failures in Wombat Multiple Failures in Wombat

If the leader and its successor(s) fail(s), the next ranking node must assume command on its own.

  • 1. Each node has a broadcast timer; if

the timer goes off, broadcast as leader.

  • 2. Each node’s timer is set by its rank.

if i< j then timer(i)<timer(j)

  • 3. Reset timer on each beacon.
  • 4. Leader’s timer value is adaptive.

Go faster if things are changing.

X

“I must be in control.”

X

slide-43
SLIDE 43

Suppressing False Leaders Suppressing False Leaders

If a node falsely broadcasts as leader:

  • 1. All nodes that know of a better leader

recognize the usurper as such.

  • 2. The real leader recognizes that it is a

better leader than the usurper.

  • 3. The real leader broadcasts the union of

its view and the usurper’s view.

  • 4. The usurper shuts up and adopts the

real leader’s view. What if the “real leader” is dead?

X

“I must be in control.” “I don’t think so.”

slide-44
SLIDE 44

Partitions in Wombat Partitions in Wombat

partition leader

If a network failure partitions the cluster:

  • 1. The old partition continues.
  • 2. The leader of the new partition

eventually broadcasts its view.

  • 3. Minions accept the new leader’s

view.

partition leader

notion

slide-45
SLIDE 45

Healing a Partition Healing a Partition

dominating leader

When the partition heals, either:

  • 1. The dominating partition leader

hears a false broadcast, and...

  • 2. ...corrects it by broadcasting the

union of the views.

  • or -
  • 1. The dominating partition leader

broadcasts first, and...

  • 2. ...minions respond “I’m here”.

partition leader

slide-46
SLIDE 46

Wombat: Wrinkles Wombat: Wrinkles

  • 1. What are the assumptions about:
  • network?
  • clocks?
  • 2. Are these reasonable/realistic assumptions?
  • 3. How to ensure a single cluster view in the event of a partition?
  • 4. How long does it take for the view to converge after a partition?
  • 5. How do we start a cluster? What if a node starts or recovers but

never receives a beacon?

  • 6. What about the ordering of messages and membership events?
  • 7. How do minions come to accept a new leader?
  • 8. What about “message storms”?