Pangaea: Wide-area File System Taming Aggressive Replication in the - - PowerPoint PPT Presentation

pangaea wide area file system taming aggressive
SMART_READER_LITE
LIVE PREVIEW

Pangaea: Wide-area File System Taming Aggressive Replication in the - - PowerPoint PPT Presentation

Pangaea: Wide-area File System Taming Aggressive Replication in the Pangaea o Support the daily Wide-area File System storage needs of distributed users. Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam o Enable ad-hoc data sharing.


slide-1
SLIDE 1

Taming Aggressive Replication in the Pangaea Wide-area File System

  • Y. Saito, C. Kaamanolis, M. Karlsson, M. Mahalingam

Presented by Jason Waddle

Pangaea: Wide-area File System

  • Support the daily

storage needs of distributed users.

  • Enable ad-hoc data

sharing.

Pangaea Design Goals

I.

Speed

  • Hide wide-area latency,

file access time ~ local file system

II.

Availability & autonomy

  • Avoid single point-of-failure
  • Adapt to churn

III.

Network economy

  • Minimize use of wide-area network
  • Exploit physical locality

Pangaea Assumptions (Non-goals)

Servers are trusted Weak data consistency is sufficient

(consistency in seconds)

slide-2
SLIDE 2

Symbiotic Design Symbiotic Design

Autonomous

Each server operates when disconnected from network.

Symbiotic Design

Autonomous

Each server operates when disconnected from network.

Cooperative

When connected, servers cooperate to enhance overall performance and availability.

Pervasive Replication

Replicate at file/directory level Aggressively create replicas: whenever a

file or directory is accessed

No single “master” replica A replica may be read / written at any time Replicas exchange updates in a peer-to-

peer fashion

slide-3
SLIDE 3

Graph-based Replica Management

Replicas connected in a sparse, strongly-

connected, random graph

Updates propagate along edges Edges used for discovery and removal

Benefits of Graph-based Approach

Inexpensive

Graph is sparse, adding/removing replicas O(1)

Available update distribution

As long as graph is connected, updates reach every

replica

Network economy

High connectivity for close replicas,

build spanning tree along fast edges

Optimistic Replica Coordination

Aim for maximum availability over strong

data-consistency

Any node issues updates at any time Update transmission and and conflict

resolution in background

Optimistic Replica Coordination

“Eventual consistency” (~ 5s in tests) No strong consistency guarantees:

no support for locks, lock-files, etc.

slide-4
SLIDE 4

Pangaea Structure

Region (<5ms RTT) Server or Node

Server Structure

NFS client User space Kernel space NFS protocol handler membership Replication engine log Pangaea server I/O request (application) Inter-node communication

Server Modules

NFS protocol handler

Receives requests from apps, updates local replicas,

generates requests to

Server Modules

NFS protocol handler

Receives requests from apps, updates local replicas,

generates requests to

Replication engine

Accepts local and remote requests Modifies replicas Forwards requests to other nodes

slide-5
SLIDE 5

Server Modules

NFS protocol handler

Receives requests from apps, updates local replicas,

generates requests to

Replication engine

Accepts local and remote requests Modifies replicas Forwards requests to other nodes

Log module

Transaction-like semantics for local updates

Server Modules

Membership module maintains:

List of regions, their members, estimated RTT

between regions

Location of root directory replicas Information coordinated by gossiping “Landmark” nodes bootstrap newly joining

nodes Maintaining RTT information: main scalability bottleneck

File System Structure

Gold replicas

Listed in directory entries Form clique in replica graph Fixed number (e.g., 3)

All replicas (gold and bronze)

Unidirectional edges to all gold replicas Bidirectional peer-edges Backpointer to parent directory

File System Structure

/joe/foo /joe

slide-6
SLIDE 6

File System Structure

struct Replica fid: FileID ts: TimeStamp vv: VersionVector goldPeers: Set(NodeID) peers: Set(NodeID) backptrs: Set(FileID, String) struct DirEntry fname: String fid: FileID downlinks: Set(NodeID) ts: TimeStamp

File Creation

Select locations for g gold replicas (e.g., g=3)

One on current server Others on random servers from different regions

Create entry in parent directory Flood updates

Update to parent directory File contents (empty) to gold replicas

Replica Creation

Recursively get replicas for ancestor

directories

Find a close replica (shortcutting)

Send request to the closest gold replica Gold replica forwards request to its neighbor

closest to requester, who then sends

Replica Creation

Select m peer-edges (e.g., m=4)

Include a gold replica (for future shortcutting) Include closest neighbor from a random gold

replica

Get remaining nodes from random walks

starting at a random gold replica

Create m bidirectional peer-edges

slide-7
SLIDE 7

Bronze Replica Removal

To recover disk space

Using GD-Size algorithm, throw out largest,

least-accessed replica

Drop useless replicas

Too many updates before an access (e.g., 4)

Must notify peer-edges of removal; peers

use random walk to choose new edge

Replica Updates

Flood entire file to replica graph neighbors Updates reach all replicas as long as the

graph is strongly connected

Optional: user can block on update until all

neighbors reply (red-button mode)

Network economy???

Optimized Replica Updates

Send only differences (deltas)

Include old timestamp, new timestamp Only apply delta to replica if old timestamp

matches

Revert to full-content transfer if necessary

Merge deltas when possible

Optimized Replica Updates

Don’t send large (e.g., > 1KB) updates to

each of m neighbors

Instead, use harbingers to dynamically

build a spanning-tree update graph

Harbinger: small message with update’s

timestamps

Send updates along spanning-tree edges Happens in two phases

slide-8
SLIDE 8

Optimized Replica Updates

Exploit Physical Topology Before pushing a harbinger to a neighbor,

add a random delay ~ RTT (e.g., 10*RTT)

Harbingers propagate down fastest links first Dynamically builds an update spanning-tree

with fast edges

Update Example (Phase 1) A B C D E F Update Example (Phase 1) A B C D E F Update Example (Phase 1) A B C D E F

slide-9
SLIDE 9

Update Example (Phase 1) A B C D E F Update Example (Phase 1) A B C D E F Update Example (Phase 1) A B C D E F Update Example (Phase 2) A B C D E F

slide-10
SLIDE 10

Update Example (Phase 2) A B C D E F Update Example (Phase 2) A B C D E F Conflict Resolution

Use a combination of version vectors and

last-writer wins to resolve

If timestamps mismatch, full-content is

transferred

Missing update: just overwrite replica

Regular File Conflict (Three Solutions)

1) Last-writer-wins, using update timestamps

  • Requires server clock synchronization

2) Concatenate both updates

  • Make the user fix it

3) Possibly application-specific resolution

slide-11
SLIDE 11

Directory Conflict

alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo

Directory Conflict

alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo /alice replica set /bob replica set

Directory Conflict

alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo Let the child (foo) decide!

  • Implement mv as a change to the file’s backpointer
  • Single file resolves conflicting updates
  • File then updates affected directories

Temporary Failure Recovery

Log outstanding remote operations

Update, random walk, edge addition, etc.

Retry logged updates

On reboot On recovery of another node

Can create superfluous edges

Retains m-connectedness

slide-12
SLIDE 12

Permanent Failures

A garbage collector (GC) scans for failed

nodes

Bronze replica on failed node

GC causes replica’s neighbors to replace link

with a new peer using random walk

Permanent Failure

Gold replica on failed node Discovered by another gold (clique)

Chooses new gold by random walk Flood choice to all replicas Update parent directory to contain new gold

replica nodes

Resolve conflicts with last-writer-wins Expensive!

Performance – LAN

Andrew-Tcl benchmarks, time in seconds

Performance – Slow Link

The importance of local replicas

slide-13
SLIDE 13

Performance – Roaming

Compile on C1 then time compile on C2. Pangaea utilizes fast links to a peer’s replicas.

Performance: Non-uniform Net

A model of HP’s corporate network.

Performance: Non-uniform Net

Performance: Update Propagation

Harbinger time is the window of inconsistency.

slide-14
SLIDE 14

Performance: Large Scale

HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Latency improves with more replicas.

Performance: Large Scale

HP: 3000 Node 7-region HP Network U: 500 regions, 6 Nodes per region, 200ms RTT 5Mb/s Network economy improves with more replicas.

Performance: Availability

Numbers in parenthesis are relative storage overhead.