Petal and Frangipani Petal and Frangipani Petal/Frangipani - - PowerPoint PPT Presentation

petal and frangipani petal and frangipani petal
SMART_READER_LITE
LIVE PREVIEW

Petal and Frangipani Petal and Frangipani Petal/Frangipani - - PowerPoint PPT Presentation

Petal and Frangipani Petal and Frangipani Petal/Frangipani Petal/Frangipani NFS NFS NAS NAS Frangipani Frangipani SAN SAN Pet al Pet al Petal/Frangipani Petal/Frangipani Unt rust ed NFS NFS OS-agnost ic FS


slide-1
SLIDE 1

Petal and Frangipani Petal and Frangipani

slide-2
SLIDE 2

Petal/Frangipani Petal/Frangipani

Pet al Pet al Frangipani Frangipani NFS NFS “SAN” “SAN” “NAS” “NAS”

slide-3
SLIDE 3

Petal/Frangipani Petal/Frangipani

Pet al Pet al Fr angipani Fr angipani NFS NFS

Unt rust ed OS-agnost ic FS semant ics Sharing/ coordinat ion Disk aggregat ion (“bricks”) Filesyst em-agnost ic Recovery and reconf igurat ion Load balancing Chained declust ering Snapshot s Does not cont rol sharing Each “cloud” may resize or reconf igure independent ly. What indirect ion is required t o make t his happen, and where is it ?

slide-4
SLIDE 4

Remaining Slides Remaining Slides

The f ollowing slides have been borrowed f rom t he Pet al and Frangipani present at ions, which were available on t he Web unt il Compaq SRC

  • dissolved. This mat erial is owned by Ed Lee, Chandu Thekkat h, and

t he ot her aut hors of t he work. The Frangipani mat erial is st ill available t hrough Chandu Thekkat h’s sit e at www.t hekkat h.org. For CPS 212, several issues are import ant :

  • Underst and t he role of each layer in t he previous slides, and t he

st rengt hs and limit at ions of each layer as a basis f or innovat ing behind it s int erf ace (NAS/ SAN).

  • Underst and t he concept s of virt ual disks and a clust er f ile syst em

embodied in Pet al and Fr angipani.

  • Underst and t he similar it ies/ dif f erences bet ween Pet al and t he ot her

reconf igur able clust er service work we have st udied: DDS and Porcupine.

  • Underst and how t he f eat ures of Pet al simplif y t he design of a scalable

clust er f ile syst em (Frangipani) above it .

  • Underst and t he nat ure, purpose, and role of t he t hree key design

element s added f or Frangipani: leased locks, a writ e-ownership consist ent caching prot ocol, and server logging f or recovery.

slide-5
SLIDE 5

5

Petal: Distributed Virtual Disks Petal: Distributed Virtual Disks

Systems Research Center Digital Equipment Corporation Edward K. Lee Chandramohan A. Thekkath

10/24/2002

slide-6
SLIDE 6

6

Logical System View Logical System View

/dev/vdisk1 /dev/vdisk2 /dev/vdisk3 /dev/vdisk4 /dev/vdisk5

AdvFS NT FS PC FS UFS

Scalable Network

Petal

slide-7
SLIDE 7

7

Physical System View Physical System View

Scalable Network

Petal Server Petal Server Petal Server Petal Server

Parallel Database or Cluster File System

/dev/shared1

slide-8
SLIDE 8

8

Virtual Disks Virtual Disks

Each disk provides 2^64 byte address space. Created and destroyed on demand. Allocates disk storage on demand. Snapshots via copy-on-write. Online incremental reconfiguration.

slide-9
SLIDE 9

9

Virtual to Physical Translation Virtual to Physical Translation

PMap0 vdiskID

  • ffset

(disk, diskOffset) PMap1 Virtual Disk Directory GMap PMap2 PMap3 (server, disk, diskOffset) (vdiskID, offset) Server 0 Server 1 Server 2 Server 3

slide-10
SLIDE 10

10

Global State Management Global State Management

Based on Leslie Lamport’s Paxos algorithm. Global state is replicated across all servers. Consistent in the face of server & network failures. A majority is needed to update global state. Any server can be added/removed in the presence of failed servers.

slide-11
SLIDE 11

11

Fault Fault-

  • Tolerant Global Operations

Tolerant Global Operations

Create/Delete virtual disks. Snapshot virtual disks. Add/Remove servers. Reconfigure virtual disks.

slide-12
SLIDE 12

12

Data Placement & Redundancy Data Placement & Redundancy

Supports non-redundant and chained-declustered virtual disks. Parity can be supported if desired. Chained-declustering tolerates any single component failure. Tolerates many common multiple failures. Throughput scales linearly with additional servers. Throughput degrades gracefully with failures.

slide-13
SLIDE 13

13

Chained Chained Declustering Declustering

D0 Server0 D3 D4 D7 D1 Server1 D0 D5 D4 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6

slide-14
SLIDE 14

14

Chained Chained Declustering Declustering

D0 Server0 D3 D4 D7 Server1 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6 D1 D0 D5 D4

slide-15
SLIDE 15

15

The Prototype The Prototype

Digital ATM network.

  • 155 Mbit/s per link.

8 AlphaStation Model 600.

  • 333 MHz Alpha running Digital Unix.

72 RZ29 disks.

  • 4.3 GB, 3.5 inch, fast SCSI (10MB/s).
  • 9 ms avg. seek, 6 MB/s sustained transfer rate.

Unix kernel device driver. User-level Petal servers.

slide-16
SLIDE 16

16

The Prototype The Prototype

src-ss1

Digital ATM Network (AN2)

src-ss2 src-ss8 petal1 petal2 petal8

/dev/vdisk1

/dev/vdisk1 /dev/vdisk1 /dev/vdisk1

……… ………

slide-17
SLIDE 17

17

Throughput Scaling Throughput Scaling

2 4 6 8 2 4 6 8 Number of Servers Throuput Scale-up LINEAR 512B Rd 8KB Rd 64KB Rd 512B Wr 8KB Wr 64KB Wr

slide-18
SLIDE 18

18

Virtual Disk Reconfiguration Virtual Disk Reconfiguration

5 10 15 20 25 30 1 2 3 4 5 6 Elapsed Time in Minutes Throughput in MB/s

6 servers 8 servers

virtual disk w/ 1GB of allocated storage 8KB reads & writes

slide-19
SLIDE 19

Frangipani: A Scalable Distributed File Frangipani: A Scalable Distributed File System System

  • C. A. Thekkath, T. Mann, and E. K. Lee

Systems Research Center Digital Equipment Corporation

slide-20
SLIDE 20

Why Not An Old File System on Petal? Why Not An Old File System on Petal?

Traditional file systems (e.g., UFS, AdvFS) cannot share a block device The machine that runs the file system can become a bottleneck

slide-21
SLIDE 21

Frangipani Frangipani

Behaves like a local file system

  • multiple machines cooperatively manage

a Petal disk

  • users on any machine see a consistent

view of data

Exhibits good performance, scaling, and load balancing Easy to administer

slide-22
SLIDE 22

Ease of Administration Ease of Administration

Frangipani machines are modular

  • can be added and deleted transparently

Common free space pool

  • users don’t have to be moved

Automatically recovers from crashes Consistent backup without halting the system

slide-23
SLIDE 23

Components of Frangipani Components of Frangipani

File system core

  • implements the Digital Unix vnode interface
  • uses the Digital Unix Unified Buffer Cache
  • exploits Petal’s large virtual space

Locks with leases Write-ahead redo log

slide-24
SLIDE 24

Locks Locks

Multiple reader/single writer Locks are moderately coarse-grained

  • protects entire file or directory

Dirty data is written to disk before lock is given to another machine Each machine aggressively caches locks

  • uses lease timeouts for lock recovery
slide-25
SLIDE 25

Logging Logging

Frangipani uses a write ahead redo log for metadata

  • log records are kept on Petal

Data is written to Petal

  • on sync, fsync, or every 30 seconds
  • on lock revocation or when the log wraps

Each machine has a separate log

  • reduces contention
  • independent recovery
slide-26
SLIDE 26

Recovery Recovery

Recovery is initiated by the lock service Recovery can be carried out on any machine

  • log is distributed and available via Petal