Petal and Frangipani Petal and Frangipani Petal/Frangipani - - PowerPoint PPT Presentation
Petal and Frangipani Petal and Frangipani Petal/Frangipani - - PowerPoint PPT Presentation
Petal and Frangipani Petal and Frangipani Petal/Frangipani Petal/Frangipani NFS NFS NAS NAS Frangipani Frangipani SAN SAN Pet al Pet al Petal/Frangipani Petal/Frangipani Unt rust ed NFS NFS OS-agnost ic FS
Petal/Frangipani Petal/Frangipani
Pet al Pet al Frangipani Frangipani NFS NFS “SAN” “SAN” “NAS” “NAS”
Petal/Frangipani Petal/Frangipani
Pet al Pet al Fr angipani Fr angipani NFS NFS
Unt rust ed OS-agnost ic FS semant ics Sharing/ coordinat ion Disk aggregat ion (“bricks”) Filesyst em-agnost ic Recovery and reconf igurat ion Load balancing Chained declust ering Snapshot s Does not cont rol sharing Each “cloud” may resize or reconf igure independent ly. What indirect ion is required t o make t his happen, and where is it ?
Remaining Slides Remaining Slides
The f ollowing slides have been borrowed f rom t he Pet al and Frangipani present at ions, which were available on t he Web unt il Compaq SRC
- dissolved. This mat erial is owned by Ed Lee, Chandu Thekkat h, and
t he ot her aut hors of t he work. The Frangipani mat erial is st ill available t hrough Chandu Thekkat h’s sit e at www.t hekkat h.org. For CPS 212, several issues are import ant :
- Underst and t he role of each layer in t he previous slides, and t he
st rengt hs and limit at ions of each layer as a basis f or innovat ing behind it s int erf ace (NAS/ SAN).
- Underst and t he concept s of virt ual disks and a clust er f ile syst em
embodied in Pet al and Fr angipani.
- Underst and t he similar it ies/ dif f erences bet ween Pet al and t he ot her
reconf igur able clust er service work we have st udied: DDS and Porcupine.
- Underst and how t he f eat ures of Pet al simplif y t he design of a scalable
clust er f ile syst em (Frangipani) above it .
- Underst and t he nat ure, purpose, and role of t he t hree key design
element s added f or Frangipani: leased locks, a writ e-ownership consist ent caching prot ocol, and server logging f or recovery.
5
Petal: Distributed Virtual Disks Petal: Distributed Virtual Disks
Systems Research Center Digital Equipment Corporation Edward K. Lee Chandramohan A. Thekkath
10/24/2002
6
Logical System View Logical System View
/dev/vdisk1 /dev/vdisk2 /dev/vdisk3 /dev/vdisk4 /dev/vdisk5
AdvFS NT FS PC FS UFS
Scalable Network
Petal
7
Physical System View Physical System View
Scalable Network
Petal Server Petal Server Petal Server Petal Server
Parallel Database or Cluster File System
/dev/shared1
8
Virtual Disks Virtual Disks
Each disk provides 2^64 byte address space. Created and destroyed on demand. Allocates disk storage on demand. Snapshots via copy-on-write. Online incremental reconfiguration.
9
Virtual to Physical Translation Virtual to Physical Translation
PMap0 vdiskID
- ffset
(disk, diskOffset) PMap1 Virtual Disk Directory GMap PMap2 PMap3 (server, disk, diskOffset) (vdiskID, offset) Server 0 Server 1 Server 2 Server 3
10
Global State Management Global State Management
Based on Leslie Lamport’s Paxos algorithm. Global state is replicated across all servers. Consistent in the face of server & network failures. A majority is needed to update global state. Any server can be added/removed in the presence of failed servers.
11
Fault Fault-
- Tolerant Global Operations
Tolerant Global Operations
Create/Delete virtual disks. Snapshot virtual disks. Add/Remove servers. Reconfigure virtual disks.
12
Data Placement & Redundancy Data Placement & Redundancy
Supports non-redundant and chained-declustered virtual disks. Parity can be supported if desired. Chained-declustering tolerates any single component failure. Tolerates many common multiple failures. Throughput scales linearly with additional servers. Throughput degrades gracefully with failures.
13
Chained Chained Declustering Declustering
D0 Server0 D3 D4 D7 D1 Server1 D0 D5 D4 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6
14
Chained Chained Declustering Declustering
D0 Server0 D3 D4 D7 Server1 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6 D1 D0 D5 D4
15
The Prototype The Prototype
Digital ATM network.
- 155 Mbit/s per link.
8 AlphaStation Model 600.
- 333 MHz Alpha running Digital Unix.
72 RZ29 disks.
- 4.3 GB, 3.5 inch, fast SCSI (10MB/s).
- 9 ms avg. seek, 6 MB/s sustained transfer rate.
Unix kernel device driver. User-level Petal servers.
16
The Prototype The Prototype
src-ss1
Digital ATM Network (AN2)
src-ss2 src-ss8 petal1 petal2 petal8
/dev/vdisk1
/dev/vdisk1 /dev/vdisk1 /dev/vdisk1
……… ………
17
Throughput Scaling Throughput Scaling
2 4 6 8 2 4 6 8 Number of Servers Throuput Scale-up LINEAR 512B Rd 8KB Rd 64KB Rd 512B Wr 8KB Wr 64KB Wr
18
Virtual Disk Reconfiguration Virtual Disk Reconfiguration
5 10 15 20 25 30 1 2 3 4 5 6 Elapsed Time in Minutes Throughput in MB/s
6 servers 8 servers
virtual disk w/ 1GB of allocated storage 8KB reads & writes
Frangipani: A Scalable Distributed File Frangipani: A Scalable Distributed File System System
- C. A. Thekkath, T. Mann, and E. K. Lee
Systems Research Center Digital Equipment Corporation
Why Not An Old File System on Petal? Why Not An Old File System on Petal?
Traditional file systems (e.g., UFS, AdvFS) cannot share a block device The machine that runs the file system can become a bottleneck
Frangipani Frangipani
Behaves like a local file system
- multiple machines cooperatively manage
a Petal disk
- users on any machine see a consistent
view of data
Exhibits good performance, scaling, and load balancing Easy to administer
Ease of Administration Ease of Administration
Frangipani machines are modular
- can be added and deleted transparently
Common free space pool
- users don’t have to be moved
Automatically recovers from crashes Consistent backup without halting the system
Components of Frangipani Components of Frangipani
File system core
- implements the Digital Unix vnode interface
- uses the Digital Unix Unified Buffer Cache
- exploits Petal’s large virtual space
Locks with leases Write-ahead redo log
Locks Locks
Multiple reader/single writer Locks are moderately coarse-grained
- protects entire file or directory
Dirty data is written to disk before lock is given to another machine Each machine aggressively caches locks
- uses lease timeouts for lock recovery
Logging Logging
Frangipani uses a write ahead redo log for metadata
- log records are kept on Petal
Data is written to Petal
- on sync, fsync, or every 30 seconds
- on lock revocation or when the log wraps
Each machine has a separate log
- reduces contention
- independent recovery
Recovery Recovery
Recovery is initiated by the lock service Recovery can be carried out on any machine
- log is distributed and available via Petal