1 Physical System View Virtual Disks Physical System View Virtual - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Physical System View Virtual Disks Physical System View Virtual - - PDF document

Petal/Frangipani Petal/Frangipani NFS NFS NAS NAS Petal and Frangipani Petal and Frangipani Fr angipani Fr angipani SAN SAN Pet al Pet al Petal/Frangipani Remaining Slides Petal/Frangipani Remaining Slides


slide-1
SLIDE 1

1

Petal and Frangipani Petal and Frangipani Petal/Frangipani Petal/Frangipani

Pet al Pet al Fr angipani Fr angipani NFS NFS “SAN” “SAN” “NAS” “NAS”

Petal/Frangipani Petal/Frangipani

P et al P et al Fr angipani Fr angipani NFS NFS

Unt r ust ed OS-agnost ic FS semant ics Shar ing/ coor dinat ion Disk aggr egat ion (“br icks”) Filesyst em-agnost ic Recover y and r econf igur at ion Load balancing Chained declust er ing Snapshot s Does not cont r ol shar ing Each “cloud” may r esize or r econf igur e independent ly. What indir ect ion is r equir ed t o make t his happen, and wher e is it ?

Remaining Slides Remaining Slides

The f ollowing slides have been bor r owed f r om t he Pet al and Fr angipani pr esent at ions, which wer e available on t he Web unt il Compaq SRC

  • dissolved. This mat er ial is owned by Ed Lee, Chandu Thekkat h, and

t he ot her aut hor s of t he wor k. The Frangipani mat er ial is st ill available t hr ough Chandu Thekkat h’s sit e at www.t hekkat h.or g. For CPS 212, sever al issues ar e impor t ant :

  • Under st and t he r ole of each layer in t he pr evious slides, and t he

st r engt hs and limit at ions of each layer as a basis f or innovat ing behind it s int er f ace (NAS/ SAN).

  • Under st and t he concept s of vir t ual disks and a clust er f ile syst em

embodied in Pet al and Fr angipani.

  • Under st and t he similar it ies/ dif f er ences bet ween P

et al and t he ot her r econf igur able clust er ser vice wor k we have st udied: DDS and P

  • r cupine.
  • Under st and how t he f eat ur es of Pet al simplif y t he design of a scalable

clust er f ile syst em (Fr angipani) above it .

  • Under st and t he nat ur e, pur pose, and r ole of t he t hr ee key design

element s added f or Fr angipani: leased locks, a wr it e-owner ship consist ent caching pr ot ocol, and ser ver logging f or r ecover y.

5

Petal: Distributed Virtual Disks Petal: Distributed Virtual Disks

Systems Research Center Digital Equipment Corporation Edward K. Lee Chandramohan A. Thekkath

10/24/2002

6

Logical System View Logical System View

/dev/vdisk1 /dev/vdisk2 /dev/vdisk3 /dev/vdisk4 /dev/vdisk5

AdvFS NT FS PC FS UFS

Scalable Network

Petal

slide-2
SLIDE 2

2

7

Physical System View Physical System View

Scalable Network

Petal Server Petal Server Petal Server Petal Server

Parallel Database or Cluster File System

/dev/shared1

8

Virtual Disks Virtual Disks

Each disk provides 2^64 byte address space. Created and destroyed on demand. Allocates disk storage on demand. Snapshots via copy-on-write. Online incremental reconfiguration.

9

Virtual to Physical Translation Virtual to Physical Translation

PMap0 vdiskID

  • ffset

(disk, diskOffset) PMap1 Virtual Disk Directory GMap PMap2 PMap3 (server, disk, diskOffset) (vdiskID, offset) Server 0 Server 1 Server 2 Server 3

10

Global State Management Global State Management

Based on Leslie Lamport’s Paxos algorithm. Global state is replicated across all servers. Consistent in the face of server & network failures. A majority is needed to update global state. Any server can be added/removed in the presence of failed servers.

11

Fault Fault-

  • Tolerant Global Operations

Tolerant Global Operations

Create/Delete virtual disks. Snapshot virtual disks. Add/Remove servers. Reconfigure virtual disks.

12

Data Placement & Redundancy Data Placement & Redundancy

Supports non-redundant and chained-declustered virtual disks. Parity can be supported if desired. Chained-declustering tolerates any single component failure. Tolerates many common multiple failures. Throughput scales linearly with additional servers. Throughput degrades gracefully with failures.

slide-3
SLIDE 3

3

13

Chained Chained Declustering Declustering

D0 Server0 D3 D4 D7 D1 Server1 D0 D5 D4 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6

14

Chained Chained Declustering Declustering

D0 Server0 D3 D4 D7 Server1 D2 Server2 D1 D6 D5 D3 Server3 D2 D7 D6 D1 D0 D5 D4

15

The Prototype The Prototype

Digital ATM network.

  • 155 Mbit/s per link.

8 AlphaStation Model 600.

  • 333 MHz Alpha running Digital Unix.

72 RZ29 disks.

  • 4.3 GB, 3.5 inch, fast SCSI (10MB/s).
  • 9 ms avg. seek, 6 MB/s sustained transfer rate.

Unix kernel device driver. User-level Petal servers.

16

The Prototype The Prototype

src-ss1

Digital ATM Network (AN2)

src-ss2 src-ss8 petal1 petal2 petal8

/dev/vdisk1

/dev/vdisk1 /dev/vdisk1 /dev/vdisk1

……… ………

17

Throughput Scaling Throughput Scaling

2 4 6 8 2 4 6 8 Number of Servers Throuput Scale-up LINEAR 512B Rd 8KB Rd 64KB Rd 512B Wr 8KB Wr 64KB Wr

18

Virtual Disk Reconfiguration Virtual Disk Reconfiguration

5 10 15 20 25 30 1 2 3 4 5 6 Elapsed Time in Minutes Throughput in MB/s 6 servers 8 servers

virtual disk w/ 1GB of allocated storage 8KB reads & writes

slide-4
SLIDE 4

4

Frangipani: A Scalable Distributed File Frangipani: A Scalable Distributed File System System

  • C. A. Thekkath, T. Mann, and E. K. Lee

Systems Research Center Digital Equipment Corporation

Why Not An Old File System on Petal? Why Not An Old File System on Petal?

Traditional file systems (e.g., UFS, AdvFS) cannot share a block device The machine that runs the file system can become a bottleneck

Frangipani Frangipani

Behaves like a local file system

  • multiple machines cooperatively manage

a Petal disk

  • users on any machine see a consistent

view of data

Exhibits good performance, scaling, and load balancing Easy to administer

Ease of Administration Ease of Administration

Frangipani machines are modular

  • can be added and deleted transparently

Common free space pool

  • users don’t have to be moved

Automatically recovers from crashes Consistent backup without halting the system

Components of Frangipani Components of Frangipani

File system core

  • implements the Digital Unix vnode interface
  • uses the Digital Unix Unified Buffer Cache
  • exploits Petal’s large virtual space

Locks with leases Write-ahead redo log

Locks Locks

Multiple reader/single writer Locks are moderately coarse-grained

  • protects entire file or directory

Dirty data is written to disk before lock is given to another machine Each machine aggressively caches locks

  • uses lease timeouts for lock recovery
slide-5
SLIDE 5

5

Logging Logging

Frangipani uses a write ahead redo log for metadata

  • log records are kept on Petal

Data is written to Petal

  • on sync, fsync, or every 30 seconds
  • on lock revocation or when the log wraps

Each machine has a separate log

  • reduces contention
  • independent recovery

Recovery Recovery

Recovery is initiated by the lock service Recovery can be carried out on any machine

  • log is distributed and available via Petal