HydraFS: a High-Throughput File System for the HYDRAstor - - PowerPoint PPT Presentation

hydrafs
SMART_READER_LITE
LIVE PREVIEW

HydraFS: a High-Throughput File System for the HYDRAstor - - PowerPoint PPT Presentation

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Steve Rago, Grzegorz Calkowski, Cezary Dubnicki, Aniruddha Bohra Feb 26, 2010


slide-1
SLIDE 1

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

Cristian Ungureanu, Benjamin Atkin, Akshat Aranya, Salil Gokhale, Steve Rago, Grzegorz Calkowski, Cezary Dubnicki, Aniruddha Bohra

Feb 26, 2010

slide-2
SLIDE 2

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Content-addressable API

HYDRAstor: De-duplicated Scalable Storage

  • Scale-out storage
  • With global de-duplication
  • Using Content-Defined Chunking
  • Resilient to multiple failures
  • Easy to manage (self-healing,…)
  • High throughput for streaming access
  • Std. interfaces (NFS/CIFS, VTL,…)

2

FAST’09 HYDRAstor: a Scalable Secondary Storage

  • Scalable
  • Easy to manage
  • Resilient
  • High throughput

Content-addressable Block Store

  • Standard protocols
  • Chunking
  • High throughput

Access Layer FAST’10

  • HydraFS: a High Throughput Filesystem
  • Bimodal CDC for Backup Streams
slide-3
SLIDE 3

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

3

B1

  • Variable-size blocks

Block Store File System

Block Store (CAS) API

slide-4
SLIDE 4

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

4

CA1 B1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store

Block Store (CAS) API

Block Store File System

slide-5
SLIDE 5

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

5

CA1 B2 B1

Block Store File System

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store

Block Store (CAS) API

slide-6
SLIDE 6

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

6

CA2 CA1 B2 B1

Block Store File System

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store

Block Store (CAS) API

slide-7
SLIDE 7

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

7

CA2 CA1 B2 B1

Block Store File System

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store

Block Store (CAS) API

B3

slide-8
SLIDE 8

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

8

CA2 CA1 B2 B1

Block Store File System

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store

Block Store (CAS) API

B3 CA1

slide-9
SLIDE 9

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

9

CA2 CA1 B2 B1

Block Store File System

B3 CA1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store

Block Store (CAS) API

slide-10
SLIDE 10

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

10

CA2 CA1 B2 B1

Block Store File System

B3 CA1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store

Block Store (CAS) API

slide-11
SLIDE 11

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

11

CA2 CA1 B2 B1

Block Store File System

CA1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store

Block Store (CAS) API

slide-12
SLIDE 12

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

12

B2 B1

Block Store File System

B4 CA2 CA1 CA1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store
  • Configurable block resilience

Block Store (CAS) API

slide-13
SLIDE 13

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

13

CA3

Block Store File System

B4 CA2 CA1 CA1 B2 B1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store
  • Configurable block resilience

Block Store (CAS) API

slide-14
SLIDE 14

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

14

CA3

Block Store File System

B4 CA2 CA1 CA1 B2 B1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store
  • Configurable block resilience

Block Store (CAS) API

slide-15
SLIDE 15

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

15

CA3 Root1

Block Store File System

B4 CA2 CA1 CA1 B2 B1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store
  • Configurable block resilience
  • Garbage collection

Block Store (CAS) API

slide-16
SLIDE 16

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HYDRAstor Usage Example

16

CA3 Root1

Block Store File System

B4 CA2 CA1 CA1 B2 B1

  • Variable-size blocks
  • Content-addressable
  • Address decided by the store
  • Duplicates eliminated by store
  • Configurable block resilience
  • Garbage collection

Block Store (CAS) API

slide-17
SLIDE 17

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Outline

  • HYDRAstor content-addressable API
  • Challenges posed to the filesystem
  • Filesystem architecture
  • Techniques used to overcome the challenges
  • Conclusions and future work

17

slide-18
SLIDE 18

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Challenges

  • Content-addressable blocks

– A change in a block’s contents also changes the block’s address

  • All metadata has to change, recursively up to the filesystem root
  • Parent can only be written after the children writes are successful

18

  • Variable-sized chunking (splitting file data into blocks)

– Block boundaries change when content is changed – Overwrites cause read-rechunk-rewrite

  • High-latency block store operations

– Why? Hashing, compression, erasure coding, fragment distribution … – Exacerbates the above two challenges

slide-19
SLIDE 19

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Persistent Layout

19

File inode Inode B-tree Directory inode Inode map (segmented array) File contents Directory contents Inode map B-tree Filesystem superblock (root block) Inode map root Directory B-tree

slide-20
SLIDE 20

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

HydraFS Architecture

20

Block Store

Data

File Server Commit Server

Control messages User

  • perations

Filesystem

Update log TS=1; op1, op2,... TS=2; … TS=3; …

File System Root Metadata TS=1 TS=20

slide-21
SLIDE 21

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

21

  • Write buffer

– Accumulates written data; flushed on sync – Helps re-order NFS packets arriving out-of-order Write Buffer (dirty data)

slide-22
SLIDE 22

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

22

  • Write buffer

– Accumulates written data; flushed on sync – Helps re-order NFS packets arriving out-of-order

  • Chunker

– Decides block boundaries (based on data content) Write Buffer (dirty data) Chunker

slide-23
SLIDE 23

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

23

Metadata Modification Records

  • File offset_range  CA
  • Directory additions/removals
  • Inode map de/allocations

(dirty metadata) Write Buffer (dirty data) Chunker

  • Write buffer

– Accumulates written data; flushed on sync – Helps re-order NFS packets arriving out-of-order

  • Chunker

– Decides block boundaries (based on data content)

  • Metadata modification records (file, directory, inode map)

– Dirty metadata annotated with time-stamp (for cleaning) – Written out to log – Large amount of dirty metadata! 

  • Requires efficient cleaning
  • Resource management issues
slide-24
SLIDE 24

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

File Server

24

Block Cache Metadata Modification Records

  • File offset_range  CA
  • Directory additions/removals
  • Inode map de/allocations
  • CA  block data

(clean data & metadata) (dirty metadata) Write Buffer (dirty data) Chunker

  • Write buffer

– Accumulates written data; flushed on sync – Helps re-order NFS packets arriving out-of-order

  • Chunker

– Decides block boundaries (based on data content)

  • Metadata modification records (file, directory, inode map)

– Dirty metadata annotated with time-stamp (for cleaning) – Written out to log

  • Block cache

– Clean data and metadata (not de-serialized)

slide-25
SLIDE 25

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

25

Block Cache Metadata Modification Records Write Buffer Chunker

Block Store

slide-26
SLIDE 26

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

26

Write Buffer Chunker

Block Store

[0,8 KB) Metadata Modification Records Block Cache

slide-27
SLIDE 27

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

27

Write Buffer Chunker

Block Store

[0, 8 KB) Metadata Modification Records Block Cache

slide-28
SLIDE 28

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

28

Write Buffer Chunker

Block Store

[0, 8 KB) [8 KB,16 KB) Metadata Modification Records Block Cache

slide-29
SLIDE 29

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

29

Write Buffer Chunker

Block Store

[0, 8 KB) [8 KB,16 KB) Metadata Modification Records Block Cache

slide-30
SLIDE 30

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

30

Write Buffer Chunker

Block Store

[12 KB, 16 KB) 12 KB of data Metadata Modification Records Block Cache

slide-31
SLIDE 31

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

31

Write Buffer Chunker

Block Store

12 KB of data

Data blocks

[12 KB, 16 KB) CA1 Metadata Modification Records Block Cache

slide-32
SLIDE 32

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

32

Write Buffer Chunker

Block Store

12 KB of data TS , [0, 12KB)  CA1

Data blocks

[12 KB, 16 KB) CA1 Metadata Modification Records Block Cache

slide-33
SLIDE 33

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

33

Write Buffer Chunker

Block Store

12 KB of data TS , [0, 12KB)  CA1

Data blocks

[12 KB, 16 KB) [16 KB,24 KB) Metadata Modification Records Block Cache

slide-34
SLIDE 34

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

34

Write Buffer Chunker

Block Store

12 KB of data TS , [0, 12KB)  CA1

Data blocks

[12 KB, 16 KB) [16 KB,24 KB) Metadata Modification Records Block Cache

slide-35
SLIDE 35

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

35

Write Buffer Chunker

Block Store

[22 KB, 24 KB) 12 KB of data 10 KB of data

Data blocks

TS , [0, 12KB)  CA1 TS , [12KB, 22KB)  CA2 Metadata Modification Records Block Cache

slide-36
SLIDE 36

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Write Processing

36

Write Buffer Chunker

Block Store

[22 KB, 24 KB) 12 KB of data 10 KB of data TS , [0, 12KB)  CA1 TS , [12KB, 22KB)  CA2 TS ; [0, 12KB)  CA1 ; [12KB, 22KB)  CA2 ; …

Log block Data blocks

Metadata Modification Records Block Cache

slide-37
SLIDE 37

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

37

File inode Inode map B-tree

TS

Filesystem superblock

  • Commit server does not read data
slide-38
SLIDE 38

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

38

File inode Inode map B-tree

TS

Filesystem superblock

TS ; inode=2,[24KB, 32KB)=CA1 TS ; inode=9,[24KB, 32KB)=CA2

Log records

slide-39
SLIDE 39

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Commit Server

39

TS TS

  • Amortize updates over many log records
  • Recovery time == the time to re-apply log
slide-40
SLIDE 40

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

40 779

Block Cache CA101 File Modification Records 791, [18KB, 31KB)  CA2 791, [8KB, 18KB)  CA1 inode #1; root= CA101 ; min=791; max=791 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

slide-41
SLIDE 41

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

41 779

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 791, [18KB, 31KB)  CA2 791, [8KB, 18KB)  CA1 inode #1; root= CA101 ; min=791; max=791 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

update( TimeStamp=802)

slide-42
SLIDE 42

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

42

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 791, [18KB, 31KB)  CA2 791, [8KB, 18KB)  CA1 inode #1; root= CA101 ; min=791; max=791 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

Read new, evict old superblock

779

update( TimeStamp=802)

slide-43
SLIDE 43

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

43

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 791, [18KB, 31KB)  CA2 791, [8KB, 18KB)  CA1 inode #1; root= CA101 ; min=791; max=791 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

Process dirty inodes one by one update( TimeStamp=802)

slide-44
SLIDE 44

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

44

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 791, [18KB, 31KB)  CA2 791, [8KB, 18KB)  CA1 inode #1; root= CA101 ; min=791; max=791 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

Case 1: 802 ≥ max  evict entire inode update( TimeStamp=802)

slide-45
SLIDE 45

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

45

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 801, [8KB, 18KB)  CA3 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= CA122 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

Case 2: 802 ≥ min and 802 < max  drop root CA  drop records with time_stamp ≤ 802

update( TimeStamp=802)

slide-46
SLIDE 46

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

46

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= ? ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

Case 3: 802 < min

skip record processing (all are newer) inode root remains unchanged

update( TimeStamp=802)

slide-47
SLIDE 47

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Metadata Cleaning

47

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= ? ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806

802

  • Locks only one inode at a time (no tree locking)
  • No I/O done with the lock held
slide-48
SLIDE 48

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Read Processing

48

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= ? ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806

802

read( inode=2, off=0, len=8KB)

slide-49
SLIDE 49

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 49

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root= ? ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806

802

Read Processing

read( inode=2, off=0, len=8KB)

slide-50
SLIDE 50

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 50

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root=CA412 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

CA412

Read Processing

read( inode=2, off=0, len=8KB)

slide-51
SLIDE 51

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS 51

Block Cache CA101 File Modification Records 803, [18KB, 29KB)  CA4 806, [10KB, 21KB)  CA4 805, [0KB, 10KB)  CA1

inode #2; root=CA412 ; min=801; max=803 inode #3; root= CA303 ; min=805; max=806 CA122

802

CA412

Read Processing

read( inode=2, off=0, len=8KB)

slide-52
SLIDE 52

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Read Performance

Just pre-fetch? Problems

  • High latency  high read-ahead
  • Poor cache locality for metadata

Solutions

  • Separate data and meta-data pre-fetch
  • Weighted-LRU Policy for Block Cache

52

Block Cache

slide-53
SLIDE 53

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Data and Metadata Pre-fetch

  • Problem: time to pre-fetch a data block varies

with its position in the B-tree

Compare: B1 – B2 with B4 – B5

  • B1: B15 – B11 – B9 – B1
  • B2: B15 – B11 – B9 – B2
  • B4: B15 – B11 – B10 – B4
  • B5: B15 – B14 – B12 – B5
  • Solution

– Pre-fetch metadata more aggressively than data

53

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B5 B8 B15 Same file offset distance Likely cache miss

slide-54
SLIDE 54

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU Policy for Block Cache

  • Problem: different access pattern for data and metadata blocks

– Data blocks being read

  • Clean pages, pinned until read completes
  • Looked-up once, then unlikely to be needed again (for streaming workloads)

– Data blocks pre-fetched

  • Clean pages, not pinned
  • Should avoid evicting before they are read

– Metadata blocks

  • Looked-up more than once, but with large duration between accesses
  • Solution: cache eviction policy that favors metadata blocks

– Insert

 Assign weight based on block type

– Lookup

 Reset to initial weight, and make MRU in that bucket

– Reclaim

 Evict blocks with zero weight; Decrease everybody else’s weight with 1

54

slide-55
SLIDE 55

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

55

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

1 2

B15 B9 B4 B3 B11 B10

3

B14 B12

slide-56
SLIDE 56

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

56

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

1 2

B15 B9 B4 B3

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

B11 B10

3

B14 B12

slide-57
SLIDE 57

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

57

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

1 2

B15 B9 B4 B3

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

B11 B10

3

B14 B12

slide-58
SLIDE 58

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

58

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

1 2

B15 B9 B3 B4

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

B11 B10

3

B14 B12 B6

slide-59
SLIDE 59

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

59

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

1 2

B4

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

3

B15 B9 B3 B11 B10 B14 B12 B6

Reclamation !

slide-60
SLIDE 60

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Weighted LRU

60

Different block weights

  • initial metadata block weight:
  • initial data block weight:

Reclamation

  • evict 0-weight blocks
  • reduce all weights by 1

3 1

1 2

B15 B9

B2 B10 B9 B3 B11 B1 B4 B6 B13 B12 B7 B14 B8 B15

B11 B10

3

B14 B12 B3 B6 B13 B7 B8

slide-61
SLIDE 61

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Effectiveness of Read Path Optimizations

  • Main techniques

– Pre-fetch metadata more aggressively than data – Weighted-LRU to evict data more aggressively than metadata

  • Experiment

– Read a large file

61

Accesses Misses Throughput (MB/s) Data Metadata Base 486,966 1577 1011 134.3 Optimized 211,632 438 945 183.2

slide-62
SLIDE 62

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

62

Auxiliary objects Pages for data and blocks Inodes Metadata modification records Log blocks

Pre-allocated, managed

  • Fixed-size pools of fixed-size objects

– pages are 4 KB – inodes are 8 KB – log blocks are up to 128 KB – etc.

Unmanaged heap

  • Objects’ number is bound by that of

some managed objects

slide-63
SLIDE 63

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

63

Resource Event actions Reserved: 0 Allocated: 0 Total: 10 Admission condition: Requested + Reserved + Allocated ≤ Total

slide-64
SLIDE 64

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

64

Resource Event actions Reserved: 0 Allocated: 0 Total: 10 Event 1 created event’s requirements determined by event type 4 + 0 + 0 ≤ 10

slide-65
SLIDE 65

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

65

Resource Event actions Reserved: 4 Allocated: 0 Total: 10 Event 1 created Event 1 admitted

slide-66
SLIDE 66

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

66

Resource Event actions Reserved: 4 Allocated: 4 Total: 10 Event 1 created Event 1 admitted Event 1 allocates 4

slide-67
SLIDE 67

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

67

Resource Event actions Reserved: 0 Allocated: 4 Total: 10 Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes event’s reservations are released may leave behind allocated resources

slide-68
SLIDE 68

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

68

Resource Event actions Reserved: 0 Allocated: 4 Total: 10 Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created 3 + 0 + 4 ≤ 10

slide-69
SLIDE 69

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

69

Resource Event actions Reserved: 3 Allocated: 4 Total: 10 Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created Event 2 admitted

slide-70
SLIDE 70

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

70

Resource Event actions Reserved: 3 Allocated: 6 Total: 10 Event 1 created Event 1 admitted Event 1 allocates 4 Event 1 completes Event 2 created Event 2 admitted Event 2 allocates 2

slide-71
SLIDE 71

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

71

Resource Event actions Reserved: 3 Allocated: 6 Total: 10 Event 2 allocates 2 Event 3 created … 3 + 3 + 6 > 10 Event is blocked

slide-72
SLIDE 72

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

72

Resource Event actions Reserved: 3 Allocated: 6 Total: 10 Event 2 allocates 2 Event 3 created Event 4 created … Blocked behind event 3 FIFO admission

slide-73
SLIDE 73

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

73

Resource Event actions Reserved: 3 Allocated: 4 Total: 10 Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 … On free, admission condition re-evaluated 3 + 3 + 4 ≤ 10

slide-74
SLIDE 74

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

74

Resource Event actions Reserved: 6 Allocated: 4 Total: 10 Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 Event 3 admitted … 3 + 6 + 4 > 10 event 4 remains blocked

slide-75
SLIDE 75

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

75

Resource Event actions Reserved: 3 Allocated: 4 Total: 10 Event 2 allocates 2 Event 3 created Event 4 created Event 2 frees 2 Event 3 admitted Event 2 completes … 1 + 3 + 4 ≤ 10 event 4 admitted … On un-reserve, admission condition re-evaluated

slide-76
SLIDE 76

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

  • Reclamation -
  • Reclamation processing

– First, free pages from clean cached blocks – If not sufficient, initiate flush of dirty inodes – Flush is an internal event with pre-reserved resources

  • Reclamation initiated when

– An event is blocked – A threshold is reached

  • Threshold limit depends on resource type

– Metadata modification records can only be cleaned through metadata update  start earlier – Others (pages, log blocks) can be cleaned quicker  start later

76

slide-77
SLIDE 77

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Resource Management

  • Limits the amount of memory used (avoid swapping)
  • Avoids handling allocation failures in the middle of event

processing

  • Avoids event starvation through FIFO processing
  • Simple but effective (allows high utilization of resources)

77

200 400 600 800 1000 1200

Page memory (MB) Time

Memory reserved and allocated during sequential write

Total Allocated Reserved

slide-78
SLIDE 78

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Experiments

  • Requests can be rejected (“system busy”)
  • Clients notified to resume submission
  • Submit requests until busy, resumes as soon as notified
  • Maximum concurrency; No parent-child structures
  • Upper limit of performance

78

Tool Flow control API

50 100 150 200 250 300 350 20 40 60 80

Throughput (MB/s) Duplicate ratio (%)

Hydra block store HydraFS

read write

slide-79
SLIDE 79

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Conclusions and Future Work

Conclusions

  • Building a filesystem for a content-addressable storage system with

content-defined chunking poses interesting challenges

  • A small number of techniques was sufficient to overcome them while

keeping the system relatively simple and achieving high throughput

Future work

  • Distribute the filesystem
  • Use SSD to improve performance for metadata intensive workloads

79

slide-80
SLIDE 80

FAST 2010 – HydraFS: a High Throughput Filesystem for CAS

Thank you!

80

We are hiring! http://www.nec-labs.com/jobs/