MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu - - PowerPoint PPT Presentation

md raid 456 write journal and cache
SMART_READER_LITE
LIVE PREVIEW

MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu - - PowerPoint PPT Presentation

MD/RAID-456 Write Journal and Cache Shaohua Li & So Song g Liu Software Engineer, Facebook MD/RAID-456 Write Journal and Cache Write holes of RAID-456 Hardware RAID: benefits and challenges Write operation in MD/RAID-456


slide-1
SLIDE 1
slide-2
SLIDE 2

MD/RAID-456 Write Journal and Cache

Shaohua Li & So

Song g Liu

Software Engineer, Facebook

slide-3
SLIDE 3

MD/RAID-456 Write Journal and Cache

  • Write holes of RAID-456
  • Hardware RAID: benefits and challenges
  • Write operation in MD/RAID-456
  • RAID-456 write journal: plug the write hole
  • RAID-456 write cache: fast fsync(), more full stripe writes
  • Examples
slide-4
SLIDE 4

Write Hole of RAID-456

slide-5
SLIDE 5

Failure Recovery of RAID-456

  • Disk failure recovery: rebuild data from parity
  • Power failure recovery: resync stripes with mis-matched

data and parity

slide-6
SLIDE 6

RAID-5: Data and Parity in Sync

D P D D xor xor =

slide-7
SLIDE 7

RAID-5 Disk Failure

D P D D

slide-8
SLIDE 8

RAID-5 Degraded Mode

D P D

slide-9
SLIDE 9

RAID-5 Rebuild Data for Disk Failure

D1 D2 D P

D = P-D1-D2

slide-10
SLIDE 10

RAID-5 Power Failure

D D D P D D D P

slide-11
SLIDE 11

RAID-5 Power Failure

D D D P D D D P

slide-12
SLIDE 12

RAID-5 after Power Failure

D D D P

slide-13
SLIDE 13

RAID-5 Resync after Power Failure

D D D drop old parity

slide-14
SLIDE 14

RAID-5 Resync after Power Failure

D D D P calculate new parity xor xor =

slide-15
SLIDE 15

Write Hole: Disk Failure + Power Failure

D D P

slide-16
SLIDE 16

Write Hole: Rebuild Wrong Data

D D D P

slide-17
SLIDE 17

Hardware RAID

slide-18
SLIDE 18

Hardware RAID

D D D P Battery-backed RAM

slide-19
SLIDE 19

Hardware RAID: No Write Hole

D D P D D P Battery-backed RAM

slide-20
SLIDE 20

Hardware RAID: No Write Hole

D D P D D P Battery-backed RAM

slide-21
SLIDE 21

Hardware RAID: No Write Hole

D D P D D P Battery-backed RAM

slide-22
SLIDE 22

Hardware RAID: Fast fsync()

D D D P Battery-backed RAM return writes from RAM D

slide-23
SLIDE 23

Hardware RAID: Fast fsync()

D D D P Battery-backed RAM return writes from RAM D D

slide-24
SLIDE 24

Hardware RAID: Full Stripe Write

D D D P Battery-backed RAM D D D

slide-25
SLIDE 25

Hardware RAID: Full Stripe Write

D D D P Battery-backed RAM D D D P

slide-26
SLIDE 26

Hardware RAID: Full Stripe Write

D D D P Battery-backed RAM D D D P

slide-27
SLIDE 27

Hardware RAID @ Facebook

  • RAID-6
  • Haystack: photo storage
  • GlusterFS: scalable network filesystem
  • RAID-0 of single HDD, for fast fsync()
slide-28
SLIDE 28

Challenges with Hardware RAID

  • Black box solution
  • Low transparency; low flexibility
  • Vendor specific toolset
slide-29
SLIDE 29

Make Software RAID Better

  • Write journal: plug write hole
  • Write cache: accelerate fsync(); more full stripe writes
slide-30
SLIDE 30

MD/RAID-456 Write

slide-31
SLIDE 31

RAID-456: Stripe Cache

D D D P Stripe Cache (System Memory)

slide-32
SLIDE 32

RAID-456 Write

  • Step 1: update data and parity in stripe cache
  • Option 1: Reconstruct
  • Option 2: R-M-W
  • Step 2: write data and parity to RAID disks
  • Step 3: bio_endio
slide-33
SLIDE 33

RAID-456 Reconstruct Write

D D D P Stripe Cache D

slide-34
SLIDE 34

RAID-456 Reconstruct Write

D D D P Stripe Cache D D D

slide-35
SLIDE 35

RAID-456 Reconstruct Write

D D D P Stripe Cache D D D P xor xor =

slide-36
SLIDE 36

RAID-456 Reconstruct Write

D D D P Stripe Cache D D D P

slide-37
SLIDE 37

RAID-456 R-M-W Write

D D D P Stripe Cache D

slide-38
SLIDE 38

RAID-456 R-M-W Write

D D D P Stripe Cache D D P

slide-39
SLIDE 39

RAID-456 R-M-W Write

D D D P Stripe Cache D D P P P=P-D+D

slide-40
SLIDE 40

RAID-456 R-M-W Write

D D D P Stripe Cache D P

slide-41
SLIDE 41

MD/RAID-456 Write Journal

slide-42
SLIDE 42

RAID-456 Write Journal

  • Use block device (SSD, NVM, etc.) as the journal
  • No change to read path
  • All writes (data and parity) hit journal before committing

to RAID array

  • For each stripe, “commit all” or “commit nothing”
  • Journal replay after power failure (no need for resync)
slide-43
SLIDE 43

RAID-456 Write Journal: Write Path

  • Step 1: update data and parity in stripe cache
  • Step 2: write data and parity to journal device
  • Step 3: flush journal device cache
  • Step 4: write data and parity to RAID disks
  • Step 5: bio_endio
slide-44
SLIDE 44

RAID-456 Write Journal: Disk Format

meta block data block data block … data block meta block data block data block … data block magic, checksum, version, meta size, seq, position type: data/parity/flush flags: discard, reshape size, sector, checksum type: data/parity/flush flags: discard, reshape size, sector, checksum …

slide-45
SLIDE 45

RAID-456 Write Journal

D D D P Stripe Cache Journal

slide-46
SLIDE 46

RAID-456 Write Journal

Stripe Cache Journal D P D D D P

slide-47
SLIDE 47

RAID-456 Write Journal

Stripe Cache Journal D P D P D D D P

slide-48
SLIDE 48

RAID-456 Write Journal

D D D P Stripe Cache Journal D P D P

slide-49
SLIDE 49

RAID-456 Write Journal: Reclaim Path

  • Step 1: update journal device super block
  • Step 2: issue discard to journal device
slide-50
SLIDE 50

RAID-456 Write Journal: Recovery Path

  • For complete stripe (with data and parity) in journal
  • Replay all data/parity to RAID disks
  • For partial stripe (data only) in journal
  • Drop the journal entry
slide-51
SLIDE 51

After Power Failure: Replay Writes

D D D P Stripe Cache Journal D P

slide-52
SLIDE 52

After Power Failure: Replay Writes

D D D P Stripe Cache Journal D P

slide-53
SLIDE 53

After Power Failure: Drop Partial Journal

D D D P Stripe Cache Journal D

slide-54
SLIDE 54

After Power Failure: Drop Partial Journal

D D D P Stripe Cache Journal

slide-55
SLIDE 55

MD/RAID-456 Write Cache

slide-56
SLIDE 56

RAID-456 Write Cache

  • Use same disk format as the write journal
  • Move bio_endio to a much earlier stage
  • Hold data in stripe cache
  • Read path must look up in stripe cache
  • Need reclaim and smarter recovery
  • Opportunity for more full stripe writes
slide-57
SLIDE 57

RAID-456 Write Cache: Read Path

  • Chunk aligned read (bypass stripe cache, optimal state)
  • Step 1: look up data in stripe cache
  • Step 2: when missed stripe cache, read from disk
  • Step 3: amend data from disk with latest data in stripe cache
  • None chunk aligned read
  • No changes
slide-58
SLIDE 58

RAID-456 Write Cache: Write Path

  • Step 1: write data to journal device
  • Step 2: flush journal device cache
  • Step 3: bio_endio
  • Step 4: update data and parity in stripe cache
  • Step 5: write parity to journal device
  • Step 6: flush journal device cache
  • Step 7: write data and parity to RAID disks
slide-59
SLIDE 59

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D

slide-60
SLIDE 60

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D D

slide-61
SLIDE 61

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D bio_endio D

slide-62
SLIDE 62

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D D D

slide-63
SLIDE 63

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D D D D

slide-64
SLIDE 64

RAID-456 Write Cache: Write Path

D D D P Stripe Cache Journal D D D D bio_endio

slide-65
SLIDE 65

RAID-456 Write Cache: Reclaim Path

  • Step 1: update data and parity in stripe cache
  • Step 2: write parity to journal device
  • Step 3: flush journal device cache
  • Step 4: write data and parity to RAID disks
slide-66
SLIDE 66

D D D P Stripe Cache Journal D D D D

RAID-456 Write Cache: Reclaim Path

slide-67
SLIDE 67

RAID-456 Write Cache: Reclaim Path

D D D P Stripe Cache Journal D D D D D

slide-68
SLIDE 68

RAID-456 Write Cache: Reclaim Path

D D D P Stripe Cache Journal D D D D D P xor xor =

slide-69
SLIDE 69

RAID-456 Write Cache: Reclaim Path

D D D P Stripe Cache Journal D D D D D P P

slide-70
SLIDE 70

RAID-456 Write Cache: Reclaim Path

D D D P Stripe Cache Journal D D D D D P P

slide-71
SLIDE 71

RAID-456 Write Cache: Recover Path

  • Stripes with data and parity in journal
  • Replay writes of data and parity
  • Stripes with data in journal
  • Repeat full recontruct write or R-M-W write
slide-72
SLIDE 72

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D

slide-73
SLIDE 73

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D D

slide-74
SLIDE 74

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D D D D

slide-75
SLIDE 75

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D D D D P xor xor =

slide-76
SLIDE 76

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D D D D P P

slide-77
SLIDE 77

Recover Stripe Data: Reconstruct

D D D P Stripe Cache Journal D D D D P P

slide-78
SLIDE 78

Recover Stripe Data: R-M-W

D D P Stripe Cache Journal D D

slide-79
SLIDE 79

Recover Stripe Data: R-M-W

D D P Stripe Cache Journal D D D P P P=P-D+D

slide-80
SLIDE 80

Recover Stripe Data: R-M-W

D D P Stripe Cache Journal D D P P

slide-81
SLIDE 81

Recover Stripe Data: R-M-W

D D P Stripe Cache Journal D D P P

slide-82
SLIDE 82

Current Status

  • Write Journal
  • Kernel changes released with kernel 4.4
  • mdadm changes released with mdadm-3.4
  • Write Cache
  • Kernel changes in progress
  • No change required for mdadm
slide-83
SLIDE 83

# create array with write journal mdadm --create -f /dev/md0 -c 64 --raid-devices=4 -- level=5 /dev/sd[b-e] --write-journal /dev/sdf # check array with journal cat /proc/mdstat Personalities : [raid6] [raid5] [raid4]md0 : active raid5 sdf[4](J) sde[3] sdd[2] sdc[1] sdb[0] # add journal to existing array mdadm --manage /dev/md0 --add-journal /dev/sdf

Examples

slide-84
SLIDE 84