Random write is still slow at SSD 1200 random write (4KB) 1000 - - PowerPoint PPT Presentation

random write is still slow at ssd
SMART_READER_LITE
LIVE PREVIEW

Random write is still slow at SSD 1200 random write (4KB) 1000 - - PowerPoint PPT Presentation

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device Yun Ho Jeong 2 and Kyung Ho Kim 2 Hyukjoong Kim 1 , Dongkun Shin 1 , Sungkyunkwan University 1 Samsung Electronics 2 Presented at


slide-1
SLIDE 1

SHRD: Improving Spatial Locality in Flash

Storage Accesses by Sequentializing in Host and Randomizing in Device

Hyukjoong Kim1, Dongkun Shin1, Yun Ho Jeong2 and Kyung Ho Kim2 Sungkyunkwan University1 Samsung Electronics2

Presented at FAST’17

1

slide-2
SLIDE 2

Sungkyunkwan University

Random write is still slow at SSD

2 200 400 600 800 1000 1200 eMMC 5.0 (Odroid-XU3) UFS (Galaxy S6) SATA SSD (Intel 525) SATA SSD (Samsung 850 Pro) NVMe SSD (Intel 750) Bandwidth (MB/s) random write (4KB) sequential write (512KB)

slide-3
SLIDE 3

Sungkyunkwan University

Why is RW slower than SW?

3

  • 1. Request Handling Overhead
  • 2. Garbage Collection Overhead
  • 3. Mapping

Table Handling Overhead

slide-4
SLIDE 4

Sungkyunkwan University

Why is RW slower than SW?

4

  • 1. Request Handling Overhead
  • Sequential write

 Large, few requests

  • Random write

 Small, many requests

  • Packed command

▪ e.g. eMMC

  • Interrupt coalescing

▪ e.g. NVMe, SATA NCQ

  • Vectored I/O

▪ e.g. OpenChannel SSD [FAST’17]

slide-5
SLIDE 5

Sungkyunkwan University

Why is RW slower than SW?

5

  • 2. Garbage Collection Overhead
  • Hot/cold separation

 Stores hot and cold data into different blocks

  • Incremental GC / bgGC

 can hide GC latency

Blocks with RW Blocks with SW

Invalid I I I Valid V V V I V I V V I V I

  • RW generates hot/cold-mixed blocks
  • Dispersed invalid pages  high GC overhead
slide-6
SLIDE 6

Sungkyunkwan University

Why is RW slower than SW?

6

  • 3. Mapping

Table Handling Overhead

  • Page-level mapping FTL shows good performance on RW

▪ Requires a large DRAM to maintain fine-grained mapping table ▪ 4 byte per 4 KB  8 GB DRAM for 8 TB storage

  • Demand loading FTL (DFTL [ASPLOS’08])

▪ Uses a small map cache with on-demand map loading ▪ Random writes invoke frequent map loading/unloading

slide-7
SLIDE 7

Sungkyunkwan University

Demand-loading FTL (DFTL)

  • Map caching scheme can show

good performance by utilizing temporal & spatial locality

▪ Page level map load/unload ✓One map page contains multiple contiguous mapping entries

  • Vulnerable to random workload

▪ low temporal & spatial locality ▪ high map miss rate ▪ high map loading overhead

7

NAND Flash Chips DRAM Map Cache

… map page … map page … map page

map blocks data blocks

map pages

data OOB

loads unloads LRU list

1023 4096 5119 7168 8191

Write LPN 768

slide-8
SLIDE 8

Sungkyunkwan University

Previous Solution: LFS

  • Generate only sequential writes

▪ out-of-place append-only write scheme

  • Problems

▪ reclaiming log space (cleaning overhead)

  • Filesystem needs to copy valid page

 host-to-device data transfer ▪ Large metadata, wandering tree problem ▪ Fragmented read operation

8

Append logging

Invalid Invalid

FS Storage copy read write Cleaning requires host-to-device data transfer operations

storage space

LPN: 32 128

slide-9
SLIDE 9

Sungkyunkwan University

Can we remove copy overhead?

  • SSD maintains a page-level mapping table
  • Address remapping

▪ Can change the logical address of a written data by modifying mapping table ▪ AnViL [FAST’15], SHARE [SIGMOD’16]

  • Can reclaim log space with address remapping

9

FS Storage copy read write < Copying at filesystem > FS Storage < Remapping at storage > remap LPN: 32 128 remap request LPN: 32 128 32 33 128 72 72

Logical Physical

Mapping Table

slide-10
SLIDE 10

Sungkyunkwan University

Which layer? File System or Block Layer

  • Our solution is Append logging on Block Layer

▪ Append logging on log area temporarily ▪ Remap to the original location ▪ Can utilize legacy filesystems (e.g. EXT4)

  • Simpler metadata management
  • Faster sequential read performance

10

Storage Append-logging D/D Legacy Filesystems Legacy Applications

slide-11
SLIDE 11

Sungkyunkwan University

SHRD (Sequentializing in Host, Randomizing in Device)

  • Sequentializing in Host

▪ Host OS writes random requests sequentially at log area

  • Randomizing in Device

▪ SSD modifies the mapping table to change the logical address

11

Log area (reserved) Normal area (1) Sequentializing (2) Randomizing

slide-12
SLIDE 12

Sungkyunkwan University

SHRD Example: write

12

Log area (FS invisible) Logical address NAND flash Normal area (FS visible) Host redirection table 1024

32 128 765 1024 1026 1025 854 1027

  • LPN

tLPN

multiple small random writes 32 128 765 854 single large sequential write Logging 1024 368

1024 1025 1026 368 369 370 1027 371 LPN PPN

Device mapping table physical address

  • LPN: original LPN

tLPN: temporal LPN

slide-13
SLIDE 13

Sungkyunkwan University

SHRD Example: read redirection

13

Logical address NAND flash Host redirection table 1024

32 128 765 1024 1026 1025 854 1027

  • LPN

tLPN

368

1024 1025 1026 368 369 370 1027 371 LPN PPN

Device mapping table

Read 32 redirect to 1024

physical address

  • LPN: original LPN

tLPN: temporal LPN

slide-14
SLIDE 14

Sungkyunkwan University

SHRD Example: remap

14

Logical address NAND flash Host redirection table 1024

32 128 765 1024 1026 1025 854 1027

  • LPN

tLPN

368

1024 1025 1026 368 369 370 1027 371 LPN PPN

Device mapping table

remap 1024-1027

32 765 128 368 369 370 854 371 LPN PPN

  • LPN

tLPN

physical address

  • LPN: original LPN

tLPN: temporal LPN

slide-15
SLIDE 15

Sungkyunkwan University

Can we really reduce map loading overhead?

  • Remap modifies the mapping entries of sequentialized pages

▪ A time-ordered access scheme inherits the original random pattern

  • low spatial locality
  • oLPN-ordered map access

▪ The mapping table is oLPN-indexed ▪ Can increase spatial locality

15

37 134 774 1028 1029 1030 time-ordered access 8 map loads 900 1031 32 765 128 1024 1025 1026 854 1027

  • LPN

tLPN 765 774 854 1025 1030 1027

  • LPN-ordered access

5 map loads

remapping sequence

900 1031 32 37 128 1024 1028 1026 134 1029

  • LPN

tLPN

slide-16
SLIDE 16

Sungkyunkwan University

The effect of request reordering

16

0% 20% 40% 60% 80% 100% 0% 5% 10% 15% NONE 2MB 4MB 8MB 16MB 32MB 64MB Reordering window size utilization of parallel unit Map miss ratio stg_0 proj_0 stg_0 proj_0

reduce map miss improve parallelism

slide-17
SLIDE 17

Sungkyunkwan University

SHRD (Sequentializing in Host, Randomizing in Device)

17

File system, Applications

SSD (SHRD-FTL)

RWLB Blocks Data Blocks Map Blocks Map Cache

Device Driver

Sequentializer Redirection Table Randomizer

map reclaim remap() write() map insert twrite() read() read( )

slide-18
SLIDE 18

Sungkyunkwan University

SHRD (Sequentializing in Host, Randomizing in Device)

18

File system, Applications

Device Driver

Sequentializer Redirection Table Randomizer

  • Sequentializer

▪ Gathers random write requests, sequentially logs into temporal location

  • Redirection table

▪ Maintains redirection table between temporal address and original address

  • Randomizer

▪ Sends remap command to storage device and reclaims temporal location

slide-19
SLIDE 19

Sungkyunkwan University

  • SHRD-FTL
  • Receives twrite and remap command from host OS
  • twrite
  • Write command with two addresses, temporal/original address
  • The data must be stored into separate physical blocks called RWLB
  • remap
  • Restores the data written at temporal location into original address
  • Changes mapping table from temporal address to original address
  • Corresponding RWLB blocks will be transferred into data blocks

SHRD (Sequentializing in Host, Randomizing in Device)

19 SSD (SHRD-FTL)

RWLB Blocks Data Blocks Map Blocks Map Cache

twrite (oLPN, tLPN) remap (tLPN, oLPN)

slide-20
SLIDE 20

Sungkyunkwan University

Special commands: twrite & remap

  • twrite (oLPN[n], tLPN_start, n, data)

▪ Write command sends two addresses, (tLPN, oLPN) ▪ oLPN is stored at the OOB area of physical page

  • used for power-off-recovery / GC

▪ Packed command with multiple RW requests

  • remap (oLPN[m], tLPN[m], m)

▪ m = # of remapping entries per remap command

  • oLPN-sorted entries  Improving spatial locality

▪ Changes mapping table from tLPN to oLPN

  • tLPN : PPN  oLPN : PPN

20

slide-21
SLIDE 21

Sungkyunkwan University

Command Sequence: Sequentializing in Host

21

Normal area Log area 1024 a b c d

I/O scheduler SHRD driver SATA Interface SSD Device

32 765 128 854 mapping twrite header

  • LPN

tLPN 32 1024 128 1026 765 1025 854 1027

a b c d

1024 1025 1026 1027 twrite data SATA write command (OOB) Completion SATA write command (OOB)

a

1024

a 32

data spare NAND

b c d

Completion rand_ptr seq_ptr

slide-22
SLIDE 22

Sungkyunkwan University

Command Sequence: Randomizing in Device

22

Normal area Log area 1024

I/O scheduler SHRD driver SATA Interface SSD Device

a b c d valid log area

  • LPN

tLPN 32 1024 128 1026 765 1025 854 1027

remap SATA write command (OOB)

Change mapping table

rand_ptr seq_ptr

slide-23
SLIDE 23

Sungkyunkwan University

  • Reverse map in out-of-band (OOB) area

▪ SSD stores corresponding LPN in OOB area ▪ Reverse map is used for GC & recovery

  • GC: change the mapping table of victim valid page
  • Recovery: recover the mapping table of active blocks

GC & Power Off Recovery (POR)

23

Data OOB ECC LPN Physical page layout

slide-24
SLIDE 24

Sungkyunkwan University

  • Store oLPN at the OOB area of RWLB

▪ RWLB blocks must be excluded from choosing victim

  • until entire data stored in the blocks are remapped

▪ Non-remapped data will be auto-remapped at POR

  • by scanning the OOB area of RWLB blocks

GC & Power Off Recovery (POR)

24

1 85 1023 72 Data OOB RWLB Block

  • LPN
  • moved into data block
  • can be victim of GC
  • POR will scan OOB
  • do auto-remap

after remap before remap twrite

slide-25
SLIDE 25

Sungkyunkwan University

Implementation

  • SHRD D/D is implemented in Linux kernel 3. 17.4

▪ Additional kernel module at SCSI D/D layer ▪ Host redirection table: about 1 MB for 64 MB log area

  • Prototype SSD device

▪ Modified the firmware of a commercial SATA3 SSD device (Samsung 843) ▪ DFTL & SHRD-FTL are implemented ▪ Map cache size is configurable

25

slide-26
SLIDE 26

Sungkyunkwan University

RW Performance According to cache

  • Better performance than DFTL

▪ By reducing map loading/unloading overhead

  • SHRD shows steady performance regardless of cache size

26

10 20 30 40 50 60 128KB 256KB 512KB 1MB 2MB 4MB 8MB fully loaded bandwidth (MB/s)

Map cache size

DFTL SHRD

30x faster than DFTL at tiny size of cache

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

128KB 256KB 512KB 1MB 2MB 4MB 8MB fully loaded

map load per page IO Map cache size DFTL SHRD

fio random write test (32GB space, 4KB write)

slide-27
SLIDE 27

Sungkyunkwan University 0.5 1 1.5 2 2.5 3 3.5 4 tpcc YCSB postmark fileserver varmail

throughput (normalized to DFTL)

Performance on Real Benchmarks

  • Better performance at all workloads
  • Small gains at sequential or read dominant workload

▪ still better than DFTL

27

small random write dominant workloads sequential write workload read/flush dominant workload

CMT caches about 5% of entire workload space

slide-28
SLIDE 28

Sungkyunkwan University

10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 bandwidth (MB/s) time(s) F2FS (SSR) w/ DFTL EXT4 w/ DFTL

SHRD gains at EXT4 vs. F2FS

  • EXT4 shows bad performance on random write
  • Performance of F2FS decreases due to SSR at high utilization

28 low utilization high utilization

slide-29
SLIDE 29

Sungkyunkwan University

20 40 60 80 100 20 40 60 80 100 bandwidth (MB/s) time(s) F2FS (SSR) w/ DFTL F2FS (SSR) w/ SHRD EXT4 w/ DFTL EXT4 w/ SHRD

SHRD gains at EXT4 vs. F2FS

  • SHRD improves both EXT4 and F2FS

▪ SHRD improves the bandwidth of aged F2FS ▪ EXT4 shows similar performance as F2FS by using SHRD

29 low utilization high utilization

ext4 improvement F2FS improvement

slide-30
SLIDE 30

Sungkyunkwan University

20 40 60 80 100 20 40 60 80 100 bandwidth (MB/s) time(s) F2FS (SSR) w/ DFTL F2FS (SSR) w/ SHRD EXT4 w/ DFTL EXT4 w/ SHRD

SHRD gains at EXT4 vs. F2FS

  • Sequential read performance of EXT4 is much better

▪ The out-of-place scheme of F2FS scatters the data blocks of a file

30

20 40 60 80 100 120 Sequential read Random read bandwidth (MB/s)

EXT4 w/ SHRD F2FS (SSR) w/ SHRD

slide-31
SLIDE 31

Sungkyunkwan University

Latency Comparison

31

5 10 15 20 25 DFTL SHRD bandwidth (MB/s) read write fio mixed workload (32GB area, 4KB random read/write mixed)

  • Remap command can delay read operations

▪ Several remapping entries are batched into a single command ▪ # of remapping entries per command can control the maximum latency of following I/O operations

remap period

slide-32
SLIDE 32

Sungkyunkwan University

Visualizing address accessing pattern: postmark

< without SHRD > < with SHRD > remap command access seqeuntialized write

slide-33
SLIDE 33

Sungkyunkwan University

Conclusion

  • SHRD is an address reshaping technique

▪ transforms RW into SW at the block D/D ▪ restores the original addresses without copy operations ▪ Solves POR / GC issues of address remapping

  • SHRD improves 30x better performance at a small map

cache

▪ reduces DRAM drastically

33

slide-34
SLIDE 34

Thank you.

34