Fla lashNet: t: Fla lash/Netw twork ork Sta tack C k Co-De - - PowerPoint PPT Presentation

fla lashnet t fla lash netw twork ork sta tack c k co de
SMART_READER_LITE
LIVE PREVIEW

Fla lashNet: t: Fla lash/Netw twork ork Sta tack C k Co-De - - PowerPoint PPT Presentation

Fla lashNet: t: Fla lash/Netw twork ork Sta tack C k Co-De Design ign Animesh Trivedi, Nikolas Ioannou, Bernard Metzler, Patrick Stuedi, Jonas Pfefferle, Ioannis Koltsidas, Kornilios Kourtis, and Thomas R. Gross IBM Research and ETH


slide-1
SLIDE 1

Fla lashNet: t: Fla lash/Netw twork

  • rk Sta

tack C k Co-De Design ign

Animesh Trivedi, Nikolas Ioannou, Bernard Metzler, Patrick Stuedi, Jonas Pfefferle, Ioannis Koltsidas, Kornilios Kourtis, and Thomas R. Gross

IBM Research and ETH Zurich, Switzerland

slide-2
SLIDE 2

SYSTOR 2017, Haifa

2

Modern Distributed Systems

 data intensive  run on 100-1000s of servers  performance depends upon

both network and storage

slide-3
SLIDE 3

SYSTOR 2017, Haifa

3

Modern Distributed Systems

 performance depends upon

both network and storage

slide-4
SLIDE 4

SYSTOR 2017, Haifa

4

Modern Distributed Systems

 performance depends upon

both network and storage

  • StackMap: Low-Latency Networking with the OS Stack and

Dedicated NICs, USENIX'16

  • Network Stack Specialization for Performance, SIGCOMM'14
  • mTCP: A Highly Scalable User-level TCP Stack for Multicore

Systems, NSDI'14

  • MegaPipe: A New Programming Interface for Scalable

Network I/O, OSDI'12

slide-5
SLIDE 5

SYSTOR 2017, Haifa

5

Modern Distributed Systems

 performance depends upon

both network and storage

  • StackMap: Low-Latency Networking with the OS Stack and

Dedicated NICs, USENIX'16

  • Network Stack Specialization for Performance, SIGCOMM'14
  • mTCP: A Highly Scalable User-level TCP Stack for Multicore

Systems, NSDI'14

  • MegaPipe: A New Programming Interface for Scalable

Network I/O, OSDI'12

  • NVMeDirect: A User-space I/O Framework for

Application-specific Optimization on NVMe SSDs, HotStorage'16

  • OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14
  • Linux Block IO: Introducing Multi-queue SSD Access on

Multi-core Systems, SYSTOR'13

  • When Poll is Better Than Interrupt, FAST'12
slide-6
SLIDE 6

SYSTOR 2017, Haifa

6

Modern Distributed Systems

 performance depends upon

both network and storage

  • StackMap: Low-Latency Networking with the OS Stack and

Dedicated NICs, USENIX'16

  • Network Stack Specialization for Performance, SIGCOMM'14
  • mTCP: A Highly Scalable User-level TCP Stack for Multicore

Systems, NSDI'14

  • MegaPipe: A New Programming Interface for Scalable

Network I/O, OSDI'12

  • NVMeDirect: A User-space I/O Framework for

Application-specific Optimization on NVMe SSDs, HotStorage'16

  • OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14
  • Linux Block IO: Introducing Multi-queue SSD Access on

Multi-core Systems, SYSTOR'13

  • When Poll is Better Than Interrupt, FAST'12
slide-7
SLIDE 7

SYSTOR 2017, Haifa

7

Modern Distributed Systems

 performance depends upon

both network and storage

  • StackMap: Low-Latency Networking with the OS Stack and

Dedicated NICs, USENIX'16

  • Network Stack Specialization for Performance, SIGCOMM'14
  • mTCP: A Highly Scalable User-level TCP Stack for Multicore

Systems, NSDI'14

  • MegaPipe: A New Programming Interface for Scalable

Network I/O, OSDI'12

  • NVMeDirect: A User-space I/O Framework for

Application-specific Optimization on NVMe SSDs, HotStorage'16

  • OS I/O Path Optimizations for Flash Solid-state Drives, USENIX'14
  • Linux Block IO: Introducing Multi-queue SSD Access on

Multi-core Systems, SYSTOR'13

  • When Poll is Better Than Interrupt, FAST'12
slide-8
SLIDE 8

SYSTOR 2017, Haifa

8

The Cost of the Gap

spec. block IO netperf iSCSI KV NFS HDFS 200 400 600 800 1000 1200 1400

K IOPS

slide-9
SLIDE 9

SYSTOR 2017, Haifa

9

The Cost of the Gap

spec. block IO netperf iSCSI KV NFS HDFS 200 400 600 800 1000 1200 1400

K IOPS

slide-10
SLIDE 10

SYSTOR 2017, Haifa

10

The Cost of the Gap

spec. block IO netperf iSCSI KV NFS HDFS 200 400 600 800 1000 1200 1400

K IOPS

slide-11
SLIDE 11

SYSTOR 2017, Haifa

11

The Reason for the Gap

client

flash storage

slide-12
SLIDE 12

SYSTOR 2017, Haifa

12

The Reason for the Gap

client

1 request 3 response 2 request processing flash storage

slide-13
SLIDE 13

SYSTOR 2017, Haifa

13

The Reason for the Gap

client

performance = network IO + server time + storage IO

1 request 3 response 2 request processing flash storage

slide-14
SLIDE 14

SYSTOR 2017, Haifa

14

The Reason for the Gap

client

performance = network IO + server time + storage IO

1 request 3 response 2 request processing flash storage

application involvement scheduling fs lookups and overheads ...

slide-15
SLIDE 15

SYSTOR 2017, Haifa

15

A Detailed Look: send

  • 1. TCP/IP

processing usersp userspace kern ernel

slide-16
SLIDE 16

SYSTOR 2017, Haifa

16

A Detailed Look: send

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request usersp userspace kern ernel

slide-17
SLIDE 17

SYSTOR 2017, Haifa

17

A Detailed Look: send

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel

slide-18
SLIDE 18

SYSTOR 2017, Haifa

18

A Detailed Look: send

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 6. flash I/O

completion

  • 7. response

transmission

  • 4. fs translation

usersp userspace kern ernel

slide-19
SLIDE 19

SYSTOR 2017, Haifa

19

A Detailed Look: send

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 6. flash I/O

completion

  • 7. response

transmission

  • 8. send

9.TX done

  • 4. fs translation

usersp userspace kern ernel

slide-20
SLIDE 20

SYSTOR 2017, Haifa

20

A Detailed Look: sendfjle

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 6. flash I/O

completion

  • 7. response

transmission

  • 8. send

9.TX done

  • 4. fs translation

usersp userspace kern ernel

slide-21
SLIDE 21

SYSTOR 2017, Haifa

21

A Detailed Look: sendfjle

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 6. flash I/O

completion

  • 7. response

transmission

  • 8. send

9.TX done

  • 4. fs translation

usersp userspace kern ernel

slide-22
SLIDE 22

SYSTOR 2017, Haifa

22

The FlashNet Approach

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel

slide-23
SLIDE 23

SYSTOR 2017, Haifa

23

The FlashNet Approach

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel eliminate application involvement

slide-24
SLIDE 24

SYSTOR 2017, Haifa

24

The FlashNet Approach

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel eliminate application involvement reduce file system overheads

slide-25
SLIDE 25

SYSTOR 2017, Haifa

25

The FlashNet Approach

  • 2. receive

processing

  • 1. TCP/IP

processing

  • 3. receive

request

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel eliminate application involvement reduce file system overheads enable direct network and storage interaction

slide-26
SLIDE 26

SYSTOR 2017, Haifa

26

The FlashNet Approach

  • 2. RDMA

processing

  • 1. TCP/IP

processing RDMA

  • 5. block I/O
  • 4. fs translation

usersp userspace kern ernel eliminate application involvement reduce file system overheads enable direct network and storage interaction

slide-27
SLIDE 27

SYSTOR 2017, Haifa

27

The FlashNet Approach

  • 2. RDMA

processing

  • 1. TCP/IP

processing RDMA

  • 5. block I/O

simple fs layout usersp userspace kern ernel eliminate application involvement reduce file system overheads enable direct network and storage interaction

slide-28
SLIDE 28

SYSTOR 2017, Haifa

28

The FlashNet Approach

  • 2. RDMA

processing

  • 1. TCP/IP

processing RDMA

  • 5. block I/O

simple fs layout usersp userspace kern ernel eliminate application involvement reduce file system overheads enable direct network and storage interaction use VM

slide-29
SLIDE 29

SYSTOR 2017, Haifa

29

The FlashNet Approach

  • 2. RDMA

processing

  • 1. TCP/IP

processing

  • 3. block I/O

usersp userspace kern ernel

  • 2. RDMA v

a

to L

B A

  • 6. response

transmission 7.TX done

  • 6. flash I/O

completion

slide-30
SLIDE 30

SYSTOR 2017, Haifa

30

FlashNet: A Co-Designed Network and Storage Stack

flash controller

  • flash virtualization
  • I/O management
  • ...

64-bit LBA space

slide-31
SLIDE 31

SYSTOR 2017, Haifa

31

FlashNet: A Co-Designed Network and Storage Stack

flash controller ContigFS

  • contigious file allocation
  • supporting m

m a p & local file I/O

slide-32
SLIDE 32

SYSTOR 2017, Haifa

32

FlashNet: A Co-Designed Network and Storage Stack

flash controller ContigFS RDMA controller

  • lazy RDMA pinning
  • resolving flash & file addresses
slide-33
SLIDE 33

SYSTOR 2017, Haifa

33

FlashNet: A Co-Designed Network and Storage Stack

flash controller ContigFS RDMA controller FlashNet I/O stack

slide-34
SLIDE 34

SYSTOR 2017, Haifa

34

FlashNet: A Co-Designed Network and Storage Stack

RDMA controller flash controller ContigFS server application

network control setup expanding to storage

file virtual address STag LBA PBA

slide-35
SLIDE 35

SYSTOR 2017, Haifa

35

FlashNet: A Co-Designed Network and Storage Stack

RDMA controller flash controller ContigFS server application

network control setup expanding to storage data path from a flash device to a client buffer

file virtual address STag LBA PBA

slide-36
SLIDE 36

SYSTOR 2017, Haifa

36

FlashNet: A Co-Designed Network and Storage Stack

RDMA controller flash controller ContigFS server application RNIC iSCSI nfs SKB client

network control setup expanding to storage data path from a flash device to a client buffer

slide-37
SLIDE 37

SYSTOR 2017, Haifa

37

Performance Evaluation

How efficient is FlashNet's IO path? Does it help with applications? ...more in the paper

9-machine cluster testbed

CPU : dual socket E5-2690, 2.9 GHz, 16 cores DRAM : 256 GB, DDR3 1600 MHz NIC : 40Gbit/s Ethernet 3xNVMe Flash : 6.6 GB/sec (read), 2.7 GB/sec (write) peak 4kB read IOPS: 1.3 M

slide-38
SLIDE 38

SYSTOR 2017, Haifa

38

Performance - IOPS Efficiency

1 2 4 8 16 32 64 128 192 256 200 400 600 800 1000 1200 1400

number of clients kIOPS

slide-39
SLIDE 39

SYSTOR 2017, Haifa

39

Performance - IOPS Efficiency

1 2 4 8 16 32 64 128 192 256 200 400 600 800 1000 1200 1400

socket/file FlashNet

number of clients kIOPS

slide-40
SLIDE 40

SYSTOR 2017, Haifa

40

Performance - IOPS Efficiency

1 2 4 8 16 32 64 128 192 256 200 400 600 800 1000 1200 1400

socket/file FlashNet

number of clients kIOPS 38.6 .6% network saturated

slide-41
SLIDE 41

SYSTOR 2017, Haifa

41

Application-level Performance: KV

Aerospike socket/file FlashNet

50 100 150 200 250 300 350 400 450 500

k IOPS

slide-42
SLIDE 42

SYSTOR 2017, Haifa

42

Application-level Performance: KV

Aerospike socket/file FlashNet

50 100 150 200 250 300 350 400 450 500

put get k IOPS

slide-43
SLIDE 43

SYSTOR 2017, Haifa

43

Application-level Performance: KV

Aerospike socket/file FlashNet

50 100 150 200 250 300 350 400 450 500

put get k IOPS 1.8 .8x 3.2 .2x

slide-44
SLIDE 44

SYSTOR 2017, Haifa

44

Conclusion

Identified performance issues with networked flash Apply RDMA principles and concepts by extending the path separation idea to a flash controller and a file system FlashNet is a concrete implementation of this idea

  • demonstrated its capabilities in micro-benchmarks and

applications Excited to explore new use-cases for FlashNet

slide-45
SLIDE 45

SYSTOR 2017, Haifa

45

Thank you

slide-46
SLIDE 46

SYSTOR 2017, Haifa

46

Unified I/O Life Cycle

slide-47
SLIDE 47

SYSTOR 2017, Haifa

47

CPU Cycles Breakdown

network storage device drivers scheduling kernel request processing misc Socket/file 19.3% 7.3% 6.7% 15.8% 40.1% 4.7% 6.1% FlashNet 20.6% 0.8% 6.4% 8.4% 46.7% 11.7% 5.4%