Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed - - PowerPoint PPT Presentation

oct octopus a an r rdma en enab abled led di distri
SMART_READER_LITE
LIVE PREVIEW

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed - - PowerPoint PPT Presentation

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed ed Pe Persistent Memory File System Yo Youyou Lu Lu 1 , , Ji Jiwu Sh Shu 1 , , Yo Youmin Ch Chen en 1 , , Ta Tao Li Li 2 1 Ts Tsinghua University 2 Uni University of


slide-1
SLIDE 1

Oct Octopus: a an R RDMA-en enab abled led Di Distri tributed ed Pe Persistent Memory File System

Yo Youyou Lu Lu1, , Ji Jiwu Sh Shu1, , Yo Youmin Ch Chen en1, , Ta Tao Li Li2

1Ts

Tsinghua University

2Uni

University of Flori rida da

1

slide-2
SLIDE 2

Ou Outline

  • Back

ckground and Motivation

  • Octopus Design
  • Evaluation
  • Conclusion

2

slide-3
SLIDE 3

NV NVMM & & R RDMA

  • NVMM (PCM, ReRAM, etc)
  • Data persistency
  • Byte-addressable
  • Low latency
  • RDMA
  • Remote direct access
  • Bypass remote kernel
  • Low latency and high throughput

Client Server

Registered Memory Registered Memory

HCA HCA CPU CPU

A B C D E

3

slide-4
SLIDE 4

Mod Modular-Desi Designed ed Di Distri tributed ed File e System em

  • Di

DiskGluster

  • Di

Disk for data storage

  • Gi

GigE for communication

  • Me

MemGluster

  • Me

Memory for data storage

  • RD

RDMA A for communication

18 ms

Latency (1KB write+sync)

98 % 2 % Overall HDD Software Network 324 us 99.7 % Overall MEM Software RDMA

4

slide-5
SLIDE 5

Mod Modular-Desi Designed ed Di Distri tributed ed File e System em

  • Di

DiskGluster

  • Di

Disk for data storage

  • Gi

GigE for communication

  • Me

MemGluster

  • Me

Memory for data storage

  • RD

RDMA A for communication

Bandwidth (1MB write)

88 MB/s 83 MB/s HDD File System 118 MB/s Network 6509 MB/s 1779MB/s MEM File System 6350 MB/s RDMA 94 % 27 %

5

slide-6
SLIDE 6

RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em

  • Mo

More e than fast hardware

  • It

It is is subop

  • ptimal

imal to

  • simp

imply ly rep eplac lace e the e ne netw twork/st storage mod module le

  • Op

Opport rtunities es and Challen enges es

  • NV

NVM

  • By

Byte-ad addressab abilit ility

  • Si

Sign gnificant overhead of

  • f data cop
  • pies
  • RD

RDMA

  • Fl

Flexi xible p programming v verb rbs (m (messag age/memory s seman antic ics)

  • Imba

Imbalanc nced ed CPU pr proces essing ng capa pacity vs. ne network I/ I/Os Os

6

slide-7
SLIDE 7

Ou Outline

  • Background and Motivation
  • Oct

ctopus Design

  • Evaluation
  • Conclusion

7

slide-8
SLIDE 8

RD RDMA-en enab abled led Dis Distrib ibuted ed File ile System em

  • It

It is is nec eces essar ary y to re rethink the design of

  • f D

DFS o

  • ver N

NVM & & RD RDMA

8

Byte-addressability of NVM One-sided RDMA verbs Shared data managements CPU is the new bottleneck New data flow strategies Flexible RDMA verbs Efficient RPC Primitive

Opportunity Approaches

slide-9
SLIDE 9

Oct Octopus A Arch chitect cture

N2 NVMM N2 NVMM N3 NVMM

HCA HCA HCA

... Shared Persistent Memory Pool Client B Client A

Self-Identified RPC RDMA-based Data IO create(“/home/cym”) Read(“/home/lyy”)

It performs remote direct data access just like an Octopus uses its eight legs

9

slide-10
SLIDE 10

1.

  • 1. Shared Persi

sistent t Me Memor

  • ry Pool
  • ol
  • Existing DFSs
  • Redundant data copy

Client Server

FS Image User Space Buffer

GlusterFS

Page Cache

User Space Buffer mbuf NIC NIC mbuf

10

7 copy

slide-11
SLIDE 11

1.

  • 1. Shared Persi

sistent t Me Memor

  • ry Pool
  • ol
  • Existing DFSs
  • Redundant data copy

Client Server

FS Image User Space Buffer

GlusterFS + DAX

Page Cache

User Space Buffer mbuf NIC NIC mbuf

11

6 copy

slide-12
SLIDE 12
  • Octopus with SPMP
  • Introduces the sh

shared persi sistent memory po pool

  • Global view of data layout

1.

  • 1. Shared Persi

sistent t Me Memor

  • ry Pool
  • ol
  • Existing DFSs
  • Redundant data copy

Client Server

FS Image User Space Buffer User Space Buffer mbuf NIC NIC mbuf

12

Message Pool

4 copy

slide-13
SLIDE 13

2. . Client-Ac Acti tive e Da Data I/ I/O

  • Server-Active
  • Server threads process the

data I/O

  • Works well for slow Ethernet
  • CPUs can easily become the

bottleneck with fast hardware

  • Client-Active
  • Let clients read/write data

directly from/to the SPMP C1 time NIC MEM CPU C2

Lookup file data Send data Lookup file data Send address

13

SERVER

slide-14
SLIDE 14

3.

  • 3. Self-Id

Iden entif ified ied Metad adata a RPC PC

  • Message-based RPC
  • easy to implement, lower throughput
  • Da

DaRPC PC[S

[SoCC’14], ], Fa

FaSST[O

[OSDI’16]

  • Memory-based RPC
  • CPU cores scan the message buffer
  • Fa

FaRM[N

[NSDI’14]

  • Using rdma_write_with_imm?
  • Scan by polling
  • Imm data for self-identification

14

Message Pool HCA

Thead1 Theadn

HCA

DATA ID

Message Pool HCA

Thead1 Theadn

HCA

DATA

slide-15
SLIDE 15

Ou Outline

  • Background and Motivation
  • Octopus Design
  • Evaluation
  • Conclusion

15

slide-16
SLIDE 16

Ev Evaluation Setup

  • Evaluation Platform
  • Connected with Mellanox SX1012 switch
  • Evaluated Distributed File Systems
  • memGluster, runs on memory, with RDMA connection
  • NVFS[O

[OSU SU], Crail[I [IBM], optimized to run on RDMA

  • memHDFS, Alluxio, for big data comparison

Cluster CPU Memory ConnectX-3 FDR Number A E5-2680 * 2 384 GB Yes * 5 B E5-2620 16 GB Yes * 7

16

slide-17
SLIDE 17

Ov Overall E Effici ciency cy

75% 80% 85% 90% 95% 100%

getattr readdir

Latency Breakdown

software mem network

1000 2000 3000 4000 5000 6000 7000

Write Read

Bandwidth Utilization

software mem network

  • Software latency is reduced from 326 u

326 us to 6 6 us us

  • Achieves read/write bandwidth that approaches the raw

storage and network bandwidth

17

slide-18
SLIDE 18

Me Metadata Operati tion

  • n Perf

rform

  • rmance
  • Octopus provides metadata IOPS in the order of 10#~10%
  • Octopus can scales linearly

6.E+03 6.E+04 6.E+05

1 2 3 4 5

MKNOD

glusterfs nvfs crail dmfs crail-poll

8.E+04 8.E+05 8.E+06

1 2 3 4 5

GETATTR

glusterfs nvfs crail dmfs crail-poll

3.E+04 3.E+05

1 2 3 4 5

RMNOD

glusterfs nvfs crail dmfs crail-poll

18

slide-19
SLIDE 19

Bi Big Da Data Evaluati tion

  • n
  • Octopus can also provide better performance for

big data applications than existing file systems.

500 1000 1500 2000 2500 3000 write read

TestDFSIO (MB/s)

memHDFS Alluxio NVFS Crail Octopus

0.6 0.7 0.8 0.9 1 1.1 Teragen Wordcount

Normalized Execution Time

memHDFS Alluxio NVFS Crail Octopus

19

slide-20
SLIDE 20

Con Conclusi sion

  • n
  • It is necessary to rethink the DFS designs over emerging H/Ws
  • Octopus’s internal mechanisms
  • Simplifies data management layer by re

reducing data ta copies

  • Re

Rebalances network and server loads with Client-Active I/O

  • Redesigns the me

metad adata a RPC and di distr tribut buted d tr transa nsacti tion n with RDMA primitives

  • Evaluations show that Octopus significantly outperforms

existing file systems

20

slide-21
SLIDE 21

Q& Q&A Th Thanks

21