Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young - - PowerPoint PPT Presentation

jae woo choi dong in shin young jin yu hyunsang eom heon
SMART_READER_LITE
LIVE PREVIEW

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young - - PowerPoint PPT Presentation

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young Yeom Seoul National Univ. TAEJIN INFOTECH SAN with High-Speed Network + Host Computer Fast Storage (initiator) = Fast SAN Environment ??? Performance degradation About


slide-1
SLIDE 1

Jae Woo Choi, Dong In Shin, Young Jin Yu, Hyunsang Eom, Heon Young Yeom Seoul National Univ. TAEJIN INFOTECH

slide-2
SLIDE 2

HDD

Fast Storage

Virtual Storage Host Computer (initiator) Storage Server (target)

Bottleneck High-Speed Network(Infiniband) SAN with High-Speed Network + Fast Storage = Fast SAN Environment ???

SAN

Performance degradation About 65% reduction

slide-3
SLIDE 3

 Found performance degradation in existing SAN

solution with a fast storage

 Proposed three optimizations for Fast SAN solution

  • Mitigate software overheads in SAN I/O path
  • Increase parallelism on Target side
  • Temporal merge for RDMA data transfer

 Implemented the new SAN solution as a prototype

slide-4
SLIDE 4

 DRAM-SSD (provided by TAEJIN Infotech)

  • 7 usecs for reading/writing a 4KB page
  • Peak device throughput: 700 MB/s
  • DDR2 64 GB, PCI-Express type
slide-5
SLIDE 5

 FIO micro benchmark, 16 threads

100 200 300 400 500 600 700 800 wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd wr r_wr rd r_rd 4K (buff) 1M (buff) 4K (direct) 1M (direct) Buffered I/O Direct I/O Throughput (MB/s)

Uniform Throughput

slide-6
SLIDE 6

 Generic SCSI Target Subsystem for Linux

  • Open Program for implementing SAN environment
  • Support Ethernet, FC, Infiniband and so on.
  • Use SRP(SCSI RDMA Protocol) for Infiniband
slide-7
SLIDE 7

SPEC TARGET INITATOR

CPU Intel Xeon E5630 (8 core) Intel Xeon E5630 (8 core) Memory 16GB 8GB INFINIBAND CARD MHQH19B-XTC 1port (40Gb/s) MHQH19B-XTC 1port (40Gb/s)

  • Device :DRAM SSD(64GB)
  • Workload size : 16 thread x 3GB (48GB)
  • Request size : 4K/1M
  • I/O type: Buffered/Direct, Sequential/Random, Read/Write
  • Benchmark Tool : FIO micro benchmark
slide-8
SLIDE 8

 I/O Scheduler policy

  • CFQ -> NOOP

100 200 300 400 500 600 700 800 wr r_wr rd r_rd wr r_wr rd r_rd 4K (buff) 4K (direct)

Throughput (MB/s)

SRP (CFQ) SRP (NOOP) Local 100 200 300 400 500 600 700 800 wr r_wr rd r_rd wr r_wr rd r_rd 1M (buff) 1M (direct) SRP (CFQ) SRP (NOOP) Local

Small Size Large Size

Reasonable throughput merge Read-ahead

gap

slide-9
SLIDE 9

Too long Elevator for request merge Plug-Unplug mechanism Cause some delays

slide-10
SLIDE 10

 Remove software overheads in I/O path

  • Bypass SCSI layer
  • Discard existing I/O scheduler

▪ Remove elevator merge and plug-unplug ▪ Maintain wait-queue based on bio structure ▪ Very simple & fast I/O scheduler

 BRP(Block RDMA Protocol)

  • Commands are also based on bio structure, not

scsi command

slide-11
SLIDE 11
slide-12
SLIDE 12

Jobs for executing RDMA data transfer Jobs for sending responses to Initiator

Event_handler

Analyze events and execute proper operations

Jobs for termination of I/O requests Jobs for Device I/O

Operations for I/O request

All these operations are independent each other

can be processed in parallel

slide-13
SLIDE 13

Thread Pool

Event_handler

Analyze events and execute proper operations

Jobs for executing RDMA data transfer Jobs for sending responses to Initiator Jobs for terminating I/O requests Jobs for Device I/O

Serial Execution

slide-14
SLIDE 14

 Increase Parallelism on the Target side

  • All the procedures for I/O requests are processed

in thread-pool

▪ Induce Multiple device I/O

Thread Pool

Storage Exploit high bandwidth of fast device

slide-15
SLIDE 15

initiator target

command Pre-Processing Post-Processing completion

RDMA

initiator target Temporal merge

RDMA Pre-Processing Post-Processing

Jumbo command

slide-16
SLIDE 16

 RDMA data transfer with temporal merge

  • Merge small sized data regardless of its spatial

continuance

  • Enabled at the only intensive-I/O situation
slide-17
SLIDE 17

 BRP-1

  • Remove software overhead in I/O path

 BRP-2

  • BRP-1 + Increase Parallelism

 BRP-3

  • BRP2 + Temporal Merge at the intensive I/O

situation

  • Just BRP means BRP-3
slide-18
SLIDE 18

 Latency comparison

  • Direct I/O, 4KB
  • dd test

I/O Type SRP(usec) BRP(usec) Latency Reduction Read 63 (51) 43 (31)

  • 31.7 (-39.2) %

Write 75 (62) 54 (41)

  • 28 (-33.8)%

( ) : the value excepting device I/O latency read-12usec, write-13usec

slide-19
SLIDE 19

100 200 300 400 500 600 700 800 wr r_wr rd r_rd wr r_wr rd r_rd (buff) (direct) Throughput (MB/s) SRP (NOOP) BRP Local

slide-20
SLIDE 20

100 200 300 400 500 600 700 800 wr r_wr rd r_rd wr r_wr rd r_rd (buff) (direct) Throughput (MB/s) SRP (NOOP) BRP Local

slide-21
SLIDE 21

100 200 300 400 500 600 700 4 8 16 32 64 128 256 512 Throughput(MB/s) SRP(NOOP) BRP-1 BRP-2 BRP-3T

BRP-3T: always executes temporal merge FIO benchmark, random write, 4KB, direct I/O,

slide-22
SLIDE 22

0.00 0.20 0.40 0.60 0.80 1.00 1.20 r_wr(buff) r_wr(direct) r_rd(direct) Nomalized Throughput

local SRP(NOOP) BRP-1 BRP-2 BRP-3 FIO benchmark, 4KB, 16 threads 256 threads

slide-23
SLIDE 23

 SAN with high performance storage  Propose new SAN solution

  • Remove Software overheads in I/O path
  • Increase parallelism on the Target side
  • Temporal merge for RDMA data transfer

 Implement the optimized SAN as a prototype

slide-24
SLIDE 24

Thank you !

QnA?