ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos - - PowerPoint PPT Presentation

reflex remote flash local flash
SMART_READER_LITE
LIVE PREVIEW

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos - - PowerPoint PPT Presentation

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18 Memorable Paper Award Finalist 1 Flash in Datacenters Flash provides 1000 higher throughput and 100 lower latency than disk PCIe Flash:


slide-1
SLIDE 1

ReFlex: Remote Flash ≈ Local Flash

Ana Klimovic Heiner Litz Christos Kozyrakis

NVMW’18 Memorable Paper Award Finalist

1

slide-2
SLIDE 2

Flash in Datacenters

  • Flash provides 1000× higher throughput and

100× lower latency than disk

  • Flash is often underutilized due to imbalanced

resource requirements

PCIe Flash: – 1,000,000 IOPS – 70 µs read latency

Solution: share SSD between remote tenants

2

slide-3
SLIDE 3

Existing Approaches

  • Remote access to disk (e.g. iSCSI)
  • Remote access to DRAM or NVMe over RDMA
  • There are 2 main issues:
  • 1. Performance overhead
  • 2. Interference on shared remote flash device

3

slide-4
SLIDE 4

200 400 600 800 1000 50 100 150 200 250 300 p95 read latency (us)

IOPS (Thousands) Local Flash iSCSI (1 core) libaio+libevent (1core)

Issue 1: Performance Overhead

4x throughput drop 2× latency increase

  • Traditional network storage protocols and Linux I/O

libraries (e.g. libaio, libevent) have high overhead

4kB random read

4

slide-5
SLIDE 5

Issue 2: Performance Interference

200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read

Writes impact read tail latency Latency depends

  • n IOPS load

To share Flash, we need to enforce performance isolation

5

slide-6
SLIDE 6

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

How does ReFlex achieve high performance?

Linux

  • vs. ReFlex

6

slide-7
SLIDE 7

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

How does ReFlex achieve high performance?

Linux

  • vs. ReFlex

Remove SW bloat by separating control & data plane

7

slide-8
SLIDE 8

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

How does ReFlex achieve high performance?

Linux

  • vs. ReFlex

DPDK SPDK

Direct access to hardware

1 data plane per CPU core 8

slide-9
SLIDE 9

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

Linux

  • vs. ReFlex

Polling vs. interrupts IRQ

How does ReFlex achieve high performance?

9

slide-10
SLIDE 10

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

Linux

  • vs. ReFlex

Polling vs. interrupts IRQ

How does ReFlex achieve high performance?

Run to completion

10

slide-11
SLIDE 11

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

Linux

  • vs. ReFlex

Polling vs. interrupts IRQ

How does ReFlex achieve high performance?

Adaptive batching

11

slide-12
SLIDE 12

User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage

Linux

  • vs. ReFlex

Zero-copy device-to-device 1. 2. 3. 4.

How does ReFlex achieve high performance?

12

slide-13
SLIDE 13

How does ReFlex enable performance isolation?

  • Request cost based scheduling
  • Determine the impact of tenant A on the tail

latency and IOPS of tenant B

  • Control plane assigns tenants with a quota
  • Data plane enforces quotas through throttling

13

slide-14
SLIDE 14

Request Cost Modeling

200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read

200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 p95 Read Latency (us) Weighted IOPS (x 103 tokens/s ) 100%read 99%read 95%read 90%read 75%read 50%read

For this device: Write == 10x Read Compensate for read-write asymmetry

14

slide-15
SLIDE 15

Request Cost Based Scheduling

15

slide-16
SLIDE 16

Request Cost Based Scheduling

1ms tail latency SLO

16

slide-17
SLIDE 17

Request Cost Based Scheduling

1ms tail latency SLO Device max IOPS: 510K

17

slide-18
SLIDE 18

Request Cost Based Scheduling

1ms tail latency SLO Device max IOPS: 510K 200K IOPS SLO

18

slide-19
SLIDE 19

Request Cost Based Scheduling

1ms tail latency SLO Device max IOPS: 510K 200K IOPS SLO 310K Slack

19

slide-20
SLIDE 20

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T

Linux-1T

Results: Local ≈ Remote Latency

ReFlex: 850K IOPS/core Linux: 75K IOPS/core

20

slide-21
SLIDE 21

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T

Results: Local ≈ Remote Latency

Latency Local Flash 78 µs ReFlex 99 µs Linux 200 µs

Linux-1T

21

slide-22
SLIDE 22

100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T

Results: Local ≈ Remote Latency

Linux-1T Linux-2T

ReFlex: saturates Flash

22

slide-23
SLIDE 23

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands) I/O sched disabled I/O sched enabled 500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us) I/O sched disabled I/O sched enabled

Results: Performance Isolation

  • Tenants A & B: latency-critical; Tenant C + D: best effort

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

23

slide-24
SLIDE 24

Results: Performance Isolation

  • Tenants A & B: latency-critical; Tenant C + D: best effort
  • Without scheduler: latency and bandwidth QoS for A/B are violated

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)

I/O sched disabled I/O sched enabled

500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)

I/O sched disabled I/O sched enabled

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

24

slide-25
SLIDE 25

Results: Performance Isolation

  • Tenants A & B: latency-critical; Tenant C + D: best effort
  • Without scheduler: latency and bandwidth QoS for A/B are violated
  • Scheduler rate limits best-effort tenants to enforce SLOs

20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)

I/O sched disabled I/O sched enabled

500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)

I/O sched disabled I/O sched enabled

Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO

100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd

25

slide-26
SLIDE 26

ReFlex Summary

  • 1. Enables Flash disaggregation à improve utilization

– Performance: remote ≈ local – Commodity networking, low CPU overhead

  • 2. Guarantees QoS in shared resource deployments

– Quality of Service aware request scheduling

26

slide-27
SLIDE 27

Impact of ReFlex

  • Open source: https://github.com/stanford-mast/reflex
  • Works on AWS i3 cloud instances with NVMe Flash
  • Integrated as a remote Flash dataplane in the Apache Crail

distributed storage system (collaboration with IBM Research)

  • Broadcom is porting ReFlex to ARM-based SoC

27

slide-28
SLIDE 28

Thank You!

Download the source code at: https://github.com/stanford-mast/reflex Original paper presented at ASPLOS’17.

28