ReFlex: Remote Flash ≈ Local Flash
Ana Klimovic Heiner Litz Christos Kozyrakis
NVMW’18 Memorable Paper Award Finalist
1
ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos - - PowerPoint PPT Presentation
ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18 Memorable Paper Award Finalist 1 Flash in Datacenters Flash provides 1000 higher throughput and 100 lower latency than disk PCIe Flash:
NVMW’18 Memorable Paper Award Finalist
1
PCIe Flash: – 1,000,000 IOPS – 70 µs read latency
2
3
200 400 600 800 1000 50 100 150 200 250 300 p95 read latency (us)
IOPS (Thousands) Local Flash iSCSI (1 core) libaio+libevent (1core)
4x throughput drop 2× latency increase
4kB random read
4
200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read
Writes impact read tail latency Latency depends
5
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
6
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
Remove SW bloat by separating control & data plane
7
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
DPDK SPDK
Direct access to hardware
1 data plane per CPU core 8
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
Polling vs. interrupts IRQ
9
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
Polling vs. interrupts IRQ
Run to completion
10
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
Polling vs. interrupts IRQ
Adaptive batching
11
User Space ReFlex Hardware Network Interface Flash Storage Remote Storage Application Data Plane Control Plane Linux Filesystem Block I/O Device Driver User Space Remote Storage Application Hardware Network Interface Flash Storage
Linux
Zero-copy device-to-device 1. 2. 3. 4.
12
13
200 400 600 800 1000 1200 1400 1600 1800 2000 250 500 750 1000 1250 p95 read latency (us) Total IOPS (Thousands) 100%read 99%read 95%read 90%read 75%read 50%read
200 400 600 800 1000 1200 1400 1600 1800 2000 200 400 600 800 1000 p95 Read Latency (us) Weighted IOPS (x 103 tokens/s ) 100%read 99%read 95%read 90%read 75%read 50%read
For this device: Write == 10x Read Compensate for read-write asymmetry
14
15
1ms tail latency SLO
16
1ms tail latency SLO Device max IOPS: 510K
17
1ms tail latency SLO Device max IOPS: 510K 200K IOPS SLO
18
1ms tail latency SLO Device max IOPS: 510K 200K IOPS SLO 310K Slack
19
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T
Linux-1T
ReFlex: 850K IOPS/core Linux: 75K IOPS/core
20
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T ReFlex-1T Libaio-1T
Latency Local Flash 78 µs ReFlex 99 µs Linux 200 µs
Linux-1T
21
100 200 300 400 500 600 700 800 900 1000 250 500 750 1000 1250 p95 Read Latency (us) IOPS (Thousands) Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T
Linux-1T Linux-2T
ReFlex: saturates Flash
22
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands) I/O sched disabled I/O sched enabled 500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us) I/O sched disabled I/O sched enabled
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
23
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)
I/O sched disabled I/O sched enabled
500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)
I/O sched disabled I/O sched enabled
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
24
20 40 60 80 100 120 140 Tenant A Tenant B Tenant C Tenant D IOPS (Thousands)
I/O sched disabled I/O sched enabled
500 1000 1500 2000 2500 3000 3500 4000 Tenant A Tenant B Tenant C Tenant D Read p95 latency (us)
I/O sched disabled I/O sched enabled
Latency SLO Tenant A IOPS SLO Tenant B IOPS SLO
100%rd 80%rd 95%rd 25%rd 100%rd 80%rd 95%rd 25%rd
25
– Performance: remote ≈ local – Commodity networking, low CPU overhead
– Quality of Service aware request scheduling
26
distributed storage system (collaboration with IBM Research)
27
28