Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 - - PowerPoint PPT Presentation

flash storage disaggregation
SMART_READER_LITE
LIVE PREVIEW

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 - - PowerPoint PPT Presentation

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 2 3 1 5 4 Flash is underutilized Flash provides higher throughput and lower latency than disk PCIe Flash:


slide-1
SLIDE 1

Flash Storage Disaggregation

Ana Klimovic1, Christos Kozyrakis1,4, Eno Thereska3,5, Binu John2 and Sanjeev Kumar2

1 2 3 4 5

slide-2
SLIDE 2

Flash is underutilized

  • Flash provides higher throughput and lower

latency than disk

  • Flash is underutilized in datacenters due to

imbalanced resource requirements

PCIe Flash: – 100,000s of IOPS – 10s of µs latency

2

slide-3
SLIDE 3

Datacenter Flash Use-Case

App Tier RAM Flash

NIC

App Tier Clients

TCP/IP

Datastore Service App Servers

Key-Value Store

get(k) put(k,val)

Applica(on Tier Datastore Tier

CPU

So9ware Hardware

3

get (k)

slide-4
SLIDE 4

Imbalanced Resource Utilization

  • Sample utilization of Facebook servers hosting a Flash-

based key-value store over 6 months

4

slide-5
SLIDE 5

Imbalanced Resource Utilization

  • Sample utilization of Facebook servers hosting a Flash-

based key-value store over 6 months

5

slide-6
SLIDE 6

Imbalanced Resource Utilization

  • Sample utilization of Facebook servers hosting a Flash-

based key-value store over 6 months

utilization

6

slide-7
SLIDE 7

Imbalanced Resource Utilization

  • Flash capacity and IOPS are underutilized for long

periods of time

7

utilization

slide-8
SLIDE 8

Imbalanced Resource Utilization

  • CPU and Flash utilization vary with separate trends

8

utilization

slide-9
SLIDE 9

Local Flash Architecture

App Tier RAM Flash

NIC

App Tier Clients

TCP/IP

Datastore Service App Servers

Key-Value Store

get(k) put(k,val)

Applica(on Tier Datastore Tier

CPU

So9ware Hardware

9

Provision Flash and CPU in a dependent manner.

slide-10
SLIDE 10

Disaggregated Flash Architecture

App Tier RAM

NIC

App Tier Clients

TCP/IP

Datastore Service App Servers

get(k) put(k,val)

Applica(on Tier Datastore Tier

CPU

So5ware Hardware

Flash

NIC

iSCSI

CPU

RAM

read(blk); write(blk,data)

Flash Tier

Key-Value Store Remote Block Service So5ware Hardware

Protocol

10

slide-11
SLIDE 11

Contributions

For real applications at Facebook, we analyze:

  • 1. What is the performance overhead of

remote Flash using existing protocols?

  • 2. What optimizations improve performance?
  • 3. When does disaggregating Flash lead to

resource efficiency benefits?

11

slide-12
SLIDE 12

Flash Workloads at Facebook

  • Analyze IO patterns of real Flash-based Facebook

applications

  • Applications use RocksDB, a key-value store with

a log structured merge tree architecture

IOPS/TB IO size Read 2K – 10K 10KB – 50KB Write 100 – 1K 500KB – 2MB

Lots of random reads Large, bursty writes

12

slide-13
SLIDE 13

Workload Analysis

App Tier

RAM

NIC

TCP/IP

SSDB

server wrapper

Application Tier Datastore Tier

CPU

Software Hardware

Flash

NIC RocksDB

Remote Block Service

Software Hardware

Protocol

Flash Tier

mutilate

load generator

13

slide-14
SLIDE 14

Workload Analysis

App Tier

RAM

NIC

TCP/IP

SSDB

server wrapper

Application Tier Datastore Tier

CPU

Software Hardware

Flash

NIC RocksDB

Remote Block Service

Software Hardware

iSCSI

Flash Tier

mutilate

load generator

14

iSCSI is a standard network storage protocol that transports block storage commands over TCP/IP

slide-15
SLIDE 15

Workload Analysis

App Tier

RAM

NIC

TCP/IP

SSDB

server wrapper

Application Tier Datastore Tier

CPU

Software Hardware

Flash

NIC RocksDB

Remote Block Service

Software Hardware

iSCSI

Flash Tier

mutilate

load generator

15

√ Transparent to application √ Runs on commodity network √ Scales datacenter-wide

slide-16
SLIDE 16

Workload Analysis

App Tier

4GB

10 Gb/E

App Tier Clients

TCP/IP

SSDB

server wrapper

mutilate

load generator

Application Tier Datastore Tier

6 cores

Software Hardware

Intel P3600 PCIe Flash

10Gb/E

iSCSI

Flash Tier

RocksDB

Remote Block Service

Software Hardware

Measure round-trip latency

16

slide-17
SLIDE 17

Unloaded Latency

  • Remote access with iSCSI adds 260µs to p95 latency,

tolerable for our target application (latency SLO ~5ms)

260µs

17

slide-18
SLIDE 18

Application Throughput

  • 45% throughput drop with “out of the box” iSCSI Flash
  • Need to optimize remote Flash server for higher throughput

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash iSCSI baseline (8 processes)

45% drop

18

slide-19
SLIDE 19

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash 6 iSCSI processes (optimal) 8 iSCSI processes (default) 1 iSCSI process

Multi-process iSCSI

  • Vary number of iSCSI processes that issue IO
  • Want enough parallelism, avoid scheduling interference

12%

19

slide-20
SLIDE 20

NIC offloads

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash NIC offload iSCSI with 6 processes iSCSI baseline (8 processes)

  • Enable NIC offloads for TCP segmentation (TSO/LRO) to

reduce CPU load on Flash server and datastore server

8%

20

slide-21
SLIDE 21

Jumbo Frames

  • Jumbo frames further reduce overhead by reducing

segmentation altogether (max MTU 9kB)

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash Jumbo frame NIC offload iSCSI with 6 processes iSCSI baseline (8 processes)

10%

21

slide-22
SLIDE 22

Interrupt Affinity Tuning

  • Steer NIC interrupts to core handling TCP connection

and Flash interrupts to cores issuing IO commands

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash Interrupt affinity Jumbo frame NIC offload iSCSI with 6 processes iSCSI baseline (8 processes)

4%

22

slide-23
SLIDE 23

Optimized Application Throughput

  • Steer NIC interrupts to core handling TCP connection

and Flash interrupts to cores issuing IO commands

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash Interrupt affinity Jumbo frame NIC offload iSCSI with 6 processes iSCSI baseline (8 processes)

42%

23

slide-24
SLIDE 24

Application Throughput

  • 20% drop in application throughput, on average

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands) local_avg remote_avg local_p95 remote_p95 20% drop

24

slide-25
SLIDE 25

Application Throughput

  • At the tail, overhead of remote access is masked by
  • ther factors like write interference on Flash

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands) local_avg remote_avg local_p95 remote_p95 10% drop

25

20% drop

slide-26
SLIDE 26

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 20 40 60 80 100 120 140 Client Latency (ms) QPS (thousands) local_avg remote_avg local_p95 remote_p95

Sharing Remote Flash

  • Sharing Flash among 2 or more tenants leads to more

write interference à degrades tail performance

26

20% drop

  • n avg

25% drop @ tail

slide-27
SLIDE 27

Disaggregation Benefits

  • Make up for throughput loss by cost-effectively

scaling resources with disaggregation

  • Improve overall resource utilization
  • Formulate cost model to quantify benefits

27

slide-28
SLIDE 28

Resource Savings

  • Resource savings of disaggregated vs. local Flash

architecture as app requirements scale

40% 30% 20% 10% 0%

  • 10%

Storage Capacity Scaling Factor Compute Intensity Scaling Factor

28

% cost benefit of disaggregation

slide-29
SLIDE 29

Resource Savings

  • Resource savings of disaggregated vs. local Flash

architecture as app requirements scale

40% 30% 20% 10% 0%

  • 10%

Storage Capacity Scaling Factor Compute Intensity Scaling Factor Balanced CPU & Flash utilization

% cost benefit of disaggregation

29

slide-30
SLIDE 30

Resource Savings

  • When storage scales at higher rate than compute, save

resources by deploying Flash without as much CPU

40% 30% 20% 10% 0%

  • 10%

Storage Capacity Scaling Factor Compute Intensity Scaling Factor Balanced CPU & Flash utilization Deploy more Flash servers than compute

30

% cost benefit of disaggregation

slide-31
SLIDE 31

Resource Savings

  • When compute and storage demands remain balanced,

no benefit with disaggregation

40% 30% 20% 10% 0%

  • 10%

Storage Capacity Scaling Factor Compute Intensity Scaling Factor Balanced CPU & Flash utilization

31

% cost benefit of disaggregation

slide-32
SLIDE 32

Implications for System Design

  • Dataplane:

– Reduce compute overhead of network (storage) stack

  • Optimize TCP/IP processing
  • Use a light-weight protocol

– Provide isolation mechanisms for shared remote Flash

  • Control plane:

– Policies for allocating and sharing remote Flash

  • Important to consider write IO patterns of applications

32

slide-33
SLIDE 33

40% 30% 20% 10% 0%

  • 10%

Storage Capacity Scaling Factor Compute Intensity Scaling Factor

% cost benefit of disaggrega1on

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands) local_avg remote_avg local_p95 remote_p95 10% drop

34

20% drop App Tier RAM

NIC

App Tier Clients

TCP/IP

Datastore Service App Servers

get(k) put(k,val)

Applica(on Tier Datastore Tier

CPU

So5ware Hardware

Flash

NIC

iSCSI

CPU RAM

read(blk); write(blk,data)

Flash Tier

Key-Value Store Remote Block Service So5ware Hardware

Protocol

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10 20 30 40 50 60 70 80 Client Latency (ms) QPS (thousands)

Local Flash Interrupt affinity Jumbo frame NIC offload iSCSI with 6 processes iSCSI baseline (8 processes)

42%

32

slide-34
SLIDE 34

Conclusion

  • Disaggregating Flash is beneficial because it

allows us to cost-effectively scale resources:

– Improve overall resource efficiency – Compensate for 20% throughput overhead by independently deploying application resources

  • System tuning improves performance ~40%,

more opportunities if redesign software stack

34

slide-35
SLIDE 35

Backup

slide-36
SLIDE 36

Remote Flash IOPS

IO-intensive benchmark: 4kB random reads

50 100 150 200 250 1 tenant 3 tenants 6 tenants IOPS (thousands)

IRQ affinity Jumbo frame NIC offload Mul@-thread Baseline Mul@-process

Local Flash IOPS

slide-37
SLIDE 37

Cost Model

slide-38
SLIDE 38

Related Work

  • Disaggregated disk storage:

– Petal [ASPLOS’96], Parallax [HotOS’05], Blizzard [NSDI’14]

  • Disaggregated Flash as distributed shared log:

– CORFU [NSDI’12], FAWN [SOSP’09]

  • Disaggregated memory:

– Memory blade servers (Lim et al.) [ISCA’09]

  • Rack-scale disaggregation:

– Pelican [OSDI’14], HP Moonshot, Intel Rack-Scale