Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 - PowerPoint PPT Presentation

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 2 3 1 5 4

Flash is underutilized • Flash provides higher throughput and lower latency than disk PCIe Flash: – 100,000s of IOPS – 10s of µs latency • Flash is underutilized in datacenters due to imbalanced resource requirements 2

Datacenter Flash Use-Case Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So9ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware get (k) Flash 3

Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months 4

Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months 5

Imbalanced Resource Utilization • Sample utilization of Facebook servers hosting a Flash- based key-value store over 6 months utilization 6

Imbalanced Resource Utilization • Flash capacity and IOPS are underutilized for long periods of time utilization 7

Imbalanced Resource Utilization • CPU and Flash utilization vary with separate trends utilization 8

Local Flash Architecture Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So9ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware Flash Provision Flash and CPU in a dependent manner. 9

Disaggregated Flash Architecture Applica(on Tier Datastore Tier Datastore Service Key-Value Store get(k) So5ware put(k,val) App App Tier App Tier TCP/IP Servers Clients CPU RAM NIC Hardware read(blk); write(blk,data) Protocol iSCSI Flash Tier Remote Block Service So5ware NIC Flash Hardware CPU RAM 10

Contributions For real applications at Facebook, we analyze: 1. What is the performance overhead of remote Flash using existing protocols? 2. What optimizations improve performance? 3. When does disaggregating Flash lead to resource efficiency benefits? 11

Flash Workloads at Facebook • Analyze IO patterns of real Flash-based Facebook applications • Applications use RocksDB, a key-value store with a log structured merge tree architecture IOPS/TB IO size Lots of random reads Read 2K – 10K 10KB – 50KB Large, bursty writes Write 100 – 1K 500KB – 2MB 12

Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator Protocol Flash Tier Software Remote Block Service NIC Flash Hardware 13

Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator iSCSI Flash Tier iSCSI is a standard network storage protocol that Software Remote Block Service transports block storage commands over TCP/IP NIC Flash Hardware 14

Workload Analysis Application Tier Datastore Tier SSDB server wrapper Software RocksDB App mutilate Tier TCP/IP load CPU RAM Hardware NIC generator iSCSI Flash Tier √ Transparent to application Software Remote Block Service √ Runs on commodity network √ Scales datacenter-wide NIC Flash Hardware 15

Workload Analysis Application Tier Datastore Tier Measure round-trip SSDB latency server wrapper Software RocksDB App App mutilate Tier Tier TCP/IP load Clients 10 Hardware 6 cores 4GB generator Gb/E iSCSI Flash Tier Software Remote Block Service Intel P3600 10Gb/E Hardware PCIe Flash 16

Unloaded Latency • Remote access with iSCSI adds 260µs to p95 latency, tolerable for our target application (latency SLO ~5ms) 260µs 17

Application Throughput • 45% throughput drop with “out of the box” iSCSI Flash • Need to optimize remote Flash server for higher throughput 2 1.8 1.6 1.4 Client Latency (ms) 45% drop 1.2 1 0.8 0.6 0.4 Local Flash 0.2 iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 18

Multi-process iSCSI • Vary number of iSCSI processes that issue IO • Want enough parallelism, avoid scheduling interference 2 1.8 1.6 1.4 Client Latency (ms) 12% 1.2 1 0.8 0.6 Local Flash 0.4 6 iSCSI processes (optimal) 0.2 8 iSCSI processes (default) 1 iSCSI process 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 19

NIC offloads • Enable NIC offloads for TCP segmentation (TSO/LRO) to reduce CPU load on Flash server and datastore server 2 1.8 1.6 1.4 Client Latency (ms) 8% 1.2 1 0.8 0.6 Local Flash 0.4 NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 20

Jumbo Frames • Jumbo frames further reduce overhead by reducing segmentation altogether (max MTU 9kB) 2 1.8 1.6 1.4 Client Latency (ms) 10% 1.2 1 0.8 0.6 Local Flash Jumbo frame 0.4 NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 21

Interrupt Affinity Tuning • Steer NIC interrupts to core handling TCP connection and Flash interrupts to cores issuing IO commands 2 1.8 1.6 1.4 Client Latency (ms) 4% 1.2 1 0.8 Local Flash 0.6 Interrupt affinity 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 22

Optimized Application Throughput • Steer NIC interrupts to core handling TCP connection and Flash interrupts to cores issuing IO commands 2 1.8 1.6 1.4 Client Latency (ms) 42% 1.2 1 0.8 Local Flash 0.6 Interrupt affinity 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 23

Application Throughput • 20% drop in application throughput, on average 2 1.8 1.6 1.4 20% drop Client Latency (ms) 1.2 1 0.8 local_avg 0.6 remote_avg 0.4 local_p95 0.2 remote_p95 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 24

Application Throughput • At the tail, overhead of remote access is masked by other factors like write interference on Flash 2 1.8 1.6 1.4 20% drop Client Latency (ms) 1.2 10% drop 1 0.8 local_avg 0.6 remote_avg 0.4 local_p95 0.2 remote_p95 0 0 10 20 30 40 50 60 70 80 QPS (thousands) 25

Sharing Remote Flash • Sharing Flash among 2 or more tenants leads to more write interference à degrades tail performance 4.0 local_avg remote_avg 3.5 local_p95 3.0 remote_p95 Client Latency (ms) 2.5 20% drop 25% drop 2.0 on avg @ tail 1.5 1.0 0.5 0.0 0 20 40 60 80 100 120 140 QPS (thousands) 26

Disaggregation Benefits • Make up for throughput loss by cost-effectively scaling resources with disaggregation • Improve overall resource utilization • Formulate cost model to quantify benefits 27

Resource Savings • Resource savings of disaggregated vs. local Flash architecture as app requirements scale 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of disaggregation Storage Capacity Scaling Factor 28

Resource Savings • Resource savings of disaggregated vs. local Flash architecture as app requirements scale 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 29

Resource Savings • When storage scales at higher rate than compute, save resources by deploying Flash without as much CPU 40% Compute Intensity Scaling Factor 30% 20% 10% Deploy more Flash 0% servers than compute -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 30

Resource Savings • When compute and storage demands remain balanced, no benefit with disaggregation 40% Compute Intensity Scaling Factor 30% 20% 10% 0% -10% % cost benefit of Balanced CPU & disaggregation Flash utilization Storage Capacity Scaling Factor 31

Implications for System Design • Dataplane: – Reduce compute overhead of network (storage) stack • Optimize TCP/IP processing • Use a light-weight protocol – Provide isolation mechanisms for shared remote Flash • Control plane: – Policies for allocating and sharing remote Flash • Important to consider write IO patterns of applications 32

Applica(on Tier Datastore Tier 2 1.8 Datastore Service 1.6 1.4 Client Latency (ms) 20% drop Key-Value Store get(k) So5ware 1.2 10% put(k,val) App App 1 drop Tier App Tier 0.8 TCP/IP Servers local_avg Clients CPU RAM NIC Hardware 0.6 remote_avg 0.4 read(blk); write(blk,data) local_p95 0.2 remote_p95 Protocol iSCSI Flash Tier 0 0 10 20 30 40 50 60 70 80 Remote Block Service So5ware QPS (thousands) 34 NIC Flash Hardware CPU RAM 40% 2 Compute Intensity Scaling Factor 1.8 30% 1.6 1.4 Client Latency (ms) 42% 20% 1.2 1 10% 0.8 Local Flash 0.6 Interrupt affinity 0% 0.4 Jumbo frame NIC offload 0.2 iSCSI with 6 processes iSCSI baseline (8 processes) -10% 0 0 10 20 30 40 50 60 70 80 % cost benefit of QPS (thousands) 32 disaggrega1on Storage Capacity Scaling Factor

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 - PowerPoint PPT Presentation

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 2 3 1 5 4 Flash is underutilized Flash provides higher throughput and lower latency than disk PCIe Flash:

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

Disaggregation and the Application Sebastian Angel Mihir Nanavati Siddhartha Sen Traditional

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

DFS: A Filesystem for Virtualized Flash Disks 25 February 2010 William Josephson

Efforts in Monitoring SDG with Disaggregation in the Philippines International Workshop on Data

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Disaggregation March 4, 2020 D. Pallme, A. Kosanovic, TN-DOT K. Pujats, M. Golias, S. Mishra,

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

Hiding @ Depth: Exploring & Subverting NAND Flash memory Josh m0nk Thomas (A DARPA CFT

Partitioned Real-Time NAND Flash Storage Katherine Missimer and Rich West Introduction 2

Contents Slide 1 Some DSP Chip History Slide 2 Other DSP Manufacturers Slide 3 DSP

Isabelle Theories for Machine Words Jeremy Dawson 1 , 2 Logic and Computation Program, NICTA and

Coding Schemes for Inter-Cell Interference in Flash Memory Sarit Buzaglo UCSD Joint work with

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

PSNC NOC migration Szymon Trocha, PSNC 13.12.2012 Facts and figures PIONIER POZMAN Regional

Interactive 3D Displays Shaban Shabani sshabani@student.ethz.ch 14.05.2013 Distributed Systems

Sambuz

Useful Links

Newsletter

Mail Us

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 - PowerPoint PPT Presentation

Flash Storage Disaggregation Ana Klimovic 1 , Christos Kozyrakis 1,4 , Eno Thereska 3,5 , Binu John 2 and Sanjeev Kumar 2 2 3 1 5 4 Flash is underutilized Flash provides higher throughput and lower latency than disk PCIe Flash:

2004: Poisson Matting 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: Flash/No-Flash 2004: The

Arc Flash Protection Arc Flash Protection Electrical Reliability Services Arc Flash Hazard Arc

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW18

The Basics Of Flash Building A Web Application With Flash What is Flash? Introduction

Arc Flash Arc Flash Mitigation Mitigation Remote Racking and Switching for Arc Flash danger

Flash Presentation The flash web designs which we make are attractive to captivate your website

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

A Case for Flash Memory SSD in A Case for Flash Memory SSD in A Case for Flash Memory SSD in

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it &amp; why do we use it? *

Disaggregation and the Application Sebastian Angel Mihir Nanavati Siddhartha Sen Traditional

NVMe-over-Fabrics Performance Characterization and the Path to Low-Overhead Flash Disaggregation

DFS: A Filesystem for Virtualized Flash Disks 25 February 2010 William Josephson

Efforts in Monitoring SDG with Disaggregation in the Philippines International Workshop on Data

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Disaggregation March 4, 2020 D. Pallme, A. Kosanovic, TN-DOT K. Pujats, M. Golias, S. Mishra,

Flash Memory and Micro SD Card Presented by: Krishna Goyal (200601195) Anirudh Tripathi

Hiding @ Depth: Exploring &amp; Subverting NAND Flash memory Josh m0nk Thomas (A DARPA CFT

Partitioned Real-Time NAND Flash Storage Katherine Missimer and Rich West Introduction 2

Contents Slide 1 Some DSP Chip History Slide 2 Other DSP Manufacturers Slide 3 DSP

Isabelle Theories for Machine Words Jeremy Dawson 1 , 2 Logic and Computation Program, NICTA and

Coding Schemes for Inter-Cell Interference in Flash Memory Sarit Buzaglo UCSD Joint work with

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

PSNC NOC migration Szymon Trocha, PSNC 13.12.2012 Facts and figures PIONIER POZMAN Regional

Interactive 3D Displays Shaban Shabani sshabani@student.ethz.ch 14.05.2013 Distributed Systems

Sambuz

Useful Links

Newsletter

Mail Us

Basics of Off-Camera Flash Off-Camera Flash www.jedi.com * What is it & why do we use it? *

Hiding @ Depth: Exploring & Subverting NAND Flash memory Josh m0nk Thomas (A DARPA CFT