[PPT] - Fla FlashSh shShare are: : Pun Punch ching ing Throug Through PowerPoint Presentation

SLIDE 1

Fla FlashSh shShare are: : Pun Punch ching ing Throug Through Ser h Server er Stora Storage Stack ge Stack from K from Kern ernel to Firmwa el to Firmware re for Ultra

r Ultra-Lo

Low Latenc w Latency y SSDs SSDs

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

SLIDE 2

Ex Executab ecutable le Sum Summar mary

FlashShare

Punches through the performance barriers

Reduce average turnaround response times by 22% Reduce 99th-percentile turnaround response times by 31%

Datacenter

Latency-critical application Throughput application Memory-like performance

Flash Firmware NVMe Driver Block Layer File System Interference Longer I/O latency

ULL-SSD

Unaware of ULL-SSD Unaware of latency-critical application

Barrier

Ultra low latency

SLIDE 3

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Datacenter executes a wide range of latency-critical workloads.

Driven by the market of social media and web services;
Required to satisfy a certain level of service-level agreement;
Sensitive to the latency (i.e., turn-around response time);

A typical example: Apache

respond server apache monitor TCP/IP service queue data obj. worker worker TCP/IP service worker data obj.

HTTP request

A key metric: user experience

SLIDE 4

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Latency-critical applications exhibit varying loads during a day.
Datacenter overprovisions its server resources to meet the SLA.
However, it results in a low utilization and low energy efficiency.

Hour of the day

Figure 1. Example diurnal pattern in queries per second for a Web Search cluster1.

1Power Management of Online Data-Intensive Services.

Fraction of Time CPU utilization Figure 2. CPU utilization analysis of Google server cluster2.

2The Datacenter as a Computer.

Varying loads 30% avg. utilization

SLIDE 5

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Micro’11 ISCA’15 Eurosys’14

SLIDE 6

Cha Challe lleng nge: : ap appli plicat cation ions s in d in dat atace acenter nter

Applications: Apache – Online latency-critical application; PageRank – Offline throughput application; Server configuration: Experiment: Apache+PageRank vs. Apache only Performance metrics: SSD device latency; Response time of latency-critical application;

SLIDE 7

Cha Challe lleng nge: : app applic licatio ations ns in d in data atacenter center

Experiment: Apache+PageRank vs. Apache only

The throughput-oriented application drastically increases the I/O access

latency of the latency-critical application.

This latency increase deteriorates the turnaround response time of

the latency-critical application.

Fig 2: Apache response time increases due to PageRank. Fig 1: Apache SSD latency increases due to PageRank.

SLIDE 8

Cha Challe lleng nge: : ULL-SSD SSD

There are emerging Ultra Low-Latency SSD (ULL-SSD) technologies, which can be used for faster I/O services in the datacenter.

ZNAND XL-Flash New NAND Flash Samsung Toshiba 3us N/A 100us N/A Optane nvNitro Technique Phase change RAM MRAM Vendor Intel Everspin Read 10us 6us Write 10us 6us

SLIDE 9

Cha Challe lleng nge: : ULL-SSD SSD

In this work, we use engineering sample of Z-SSD.

Z-NAND1 Technology SLC based 3D NAND 48 stacked word-line layer Capacity 64Gb/die Page Size 2KB/Page

Z-NAND based archives “Z-SSD”

[1] Cheong, Wooseong, et al. "A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time." 2018 IEEE International Solid-State Circuits Conference-(ISSCC), 2018.

SLIDE 10

Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD

Applications: Apache – online latency-critical application; PageRank – offline throughput application;

Device latency analysis

42x

36us 28us

Server configuration: Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).

SLIDE 11

Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD

The storage stack is unaware of the characteristics of both latency-critical workload and ULL-SSD

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

blkmq

The current design of blkmq layer, NVMe driver, and SSD firmware can hurt the performance of latency- critical applications.

ULL-SSD fails to bring short latency, because of the storage stack.

NVMe Driver ULL-SSD

SLIDE 12

Blk Blkmq mq lay layer er: : challenge

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

I/O submission

blkmq

Software Queue Hardware Queue Req Req

Merge

Req Req Req Req

Incoming requests

Req Req

Merge

Req

Queuing Queuing

Software queue: holds latency-critical I/O requests for a long time;

SLIDE 13

Blk Blkmq mq lay layer er: : challenge

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

blkmq

Software Queue Hardware Queue

Dispatch Token=1 Token=0 Token=0

Req Req Req Req Req

Software queue: holds latency-critical I/O requests for a long time; Hardware queue: dispatches an I/O request without a knowledge of the latency-criticality; I/O submission

SLIDE 14

Blk Blkmq mq lay layer er: : optimization

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

blkmq

Software Queue Hardware Queue Req Req Req

Our solution: bypass.

Req Req Req Req Software Queue Hardware Queue

Bypass

LatReq ThrReq ThrReq

Incoming requests

LatReq

No merge No I/O scheduling

Latency-critical I/Os: bypass blkmq for a faster response;

ThrReq ThrReq

Throughput I/Os: merge in blkmq for a higher storage bandwidth.

Little penalty Addressed in NVMe

I/O submission

SLIDE 15

NVMe NVMe SQ SQ: : challenge (bypass is not simple enough)

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver NVMe Driver

ULL-SSD

SQ doorbell register NVMe controller CQ doorbell register

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

Incoming requests

Ring

Head Tail Tail

Req

Wait Head Head

Fetch

+ Tfetch + 2xTfetch

NVMe protocol-level queue: a latency-critical I/O request can be blocked by prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%

verhead

I/O submission

SLIDE 16

NVMe NVMe SQ SQ: : optimization

Inc Target: Designing towards a responsiveness-aware NVMe submission. Key Insight:

Conventional NVMe controller(s) allow to customize the standard arbitration

strategy for different NVMe protocol-level queue accesses.

Thus, we can make the NVMe controller to decide which NVMe command to

fetch by sharing a hint for the I/O urgency.

SLIDE 17

NVMe NVMe SQ SQ: : optimization

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver NVMe Driver

ULL-SSD

SQ doorbell register NVMe controller CQ doorbell register

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

Ring

Our Solution:

Lat-SQ doorbell Thr-SQ doorbell NVMe CTL CQ doorbell

Core 2

Lat-SQ Thr-SQ

Core 1 Core 0

CQ

Incoming requests

ThrReq ThrReq LatReq

Ring

Postpone

1. Double SQs

(one for latency-critical I/Os, another for throughput I/Os)

2. Double the SQ doorbell register
3. New arbitration strategy: gives the highest priority to the Lat-SQ

Immediate fetch I/O submission

SLIDE 18

SSD firmw SSD firmware are: : challenge

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

Embedded cache cannot protect the latency-critical I/O from an eviction;

ULL-SSD

NVMe Controller Caching Layer FTL NAND Flash Caching Layer

I/O Hit Miss Embedded Cache

way addr 0x0 1 0x1 2 0x2 0x1 0x5 0x8 Req@0x01 Req@0x05 Req@0x04

Cost: TCL+TCACHE Cost: TCL+TFTL+TNAND+TCACHE

0x4 0x1

Evict

Req@0x08

Incoming requests

Embedded cache provides the fastest response (DRAM service)

0x0

I/O submission Embedded cache can be polluted by the throughput requests;

Req@0x0b 0xb

0x5

SLIDE 19

SSD firmw SSD firmware are: : optimization

App App App ULL-SSD Caching layer Filesystem blkmq NVMe Driver

Our design: splits the internal cache space to protect latency-critical I/O requests;

ULL-SSD

NVMe Controller Caching Layer FTL NAND Flash Caching Layer

Embedded Cache

way addr 0x0 1 0x1 2 0x2 0x1 0x5 Req@0x01 Req@0x05 Req@0x04 0x4

Evict

Req@0x08

Incoming requests 0x0 Protection region

0x4 0x8

I/O submission

SLIDE 20

Filesystem NVMe Driver Caching layer blkmq

NVMe NVMe CQ CQ: : challenge

App App App ULL-SSD

NVMe completion: MSI overhead for each I/O request;

ULL-SSD NVMe Driver

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

SQ doorbell register NVMe controller CQ doorbell register Message

Head Tail

Interrupt Controller CPU Interrupt Service Routine MSI Interrupt context switch Blkmq layer context switch

Tail

Cost: 2xTCS +TISR

I/O completion

Cost: TCS Cost: TCS Cost: TISR

SLIDE 21

Filesystem NVMe Driver Caching layer blkmq

NVMe NVMe CQ CQ: : optimization

App App App ULL-SSD

ULL-SSD NVMe Driver

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

SQ doorbell register NVMe controller CQ doorbell register

Interrupt Controller CPU Interrupt Service Routine MSI Interrupt context switch Blkmq layer context switch

I/O completion Key insight: state-of-the-art Linux supports a poll mechanism;

Poll

Message

Blkmq layer Save: 2xTCS +TISR Save: TCS Save: TISR Save: TCS

SLIDE 22

NVMe NVMe CQ CQ: : optimization

Poll mechanism can bring benefits to fast storage device.

4KB 8KB 16KB 32KB 14 16 18 20 22 24 26 28 30 Average Latency (sec) Interrupt Polling 4KB 8KB 16KB 32KB 10 12 14 16 18 20 22 Polling Average Latency (sec) Interrupt

ULL SSD

Decreases by Read: 7.5% & Write: 13.2%

Read Write

SLIDE 23

NVMe NVMe CQ CQ: : optimization

However, the poll-based I/O services consume most host resources.

4KB 8KB 16KB 32KB 20 40 60 80 100 Memory Bound (%) Polling Interrupt

20 40 60 80 100 Time CPU Utilization (%)

Interrupt

20 40 60 80 100 Time CPU Utilization (%)

Polling

SLIDE 24

Interrupt Controller

Filesystem NVMe Driver Caching layer blkmq

NVMe NVMe CQ CQ: : optimization

App App App ULL-SSD

Our solution: selective interrupt service routine (Select-ISR).

ULL-SSD NVMe Driver

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

SQ doorbell register NVMe controller CQ doorbell register Message

MSI Interrupt Incoming requests

ThrReq LatReq CPU

Blkmq layer

Message

Interrupt Service Routine context switch

I/O completion

Sleep

SLIDE 25

Des Design ign: : Responsiveness Awareness

Key Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput). Our Approach:

Open a set of APIs to users, which pass the workload attribute to Linux PCB.

Call a new utility: chworkload_attr Modify Linux PCB data structure Invoke new system call

SLIDE 26

Des Design ign: : Responsiveness Awareness

Key Insight: users have a better knowledge of I/O responsiveness (i.e., latency critical/throughput). Our Approach:

Open a set of APIs to users, which pass the workload attribute to Linux PCB.
Deliver the workload attribute to each layer of storage stack.

Workload attribute rsvd2 nvme_cmd NVMe controller task_struct User Process tag array Embedded cache address_space Page cache BIO File system request Block layer (blk-mq) nvme_rw_command NVMe driver

SLIDE 27

More More opt

ptimizations

imizations

Advanced caching layer designs:

Dynamic cache split scheme: to maximize cache hits in various request

patterns;

Read prefetching: better utilize SSD internal parallelism;
Adjustable read prefetching with ghost cache: adaptive to different

request patterns; Hardware accelerator designs:

Conduct simple but timing-consuming tasks such as I/O poll and I/O

merge;

Simplify the design of blkmq and NVMe driver.

SLIDE 28

Exp Experiment Setup eriment Setup

Test Environment System configurations:

Vanilla – a vanilla Linux-based computer system running on ZSSD;
CacheOpt – compared to Vanilla, it optimizes the cache layer of SSD firmware;
KernelOpt – it optimizes blkmq layer and NVMe I/O submission;
SelectISR – compared to KernelOpt, it adds the optimization of selective ISR;

http://simplessd.org

SLIDE 29

Ev Evalu aluation: ation: latency breakdown

46% reduction 38% reduction 5us reduction

KernelOpt reduces the time cost of blkmq layer by 46% thanks to no queuing time;
As latency-critical I/Os are fetched by NVMe controller immediately, KernelOpt drastically

reduces the waiting time;

CacheOpt better utilizes the embedded cache layer and reduces the SSD access delays by 38%;
By selectively using polling mechanism, SelectISR can reduce the I/O completion time by 5us.

SLIDE 30

Ev Evalu aluation: ation: online I/ I/O access

CacheOpt reduces the average I/O service latency, but it cannot eliminate the long tails;
KernelOpt can remove the long tails, because it can avoid long queuing time and prevents

throughput I/Os from blocking latency-critical I/Os;

SelectISR reduces the average latency further, thanks to selectively using poll mechanism.

SLIDE 31

Con Conclusion clusion

Observation

The ultra-low latency of new memory-based SSDs is not exposed to latency-critical application and have no benefit from user-experience angle;

Challenge

Piecemeal reformations of the current storage stack won’t work due to multiple barriers; the storage stack is unaware of all behaviors of ULL-SSD and latency- critical applications;

Our solution

FlashShare: We expose different levels of I/O responsiveness to the key components in the current storage stack and optimize the corresponding system layers to make ULL visible to users (latency-critical applications).

Major results

Reducing average turnaround response times by 22%;
Reducing 99th-percentile turnaround response times by 31%.

SLIDE 32

Fla FlashSh shShare are: : Pun Punch ching ing Throug Through Ser h Server er Stora Storage Stack ge Stack from K from Kern ernel to Firmwa el to Firmware re for Ultra

Low Latenc w Latency y SSDs SSDs

Jie Zhang, Miryeong Kwon, Donghyun Gouk, Sungjoon Koh, Changlim Lee, Mohammad Alian, Myoungjun Chun, Mahmut Kandemir, Nam Sung Kim, Jihong Kim and Myoungsoo Jung

Ex Executab ecutable le Sum Summar mary

FlashShare

Punches through the performance barriers

Reduce average turnaround response times by 22% Reduce 99th-percentile turnaround response times by 31%

Datacenter

Flash Firmware NVMe Driver Block Layer File System Interference Longer I/O latency

ULL-SSD

Barrier

Ultra low latency

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Datacenter executes a wide range of latency-critical workloads.

A typical example: Apache

respond server apache monitor TCP/IP service queue data obj. worker worker TCP/IP service worker data obj.

HTTP request

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Varying loads 30% avg. utilization

Motiv Motivation ation: : app applic licatio ations ns in d in data atacenter center

Popular solution: co-locating latency-critical and throughput workloads.

Micro’11 ISCA’15 Eurosys’14

Cha Challe lleng nge: : ap appli plicat cation ions s in d in dat atace acenter nter

Applications: Apache – Online latency-critical application; PageRank – Offline throughput application; Server configuration: Experiment: Apache+PageRank vs. Apache only Performance metrics: SSD device latency; Response time of latency-critical application;

Cha Challe lleng nge: : app applic licatio ations ns in d in data atacenter center

Experiment: Apache+PageRank vs. Apache only

latency of the latency-critical application.

the latency-critical application.

Cha Challe lleng nge: : ULL-SSD SSD

There are emerging Ultra Low-Latency SSD (ULL-SSD) technologies, which can be used for faster I/O services in the datacenter.

Cha Challe lleng nge: : ULL-SSD SSD

In this work, we use engineering sample of Z-SSD.

Z-NAND based archives “Z-SSD”

Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD

Applications: Apache – online latency-critical application; PageRank – offline throughput application;

Device latency analysis

Server configuration: Unfortunately, the short latency characteristics of ULL-SSD cannot be exposed to users (in particular, for the latency-critical applications).

Cha Challe lleng nge: : Datacenter server with ULL-SSD SSD

The storage stack is unaware of the characteristics of both latency-critical workload and ULL-SSD

The current design of blkmq layer, NVMe driver, and SSD firmware can hurt the performance of latency- critical applications.

ULL-SSD fails to bring short latency, because of the storage stack.

Blk Blkmq mq lay layer er: : challenge

I/O submission

Software Queue Hardware Queue Req Req

Merge

Req Req Req Req

Incoming requests

Req Req

Merge

Req

Queuing Queuing

Software queue: holds latency-critical I/O requests for a long time;

Blk Blkmq mq lay layer er: : challenge

Software Queue Hardware Queue

Dispatch Token=1 Token=0 Token=0

Req Req Req Req Req

Software queue: holds latency-critical I/O requests for a long time; Hardware queue: dispatches an I/O request without a knowledge of the latency-criticality; I/O submission

Blk Blkmq mq lay layer er: : optimization

Software Queue Hardware Queue Req Req Req

Our solution: bypass.

Req Req Req Req Software Queue Hardware Queue

Bypass

LatReq ThrReq ThrReq

Incoming requests

LatReq

No merge No I/O scheduling

Latency-critical I/Os: bypass blkmq for a faster response;

Throughput I/Os: merge in blkmq for a higher storage bandwidth.

Little penalty Addressed in NVMe

I/O submission

NVMe NVMe SQ SQ: : challenge (bypass is not simple enough)

SQ doorbell register NVMe controller CQ doorbell register

Core 1 Core 0 Core 2

NVMe SQ NVMe CQ

Incoming requests

Ring

Req

Fetch

NVMe protocol-level queue: a latency-critical I/O request can be blocked by prior I/O requests; Time Cost = Tfetch-self + 2xTfetch >200%

I/O submission