Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016

What are the issues? • Devices went from “hundreds of IOPS” to “hundreds of thousands of IOPS” • Increases in core count, and NUMA • Existing IO stack has a lot of data sharing ● For applications ● And between submission and completion • Existing heuristics and optimizations centered around slower storage

Observed problems • The old stack had severe scaling issues ● Even negative scaling ● Wasting lots of CPU cycles • This also lead to much higher latencies • But where are the real scaling bottlenecks hidden?

IO stack F i l e s y s t e m B I O l a y e r ( s t r u c t b i o ) B l o c k l a y e r ( s t r u c t r e q u e s t ) S C S I s t a c k S C S I d r i v e r r e q u e s t _ f n d r i v e r B y p a s s d r i v e r

Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r B l o c k l a y e r D r i v e r

Seen from the application A p p A p p A p p A p p C P U A C P U B C P U C C P U D F i l e s y s t e m B I O l a y e r H m m m m ! B l o c k l a y e r D r i v e r

Testing the theory • At this point we may have a suspicion of where the bottleneck might be. Let's run a test and see if it backs up the theory. • We use null_blk ● queue_mode=1 completion_nsec=0 irqmode=0 • Fio ● Each thread does pread(2), 4k, randomly, O_DIRECT • Each added thread alternates between the two available NUMA nodes (2 socket system, 32 threads)

T h a t l o o k s l i k e a l o t o f l o c k c o n t e n t i o n … F i o r e p o r t s s p e n d i n g 9 5 % o f t h e t i m e i n t h e k e r n e l , l o o k s l i k e ~ 7 5 % o f t h a t t i m e i s s p i n n i n g o n l o c k s . L o o k i n g a t c a l l g r a p h s , i t ' s a g o o d m i x o f q u e u e v s c o m p l e t i o n , a n d q u e u e v s q u e u e ( a n d q u e u e - t o - b l o c k v s q u e u e - t o - d r i v e r ) .

A p p A p p A p p A p p C P U A C P U B C P U C C P U D B l o c k l a y e r - R e q u e s t s p l a c e d f o r p r o c e s s i n g - R e q u e s t s r e t r i e v e d b y d r i v e r - R e q u e s t s c o m p l e t i o n s i g n a l e d = = L o t s o f s h a r e d s t a t e ! D r i v e r

Problem areas • We have good scalability until we reach the block layer ● The shared state is a massive issue • A bypass mode driver could work around the problem • We need a real and future proof solution!

Enter block multiqueue • Shares basic name with similar networking functionality, but was built from scratch • Basic idea is to separate shared state ● Between applications ● Between completion and submission • Improve scaling on non-mq hardware was a criteria • Provide a full pool of helper functionality ● Implement and debug once • Become THE queuing model, not “the 3 rd one”

History • Started in 2011 • Original design reworked, fjnalized around 2012 • Merged in 3.13

A p p A p p A p p A p p A p p A p p C P U D C P U A C P U B C P U C C P U E C P U F P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e a n d d r i v e r H a r d w a r e q u e u e H a r d w a r e q u e u e H a r d w a r e q u e u e

Application touches private per-cpu queue • A p p A p p C P U A C P U B ● Software queues ● Submission is now almost fully privatized P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

A p p A p p C P U A C P U B Software queues map M:N to hardware • P e r - c p u queues s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) ● There are always as many software queues S u b m i s s i o n s as CPUs ● With enough hardware queues, it's a 1:1 C o m p l e t i o n s mapping ● Fewer, and we map based on topology of H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) the system H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

A p p A p p C P U A C P U B P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Hardware queues handle dispatch to • hardware and completions H a r d w a r e q u e u e H a r d w a r e a n d d r i v e r

Features • Effjcient and fast versions of: ● T agging ● Timeout handling ● Allocation eliminations ● Local completions • Provides intelligent queue ↔ CPU mappings ● Can be used for IRQ mappings as well • Clean API ● Driver conversions generally remove more code than they add

blk-mq IO fmow I n s e r t i n t o s o f t w a r e S i g n a l h a r d w a r e q u e u e A l l o c a t e b i o F i n d f r e e r e q u e s t M a p b i o t o r e q u e s t q u e u e r u n ? F r e e r e s o u r c e s S l e e p o n f r e e r e q u e s t H a r d w a r e q u e u e r u n s ( b i o , m a r k r q a s f r e e ) C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e

Block layer IO fmow A l l o c a t e b i o A l l o c a t e r e q u e s t M a p b i o t o r e q u e s t I n s e r t i n t o q u e u e S i g n a l d r i v e r ( ? ) F r e e r e s o u r c e s A l l o c a t e d r i v e r ( r e q u e s t , b i o , t a g , S l e e p o n r e s o u r c e s c o m m a n d a n d S G A l l o c a t e t a g D r i v e r r u n s H a r d w a r e , e t c ) l i s t P u l l r e q u e s t o ff b l o c k C o m p l e t e I O H a r d w a r e I R Q e v e n t S u b m i t t o h a r d w a r e l a y e r q u e u e

Completions • Want completions as local as possible ● Even without queue shared state, there's still the request • Particularly for fewer/single hardware queue design, care must be taken to minimize sharing • If completion queue can place event, we use that ● If not, IPI

A p p A p p C P U A C P U B C o m p l e t e I O P e r - c p u s o f t w a r e q u e u e s ( b l k _ m q _ c t x ) S u b m i s s i o n s C o m p l e t i o n s H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) Y e s I R Q I P I t o r i g h t C P U H a r d w a r e q u e u e N o I R Q i n r i g h t l o c a t i o n ? H a r d w a r e a n d d r i v e r

Tagging • Almost all hardware uses tags to identify IO requests ● Must get a free tag on request issue ● Must return tag to pool on completion “ T h i s i s a r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ” D r i v e r H a r d w a r e “ T h i s i s t h e c o m p l e t i o n e v e n t f o r t h e r e q u e s t i d e n t i fi e d b y t a g = 0 x 1 3 ”

Tag support • Must have features: ● Effjcient at or near tag exhaustion ● Effjcient for shared tag maps • Blk-mq implements a novel bitmap tag approach ● Software queue hinting (sticky) ● Sparse layout ● Rolling wakeups

Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Linux-iSCSI.org BoF Linux-iSCSI.org BoF Current Status and Future of iSCSI on the Current Status

The State of the Linux Desktop An OSDL Perspective John Cherry OSDL Desktop Linux (DTL)

Introduction to Linux Introduction to Linux Phil Mercurio The Scripps Research Institute

Computer Systems Performance Evaluation Carey Williamson Department of Computer Science

Reference Tables on Probability Distributions and Statistics (1) Source: Arnold O. Allen,

PAFFI: Performance Analysis Framework for Fog Infrastructures in realistic scenarios Claudia

Queuing Analysis Gregory (Grisha) Chockler, Zinovi Rabinovich, Ittai Abraham Operating Systems

Chapter 2 Simulation Examples Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Algorithms Theory 07 Binomial Queues Prof. Dr. S. Albers Priority queues: operations

CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 4: Packet switching

Networked Servers Subject to MMPP Arrival Process Bruno Ciciani, Andrea Santoro, Paolo Romano

Solving the Linux storage scalability bottlenecks Jens Axboe - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Introduction to Linux Aline Abler Aline Abler Linux, whats that? The pieces of a Linux

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Linux from Sensors to Servers ! When is Linux Not Linux? ! 1 1 Linux runs across a huge range

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

Root zone scalability model Bart Gijsen October 28, 2009 Root zone scalability model

Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Linux Kung Fu Introduction What is Linux? Why Linux? What is the difference between a client

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Creating Web Farms with Linux (Linux High Availability and Scalability) Horms (Simon Horman)

Linux-iSCSI.org BoF Linux-iSCSI.org BoF Current Status and Future of iSCSI on the Current Status

The State of the Linux Desktop An OSDL Perspective John Cherry OSDL Desktop Linux (DTL)

Introduction to Linux Introduction to Linux Phil Mercurio The Scripps Research Institute

Computer Systems Performance Evaluation Carey Williamson Department of Computer Science

Reference Tables on Probability Distributions and Statistics (1) Source: Arnold O. Allen,

PAFFI: Performance Analysis Framework for Fog Infrastructures in realistic scenarios Claudia

Queuing Analysis Gregory (Grisha) Chockler, Zinovi Rabinovich, Ittai Abraham Operating Systems

Chapter 2 Simulation Examples Banks, Carson, Nelson &amp; Nicol Discrete-Event System Simulation

Algorithms Theory 07 Binomial Queues Prof. Dr. S. Albers Priority queues: operations

CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 4: Packet switching

Networked Servers Subject to MMPP Arrival Process Bruno Ciciani, Andrea Santoro, Paolo Romano

Chapter 2 Simulation Examples Banks, Carson, Nelson & Nicol Discrete-Event System Simulation