Solving the Linux storage scalability bottlenecks Jens Axboe - - PowerPoint PPT Presentation

solving the linux storage scalability bottlenecks
SMART_READER_LITE
LIVE PREVIEW

Solving the Linux storage scalability bottlenecks Jens Axboe - - PowerPoint PPT Presentation

Solving the Linux storage scalability bottlenecks Jens Axboe Software Engineer Vault 2016 What are the issues? Devices went from hundreds of IOPS to hundreds of thousands of IOPS Increases in core count, and NUMA


slide-1
SLIDE 1
slide-2
SLIDE 2

Solving the Linux storage scalability bottlenecks

Jens Axboe

Software Engineer Vault 2016

slide-3
SLIDE 3
  • Devices went from “hundreds of IOPS” to “hundreds of

thousands of IOPS”

  • Increases in core count, and NUMA
  • Existing IO stack has a lot of data sharing
  • For applications
  • And between submission and completion
  • Existing heuristics and optimizations centered around

slower storage

What are the issues?

slide-4
SLIDE 4
  • The old stack had severe scaling issues
  • Even negative scaling
  • Wasting lots of CPU cycles
  • This also lead to much higher latencies
  • But where are the real scaling bottlenecks hidden?

Observed problems

slide-5
SLIDE 5

S C S I d r i v e r r e q u e s t _ f n d r i v e r B y p a s s d r i v e r

F i l e s y s t e m

B I O l a y e r ( s t r u c t b i

  • )

B l

  • c

k l a y e r ( s t r u c t r e q u e s t ) S C S I s t a c k

IO stack

slide-6
SLIDE 6

A p p C P U A A p p C P U B A p p C P U C A p p C P U D B I O l a y e r B l

  • c

k l a y e r D r i v e r F i l e s y s t e m

Seen from the application

slide-7
SLIDE 7

A p p C P U A A p p C P U B A p p C P U C A p p C P U D B I O l a y e r B l

  • c

k l a y e r D r i v e r F i l e s y s t e m

Seen from the application

H m m m m !

slide-8
SLIDE 8
  • At this point we may have a suspicion of where the

bottleneck might be. Let's run a test and see if it backs up the theory.

  • We use null_blk
  • queue_mode=1 completion_nsec=0 irqmode=0
  • Fio
  • Each thread does pread(2), 4k, randomly, O_DIRECT
  • Each added thread alternates between the two available

NUMA nodes (2 socket system, 32 threads)

Testing the theory

slide-9
SLIDE 9
slide-10
SLIDE 10

T h a t l

  • k

s l i k e a l

  • t
  • f

l

  • c

k c

  • n

t e n t i

  • n

… F i

  • r

e p

  • r

t s s p e n d i n g 9 5 %

  • f

t h e t i m e i n t h e k e r n e l , l

  • k

s l i k e ~ 7 5 %

  • f

t h a t t i m e i s s p i n n i n g

  • n

l

  • c

k s . L

  • k

i n g a t c a l l g r a p h s , i t ' s a g

  • d

m i x

  • f

q u e u e v s c

  • m

p l e t i

  • n

, a n d q u e u e v s q u e u e ( a n d q u e u e

  • t
  • b

l

  • c

k v s q u e u e

  • t
  • d

r i v e r ) .

slide-11
SLIDE 11

D r i v e r A p p C P U A A p p C P U B A p p C P U C A p p C P U D

  • R

e q u e s t s p l a c e d f

  • r

p r

  • c

e s s i n g

  • R

e q u e s t s r e t r i e v e d b y d r i v e r

  • R

e q u e s t s c

  • m

p l e t i

  • n

s i g n a l e d = = L

  • t

s

  • f

s h a r e d s t a t e ! B l

  • c

k l a y e r

slide-12
SLIDE 12
  • We have good scalability until we reach the block layer
  • The shared state is a massive issue
  • A bypass mode driver could work around the problem
  • We need a real and future proof solution!

Problem areas

slide-13
SLIDE 13
  • Shares basic name with similar networking functionality,

but was built from scratch

  • Basic idea is to separate shared state
  • Between applications
  • Between completion and submission
  • Improve scaling on non-mq hardware was a criteria
  • Provide a full pool of helper functionality
  • Implement and debug once
  • Become THE queuing model, not “the 3rd one”

Enter block multiqueue

slide-14
SLIDE 14
  • Started in 2011
  • Original design reworked, fjnalized around 2012
  • Merged in 3.13

History

slide-15
SLIDE 15

P e r

  • c

p u s

  • f

t w a r e q u e u e s ( b l k _ m q _ c t x )

A p p C P U A A p p C P U B A p p C P U C A p p C P U D A p p C P U E A p p C P U F

H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e H a r d w a r e q u e u e H a r d w a r e q u e u e

H a r d w a r e a n d d r i v e r

slide-16
SLIDE 16

P e r

  • c

p u s

  • f

t w a r e q u e u e s ( b l k _ m q _ c t x )

A p p C P U A A p p C P U B

H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e

H a r d w a r e a n d d r i v e r

C

  • m

p l e t i

  • n

s S u b m i s s i

  • n

s

  • Application touches private per-cpu queue
  • Software queues
  • Submission is now almost fully privatized
slide-17
SLIDE 17

P e r

  • c

p u s

  • f

t w a r e q u e u e s ( b l k _ m q _ c t x )

A p p C P U A A p p C P U B

H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e

H a r d w a r e a n d d r i v e r

C

  • m

p l e t i

  • n

s S u b m i s s i

  • n

s

  • Software queues map M:N to hardware

queues

  • There are always as many software queues

as CPUs

  • With enough hardware queues, it's a 1:1

mapping

  • Fewer, and we map based on topology of

the system

slide-18
SLIDE 18

P e r

  • c

p u s

  • f

t w a r e q u e u e s ( b l k _ m q _ c t x )

A p p C P U A A p p C P U B

H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e

H a r d w a r e a n d d r i v e r

C

  • m

p l e t i

  • n

s S u b m i s s i

  • n

s

  • Hardware queues handle dispatch to

hardware and completions

slide-19
SLIDE 19
  • Effjcient and fast versions of:
  • T

agging

  • Timeout handling
  • Allocation eliminations
  • Local completions
  • Provides intelligent queue ↔ CPU mappings
  • Can be used for IRQ mappings as well
  • Clean API
  • Driver conversions generally remove more code than they

add

Features

slide-20
SLIDE 20

A l l

  • c

a t e b i

  • F

i n d f r e e r e q u e s t M a p b i

  • t
  • r

e q u e s t I n s e r t i n t

  • s
  • f

t w a r e q u e u e S i g n a l h a r d w a r e q u e u e r u n ? H a r d w a r e q u e u e r u n s S u b m i t t

  • h

a r d w a r e S l e e p

  • n

f r e e r e q u e s t F r e e r e s

  • u

r c e s ( b i

  • ,

m a r k r q a s f r e e ) C

  • m

p l e t e I O H a r d w a r e I R Q e v e n t

blk-mq IO fmow

slide-21
SLIDE 21

A l l

  • c

a t e b i

  • A

l l

  • c

a t e r e q u e s t M a p b i

  • t
  • r

e q u e s t I n s e r t i n t

  • q

u e u e S i g n a l d r i v e r ( ? ) D r i v e r r u n s A l l

  • c

a t e t a g S u b m i t t

  • h

a r d w a r e S l e e p

  • n

r e s

  • u

r c e s F r e e r e s

  • u

r c e s ( r e q u e s t , b i

  • ,

t a g , H a r d w a r e , e t c ) C

  • m

p l e t e I O H a r d w a r e I R Q e v e n t A l l

  • c

a t e d r i v e r c

  • m

m a n d a n d S G l i s t

Block layer IO fmow

P u l l r e q u e s t

  • ff

b l

  • c

k l a y e r q u e u e

slide-22
SLIDE 22
  • Want completions as local as possible
  • Even without queue shared state, there's still the request
  • Particularly for fewer/single hardware queue design, care

must be taken to minimize sharing

  • If completion queue can place event, we use that
  • If not, IPI

Completions

slide-23
SLIDE 23

P e r

  • c

p u s

  • f

t w a r e q u e u e s ( b l k _ m q _ c t x )

A p p C P U A A p p C P U B

H a r d w a r e m a p p i n g q u e u e s ( b l k _ m q _ h w _ c t x ) H a r d w a r e q u e u e

H a r d w a r e a n d d r i v e r

C

  • m

p l e t i

  • n

s S u b m i s s i

  • n

s I R Q i n r i g h t l

  • c

a t i

  • n

? Y e s N

  • I

P I t

  • r

i g h t C P U C

  • m

p l e t e I O I R Q

slide-24
SLIDE 24
  • Almost all hardware uses tags to identify IO requests
  • Must get a free tag on request issue
  • Must return tag to pool on completion

Tagging

D r i v e r H a r d w a r e

“ T h i s i s a r e q u e s t i d e n t i fi e d b y t a g = x 1 3 ” “ T h i s i s t h e c

  • m

p l e t i

  • n

e v e n t f

  • r

t h e r e q u e s t i d e n t i fi e d b y t a g = x 1 3 ”

slide-25
SLIDE 25
  • Must have features:
  • Effjcient at or near tag exhaustion
  • Effjcient for shared tag maps
  • Blk-mq implements a novel bitmap tag approach
  • Software queue hinting (sticky)
  • Sparse layout
  • Rolling wakeups

Tag support

slide-26
SLIDE 26

Sparse tag maps

$ c a t / s y s / b l

  • c

k / s d a / m q / / t a g s n r _ t a g s = 3 1 , r e s e r v e d _ t a g s = , b i t s _ p e r _ w

  • r

d = 2 n r _ f r e e = 3 1 , n r _ r e s e r v e d = |

  • C

a c h e l i n e ( g e n e r a l l y 6 4 b )

  • |

T a g v a l u e s

  • 1

T a g v a l u e s 2

  • 3

T a g v a l u e s 4

  • 5

T a g v a l u e s 6

  • 7

A p p A A p p B A p p C A p p D

  • Applications tend to stick to software queues
  • Utilize that concept to make them stick to tag cachelines
  • Cache last tag in software queue
slide-27
SLIDE 27
  • We use null_blk
  • Fio
  • Each thread does pread(2), 4k, randomly, O_DIRECT
  • queue_mode=2 completion_nsec=0 irqmode=0

submit_queues=32

  • Each added thread alternates between the two available

NUMA nodes (2 socket system)

Rerunning the test case

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

S i n g l e q u e u e m

  • d

e , b a s i c a l l y a l l s y s t e m t i m e i s s p e n t b a n g i n g

  • n

t h e d e v i c e q u e u e l

  • c

k . F i

  • r

e p

  • r

t s 9 5 %

  • f

t h e t i m e s p e n t i n t h e K e r n e l . M a x c

  • m

p l e t i

  • n

t i m e i s 1 x h i g h e r t h a n b l k

  • m

q m

  • d

e , 5

t h

p e r c e n t i l e i s 2 4 u s e c . I n b l k

  • m

q m

  • d

e , l

  • c

k i n g t i m e i s d r a s t i c a l l y r e d u c e d a n d t h e p r

  • fi

l e I s m u c h c l e a n e r . F i

  • r

e p

  • r

t s 7 4 %

  • f

t h e t i m e s p e n t i n t h e k e r n e l . 5

t h

p e r c e n t i l e i s 3 u s e c .

slide-31
SLIDE 31

“But Jens, isn't most storage hardware still single queue? What about single queue performance on blk-mq?”

— Astute audience member

slide-32
SLIDE 32
slide-33
SLIDE 33
  • SCSI had severe scaling issues
  • Per LUN performance limited to ~150K IOPS
  • SCSI queuing layered on top of blk-mq
  • Initially by Nic Bellinger (Datera), later continued by

Christoph Hellwig

  • Merged in 3.17
  • CONFIG_SCSI_MQ_DEFAULT=y
  • scsi_mod.use_blk_mq=1
  • Helped drive some blk-mq features

Scsi-mq

slide-34
SLIDE 34

G r a p h f r

  • m

C h r i s t

  • p

h H e l l w i g

slide-35
SLIDE 35
  • Backport
  • Ran a pilot last year, results were so good it was

immediately put in production.

  • Running in production at Facebook
  • TAO, cache
  • Biggest win was in latency reductions
  • FB workloads not that IOPS intensive
  • But still saw sys % wins too

At Facebook

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
  • As of 4.6-rc4
  • mtip32xx (micron SSD)
  • NVMe
  • virtio_blk, xen block driver
  • rbd (ceph block)
  • loop
  • ubi
  • SCSI
  • All over the map (which is good)

Conversion progress

slide-39
SLIDE 39
  • An IO scheduler
  • Better helpers for IRQ affjnity mappings
  • IO accounting
  • IO polling
  • More conversions
  • Long term goal remains killing ofg request_fn

Future work

slide-40
SLIDE 40