Keeping up with the hardware Challenges in scaling I/O performance - - PowerPoint PPT Presentation

keeping up with the hardware
SMART_READER_LITE
LIVE PREVIEW

Keeping up with the hardware Challenges in scaling I/O performance - - PowerPoint PPT Presentation

Keeping up with the hardware Challenges in scaling I/O performance Jonathan Davies XenServer System Performance Lead XenServer Engineering, Citrix Cambridge, UK 18 Aug 2015 Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015


slide-1
SLIDE 1

Keeping up with the hardware

Challenges in scaling I/O performance Jonathan Davies

XenServer System Performance Lead

XenServer Engineering, Citrix Cambridge, UK

18 Aug 2015

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 1 / 50

slide-2
SLIDE 2

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 2 / 50

slide-3
SLIDE 3

The virtualisation performance challenge

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 3 / 50

slide-4
SLIDE 4

The virtualisation performance challenge

Recent hardware trends

NICs disks CPUs speed (log scale) 2000 2005 2010 2015

HDD SSD NVMe 10 Gb/s 40 Gb/s 100 Gb/s 1 Gb/s

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 4 / 50

slide-5
SLIDE 5

The virtualisation performance challenge

Virtualisation overhead is increasing

As I/O devices get faster but CPU speeds remain constant, this means the relative virtualisation overhead increases:

  • verhead

time spent on physical device

  • verhead

Old I/O devices Modern I/O devices

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 5 / 50

slide-6
SLIDE 6

Networking performance

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 6 / 50

slide-7
SLIDE 7

Networking performance

Areas of weak networking performance

Metric Xen’s performance Intrahost VM-to-VM throughput weak Intrahost aggregate throughput weak Interhost from-VM transmit throughput strong Interhost into-VM receive throughput weak Interhost aggregate throughput strong

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 7 / 50

slide-8
SLIDE 8

Networking performance Improving intrahost single-stream throughput

Outline

1

The virtualisation performance challenge

2

Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 8 / 50

slide-9
SLIDE 9

Networking performance Improving intrahost single-stream throughput

Where do we stand?

Intrahost VM-to-VM single-stream throughput measurements (using CentOS 7):

XenServer 6.5 15 Gb/s Target 30 Gb/s

more is better

Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 9 / 50

slide-10
SLIDE 10

Networking performance Improving intrahost single-stream throughput

It’s even worse with an upstream guest kernel!

Intrahost VM-to-VM single-stream throughput measurements (using CentOS 7):

XenServer 6.5 15 Gb/s (guests with 4.0 kernel) 9 Gb/s Target 30 Gb/s

more is better

Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 10 / 50

slide-11
SLIDE 11

Networking performance Improving intrahost single-stream throughput

Datapath analysis with 4.0 kernel in guests

☎ ✥
  • tx kernel calling tcp_transmit_skb

tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to

rst tx slot tx netfront written to last tx slot tx netback reading from

rst tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback

nished build_gops tx netback

nished gntcpy tx netback

nished gntmap tx netback dequeued skb from tx_queue tx netback

lling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy

nished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront

lling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 11 / 50

slide-12
SLIDE 12

Networking performance Improving intrahost single-stream throughput

Datapath analysis with 4.0 kernel in guests

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 12 / 50

slide-13
SLIDE 13

Networking performance Improving intrahost single-stream throughput

Transmitter often stalls; only ever two packets in flight

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Red boxes: periods when netfront is not running

Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 13 / 50

slide-14
SLIDE 14

Networking performance Improving intrahost single-stream throughput

Principal bottleneck: high TX completion latency

High TX completion latency is a serious problem with guests using 4.x kernels, which aggressively limit the amount of uncompleted data. Definition of TX completion latency

skb generated by guest request put in TX ring request consumed by dom0 response received in TX ring time

TX completion latency

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 14 / 50

slide-15
SLIDE 15

Networking performance Improving intrahost single-stream throughput

The transmitter waits for TX completion

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Yellow slice: point of TX completion

Two CentOS 7.0 VMs (4.0.9 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 15 / 50

slide-16
SLIDE 16

Networking performance Improving intrahost single-stream throughput

Principal bottleneck: high TX completion latency

Idea to reduce TX completion latency

1

Pretend TX completion happens after netback consumes the request.

This can be done using skb_orphan, which decouples freeing from skb accounting Rationale: On physical NIC drivers, TX completion occurs when the packet has hit the wire, not when it has gone into the receiver’s queue.

skb generated by guest request put in TX ring request consumed by dom0 response received in TX ring time

  • rphan the skb?

effective TX completion latency

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 16 / 50

slide-17
SLIDE 17

Networking performance Improving intrahost single-stream throughput

Datapath analysis with 3.18 kernel in guests

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Two CentOS 7.0 VMs (3.18.20 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 17 / 50

slide-18
SLIDE 18

Networking performance Improving intrahost single-stream throughput

The main problem is still TX completion latency

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Red boxes: periods when netfront is not running

Two CentOS 7.0 VMs (3.18.20 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 18 / 50

slide-19
SLIDE 19

Networking performance Improving intrahost single-stream throughput

Next bottleneck: NAPI CPU utilisation

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Red boxes: periods when NAPI is not running

Two CentOS 7.0 VMs (3.18.20 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 19 / 50

slide-20
SLIDE 20

Networking performance Improving intrahost single-stream throughput

Next bottleneck: NAPI CPU utilisation

After TX completion latency, the next bottleneck is that netback’s NAPI thread (softirq context) fully utilises a CPU. Ideas to reduce NAPI CPU utilisation

1

Avoid spilling over into a frag-list by copying more

Rationale: It’s much more costly to handle an skb with a frag-list, so try to fit the data into a single skb. For intrahost VM-to-VM traffic, around 30% of skbs have a frag-list.

2

Unbatch grant-map

Rationale: Historically, batching was best due to the overheads in the hypercall. But recent improvements in grant-map locking means it’s no longer so expensive.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 20 / 50

slide-21
SLIDE 21

Networking performance Improving intrahost single-stream throughput

Avoiding frag-lists and unbatching grant-map

2000 4000 6000 8000 10000 12000 tx kernel calling tcp_transmit_skb tx kernel in tcp_transmit_skb tx kernel clones skb tx kernel passes skb to ip layer tx kernel passes skb to netfront tx netfront written to first tx slot tx netfront written to last tx slot tx netback reading from first tx slot tx netback allocated skb tx netback enqueued on tx_queue tx netback finished build_gops tx netback finished gntcpy tx netback finished gntmap tx netback dequeued skb from tx_queue tx netback filling frags tx netback passing skb to kernel bridge received skb bridge delivered skb rx netback device received skb rx netback kicking receive thread rx netback enqueued in rxq rx netback gntcpy finished rx netback dequeued from rxq rx netback freeing skb rx netback put in dealloc ring dealloc thread got from dealloc ring dealloc thread releasing dealloc thread sent tx response tx netfront received tx response tx netfront freeing skb rx netfront reading from rx slot rx netfront enqueuing skb on tmpq rx netfront dequeuing skb from tmpq rx netfront filling frags rx netfront put on rxq rx netfront passing skb to kernel tsc/1000

Red boxes: periods when NAPI is not running

Two CentOS 7.0 VMs (3.18.20 kernel) on Dell R720 (2 × Xeon E5-2643 v2)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 21 / 50

slide-22
SLIDE 22

Networking performance Improving intrahost single-stream throughput

NAPI CPU utilisation bottleneck

These ideas make the datapath look a lot ‘cleaner’, but don’t reduce the CPU utilisation noticeably. Conclusion Further work required to increase the efficiency of the NAPI thread.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 22 / 50

slide-23
SLIDE 23

Networking performance Improving intrahost aggregate throughput

Outline

1

The virtualisation performance challenge

2

Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 23 / 50

slide-24
SLIDE 24

Networking performance Improving intrahost aggregate throughput

Intrahost aggregate throughput measurements

XenServer 6.5 33 Gb/s Target s

more is better

Dell R730 (2 × Xeon E5-2670 v3)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 24 / 50

slide-25
SLIDE 25

Networking performance Improving intrahost aggregate throughput

Intrahost aggregate throughput analysis

Intrahost aggregate throughput is typically limited by dom0 CPU utilisation. Ideas to improve aggregate throughput

1

Improve grant-map scalability:

per-vCPU maptrack free lists – already in Xen 4.6 per-active entry locking – already in Xen 4.6 avoid TLB flush on unmap – patches proposed by Malcolm Crossley

2

Provide dom0 with more CPU power

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 25 / 50

slide-26
SLIDE 26

Networking performance Improving intrahost aggregate throughput

Grant-map locking improvements have really helped

Aggregate intrahost throughput, 40 VMs

10 20 30 aggregate throughput (Gb/s) dom0 vCPUs before improvements after improvements

Dell R730 (2 × Xeon E5-2670 v3)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 26 / 50

slide-27
SLIDE 27

Networking performance Summary

Outline

1

The virtualisation performance challenge

2

Networking performance Improving intrahost single-stream throughput Improving intrahost aggregate throughput Summary

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 27 / 50

slide-28
SLIDE 28

Networking performance Summary

Summary

Bottlenecks with intrahost VM-to-VM throughput (listed in order):

TX completion latency – potential mitigation using skb_orphan NAPI CPU utilisation – prototype showed minimal improvement

Bottlenecks with aggregate intrahost throughput: dom0 CPU utilisation – already improved in Xen 4.6 Future work Work to minimise TX completion latency required to avoid regression with recent kernels Further optimisations need implementing

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 28 / 50

slide-29
SLIDE 29

Storage performance

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 29 / 50

slide-30
SLIDE 30

Storage performance

Xen is weakest in single-VBD performance

Metric Xen’s performance Single-VBD throughput weak Multiple-VBD aggregate throughput strong For example, consider 4 KB serial IOPS:

XenServer 6.5 Target

more is better

Debian 6.0 VM on Dell R815 (Opteron 6272), Intel S3700 SSD

Deficiencies with single-VBD performance

1

Latency is too high

2

Not enough data in-flight

3

Backend CPU utilisation too high

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 30 / 50

slide-31
SLIDE 31

Storage performance Reduce latency

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance Reduce latency Allow more data in-flight Summary

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 31 / 50

slide-32
SLIDE 32

Storage performance Reduce latency

Reduce latency

The problem Latency is too high. This especially impacts serial I/O with small block sizes. XenServer uses tapdisk3, a user-space backend using grant-copy via the gntdev. Ideas to reduce latency

1

Polling in the backend

Rationale: Event-channel and backend-scheduling latency is too high.

2

Use grant-map in the backend

Rationale: In principle, grant-copy should be slower than grant-map.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 32 / 50

slide-33
SLIDE 33

Storage performance Reduce latency

Idea 1: Polling in the backend

Single-threaded sequential reads, queue-depth 1, varying block size

2000 6000 8000 10000 12000 16000 18000 0.5 1 2 8 16 32 128 256 512 IOPS block size (KB) with polling (1 ms) without polling

Debian 6.0 VM on Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 33 / 50

slide-34
SLIDE 34

Storage performance Reduce latency

Idea 1: Polling in the backend

Polling for just 1 millisecond can yield a significant improvement1. The faster the disk, the bigger the improvement2. Conclusion XenServer will likely adopt polling in tapdisk3. But we need to be careful about eating too much CPU, which can hurt multi-VBD aggregate throughput.

1On blkback the improvement may be even larger. 2Until the tapdisk3 process fully utilises a CPU even when not polling – the next bottleneck. Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 34 / 50

slide-35
SLIDE 35

Storage performance Reduce latency

Idea 2: Grant-map in the backend

Single-threaded sequential reads, queue-depth 1

1000 2000 3000 5000 6000 8000 0.5 1 2 8 16 32 128 256 512 IOPS block size (KB)

Debian 6.0 VM on Dell R720, Intel S3700 SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 35 / 50

slide-36
SLIDE 36

Storage performance Reduce latency

Idea 2: Grant-map in the backend

So grant-copy is still faster in practice, despite recent improvements to grant-map locking. This suggests inefficiency issues with the gntdev. . . ? Conclusion XenServer will likely retain grant-copy for now.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 36 / 50

slide-37
SLIDE 37

Storage performance Allow more data in-flight

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance Reduce latency Allow more data in-flight Summary

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 37 / 50

slide-38
SLIDE 38

Storage performance Allow more data in-flight

Allow more data in-flight

The problem Each blkif ring supports 32 slots, each of which can address up to 44 KB, i.e. a total of 1.375 MB. Meanwhile, modern disks and arrays can give better throughput when issued with more than this. Ideas to get more data in-flight

1

Multi-queue – patches proposed by Bob Liu

Rationale: more than one blkif ring per device

2

Multi-page ring – patches proposed by Bob Liu

Rationale: larger blkif ring

3

Indirect descriptors – available since kernel 3.11

Rationale: ability to address more data per ring slot

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 38 / 50

slide-39
SLIDE 39

Storage performance Allow more data in-flight

Idea 1: Multi-queue measurements

Sequential reads, 8 threads, queue-depth 32, varying block size

50000 100000 150000 200000 250000 300000 350000 0.5 1 2 8 16 32 128 256 512 IOPS block size (KB)

Ubuntu 15.04 VM using blkback on Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 39 / 50

slide-40
SLIDE 40

Storage performance Allow more data in-flight

Idea 1: Multi-queue measurements in context

Sequential reads, 8 threads, queue-depth 32, varying block size

50000 100000 150000 200000 250000 300000 350000 0.5 1 2 8 16 32 128 256 512 IOPS block size (KB)

Ubuntu 15.04 VM using blkback on Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 40 / 50

slide-41
SLIDE 41

Storage performance Allow more data in-flight

Idea 1: Multi-queue

Adding multi-queue support hurts performance for small block sizes. Explanation Explanation pending! The guest does no request merging. We rely on merging to get good performance on modern disks for sequential I/O. Conclusion Unless the sequential I/O performance when requests are merged can be retained, XenServer will likely not adopt multi-queue.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 41 / 50

slide-42
SLIDE 42

Storage performance Allow more data in-flight

Idea 2: Multi-page ring: good for random I/O

Random 4 KB reads, queue-depth 4, varying number of threads

50000 100000 150000 200000 250000 10 20 30 40 50 60 IOPS

Ubuntu 15.04 VM (16 vCPUs) using blkback on Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 42 / 50

slide-43
SLIDE 43

Storage performance Allow more data in-flight

Idea 2: Multi-page ring: poor for sequential I/O

Sequential reads, 8 threads, queue-depth 32, varying block size

50000 100000 150000 200000 250000 300000 0.5 1 2 4 8 16 32 64 128 256 512 1024 2048 IOPS block size (KB)

Ubuntu 15.04 VM (4 vCPUs) using blkback on Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 43 / 50

slide-44
SLIDE 44

Storage performance Allow more data in-flight

Idea 2: Multi-page ring

Improves random I/O throughput by over 50% when the ring would otherwise be full. But reduces sequential I/O throughput for small block sizes and high queue depth. Explanation The guest kernel does not merge requests when there is a multi-page ring. Conclusion Further work needed to mitigate effect on request merging. XenServer will likely retain use of a single-page ring for now.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 44 / 50

slide-45
SLIDE 45

Storage performance Allow more data in-flight

Idea 3: Indirect descriptors

Background Indirect descriptors has been available in blkfront/blkback since 3.11. This allows up to 1 MB to be addressed per ring slot, meaning the total in-flight data can be 32 MB rather than 1.375 MB. But is this actually a good thing? Most modern disks respond better to smaller

  • requests. . .

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 45 / 50

slide-46
SLIDE 46

Storage performance Allow more data in-flight

Idea 3: Indirect descriptors – is it worthwhile?

Reading direct from physical disk, splitting requests into chunks issued in parallel

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0.5 1 2 4 8 16 32 64 128 256 512 1024 2048 chunk size (KB)

Dell R720 (2 × Xeon E5-2643 v2), Micron P320h SSD

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 46 / 50

slide-47
SLIDE 47

Storage performance Allow more data in-flight

Idea 3: Indirect descriptors

Conclusion On modern disks, throughput generally improves by splitting large requests into 44 KB chunks! Allowing bigger requests through can hurt performance. Ideally we need the Linux block layer to know the disk’s optimal block size, and to split or merge requests accordingly. Then indirect I/O would present an improvement by allowing more data in flight.

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 47 / 50

slide-48
SLIDE 48

Storage performance Summary

Outline

1

The virtualisation performance challenge

2

Networking performance

3

Storage performance Reduce latency Allow more data in-flight Summary

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 48 / 50

slide-49
SLIDE 49

Storage performance Summary

Summary

Reduce latency: Polling – promising results Grant-map – needs more work for userspace backend Allow more data in-flight: Multi-queue – prevents request merging Multi-page ring – prevents request merging Indirect descriptors – prevents use of optimal block size Future work Improve performance of gntdev Better strategy for getting more data in-flight whilst ensuring that requests are of optimal size

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 49 / 50

slide-50
SLIDE 50

Questions

Questions

?

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 50 / 50

slide-51
SLIDE 51

Extra slides

There’s little benefit from batching nowadays

100 200 300 400 500 600

Dell R220 (Xeon E3-1230 v3)

Jonathan Davies (Citrix) Keeping up with the hardware 18 Aug 2015 1 / 1