DMA API Performance and Contention on IOMMU Enabled Environments - - PowerPoint PPT Presentation

dma api performance and contention on iommu enabled
SMART_READER_LITE
LIVE PREVIEW

DMA API Performance and Contention on IOMMU Enabled Environments - - PowerPoint PPT Presentation

DMA API Performance and Contention on IOMMU Enabled Environments Thadeu Cascardo <cascardo@linux.vnet.ibm.com> Linux is a registered trademark of Linus Torvalds. Disclaimer There is some bias towards PPC64. Please, help me


slide-1
SLIDE 1

Linux is a registered trademark of Linus Torvalds.

DMA API Performance and Contention

  • n IOMMU Enabled Environments
  • Thadeu Cascardo

<cascardo@linux.vnet.ibm.com>

slide-2
SLIDE 2

Disclaimer

There is some bias towards PPC64. Please, help me collaborate with you.

slide-3
SLIDE 3

Agenda

  • Why
  • What
  • How much
  • How
slide-4
SLIDE 4

Why

I/O Virtualization in the form of PCI passthrough Provides:

– Isolation between guests – Performance – Stability – Debug

slide-5
SLIDE 5

What

DMA API:

– virtual address -> IO/DMA address

IOMMU:

– translates addresses coming from the bus

into memory addresses

slide-6
SLIDE 6

PVDMA

Pseries use paravirtualized IOMMU KVM PVDMA didn't make into mainline (5-ish years old) Advantage: don't have to pin whole guest memory

slide-7
SLIDE 7

DMA maps performance

Direct PVDMA Map Adds offset Allocate IOVA, insert mapping Unmap Nothing Remove mapping, free IOVA

slide-8
SLIDE 8

PVDMA Performance

Hypercall cost IOVA allocation cost <- Contention

slide-9
SLIDE 9

Drivers performance

10Gbps NIC device driver mapped for every packet

slide-10
SLIDE 10

Drivers performance

Result: 3Gbps

slide-11
SLIDE 11

Drivers Optimization

After allocation chunks from mapped pages

slide-12
SLIDE 12

Drivers optimization

Result: 9.5Gbps

slide-13
SLIDE 13

Performance Numbers

1 2 4 8 16 32 64 128 256 512 1024 50 100 150 200 250 300 350 400 450 IOMMU only Direct DMA Direct DMA IOMMU Bitmap IOMMU Bitmap Pool IOMMU RBTree IOMMU

Threads Time (s)

slide-14
SLIDE 14

Performance Numbers

1 2 4 8 16 32 64 128 256 0.1 1 10 100 1M IOMMU Ops Threads Time (s)

slide-15
SLIDE 15

Performance Numbers

1 2 4 8 16 32 64 128 256 512 1024 0.01 0.1 1 10 1M Direct DMA map Threads Time (s)

slide-16
SLIDE 16

Performance Numbers

1 2 4 8 16 32 64 128 256 0.1 1 10 100 1M Direct DMA IOMMU Ops Threads Time (s)

slide-17
SLIDE 17

Performance Numbers

1 2 4 8 16 32 64 0.1 1 10 100 1000 1M Bitmap DMA IOMMU Ops Threads Time (s)

slide-18
SLIDE 18

Performance Numbers

1 2 4 8 16 32 64 128 256 0.1 1 10 100 1000 1M Bitmap Pool DMA IOMMU Ops Threads Time (s)

slide-19
SLIDE 19

Performance Numbers

1 2 4 8 1 10 100 1000 1M RBTree DMA IOMMU Ops Threads Time (s)

slide-20
SLIDE 20

Performance Numbers

1 2 4 8 1 10 100 1000 1M RBTree DMA IOMMU Ops on X86 Threads Time (s)

slide-21
SLIDE 21

Sharing code

  • IOMMU drivers infrastructure
  • Allocation algorithm(s)
  • PVDMA Guest and Host code
slide-22
SLIDE 22

Future works

  • Experiment with other tree-based algorithms
slide-23
SLIDE 23

Conclusions

  • A lower bound for allocation algorithms
  • Current RBTree IOVA code has bad

performance

  • IOMMUs are currently underused