DMA API Performance and Contention on IOMMU Enabled Environments ● Thadeu Cascardo <cascardo@linux.vnet.ibm.com> Linux is a registered trademark of Linus Torvalds.
Disclaimer There is some bias towards PPC64. Please, help me collaborate with you.
Agenda ● Why ● What ● How much ● How
Why I/O Virtualization in the form of PCI passthrough Provides: – Isolation between guests – Performance – Stability – Debug
What DMA API: – virtual address -> IO/DMA address IOMMU: – translates addresses coming from the bus into memory addresses
PVDMA Pseries use paravirtualized IOMMU KVM PVDMA didn't make into mainline (5-ish years old) Advantage: don't have to pin whole guest memory
DMA maps performance Direct PVDMA Map Adds offset Allocate IOVA, insert mapping Unmap Nothing Remove mapping, free IOVA
PVDMA Performance Hypercall cost IOVA allocation cost <- Contention
Drivers performance 10Gbps NIC device driver mapped for every packet
Drivers performance Result: 3Gbps
Drivers Optimization After allocation chunks from mapped pages
Drivers optimization Result: 9.5Gbps
Performance Numbers IOMMU only Direct DMA Direct DMA IOMMU Bitmap IOMMU Bitmap Pool IOMMU RBTree IOMMU 450 400 350 300 250 Time (s) 200 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1024 Threads
Performance Numbers 1M IOMMU Ops 100 10 Time (s) 1 0.1 1 2 4 8 16 32 64 128 256 Threads
Performance Numbers 1M Direct DMA map 10 1 Time (s) 0.1 0.01 1 2 4 8 16 32 64 128 256 512 1024 Threads
Performance Numbers 1M Direct DMA IOMMU Ops 100 10 Time (s) 1 0.1 1 2 4 8 16 32 64 128 256 Threads
Performance Numbers 1M Bitmap DMA IOMMU Ops 1000 100 Time (s) 10 1 0.1 1 2 4 8 16 32 64 Threads
Performance Numbers 1M Bitmap Pool DMA IOMMU Ops 1000 100 Time (s) 10 1 0.1 1 2 4 8 16 32 64 128 256 Threads
Performance Numbers 1M RBTree DMA IOMMU Ops 1000 100 Time (s) 10 1 1 2 4 8 Threads
Performance Numbers 1M RBTree DMA IOMMU Ops on X86 1000 100 Time (s) 10 1 1 2 4 8 Threads
Sharing code ● IOMMU drivers infrastructure ● Allocation algorithm(s) ● PVDMA Guest and Host code
Future works ● Experiment with other tree-based algorithms
Conclusions ● A lower bound for allocation algorithms ● Current RBTree IOVA code has bad performance ● IOMMUs are currently underused
Recommend
More recommend