A scalable, common IOMMU pooled allocation library for multiple - PowerPoint PPT Presentation

A scalable, common IOMMU pooled allocation library for multiple architectures Sowmini Varadhan (sowmini.varadhan@oracle.com) Linuxcon North America, Seattle 2015 \ 1

Agenda • What is IOMMU? • Benefits/drawbacks of IOMMU • Typical design of IOMMU allocators – Extracting common constructs: a generic IOMMU pooled allocator – Some optimizations, performance results • Some case-studies of sun4v/sun4u usage • Ongoing/future Work Linuxcon North America, Seattle 2015

What is IOMMU? • IOMMU = I/O Memory Management Unit • Connects a DMA capable I/O bus to the main memory – IOMMU maps bus-address space to the physical address space. – IOMMU is the bus-address mapper, just as MMU is the virtual address mapper – IOMMU and MMU have many of the same benefits and costs. Linuxcon North America, Seattle 2015

Pictorially.. CPU virtual CPU virtual Bus Address CPU Physical Address space Address space space Address space X Y Z Device CPU Virtual Mapping Mapping by by MMU IOMMU Diagram adapted from Documentation/DMA-API-HOWTO.txt

Why have IOMMU? • Bus address length != Physical address size – e.g., bus address width may be 24 bits, physaddr may be 32 bit. Remapping allows the device to efficiently reach every part of physical memory. Without IOMMU, inefficient schemes like bounce buffers would be needed. • Can efficiently map a contiguous bus address space to non-contiguous main-memory. Linuxcon North America, Seattle 2015

Additional IOMMU Benefits • Protection: a device cannot read or write memory that has not been explicitly mapped for it. • Virtualization: guest OS can use h/w that is not virtualization aware: IOMMU handles remapping needed for these devices to access the vaddrs remapped as needed by the VM subsystem. Linuxcon North America, Seattle 2015

IOMMU drawbacks.. • Extra mapping overhead • Consumption of physmem for page tables Linuxcon North America, Seattle 2015

IOMMU and DMA • Virtual address “X” is generated by kmalloc() • VM system maps “X” to a physaddr “Y” – But a driver cannot set up DMA to the bus address for “Y” without the help of IOMMU • IOMMU translates “Z” to “Y” • To do DMA, driver can call dma_map_single() with “X” to set up the needed IOMMU for “Z” Linuxcon North America, Seattle 2015

How does this work? Driver calls a mapping indirection from its dma_map_ops e.g., ->map_page . The driver passes in the virtual address it wishes to DMA to, and the number of pages (' n' ) desired. – IOMMU allocator now needs to set up the bus-addr ↔ physaddr mappings as needed for DMA. Linuxcon North America, Seattle 2015

DMA mapping and IOMMU allocation • Given virtual address ( vaddr ) for the page passed in, and an allocation request for n pages.. • Call the IOMMU arena allocator to find an available block of n entries in the vdma/TSB for the device (TSB = Translation Storage Buffer) • For those entries returned (indices into TSB) commit the physaddr of the vaddr mapping in the TSB Linuxcon North America, Seattle 2015

IOMMU arena allocation • The IOMMU allocator needs to have a data- structure representing the current state of the device TSB • Typically represented as a bitmap, where each bit represents a page of the bus address space. A bit value of “1” indicates that the page is in use. Linuxcon North America, Seattle 2015

IOMMU Arena Allocator • Finding a block of n entries <=> finding a contiguous set of n zero-bits. 1111110011100000011..... 6 pages available in TSB Linuxcon North America, Seattle 2015

IOMMU area allocation • Simplistic algorithm: mutex_lock(&iommu->lock) /* find a zero area of n bits */ /* mark area as “taken” */ mutex_unlock(&iommu->lock) /* update TSB with physaddrs */ Linuxcon North America, Seattle 2015

Optimized IOMMU Arena Allocator • We don't need the left-most zero-area, any contiguous set will suffice • Can parallelize the arena allocation by dividing bitmap into pools 11111100111 00000011..... • Only need to lock each pool, i.e., per-pool lock. • 4 threads can concurrently call arena allocator Linuxcon North America, Seattle 2015

Results: ethernet perf before optimization Before the IOMMU lock optimization, • Iperf with 8-10 threads, with TSO disabled, can only manage 1-2Gbps on a 10 Gbps point-to- point ethernet connection on a sparc T5-2 (64 CPUs, 8 cores/socket, 8 threads/core) • Lockstat shows about 35M contentions, 40M acquisitions for the iommu->lock • Lock contention points: dma_4v_[un]map_sg, dma_4v_[un]map_page Linuxcon North America, Seattle 2015

Ethernet perf results after lock optimization After lock fragmentation (16 pools from 256K TSB entries) • Can achieve 8.5-9.5 Gbps without any additional optimizations, both with and without TSO support. • Lockstat now only shows some slight contention on the iommu lock: approx 250 contentions versus 43M acquisitions Linuxcon North America, Seattle 2015

Results: RDS over Infiniband • “rds-stress” request/response test: IOPS (transactions per second) for 8K sized RDMA on a T7-2 (320 CPUs, 2 sockets, 12 cores/socket, 13 threads/core). – “single pool” represents unoptimized allocator Single pool 16 pools 1 conn 80K 90K 4 conn 80K 160K Linuxcon North America, Seattle 2015

Additional enhancements • Breaking up the bitmap into pools may result in fragmented pools that cannot handle requests for a very large number of pages. • Large pool support ( from PPC): upper ¼ of the bitmap is set up as a “large pool” for allocations of more than iommu_large_alloc pages (default = 15) Linuxcon North America, Seattle 2015

Pool initialization parameters: large pool support • In practice, large pool allocations are rare, and dependant on the type of device – Make large-pool reservation an optional parameter during IOMMU initialization via iommu_tbl_pool_init() • iommu_large_alloc cut-off should also be a pool parameter? Linuxcon North America, Seattle 2015

Optimizing TSB flushes • When the arena allocator returns, it tracks the current location where the search terminated. 11111100111 00000011..... current • Next allocator invocation would start at current, and search toward the right end of the pool, wrapping around if necessary – i.e., circular buffer search. Linuxcon North America, Seattle 2015

lazy_flush() • Arena allocator hands out entries in a strict lowest → highest order within the pool. So if an entry to the left of “current” has not been re-used, there is no need to flush the TSB entry either. • Leverage this where possible (e.g., older sun4u h/w), by only flushing when we go back to the beginning of the pool in the arena allocator. • Caller of pool init must supply the lazy_flush() where this is possible Linuxcon North America, Seattle 2015

sun4v initializing the IOMMU allocator • Core data-structure used by library is struct iommu_map_table which has the bitmap “ map ” • iommu_map_table initialization from pci_sun4v_iommu_init: iommu->tbl.map = kzalloc(sz, ..); • Call iommu_tbl_pool_init() to initialize the rest of iommu_map_table , with npools == 16 , no large pool, null (*lazy_flush)() Linuxcon North America, Seattle 2015

sun4v IOMMU alloc • First call iommu_tbl_range_alloc() to find the indices of the TSB that have been set aside for this allocation • Commit the physaddr ↔ bus-addr mappings to the TSB by invoking pci_sun4v_iommu_map() (HV call) Linuxcon North America, Seattle 2015

sun4v: releasing an IOMMU allocation • First reset the TSB state by calling pci_sun4v_iommu_demap() (HV call) • Then reset iommu bitmap data-structure by calling iommu_tbl_range_free() – MUST hold the lock on the specific pool from which allocation was made Linuxcon North America, Seattle 2015

sun4u customizations • No intermediate HV, so the lazy_flush is non- null. • Currently uses only 1 pool for 1M TSB entries- need to experiment with using multiple pools – Caveat: typical sun4u is 2 socket, 2 CPU. Linuxcon North America, Seattle 2015

Current/ongoing Work • Currently: all the sparc code uses the generic iommu allocator: includes LDC (for Ldoms), sun4u, sun4v • Convert PPC and x86_64 code to use the common arena allocator • Try using multiple pools on older hardware (sun4u) and check for perf improvements Linuxcon North America, Seattle 2015

Acknowledgements • David Miller, for the original sparc code, and encouragement to create a generic library, • Ben Herrenschmidt and PPC team for all the reviews and feedback that helped make this better, • All the folks on sparclinux that helped test this on legacy sun4u platforms, • Linuxcon, Oracle for their support. Linuxcon North America, Seattle 2015

A scalable, common IOMMU pooled allocation library for multiple - PowerPoint PPT Presentation

A scalable, common IOMMU pooled allocation library for multiple architectures Sowmini Varadhan (sowmini.varadhan@oracle.com) Linuxcon North America, Seattle 2015 \ 1 Agenda What is IOMMU? Benefits/drawbacks of IOMMU Typical design

PS 4 Panel Models 11 December 2014 PS 4 Panel Models Pooled OLS vs Fixed Effects Pooled OLS vs

vIOMMU/ARM: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017 Overview

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

LEGAL CONSI DERATI ONS FOR STARTI NG A POOLED FUND Pierre-Yves Chtillon, Fasken Martineau

Introduction of New Whole Blood Collection Bags and Pooled Platelets Bag Canadian Blood Services

Next Generation Transportation Construction Management Pooled Fund Program Keith Molenaar 2015

Pooled steganalysis in JPEG: how to deal with the spreading strategy? Ahmad ZAKARIA 1 , 2 , Marc

TPM Pooled Fund Quarterly Meeting March 28, 2019 Agenda 2:00 Welcome and Agenda. Christos

More Register Allocation Last time Register allocation Global allocation via graph

Mastering the DMA and IOMMU APIs Embedded Linux Conference Europe 2014 Dsseldorf Laurent

DMA API Performance and Contention on IOMMU Enabled Environments Thadeu Cascardo

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Quantum algorithm for Petz recovery channels and pretty good measurements (arXiv:2006.16924)

Unit 7: Virtual Memory TLBs and memory hierarchy issues

Virtual memory (VM) Overview and mo-va-on VM as

Lecture 21: Virtual Memory, I/O Basics Todays topics: Virtual memory I/O overview

RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans

REIMAGINING (MAP) NAVIGATION Daniel Meusburger Fundamentals Situated Interaction 30 September

the stack & the heap hic 1 memory management So far: data representations: how are

A scalable, common IOMMU pooled allocation library for multiple - PowerPoint PPT Presentation

A scalable, common IOMMU pooled allocation library for multiple architectures Sowmini Varadhan (sowmini.varadhan@oracle.com) Linuxcon North America, Seattle 2015 \ 1 Agenda What is IOMMU? Benefits/drawbacks of IOMMU Typical design

PS 4 Panel Models 11 December 2014 PS 4 Panel Models Pooled OLS vs Fixed Effects Pooled OLS vs

vIOMMU/ARM: full emulation and virtio-iommu approaches Eric Auger KVM Forum 2017 Overview

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

LEGAL CONSI DERATI ONS FOR STARTI NG A POOLED FUND Pierre-Yves Chtillon, Fasken Martineau

Introduction of New Whole Blood Collection Bags and Pooled Platelets Bag Canadian Blood Services

Next Generation Transportation Construction Management Pooled Fund Program Keith Molenaar 2015

Pooled steganalysis in JPEG: how to deal with the spreading strategy? Ahmad ZAKARIA 1 , 2 , Marc

TPM Pooled Fund Quarterly Meeting March 28, 2019 Agenda 2:00 Welcome and Agenda. Christos

More Register Allocation Last time Register allocation Global allocation via graph

Mastering the DMA and IOMMU APIs Embedded Linux Conference Europe 2014 Dsseldorf Laurent

DMA API Performance and Contention on IOMMU Enabled Environments Thadeu Cascardo

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

Quantum algorithm for Petz recovery channels and pretty good measurements (arXiv:2006.16924)

Unit 7: Virtual Memory TLBs and memory hierarchy issues

Virtual memory (VM) Overview and mo-va-on VM as

Lecture 21: Virtual Memory, I/O Basics Todays topics: Virtual memory I/O overview

RadixVM: Scalable address spaces for multithreaded applications Austin T. Clements M. Frans

REIMAGINING (MAP) NAVIGATION Daniel Meusburger Fundamentals Situated Interaction 30 September

the stack &amp; the heap hic 1 memory management So far: data representations: how are

the stack & the heap hic 1 memory management So far: data representations: how are