IO Virtualization with InfiniBand [InfiniBand as a Hypervisor - - PowerPoint PPT Presentation
IO Virtualization with InfiniBand [InfiniBand as a Hypervisor - - PowerPoint PPT Presentation
IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice President, Architecture Mellanox Technologies michael@mellanox.co.il Key messages InfiniBand enables efficient servers virtualization
Leadership in InfiniBand silicon
2 April 05
Key messages
- InfiniBand enables efficient server’s virtualization
– Cross-domain isolation – Efficient IO sharing – Protection enforcement
- Existing HW fully supports virtualization
– The most cost-effective path for single-node virtual servers – SW-transparent scale-out
- VMM support in OpenIB SW stack by fall ’05
– Alpha version of FW and driver in June
Leadership in InfiniBand silicon
3 April 05
InfiniBand scope in server virtualization
- CPU virtualization
– Compute power
- Memory virtualization
– Memory allocation – Address translation – Protection
- IO virtualization
- NO
- Partial
– No – Yes – for IO accesses – Yes – for IO accesses
- YES
Virtualized server Hypervisor Domain0 DomainX DomainY IO
IO drv
Kernel Kernel
IO drv IO drv Virtual switch(es) Bridge IO drv
CPU memory
Leadership in InfiniBand silicon
4 April 05
Switch
End Node
Switch Switch Switch
End Node End Node End Node End Node End Node End Node End Node End Node End Node
InfiniBand – Overview
- Performance
– Bandwidth – up to 120Gbit/sec per link – Latency – under 3uSec (today)
- Kernel bypass for IO access
– Cross-process protection and isolation
- Quality Of Service
– End node – Fabric
- Scalability/flexibility
– Up to 48K local nodes, up to 2128 total – Multiple link width/trace (Cu, Fiber)
- Multiple transport services in HW
– Reliable and unreliable – Connected and datagram – Automatic path migration in HW
- Memory exposure to remote node
– RDMA-read and RDMA-write
- Multiple networks on a single wire
– Network partitioning in HW (“VLAN”) – Multiple independent virtual networks on a wire
Link data rate: Today: 2.5,10,20,30,60Gb/s Spec: up to 120Gb/sec Cu & Optical
Leadership in InfiniBand silicon
5 April 05
InfiniBand communication
Consumer channel interface Network (fabric) interface
Leadership in InfiniBand silicon
6 April 05
Consumer Queue Model
- Asynchronous execution
- In-order execution on queue
- Flexible completion report
Host Channel Adapter (HCA)
- Consumers connected via queues
– Local or remote node
- 16M independent queues
– 16M IO channel – 16M QoS levels
- transport, priority
- Memory access through virtual address
– Remote and local – 2G address spaces, 64-bit each – Access rights and isolation enforced by HW
InfiniBand Channel Interface
PCI-Express InfiniBand ports
HCA
Leadership in InfiniBand silicon
7 April 05 Userland Kernel HCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter
- HCA configuration via Command Queue
– Initialization – Run-time resource assignment and setup
- HCA resources (queues) allocated for applications
– Resource protection through User Access Region
- IO access through HCA QPs (“IO channels”)
– QPs properties match IO requirements – Cross-QP resource isolation
- Memory protection – via Protection Domains
– Many-to one association
- Address space to Protection Domain
- QP to Protection Domain
– Memory access using Key and virtual address
- Boundary and access right validation
- Protection Domain validation
- Virtual to physical (HW) address translation
- Interrupts delivery – Event Queues
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App
CCQ
App App
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App
Leadership in InfiniBand silicon
8 April 05 Domain0 Kernel Userland Kernel HCA HCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter
- HCA configuration via Command Queue
– Initialization – Run-time resource assignment and setup
- HCA resources (queues) allocated for applications
– Resource protection through User Access Region
- IO access through HCA QPs (“IO channels”)
– QPs properties match IO requirements – Cross-QP resource isolation
- Memory protection – via Protection Domains
– Many-to one association
- Address space to Protection Domain
- QP to Protection Domain
– Memory access using Key and virtual address
- Boundary and access right validation
- Protection Domain validation
- Virtual to physical (HW) address translation
- Interrupts delivery – Event Queues
- HCA initialization by VMM
– Assign command queue per guest domain – HCA resources partitioned and exported to guest OSes
- HCA resources allocated to guests/their apps
– Resource protection through UAR
- Each VM has direct IO access
– “Hypervisor offload”
- Memory protection – via Protection Domains
- Address translation step generates HW address
– Guest Physical Address to HW address translation – Validate access rights
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App
CCQ
App App
Up to 16M work queues
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ
DomainZ Kernel
App App
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App App App
DomainY Kernel
App App App App
DomainX Kernel
App App App App App App
Leadership in InfiniBand silicon
9 April 05 Domain0 Kernel Userland Kernel HCA HCA
CQ
Up to 16M work queues
InfiniBand Host Channel Adapter
- HCA configuration via Command Queue
– Initialization – Run-time resource assignment and setup
- HCA resources (queues) allocated for applications
– Resource protection through User Access Region
- IO access through HCA QPs (“IO channels”)
– QPs properties match IO requirements – Cross-QP resource isolation
- Memory protection – via Protection Domains
– Many-to one association
- Address space to Protection Domain
- QP to Protection Domain
– Memory access using Key and virtual address
- Boundary and access right validation
- Protection Domain validation
- Virtual to physical (HW) address translation
- Interrupts delivery – Event Queues
- HCA initialization by VMM
– Assign command queue per guest domain – HCA resources partitioned and exported to guest OSes
- HCA resources allocated to guests/their apps
– Resource protection through UAR
- Each VM has direct IO access
– “Hypervisor offload”
- Memory protection – via Protection Domains
- Address translation step generates HW address
– Guest Physical Address to HW address translation – Validate access rights
- Guest driver manages HCA resources at run-time
– Each OS sees “its own HCA” – HCA HW keeps guest OS honest – Connection manager – see later
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App
CCQ
App App
Up to 128 CCQ Up to 16M work queues
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ
DomainZ Kernel
Driver App App
CQ CCQ CQ CCQ CQ CCQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ
App App App App
DomainY Kernel
Driver App App App App
DomainX Kernel
Driver App App App App App App
Leadership in InfiniBand silicon
10 April 05
Address translation and protection
Non-virtual server
- HCA TPT set by driver
– Boundaries, access rights – vir2phys table
- Run-time address translation
– Access right validation – Translation tables’ walk
Virtual server
- VMM sets guest HW address tables
– Address space per guest domain – Managed and updated by VMM
- Guest driver sets HCA TPT
– Guest PA in vir2phys table
- Run-time address translation
- 1. Virtual to Guest Phys. Addr
- 2. Guest Physical to HW address
MKey Virtual address
Application MKey entry
HW physical address
MKey table Translation tables
MKey Virtual address
Application MKey entry VM GPA MKey entry
HW physical address
1 1 1 2 2 2
MKey table
2
Translation tables
Leadership in InfiniBand silicon
11 April 05
IO virtualization with InfiniBand single node, local IO
- Full offload for local cross-domain access
– Eliminate Hypervisor kernel transition on data path
- Reduce cross-domain access latency
- Reduce CPU utilization
- Kernel bypass on IO access to guest application
- Shared [local] IO
– Shared by guest domains
Virtualized server Hypervisor Domain0 DomainX DomainY IO
IO drv
Kernel Kernel
IO drv IO drv Virtual switch(es) Bridge IO drv
Virtualized server Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv Bridge IO drv
IO HCA
HW switch(es)
IO Hypervisor Off-load
Leadership in InfiniBand silicon
12 April 05
Virtualized server
IO virtualization with InfiniBand multiple nodes (cluster), network-resident IO
- SW-transparent remote IO access
- No Hypervisor kernel transition
- Kernel bypass for guest apps
- Shared [remote] IO
– Shared by domains
Virtualized server Hypervisor Domain0 DomainX DomainY IO
IO drv
Kernel Kernel
IO drv IO drv Virtual switch(es) Bridge IO drv
Virtualized server DomainY1
Kernel
IO drv
HCA
HW switch(es)
InfiniBand switch
IO Hypervisor Off-load SW-transparent Scale-out IO sharing
DomainX1
Kernel
IO drv
Domain0 Virtualized server DomainY2
Kernel
IO drv
HCA
HW switch(es)
DomainX2
Kernel
IO drv
Domain0 Bridge
IO Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv Bridge IO drv
IO HCA
HW switch(es)
HCA
Leadership in InfiniBand silicon
13 April 05
Network – IPoverIB
- IP over Ethernet
– SW channel for each domain
- Virtual NIC in domain
- Switch in SW
– Copy, VLANs
– Hypervisor call
- Kernel transition
– NIC driver in domain0
- External L2 bridge
- IP over IB
– HW channel for each domain
- Virtual NIC in domain
- Switch in HW
– VLANs, data move
– Direct HW access from guest domain
- No Hypervisor transition
– IPoverIB in domain0
- Bypass L2 bridge
Virtualized server Hypervisor Domain0 DomainX DomainY NIC
N/W drv
Kernel Kernel
N/W drv N/W drv Virtual switch(es) Bridge NIC drv
Virtualized server Domain0 DomainX DomainY
IPoIB
Kernel Kernel
IPoIB IPoIB Bridge NIC drv
NIC HCA
HW switch(es)
Leadership in InfiniBand silicon
14 April 05
Network – sockets
- Sockets over Ethernet
– TCP/IP stack in guest domain – SW L2 channel for guest domain
- Virtual NIC in domain
- Switch in SW
– Copy, VLANs
- Hypervisor call
– Kernel transition
– NIC driver in domain0
- Sockets over InfiniBand (SDP)
– HW L4 channel for guest domain
- Socket QP(s) per domain
- Transport and switch in HW
– Copy, VLANs
– Direct HW access from guest domain
- No Hypervisor transition
– Full bypass of domain0
Virtualized server Domain0 DomainX DomainY
SDP
Kernel Kernel
SDP SDP Bridge NIC drv
NIC HCA
HW switch(es)
Virtualized server Hypervisor Domain0 DomainX DomainY NIC
N/W drv
Kernel Kernel
N/W drv Virtual switch(es) Bridge NIC drv TCP/IP N/W drv TCP/IP
Leadership in InfiniBand silicon
15 April 05
Storage
- Virtualized disk access
– vSCSI driver in guest domain – SCSI “switch” in Hypervisor
- Switch in SW
– Copy, isolation
- Hypervisor call
– Kernel transition
– Disk driver in domain0
- HBA for SAN
- Virtualized disk access
– SRP initiator per guest domain – SCSI “switch” in HCA
- Transport and switch in HW
– Copy, isolation
– Direct HW access from guest domain
- No Hypervisor transition
– Disk driver in domain0
- Bypass domain0 for SAN
Virtualized server Domain0 DomainX DomainY
Kernel Kernel
SRP SRP SRP target Adapter
HCA
HW switch(es)
Virtualized server Hypervisor Domain0 DomainX DomainY
vSCSI
Kernel Kernel
Virtual switch(es) SRP target Adapter vSCSI vSCSI
Leadership in InfiniBand silicon
16 April 05
MPI applications
[MPI as an example for user-mode access to network]
- MPI over TCP/IP???
– Datapath performance hit
- Two kernel transitions on performance
path
– Forget about low latency
- MPI driver in guest app
– No datapath performance hit
- Direct access to HCA HW
- Full guest kernel bypass
- Full Hypervisor bypass
– Event delivery directly to guest OS
- Retain control path performance
– Memory registration needs attention
- Registration cache
Virtualized server Domain0 DomainX DomainY
MPI
HCA
HW switch(es)
Virtualized server Hypervisor Domain0 DomainX DomainY NIC
N/W drv
Kernel Kernel
N/W drv Virtual switch(es) Bridge NIC drv TCP/IP N/W drv TCP/IP MPI
Kernel Kernel
Leadership in InfiniBand silicon
17 April 05
Plans
- Stage1,2 driver update
June/05
– Hypervisor bypass for Data Path operation
- Stage3 FW and driver update
Aug/05
– Full HCA export to guest domain
Domain0 Kernel
Hypervisor
HCA
Up to 128 CCQ Up to 16M work queues
Driver
WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ CQ CCQ
DomainZ Kernel
Driver App App
CQ CCQ CQ CCQ WQ CQ WQ CQ WQ CQ WQ CQ
App App
DomainY Kernel
Driver App App App App
DomainX Kernel
Driver App App App App
Stage1,2 stage3
Virtualized server Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv Bridge IO drv
IO HCA
HW switch(es)
Virtualized server Domain0 DomainX DomainY
IO drv
Kernel Kernel
IO drv IO drv Bridge IO drv
IO HCA
HW switch(es)
Leadership in InfiniBand silicon
18 April 05
Summary
- InfiniBand HCA is a Hypervisor offload engine
- InfiniBand enables efficient server’s virtualization
– Cross-domain isolation – Efficient IO sharing – Protection enforcement
- Existing HW fully supports virtualization
– The most cost-effective path for single-node virtual servers – SW-transparent scale-out
- VMM support in OpenIB SW stack by fall ’05