IO Virtualization with InfiniBand [InfiniBand as a Hypervisor - - PowerPoint PPT Presentation

io virtualization with infiniband infiniband as a
SMART_READER_LITE
LIVE PREVIEW

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor - - PowerPoint PPT Presentation

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator] Michael Kagan Vice President, Architecture Mellanox Technologies michael@mellanox.co.il Key messages InfiniBand enables efficient servers virtualization


slide-1
SLIDE 1

IO Virtualization with InfiniBand [InfiniBand as a Hypervisor Accelerator]

Michael Kagan Vice President, Architecture Mellanox Technologies

michael@mellanox.co.il

slide-2
SLIDE 2

Leadership in InfiniBand silicon

2 April 05

Key messages

  • InfiniBand enables efficient server’s virtualization

– Cross-domain isolation – Efficient IO sharing – Protection enforcement

  • Existing HW fully supports virtualization

– The most cost-effective path for single-node virtual servers – SW-transparent scale-out

  • VMM support in OpenIB SW stack by fall ’05

– Alpha version of FW and driver in June

slide-3
SLIDE 3

Leadership in InfiniBand silicon

3 April 05

InfiniBand scope in server virtualization

  • CPU virtualization

– Compute power

  • Memory virtualization

– Memory allocation – Address translation – Protection

  • IO virtualization
  • NO
  • Partial

– No – Yes – for IO accesses – Yes – for IO accesses

  • YES

Virtualized server Hypervisor Domain0 DomainX DomainY IO

IO drv

Kernel Kernel

IO drv IO drv Virtual switch(es) Bridge IO drv

CPU memory

slide-4
SLIDE 4

Leadership in InfiniBand silicon

4 April 05

Switch

End Node

Switch Switch Switch

End Node End Node End Node End Node End Node End Node End Node End Node End Node

InfiniBand – Overview

  • Performance

– Bandwidth – up to 120Gbit/sec per link – Latency – under 3uSec (today)

  • Kernel bypass for IO access

– Cross-process protection and isolation

  • Quality Of Service

– End node – Fabric

  • Scalability/flexibility

– Up to 48K local nodes, up to 2128 total – Multiple link width/trace (Cu, Fiber)

  • Multiple transport services in HW

– Reliable and unreliable – Connected and datagram – Automatic path migration in HW

  • Memory exposure to remote node

– RDMA-read and RDMA-write

  • Multiple networks on a single wire

– Network partitioning in HW (“VLAN”) – Multiple independent virtual networks on a wire

Link data rate: Today: 2.5,10,20,30,60Gb/s Spec: up to 120Gb/sec Cu & Optical

slide-5
SLIDE 5

Leadership in InfiniBand silicon

5 April 05

InfiniBand communication

Consumer channel interface Network (fabric) interface

slide-6
SLIDE 6

Leadership in InfiniBand silicon

6 April 05

Consumer Queue Model

  • Asynchronous execution
  • In-order execution on queue
  • Flexible completion report

Host Channel Adapter (HCA)

  • Consumers connected via queues

– Local or remote node

  • 16M independent queues

– 16M IO channel – 16M QoS levels

  • transport, priority
  • Memory access through virtual address

– Remote and local – 2G address spaces, 64-bit each – Access rights and isolation enforced by HW

InfiniBand Channel Interface

PCI-Express InfiniBand ports

HCA

slide-7
SLIDE 7

Leadership in InfiniBand silicon

7 April 05 Userland Kernel HCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter

  • HCA configuration via Command Queue

– Initialization – Run-time resource assignment and setup

  • HCA resources (queues) allocated for applications

– Resource protection through User Access Region

  • IO access through HCA QPs (“IO channels”)

– QPs properties match IO requirements – Cross-QP resource isolation

  • Memory protection – via Protection Domains

– Many-to one association

  • Address space to Protection Domain
  • QP to Protection Domain

– Memory access using Key and virtual address

  • Boundary and access right validation
  • Protection Domain validation
  • Virtual to physical (HW) address translation
  • Interrupts delivery – Event Queues

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App

CCQ

App App

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App

slide-8
SLIDE 8

Leadership in InfiniBand silicon

8 April 05 Domain0 Kernel Userland Kernel HCA HCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter

  • HCA configuration via Command Queue

– Initialization – Run-time resource assignment and setup

  • HCA resources (queues) allocated for applications

– Resource protection through User Access Region

  • IO access through HCA QPs (“IO channels”)

– QPs properties match IO requirements – Cross-QP resource isolation

  • Memory protection – via Protection Domains

– Many-to one association

  • Address space to Protection Domain
  • QP to Protection Domain

– Memory access using Key and virtual address

  • Boundary and access right validation
  • Protection Domain validation
  • Virtual to physical (HW) address translation
  • Interrupts delivery – Event Queues
  • HCA initialization by VMM

– Assign command queue per guest domain – HCA resources partitioned and exported to guest OSes

  • HCA resources allocated to guests/their apps

– Resource protection through UAR

  • Each VM has direct IO access

– “Hypervisor offload”

  • Memory protection – via Protection Domains
  • Address translation step generates HW address

– Guest Physical Address to HW address translation – Validate access rights

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App

CCQ

App App

Up to 16M work queues

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ

DomainZ Kernel

App App

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App App App

DomainY Kernel

App App App App

DomainX Kernel

App App App App App App

slide-9
SLIDE 9

Leadership in InfiniBand silicon

9 April 05 Domain0 Kernel Userland Kernel HCA HCA

CQ

Up to 16M work queues

InfiniBand Host Channel Adapter

  • HCA configuration via Command Queue

– Initialization – Run-time resource assignment and setup

  • HCA resources (queues) allocated for applications

– Resource protection through User Access Region

  • IO access through HCA QPs (“IO channels”)

– QPs properties match IO requirements – Cross-QP resource isolation

  • Memory protection – via Protection Domains

– Many-to one association

  • Address space to Protection Domain
  • QP to Protection Domain

– Memory access using Key and virtual address

  • Boundary and access right validation
  • Protection Domain validation
  • Virtual to physical (HW) address translation
  • Interrupts delivery – Event Queues
  • HCA initialization by VMM

– Assign command queue per guest domain – HCA resources partitioned and exported to guest OSes

  • HCA resources allocated to guests/their apps

– Resource protection through UAR

  • Each VM has direct IO access

– “Hypervisor offload”

  • Memory protection – via Protection Domains
  • Address translation step generates HW address

– Guest Physical Address to HW address translation – Validate access rights

  • Guest driver manages HCA resources at run-time

– Each OS sees “its own HCA” – HCA HW keeps guest OS honest – Connection manager – see later

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App

CCQ

App App

Up to 128 CCQ Up to 16M work queues

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ

DomainZ Kernel

Driver App App

CQ CCQ CQ CCQ CQ CCQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ

App App App App

DomainY Kernel

Driver App App App App

DomainX Kernel

Driver App App App App App App

slide-10
SLIDE 10

Leadership in InfiniBand silicon

10 April 05

Address translation and protection

Non-virtual server

  • HCA TPT set by driver

– Boundaries, access rights – vir2phys table

  • Run-time address translation

– Access right validation – Translation tables’ walk

Virtual server

  • VMM sets guest HW address tables

– Address space per guest domain – Managed and updated by VMM

  • Guest driver sets HCA TPT

– Guest PA in vir2phys table

  • Run-time address translation
  • 1. Virtual to Guest Phys. Addr
  • 2. Guest Physical to HW address

MKey Virtual address

Application MKey entry

HW physical address

MKey table Translation tables

MKey Virtual address

Application MKey entry VM GPA MKey entry

HW physical address

1 1 1 2 2 2

MKey table

2

Translation tables

slide-11
SLIDE 11

Leadership in InfiniBand silicon

11 April 05

IO virtualization with InfiniBand single node, local IO

  • Full offload for local cross-domain access

– Eliminate Hypervisor kernel transition on data path

  • Reduce cross-domain access latency
  • Reduce CPU utilization
  • Kernel bypass on IO access to guest application
  • Shared [local] IO

– Shared by guest domains

Virtualized server Hypervisor Domain0 DomainX DomainY IO

IO drv

Kernel Kernel

IO drv IO drv Virtual switch(es) Bridge IO drv

Virtualized server Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv Bridge IO drv

IO HCA

HW switch(es)

IO Hypervisor Off-load

slide-12
SLIDE 12

Leadership in InfiniBand silicon

12 April 05

Virtualized server

IO virtualization with InfiniBand multiple nodes (cluster), network-resident IO

  • SW-transparent remote IO access
  • No Hypervisor kernel transition
  • Kernel bypass for guest apps
  • Shared [remote] IO

– Shared by domains

Virtualized server Hypervisor Domain0 DomainX DomainY IO

IO drv

Kernel Kernel

IO drv IO drv Virtual switch(es) Bridge IO drv

Virtualized server DomainY1

Kernel

IO drv

HCA

HW switch(es)

InfiniBand switch

IO Hypervisor Off-load SW-transparent Scale-out IO sharing

DomainX1

Kernel

IO drv

Domain0 Virtualized server DomainY2

Kernel

IO drv

HCA

HW switch(es)

DomainX2

Kernel

IO drv

Domain0 Bridge

IO Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv Bridge IO drv

IO HCA

HW switch(es)

HCA

slide-13
SLIDE 13

Leadership in InfiniBand silicon

13 April 05

Network – IPoverIB

  • IP over Ethernet

– SW channel for each domain

  • Virtual NIC in domain
  • Switch in SW

– Copy, VLANs

– Hypervisor call

  • Kernel transition

– NIC driver in domain0

  • External L2 bridge
  • IP over IB

– HW channel for each domain

  • Virtual NIC in domain
  • Switch in HW

– VLANs, data move

– Direct HW access from guest domain

  • No Hypervisor transition

– IPoverIB in domain0

  • Bypass L2 bridge

Virtualized server Hypervisor Domain0 DomainX DomainY NIC

N/W drv

Kernel Kernel

N/W drv N/W drv Virtual switch(es) Bridge NIC drv

Virtualized server Domain0 DomainX DomainY

IPoIB

Kernel Kernel

IPoIB IPoIB Bridge NIC drv

NIC HCA

HW switch(es)

slide-14
SLIDE 14

Leadership in InfiniBand silicon

14 April 05

Network – sockets

  • Sockets over Ethernet

– TCP/IP stack in guest domain – SW L2 channel for guest domain

  • Virtual NIC in domain
  • Switch in SW

– Copy, VLANs

  • Hypervisor call

– Kernel transition

– NIC driver in domain0

  • Sockets over InfiniBand (SDP)

– HW L4 channel for guest domain

  • Socket QP(s) per domain
  • Transport and switch in HW

– Copy, VLANs

– Direct HW access from guest domain

  • No Hypervisor transition

– Full bypass of domain0

Virtualized server Domain0 DomainX DomainY

SDP

Kernel Kernel

SDP SDP Bridge NIC drv

NIC HCA

HW switch(es)

Virtualized server Hypervisor Domain0 DomainX DomainY NIC

N/W drv

Kernel Kernel

N/W drv Virtual switch(es) Bridge NIC drv TCP/IP N/W drv TCP/IP

slide-15
SLIDE 15

Leadership in InfiniBand silicon

15 April 05

Storage

  • Virtualized disk access

– vSCSI driver in guest domain – SCSI “switch” in Hypervisor

  • Switch in SW

– Copy, isolation

  • Hypervisor call

– Kernel transition

– Disk driver in domain0

  • HBA for SAN
  • Virtualized disk access

– SRP initiator per guest domain – SCSI “switch” in HCA

  • Transport and switch in HW

– Copy, isolation

– Direct HW access from guest domain

  • No Hypervisor transition

– Disk driver in domain0

  • Bypass domain0 for SAN

Virtualized server Domain0 DomainX DomainY

Kernel Kernel

SRP SRP SRP target Adapter

HCA

HW switch(es)

Virtualized server Hypervisor Domain0 DomainX DomainY

vSCSI

Kernel Kernel

Virtual switch(es) SRP target Adapter vSCSI vSCSI

slide-16
SLIDE 16

Leadership in InfiniBand silicon

16 April 05

MPI applications

[MPI as an example for user-mode access to network]

  • MPI over TCP/IP???

– Datapath performance hit

  • Two kernel transitions on performance

path

– Forget about low latency 

  • MPI driver in guest app

– No datapath performance hit

  • Direct access to HCA HW
  • Full guest kernel bypass
  • Full Hypervisor bypass

– Event delivery directly to guest OS

  • Retain control path performance

– Memory registration needs attention

  • Registration cache

Virtualized server Domain0 DomainX DomainY

MPI

HCA

HW switch(es)

Virtualized server Hypervisor Domain0 DomainX DomainY NIC

N/W drv

Kernel Kernel

N/W drv Virtual switch(es) Bridge NIC drv TCP/IP N/W drv TCP/IP MPI

Kernel Kernel

slide-17
SLIDE 17

Leadership in InfiniBand silicon

17 April 05

Plans

  • Stage1,2 driver update

June/05

– Hypervisor bypass for Data Path operation

  • Stage3 FW and driver update

Aug/05

– Full HCA export to guest domain

Domain0 Kernel

Hypervisor

HCA

Up to 128 CCQ Up to 16M work queues

Driver

WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ WQ CQ CQ CCQ CQ CCQ

DomainZ Kernel

Driver App App

CQ CCQ CQ CCQ WQ CQ WQ CQ WQ CQ WQ CQ

App App

DomainY Kernel

Driver App App App App

DomainX Kernel

Driver App App App App

Stage1,2 stage3

Virtualized server Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv Bridge IO drv

IO HCA

HW switch(es)

Virtualized server Domain0 DomainX DomainY

IO drv

Kernel Kernel

IO drv IO drv Bridge IO drv

IO HCA

HW switch(es)

slide-18
SLIDE 18

Leadership in InfiniBand silicon

18 April 05

Summary

  • InfiniBand HCA is a Hypervisor offload engine
  • InfiniBand enables efficient server’s virtualization

– Cross-domain isolation – Efficient IO sharing – Protection enforcement

  • Existing HW fully supports virtualization

– The most cost-effective path for single-node virtual servers – SW-transparent scale-out

  • VMM support in OpenIB SW stack by fall ’05

– Alpha version of FW and driver in June