To EL2, and Beyond! Optimizing the Design and Implementation of - - PowerPoint PPT Presentation

to el2 and beyond
SMART_READER_LITE
LIVE PREVIEW

To EL2, and Beyond! Optimizing the Design and Implementation of - - PowerPoint PPT Presentation

To EL2, and Beyond! Optimizing the Design and Implementation of KVM/ARM Christo ff er Dall <cdall@kernel.org> LEADING Shih-Wei Li <shihwei@cs.columbia.edu> COLLABORATION IN THE ARM ECOSYSTEM connect.linaro.org E


slide-1
SLIDE 1

connect.linaro.org

LEADING 
 COLLABORATION 
 IN THE ARM 
 ECOSYSTEM

To EL2, and Beyond!

Optimizing the Design and Implementation of KVM/ARM

Christoffer Dall <cdall@kernel.org>
 Shih-Wei Li <shihwei@cs.columbia.edu>

slide-2
SLIDE 2

–Popek and Golberg [Formal requirements for virtualizable third generation architectures ’74]

“Efficient, isolated duplicate


  • f the real machine”

“…a statistically dominant subset of the virtual processor’s instructions be executed directly by the real processor, with no software intervention by the VMM.”

slide-3
SLIDE 3

IBM 360/91

Columbia University Computer Center machine room in February or March 1969

slide-4
SLIDE 4

PDP-10

KL10 CPU and MH10 memory cabinets
 Originally installed 1985 at Sikorsky Aircraft

slide-5
SLIDE 5

Gigabyte R270-T61
 96 Cores

Dual Cavium ThunderX

slide-6
SLIDE 6

Hardware OS Kernel App App App Hardware Hypervisor

VM Kernel

App App

VM Kernel

App App

Native Virtual Machines

Virtualization

Privileged Non-privileged Privileged Non-privileged

slide-7
SLIDE 7

Non-virtualizable architectures

slide-8
SLIDE 8

ARM Hardware Virtualization Support

VT-x

!=

Virtualization Extensions

slide-9
SLIDE 9

x86 Virtualization Support

Root (Hypervisor) Non-Root (VM)

VM Exit

VMCS

VM Entry

slide-10
SLIDE 10

ARM Virtualization Extensions

Kernel User

EL0 EL1

Hypervisor

EL2

slide-11
SLIDE 11

EL2

  • Separate CPU mode designed to run hypervisors
  • Not designed to run full operating systems
  • Reduced virtual memory support compared to EL1
  • Limited support for interacting with userspace in EL0
slide-12
SLIDE 12

ARM VE and Hypervisors

Xen

Dom0 Linux

App App

DomU Linux

App App

EL0 EL1 EL2

?

slide-13
SLIDE 13

KVM/ARM

  • KVM is integreated with Linux
  • Linux is a full operating system designed

to run in EL1

  • KVM cannot run VMs without EL2
slide-14
SLIDE 14

KVM/ARM Split-Mode

Host

Linux App App

VM

Kernel App App KVM

KVM lowvisor

EL0 EL1 EL2

  • 1. Hypercall
  • 2. Return
  • 3. Hypercall
  • 4. Return

switch state

slide-15
SLIDE 15

What if we could do this?

Host

Linux App App

VM

Kernel App App KVM

EL0 EL1 EL2

  • 1. Hypercall
  • 2. Return
slide-16
SLIDE 16

ARMv8.1 VHE

  • Virtualization Host Extensions
  • Supports running unmodified

OSes in EL2 without using EL1

Linux

EL0 EL1 EL2

App App

slide-17
SLIDE 17

VHE #1: Backwards Compatible

  • HCR_EL2.E2H complete enables and disables VHE
  • When disabled, completely backwards compatible with ARMv8.0
  • Example: Xen disables VHE
slide-18
SLIDE 18

VHE #2: Expands Functionality of EL2

  • Expanded EL2 functionality
  • Inherits all EL1 MMU features
  • New virtual EL2 timer
  • A corresponding EL2 system register for each EL1 system register
slide-19
SLIDE 19

VHE #3: Support Userspace in EL0

  • TGE: Trap General Exceptions
  • Routes all exceptions to EL2
  • VHE no longer disables stage 1 MMU in EL0

Linux

EL0 EL1 EL2

App App

Exceptions

slide-20
SLIDE 20

VHE #4: EL2&0 Translation Regime

  • Same page table format as EL1
  • Used in EL0 with TGE bit set
slide-21
SLIDE 21
  • Linux is written to run in

EL1

  • EL<x> is controlled by

EL<x> system registers

  • VHE runs Linux in EL2
  • Unmodified!

VHE #5: System Register Redirection

Linux

EL0 EL1 EL2

App App

EL1 Registers EL2 Registers

Linux

slide-22
SLIDE 22
  • Linux is written to run in

EL1

  • VHE runs Linux in EL2
  • Unmodified!

VHE #5: System Register Redirection

Linux

EL0 EL1 EL2

App App

EL1 Registers EL2 Registers

Linux

slide-23
SLIDE 23

VHE: System Register Redirection

mrs x0, ESR_EL1

slide-24
SLIDE 24

VHE #5: System Register Redirection

ESR_EL1 mrs x0, ESR_EL1

VHE Disabled

ESR_EL2

slide-25
SLIDE 25

VHE #5: System Register Redirection

ESR_EL1 mrs x0, ESR_EL1 ESR_EL2

VHE Enabled

slide-26
SLIDE 26

VHE #5: System Register Redirection

ESR_EL1 mrs x0, ESR_EL12

VM

Kernel App App

EL0 EL1 EL2 Host

App App Linux

KVM

slide-27
SLIDE 27

VHE #5: More System Register Redirection

  • Some registers change bit position to be similar between EL1 and EL2
  • Example:
  • VHE: CNTKCTL_EL1 redirects to CNTHTCL_EL2
  • But they have different layouts
  • VHE: EL2 register changes layout to EL1 register (with extra bits)
slide-28
SLIDE 28

Legacy KVM/ARM without VHE

Hypervisor Linux EL2 EL1 KVM Lowvisor

Trap Run VM

slide-29
SLIDE 29

KVM/ARM with VHE

Hypervisor Linux EL2 KVM world switch

Function
 Call Run VM

slide-30
SLIDE 30

No VHE hardware

  • How do we measure VHE performance?
  • None available at start of this work
  • Still no publicly available hardware
slide-31
SLIDE 31

Linux in EL2

Modify Linux to:

  • 1. Access EL2 registers
  • 2. Use EL2 virtual memory

system

  • 3. Support user space

applications in EL0 EL0 EL1 EL2

Linux Userspace KVM

slide-32
SLIDE 32

System Registers Accesses

  • Lots of:



 #ifndef CONFIG_EL2_KERNEL
 msr tcr_el1, x0
 #else
 msr tcr_el2, x0
 #endif

slide-33
SLIDE 33

EL1 VA Space (39 bits)

Userspace

0x7f ffffffff 0x0

Kernel

0xffffffff ffffffff 0xffffff80 00000000

TTBR0_EL1 TTBR1_EL1

slide-34
SLIDE 34

EL2 VA Space (39 bits)

0x7f ffffffff 0x0

TTBR0_EL2 Where do we put the kernel and userspace?

slide-35
SLIDE 35

EL2 Split VA Space

Kernel

0x7f ffffffff 0x0

TTBR0_EL2

Userspace

  • Problem A: address space

compression

  • Problem B: Page table

formats

  • Problem C: requires TLB

invalidation

0x3f ffffffff 0x40 00000000

*Only problems on non-VHE hardware!

slide-36
SLIDE 36

Sharing Page Tables in EL0 and EL2

  • Same page table between user and kernel
  • Different page table format in EL0 and EL2


Descriptor bit EL0 EL2 AP[2] R/W R/W AP[1] User access RES1 UXN/XN UXN XN PXN PXN RES0

slide-37
SLIDE 37

Descriptor bit EL0 EL2 AP[2] R/W R/W AP[1] User access RES1 UXN/XN UXN XN PXN PXN RES0

The AP[1] bit and Linux in EL2

  • AP[1] controls if userspace can access the page
  • Must be set to 0 for kernel mappings
  • RES1 in EL2
slide-38
SLIDE 38

RES1 definition

ARMv8.0 hardware must treat non-register RES1 bits as:
 
 “reads-as-written with no effect on the behaviour of the CPU”

slide-39
SLIDE 39

Descriptor bit EL0 EL2 AP[2] R/W R/W AP[1] User access RES1 UXN/XN UXN XN PXN PXN RES0

UXN/XN and PXN for Linux in EL2

  • PXN has no effect outside EL1
  • UXN/XN means ‘execute never’ in both modes
  • Cannot separate user and kernel executable
slide-40
SLIDE 40

No ASID Support in EL2

  • Address Space Identifiers (ASID)
  • Avoids TLB aliasing by tagging accesses with per-context ID
  • No ASID support in EL2
  • Must invalidate EL2 TLB on host process context switch
slide-41
SLIDE 41

Routing Exceptions to EL2

Kernel User

EL0 EL1

Exceptions from kernel Exceptions from userspace

Kernel User

EL0 EL2

Exceptions from kernel Exceptions from userspace

EL1

Linux in EL1 Linux in EL2

slide-42
SLIDE 42

Routing Exceptions to EL2

  • HCR_EL2.TGE traps general exceptions to EL2
  • Does NOT work, because TGE without VHE disables MMU in userspace
slide-43
SLIDE 43

Routing Exceptions to EL2

  • Forward exceptions with

software using a small shim

Kernel User

EL0 EL2 EL1

shim

slide-44
SLIDE 44

Linux in EL2 on non-VHE hardware

The bad (and the ugly)

  • Less secure than Linux in EL1
  • Relies on strictly correct

implementation of RES1 page table bits

  • Potentially worse performance for

host workloads The Good

  • Good prototyping tool!
  • Closely emulates performance of

VHE for running VMs

slide-45
SLIDE 45

Experimental Setup

  • AMD Seattle B0 ARM Server
  • 64-bit ARMv8-A
  • 2.0 GHz AMD A1100 CPU
  • 8-way SMP
  • 16 GB RAM
  • 10 GB Ethernet (passthrough)

*Measurements obtained using Linux in EL2.

slide-46
SLIDE 46

VHE Performance at First Glance

CPU Clock Cycles non-VHE VHE*

Hypercall

3.181 3.045

*Measurements obtained using Linux in EL2.

slide-47
SLIDE 47

The KVM Run Loop

vcpu_load vcpu_put vcpu run loop

while (1) { prepare(); run_vcpu(); handle_exit(); }

slide-48
SLIDE 48

KVM/ARM Optimization

  • Move logic out of the run loop and into

vcpu_load and vcpu_put

  • Only possible with VHE


(or Linux in EL2)

vcpu_load vcpu_put vcpu run loop

slide-49
SLIDE 49

ARM Generic Timers

  • Also known as “Architected

Timers”

  • Timer hardware directly

programmable by guest

  • Expired timers generate

physical interrupts for the hypervisor

slide-50
SLIDE 50

KVM/ARM Timers

VCPU entry

  • Programs timer with guest state

VCPU is running

  • When the timer fires it causes an exit to the hypervisor

VCPU exit

  • Reads guest timer state to memory
  • Disables hardware timer
  • In software: If timer is expired, inject virtual interrupt
slide-51
SLIDE 51

Optimized KVM/ARM Timers

VCPU load

  • Programs timer with guest state

VCPU is running

  • When the timer fires it causes an exit to the hypervisor

KVM is running

  • When the time fires, the timer ISR injects virtual interrupts to the guest.

VCPU put

  • Reads guest timer state to memory
  • Disables hardware timer
slide-52
SLIDE 52

EL1 System Registers

VM

Kernel App App

EL0 EL1 EL2 Host

App App Linux

KVM

  • Defer saving/restoring

EL1 system register state to vcpu_load and vcpu_put

slide-53
SLIDE 53

Virtualization Features

VM

Kernel App App

EL0 EL1 EL2 Host

App App Linux

KVM

  • Legacy KVM/ARM design

enabled/disabled virtualization features on every transition

  • Virtual/Physical interrupts
  • Stage 2 memory translation

KVM Lowvisor

Disable traps Enable traps

slide-54
SLIDE 54

Virtualization Features

VM

Kernel App App

EL0 EL1 EL2 Host

App App Linux

KVM

Optimized version:

  • Leave virtualization

features enabled

  • Host EL2 never uses

stage 2 translations and always has full hardware access.

slide-55
SLIDE 55

Rewrite the World-Switch

  • Rewrite the world

switch code

  • Very simple VHE

function

  • Complicated non-VHE

function

kvm_arch_vcpu_ioctl_run { ... while (1) { ... if (has_vhe()) /* static key */ ret = kvm_vcpu_vhe_run(vcpu); else ret = kvm_call_hyp(__kvm_vcpu_run, vcpu); ... } ... }

slide-56
SLIDE 56

Experimental Setup

  • AMD Seattle B0 ARM Server
  • 64-bit ARMv8-A
  • 2.0 GHz AMD A1100 CPU
  • 8-way SMP
  • 16 GB RAM
  • 10 GB Ethernet (passthrough)

*Measurements obtained using Linux in EL2.

  • Dell r320 x86 Server
  • 64-bit Intel
  • 2.1 GHz Xeon E5-2450
  • 8-way SMP
  • 16 GB RAM
  • 10 GB Ethernet (passthrough)
slide-57
SLIDE 57

Microbenchmark Results

CPU Clock Cycles non-VHE VHE OPT * x86

Hypercall

3.181 752 1.437

I/O Kernel

3.992 1.604 2.565

I/O User

6.665 7.630 6.732

Virtual IPI

14.155 2.526 3.102

*Measurements obtained using Linux in EL2.

slide-58
SLIDE 58

Application Workloads

Application Description Kernbench Kernel compile Hackbench Scheduler stress Netperf Network performance Apache Web server stress Memcached Key-Value store

slide-59
SLIDE 59

Application Workloads

0.00 0.50 1.00 1.50 2.00 Kernbench Hackbench TCP_STREAM TCP_MAERTS TCP_RR Apache Memcached

non-VHE VHE OPT* x86

*Measurements obtained using Linux in EL2. See BKK16 talk.

Normalized overhead (lower is better)

slide-60
SLIDE 60

Conclusions

  • Optimize and redesign KVM/ARM for VHE
  • Significant improvement in microbenchmark results
  • Significant improvement in application benchmark results
  • Similar (or better) performance characteristics compared to x86
  • Published in USENIX ATC’17:


https://www.usenix.org/system/files/conference/atc17/atc17-dall.pdf

slide-61
SLIDE 61

Code

  • Timer optimization patches (v4):


https://lists.cs.columbia.edu/pipermail/kvmarm/2017-October/027836.html

  • Core optimization patches:


https://lists.cs.columbia.edu/pipermail/kvmarm/2017-October/027523.html

  • Linux in EL2 (not for upstream, not supported, don’t come crying…):


https://github.com/chazy/el2linux

  • Target is v4.16
  • Reviews are welcome