How to Handle Globally Distributed QCOW2 Chains? Eyal Moscovici - - PowerPoint PPT Presentation

how to handle globally distributed qcow2 chains
SMART_READER_LITE
LIVE PREVIEW

How to Handle Globally Distributed QCOW2 Chains? Eyal Moscovici - - PowerPoint PPT Presentation

How to Handle Globally Distributed QCOW2 Chains? Eyal Moscovici & Amit Abir Oracle-Ravello About Us Eyal Moscovici Amit Abir With Oracle Ravello With Oracle Ravello since 2015 since 2011 Software Engineer in Virtual


slide-1
SLIDE 1

How to Handle Globally Distributed QCOW2 Chains?

Eyal Moscovici & Amit Abir Oracle-Ravello

slide-2
SLIDE 2

10/27/17 2 / 32

About Us

  • Eyal Moscovici

– With Oracle Ravello

since 2015

– Software Engineer in

the Virtualization group, focusing on the Linux kernel and QEMU

  • Amit Abir

– With Oracle Ravello

since 2011

– Virtual Storage &

Networking Team Leader

slide-3
SLIDE 3

10/27/17 3 / 32

Agenda

➔ Oracle Ravello Introduction ➔ Storage Layer Design ➔ Storage Layer Implementation ➔ Challenges and Solutions ➔ Summary

slide-4
SLIDE 4

10/27/17 4 / 32

Oracle Ravello - Introduction

  • Founded in 2011 by Qumranet founders, acquired in 2016 by Oracle
  • Oracle Ravello is a Virtual Cloud Provider
  • Allows seamless “Lift and Shift”:

– Migrate on-premise data-center workloads to the public cloud

  • No need to change:

– The VM images – Network confjguration – Storage confjguration

slide-5
SLIDE 5

10/27/17 5 / 32

Migration to the Cloud - Challenges

  • Virtual hardware

– Difgerent hypervisors have difgerent virtual hardware – Chipsets, disk/net controllers, SMBIOS/ACPI and etc.

  • Network topology and capabilities

– Clouds only support L3 IP-based communication – No switches, VLANs, Mirror-ports and etc.

slide-6
SLIDE 6

10/27/17 6 / 32

Virtual hardware support

  • Solved by Nested Virtualization:

– HVX: Our own binary translation hypervisor – KVM: When HW assist available

  • Enhanced QEMU, SeaBIOS & OVMF supporting:

– i440bx chipset – VMXNET3, PVSCSI – Multiple Para-virtual interfaces (including VMWare backdoor ports) – SMBIOS & ACPI interface – Boot from LSILogic & PVSCSI

slide-7
SLIDE 7

10/27/17 7 / 32

Network capabilities support

  • Solved by our Software Defjned Network - SDN
  • Leveraging Linux SDN components

– Tun/Tap, TC Actions, Bridge, eBPF and etc.

  • Fully distributed network functions

– Leverages OpenVSwitch

slide-8
SLIDE 8

10/27/17 8 / 32

Oracle Ravello Flow

  • 1. Import

VM VM VM Ravello Image Storage HW

Cloud VM (KVM/Xen) KVM/HVX

VM VM VM HW Hypervisor VM VM VM Data Center

  • 2. Publish

Public Cloud Ravello Console

slide-9
SLIDE 9

10/27/17 9 / 32

Storage Layer - Challenges

  • Where to place the VM disks data?
  • Should support multiple clouds and regions
  • Fetch data in real time
  • Clone a VM fast
  • Writes to the disk should be persistent
slide-10
SLIDE 10

10/27/17 10 / 32

Storage Layer – Basic Solution

  • Place the VMs disk images directly on cloud volumes (EBS)
  • Advantages:

– Performance – Zero time to fjrst byte

  • Disadvantages:

– Cloud and region bounded – Long cloning time – Too expensive

Cloud VM QEMU Volume data /dev/sdb

slide-11
SLIDE 11

10/27/17 11 / 32

Storage Layer – Alternative Solution

  • Place a raw fjle in the cloud object storage
  • Advantages:

– Globally available – Fast cloning – Inexpensive

  • Disadvantages:

– Long boot time – Long snapshot time – Same sectors stored many times

Cloud VM QEMU Volume data /dev/sdb/data data Object Storage

Remote access

slide-12
SLIDE 12

10/27/17 12 / 32

Storage Layer – Our Solution

  • Place the image in the object storage and upload deltas to create a chain
  • Advantages:

– Boot starts immediately – Store only new data – Globally available – Fast cloning

– Inexpensive

  • Disadvantages:

– Performance penalty

Cloud VM QEMU Volume tip /dev/sdb/tip Object Storage

Remote Reads Local writes

slide-13
SLIDE 13

10/27/17 13 / 32

Storage Layer Architecture

  • VM disk is backed by a QCow2 image

chain

  • Reads are performed by Cloud FS: Our RO

storage layer fjle system

– Translates disk reads to HTTP requests – Supports multiple cloud object storages – Caches read data locally – Fuse based QEMU Disk Cloud FS Cloud VM Object Storage QCow2 tip Cloud FS cache

QCow2 Chain

Cloud Volume

slide-14
SLIDE 14

10/27/17 14 / 32

CloudFS - Read Flow

GET /diff4 HTTP/1.1 Host: ravello-vm-disks.s3.amazonaws.com x-amz-date: Wed, 18 Oct 2017 21:32:02 GMT Range: bytes=1024-1535 QEMU Cloud FS

read(”/mnt/cloudfs/diff4”, offset=1024, size=512, ...)

Cloud Object Storage Cloud VM

fuse_op_read(”/mnt/cloudfs/diff4”, offset=1024, size=512...)

/mnt/cloudfs/diff4

slide-15
SLIDE 15

10/27/17 15 / 32

CloudFS - Write Flow

  • A new tip to the QCow chain is created: qemu-img create

– Before a VM starts – Before a snapshot (using QMP): blockdev-snapshot-sync

  • The tip is uploaded to the cloud storage:

– After the VM stops – During a snapshot

Cloud VM QEMU tip Object Storage

slide-16
SLIDE 16

10/27/17 16 / 32

Accelerate Remote Access

  • Small requests are extended to 2MB requests

– Assume data read locality – Latency vs. Throughput – Experiments show that 2MB is optimal

  • QCow2 chain fjles have random names

– They hit difgerent cloud workers for cloud

requests

slide-17
SLIDE 17

10/27/17 17 / 32

Globally Distributed Chains

  • A VM can start on any cloud or region
  • New data is uploaded to the same local region

– Data locality is assumed

  • Globally distributed chains are created
  • Problem: Reading data from remote regions could be long

AWS Sydney OCI Pheonix

Base diff1 diff2 diff3

GCE Frankfurt

diff4

slide-18
SLIDE 18

10/27/17 18 / 32

Globally Distributed Chains - Solution

  • Every region has its own cache for parts of the chain

from difgerent regions

  • The fjrst time the VM starts in a new region – every

remote sector read is copied to the regional cache

AWS Sydney OCI Pheonix

Cache Base diff1 diff2 diff3 Base diff1

slide-19
SLIDE 19

10/27/17 19 / 32

Performance Drawbacks of QCow Chains

  • QCow keeps minimal information about the entire chain its

backing fjle

– QEMU must “walk the chain” to load image metadata (L1

table) to RAM

  • Some metadata (L2 tables) is spread across the image

– A single disk read creates multiple random remote reads of

metadata from multiple remote fjles

  • qemu-img commands work on the whole virtual disk

– Hard to bound execution time

slide-20
SLIDE 20

10/27/17 20 / 32

Keep QCow2 Chains Short

  • A new tip to the QCow chain is created:

– Each VM starts – Each snapshot

  • Problem: Chains are getting longer!

– For Example: a VM with 1 Disks that started 100 times has a chain 100 links

deep.

  • Long chains cause:

– High latency: Data/metadata read requires to “walk the chain” – High memory usage: Each fjle has its own metadata (L1 tables).

1MB (L1 size) * 100 (links) = 100MB per disk. Assume 10 VMs with 4 Disks each: 4G of memory overhead

A Tip Virtual disk Base

slide-21
SLIDE 21

10/27/17 21 / 32

Keep QCow2 Chains Short (Cont.)

  • Solution: merge tip with backing fjle before upload

– Rebase the tip over the grandparent. – Only when backing fjle is small (~300MB) to keep snapshot time minimal

  • This is done live/offmine:

– Live: using QMP block-stream job command – Offmine: using qemu-img rebase

B (rebase target) A Tip Rebased Tip Virtual disk Virtual disk B (rebase target)

slide-22
SLIDE 22

10/27/17 22 / 32

qemu-img rebase

  • Problem: per-byte

comparison between ALL allocated sectors not present in tip

Logic is difgerent then QMP block-stream rebase

Requires fetching these sectors

static int img_rebase(int argc, char **argv) { ... for (sector = 0; sector < num_sectors; sector += n) { ... ret = blk_pread(blk_old_backing, sector << BDRV_SECTOR_BITS, buf_old, n << BDRV_SECTOR_BITS); ... ret = blk_pread(blk_new_backing, sector << BDRV_SECTOR_BITS, buf_new, n << BDRV_SECTOR_BITS); ... while (written < n) { if (compare_sectors(buf_old + written * 512, buf_new + written * 512, n - written, &pnum)) { ret = blk_pwrite(blk, (sector + written) << BDRV_SECTOR_BITS, buf_old + written * 512, pnum << BDRV_SECTOR_BITS, 0); } written += pnum; } } }

B (rebase target) A Tip Virtual disk

slide-23
SLIDE 23

10/27/17 23 / 32

qemu-img rebase (2)

  • Solution: Optimized rebase in the same image chain

Only Compare sectors that were changed after the rebase target

static int img_rebase(int argc, char **argv) { ... // check if blk_new_backing and blk are in the same chain same_chain = ... for (sector = 0; sector < num_sectors; sector += n) { ... m = n; if (same_chain) { ret = bdrv_is_allocated_above(blk, blk_new_backing, sector, m, &m); if (!ret) continue; } ...

No need to compare this part B (rebase target) A Tip Virtual disk

slide-24
SLIDE 24

10/27/17 24 / 32

Reduce fjrst remote read latency

  • Problem: High latency on fjrst data remote read

– Prolongs boot time – Prolongs user application startup – Gets worse with long chains (more remote reads)

Cloud VM QEMU tip Object Storage

slide-25
SLIDE 25

10/27/17 25 / 32

Prefetch Disk Data

  • Solution: Prefetch disk data

– While the VM is running, start reading the disks

data from the cloud

– Read all disks in parallel – Only in relatively idle times

slide-26
SLIDE 26

10/27/17 26 / 32

Prefetch Disk Data

  • Naive solution: read ALL the fjles in the chain
  • Problem: We may fetch a lot of redundant data

– An image may contain overwritten data

B A Tip Redundant Data

slide-27
SLIDE 27

10/27/17 27 / 32

Avoid pre-fetching redundant data

  • Solution: Fetch data from the virtual disk exposed to the

guest

– Mount the tip image as a block device – Read data from the block device – QEMU will fetch only the relavent data B A Tip Virtual disk Redundant Data

> qemu-nbd –connect=/dev/nbd0 tip.qcow > dd if=/dev/nbd0 of=/dev/null

slide-28
SLIDE 28

10/27/17 28 / 32

Avoid pre-fetching redundant data (2)

  • Problem: Reading raw block device read ALL sectors

– Reading unallocated sectors wastes CPU cycles

  • Solution: use qemu-img map

– Returns a map of allocated sectors. – Allows us to read only allocated sectors.

qemu-img map tip.qcow

slide-29
SLIDE 29

10/27/17 29 / 32

Avoid pre-fetching redundant data (3)

  • Problem: qemu-img map works on the whole disk

– Takes a long time to fjnish – We can’t prefetch data during map

slide-30
SLIDE 30

10/27/17 30 / 32

Avoid pre-fetching redundant data (4)

  • Solution: split the map of the disk

– We added ofgset and length parameter to the

  • peration

– Bounds execution time – Starts prefetch data quickly

qemu-img map -offset 0 -length 1G tip.qcow

slide-31
SLIDE 31

10/27/17 31 / 32

Summary

  • Oracle Ravello storage layer is implemented using QCow2 chains

– Stored on the public cloud’s object storage

  • QCow2 and QEMU implementations are not ideal for our use case

– QCow2 keeps minimal metadata about the entire chain – Qcow2 metadata is spread across the fjle – QEMU must often “walk the chain”

  • We would like to work with the community to improve

performance in usecases such as ours

slide-32
SLIDE 32

10/27/17 32 / 32

Questions?

Thank you!