qcow2 why (not)? Max Reitz <mreitz@redhat.com> Kevin Wolf - - PowerPoint PPT Presentation

qcow2 why not max reitz mreitz redhat com kevin wolf
SMART_READER_LITE
LIVE PREVIEW

qcow2 why (not)? Max Reitz <mreitz@redhat.com> Kevin Wolf - - PowerPoint PPT Presentation

qcow2 why (not)? Max Reitz <mreitz@redhat.com> Kevin Wolf <kwolf@redhat.com> KVM Forum 2015 Choosing between raw and qcow2 Traditional answer: Performance? raw! Features? qcow2! But what if you need both? A car analogy


slide-1
SLIDE 1

qcow2 – why (not)? Max Reitz <mreitz@redhat.com> Kevin Wolf <kwolf@redhat.com> KVM Forum 2015

slide-2
SLIDE 2

Choosing between raw and qcow2 Traditional answer:

Performance? raw! Features? qcow2!

But what if you need both?

slide-3
SLIDE 3

A car analogy Throwing out the seats gives you better acceleration Is it worth it?

slide-4
SLIDE 4

A car analogy Throwing out the seats gives you better acceleration Is it worth it?

slide-5
SLIDE 5

Our goal Keep the seats in! Never try to get away without qcow2’s features

slide-6
SLIDE 6

Part I What are those features?

slide-7
SLIDE 7

qcow2 features Backing files Internal snapshots Zero clusters and partial allocation (on all filesystems) Compression

slide-8
SLIDE 8

qcow2 metadata Image is split into clusters (default: 64 kB) L2 tables map guest offsets to host offsets Refcount blocks store allocation information

slide-9
SLIDE 9

qcow2 metadata For non-allocating I/O: Only L2 tables needed

slide-10
SLIDE 10

Part II Preallocated images

slide-11
SLIDE 11

What is tested? Linux guest with fio (120 s runtime per test/pattern; O DIRECT AIO) 6 GB images on SSD and HDD Random/sequential 4k/1M blocks qcow2: preallocation=metadata

slide-12
SLIDE 12

SSD write performance 4k random 1M random 4k seq 1M seq 0.5 1 1.5 Fraction of raw IOPS raw qcow2

slide-13
SLIDE 13

SSD read performance 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-14
SLIDE 14

HDD write performance 4k random 1M random 4k seq 1M seq 0.5 1 1.5 Fraction of raw IOPS raw qcow2

slide-15
SLIDE 15

HDD read performance 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-16
SLIDE 16

So?

Looks good, right?

slide-17
SLIDE 17

So?

Let’s increase the image size!

slide-18
SLIDE 18

SSD 16 GB image write performance 4k random 1M random 4k seq 1M seq 0.5 1 1.5 Fraction of raw IOPS raw qcow2

slide-19
SLIDE 19

SSD 16 GB image read performance 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-20
SLIDE 20

HDD 32 GB image write performance 4k random 1M random 4k seq 1M seq 0.5 1 Fraction of raw IOPS raw qcow2

slide-21
SLIDE 21

HDD 32 GB image read performance 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-22
SLIDE 22

What happened? Cache thrashing happened! qcow2 caches L2 tables; default cache size: 1 MB This covers 8 GB of an image!

slide-23
SLIDE 23

How to fix it?

1 DON’T PANIC – Don’t fix it.

Random accesses contained in an 8 GB area are fine, no matter the image size

2 Increase the cache size

l2-cache-size runtime option e.g. -drive format=qcow2,l2-cache-size=4M,...

area size cluster size ÷ 8 = area size 8192 B

slide-24
SLIDE 24

SSD 16 GB image, 2 MB L2 cache, writing 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-25
SLIDE 25

SSD 16 GB image, 2 MB L2 cache, reading 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-26
SLIDE 26

HDD 32 GB image, 4 MB L2 cache, writing 4k random 1M random 4k seq 1M seq 0.5 1 Fraction of raw IOPS raw qcow2

slide-27
SLIDE 27

HDD 32 GB image, 4 MB L2 cache, reading 4k random 1M random 4k seq 1M seq 0.2 0.4 0.6 0.8 1 1.2 Fraction of raw IOPS raw qcow2

slide-28
SLIDE 28

Results No significant difference between raw and qcow2 for preallocated images . . . As long as the L2 cache is large enough! Without COW, everything is good! But it is named qcow2 for a reason. . .

slide-29
SLIDE 29

Part III Cluster allocations

slide-30
SLIDE 30

Cluster allocation When is a new cluster allocated? When writing to unallocated clusters

Previous content in backing file Without backing file: all zero

For COW if existing cluster was shared

Internal snapshots Compressed image

slide-31
SLIDE 31

Copy on Write

Clusters

64k 128k 192k

Write request Data written by guest Copy on Write area

Cluster content must be completely valid (64k) Guest may write with sector granularity (512b) Partial write to newly allocated cluster → Rest must be filled with old data

slide-32
SLIDE 32

Copy on Write

Clusters

64k 128k 192k

Write request Data written by guest Copy on Write area

COW cost is most expensive part of allocations

1 More I/O requests 2 More bytes transferred 3 More disk flushes (in some cases)

slide-33
SLIDE 33

Copy on Write is slow (Problem 1)

Clusters

64k 128k 192k

Write request Data written by guest Copy on Write area

Naive implementation: 2 reads and 3 writes About 30% performance hit vs. rewrite

slide-34
SLIDE 34

Copy on Write is slow (Problem 1)

Clusters

64k 128k 192k

Write request Data written by guest Copy on Write area

Can combine writes into a single request

Fixes allocation performance without backing file Doesn’t fix other cases: read is expensive

slide-35
SLIDE 35

Copy on Write is slow (Problem 2)

Clusters

64k 128k 192k

Write request 1 Write request 2 Write request 3 Write request 4 Data written by guest Copy on Write area Unnecessary COW overhead

Most COW is unnecessary for sequential writes If the COW area is overwritten anyway: Avoid the copy in the first place

slide-36
SLIDE 36

qcow2 data cache Metadata already uses a cache for batching. We can do the same for data! Mark COW area invalid at first Only read from backing file when accessed Overwriting makes it valid → read avoided

slide-37
SLIDE 37

Data cache performance

  • Seq. allocating writes (qcow2 with backing file)

8k rewrite 256k rewrite 50 100 150 200 MB/s master data cache raw

slide-38
SLIDE 38

Copy on Write is slow (Problem 3) Internal COW (internal snapshots, compression):

1 Allocate new cluster:

Must increase refcount before mapping update

2 Drop reference for old cluster:

Must update mapping before refcount decrease → Need two (slow) disk flushes per allocation

slide-39
SLIDE 39

Copy on Write is slow (Problem 3) Possible solutions: lazy refcounts=on allows inconsistent refcounts Implement journalling allows updating both at the same time → No flushes needed → Performance fixed

slide-40
SLIDE 40

Another solution: Avoid COW

Clusters

64k 128k 192k

Write request Data written by guest Stays unmodified (COW with large clusters)

Don’t optimize COW, avoid it → Use a small cluster size (= sector size)

slide-41
SLIDE 41

Another solution: Avoid COW

Clusters

64k 128k 192k

Write request Data written by guest Stays unmodified (COW with large clusters)

But small cluster size isn’t practicable: Large metadata (but no larger caches) Potentially more fragmentation → No COW any more, but everything is slow

slide-42
SLIDE 42

Subclusters

Clusters Subclusters

64k 128k 192k

Write request (Sub)cluster gets allocated Stays unallocated

Split cluster size into two different sizes: Granularity for the mapping (clusters, large) Granularity of COW (subclusters, small) Add subcluster bitmap to L2 table for COW status

slide-43
SLIDE 43

Subclusters

Clusters Subclusters

64k 128k 192k

Write request (Sub)cluster gets allocated Stays unallocated

Requires incompatible image format change Can solve problems 1 and 2, but not 3

slide-44
SLIDE 44

Status Data cache: Prototype patches exist (ready for 2.5 or 2.6?) Subclusters: Only theory, no code Still useful with cache merged Journalling: Not anytime soon Use lazy refcounts for internal COW

slide-45
SLIDE 45

Questions?