PBLCACHE PBLCACHE A client side persistent block cache for the data - - PowerPoint PPT Presentation

pblcache pblcache
SMART_READER_LITE
LIVE PREVIEW

PBLCACHE PBLCACHE A client side persistent block cache for the data - - PowerPoint PPT Presentation

PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabn - Red Hat ABOUT ME ABOUT ME LUIS PAB N LUIS PAB N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon QUESTIONS:


slide-1
SLIDE 1

PBLCACHE PBLCACHE

Vault Boston 2015 - Luis Pabón - Red Hat A client side persistent block cache for the data center

slide-2
SLIDE 2

ABOUT ME ABOUT ME

LUIS PABÓ N LUIS PABÓ N

Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon

slide-3
SLIDE 3

QUESTIONS: QUESTIONS:

Storage SSD Compute Node

What are the benefits of client side persistent caching? How to effectively use the SSD?

slide-4
SLIDE 4

MERCURY* MERCURY*

* S. Byan, et al., Mercury: Host-side flash caching for the data center Use in memory data structures to handle cache misses as quickly as possible Write sequentially to the SSD Cache must be persistent since warming could be time consuming Increase storage backend availability by reducing read requests

slide-5
SLIDE 5

M E RC U RY Q E M U I N T EG RATION M E RC U RY Q E M U I N T EG RATION

slide-6
SLIDE 6

PBLCACHE PBLCACHE

slide-7
SLIDE 7

PBLCACHE PBLCACHE

Persistent, block based, look aside cache for QEMU User space library/application Based on ideas described in the Mercury paper Requires exclusive access to mutable objects

Persistent BLock Cache

slide-8
SLIDE 8

GOAL: QEMU SHARED CACHE GOAL: QEMU SHARED CACHE

slide-9
SLIDE 9

PBLCACHE ARCHITECTURE PBLCACHE ARCHITECTURE

PBL Application Cache Map Log SSD

slide-10
SLIDE 10

PBL APPLICATION PBL APPLICATION

Sets up the cache map and log Decides how to use the cache (writethrough, read-miss) Inserts, retrieves, or invalidates blocks from the cache

Cache map Log Msg Queue Pbl App

slide-11
SLIDE 11

CACHE MAP CACHE MAP

Composed of two data structures Maintains all block metadata

Address Map Block Descriptor Array

slide-12
SLIDE 12

ADDRESS MAP ADDRESS MAP

Address Map Block Descriptor Array

Implemented using as a hash table Translates object blocks to Block Descriptor Array (BDA) indeces Cache misses are determined extremely fast

slide-13
SLIDE 13

BLOCK DESCRIPTOR ARRAY BLOCK DESCRIPTOR ARRAY

Address Map Block Descriptor Array

Contains metadata for blocks stored in the log Length is equal to the maximum number of blocks stored in the log Handles CLOCK evictions Invalidations are extremely fast

Insertions always append

slide-14
SLIDE 14

CACHE MAP I/O FLOW CACHE MAP I/O FLOW

Block Descriptor Array

slide-15
SLIDE 15

CACHE MAP I/O FLOW CACHE MAP I/O FLOW

Get In address map Miss Hit Set CLOCK bit in BDA Read from log No Yes

slide-16
SLIDE 16

CACHE MAP I/O FLOW CACHE MAP I/O FLOW

Invalidate Free BDA index Delete from map

slide-17
SLIDE 17

LOG LOG

Block location determined by BDA CLOCK optimized with segment read-ahead Segment pool with buffered writes Contiguous block support

Segments SSD

slide-18
SLIDE 18

LOG SEGMENT STATE MACHINE LOG SEGMENT STATE MACHINE

slide-19
SLIDE 19

LOG READ I/O FLOW LOG READ I/O FLOW

Read In a segment? Read from segment Read from SSD Yes No

slide-20
SLIDE 20

PERSISTENT METADATA PERSISTENT METADATA

Save address map to a file on application shutdown Cache warm on application restart Not designed to be durable System crash will cause metadata file not to be created

slide-21
SLIDE 21

PBLIO BENCHMARK PBLIO BENCHMARK

PBL APPLICATION PBL APPLICATION

slide-22
SLIDE 22

PBLIO PBLIO

Benchmark tool Uses an enterprise workload workload generator from NetApp* Cache setup as write through Can be used with or without pblcache Documentation

https://github.com/pblcache/pblcache/wiki/Pblio * S. Daniel et al., A portable, open-source implementation of the SPC-1 workload * https://github.com/lpabon/goioworkload

slide-23
SLIDE 23

ENTERPRISE WORKLOAD ENTERPRISE WORKLOAD

Synthetic OLTP enterprise workload generator Tests for maximum number of IOPS before exceeding 30ms latency Divides storage system into three logical storage units: ​ ASU1 - Data Store - 45% of total storage - RW ASU2 - User Store - 45% of total storage - RW ASU3 - Log - 10% of total storage - Write Only BSU - Business Scaling Units 1 BSU = 50 IOPS

slide-24
SLIDE 24

S IM P L E E XAM P L E S IM P L E E XAM P L E

$ fallocate -l 45MiB file1 $ fallocate -l 45MiB file2 $ fallocate -l 10MiB file3 $ $ ./pblio -asu1=file1 \

  • asu2=file2 \
  • asu3=file3 \
  • runlen=30 -bsu=2
  • pblio
  • Cache : None

ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s

  • Avg IOPS:98.63 Avg Latency:0.2895 ms
slide-25
SLIDE 25

RAW D EVICES E XAMPL E RAW D EVICES E XAMPL E

$ ./pblio -asu1=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde \

  • asu2=/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi \
  • asu3=/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm \
  • runlen=30 -bsu=2
slide-26
SLIDE 26

CACHE EXAMPLE CACHE EXAMPLE

$ fallocate -l 10MiB mycache $ ./pblio -asu1=file1 -asu2=file2 -asu3=file3 \

  • runlen=30 -bsu=2 -cache=mycache
  • pblio
  • Cache : mycache (New)

C Size : 0.01 GB ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s

  • Avg IOPS:98.63 Avg Latency:0.2573 ms

Read Hit Rate: 0.4457 Invalidate Hit Rate: 0.6764 Read hits: 1120 Invalidate hits: 347 Reads: 2513 Insertions: 1906 Evictions: 0 Invalidations: 513 == Log Information == Ram Hit Rate: 1.0000 Ram Hits: 1120 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 0 Wraps: 1 Segments Skipped: 0 Mean Read Latency: 0.00 usec Mean Segment Read Latency: 4396.77 usec Mean Write Latency: 1162.58 usec

slide-27
SLIDE 27
  • pblio
  • Cache : /dev/sdg (Loaded)

C Size : 185.75 GB ASU1 : 673.83 GB ASU2 : 673.83 GB ASU3 : 149.74 GB BSUs : 32 Contexts: 1 Run time: 600 s

  • Avg IOPS:1514.92 Avg Latency:112.1096 ms

Read Hit Rate: 0.7004 Invalidate Hit Rate: 0.7905 Read hits: 528539 Invalidate hits: 120189 Reads: 754593 Insertions: 378093 Evictions: 303616 Invalidations: 152039 == Log Information == Ram Hit Rate: 0.0002 Ram Hits: 75 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 445638 Wraps: 0 Segments Skipped: 0 Mean Read Latency: 850.89 usec Mean Segment Read Latency: 2856.16 usec Mean Write Latency: 6472.74 usec

L ATENCY OVER 30MS L ATENCY OVER 30MS

slide-28
SLIDE 28

EVALUATION EVALUATION

slide-29
SLIDE 29

TEST SETUP TEST SETUP

Client using 180GB SAS SSD (about 10% of workload size) GlusterFS 6x2 Cluster 100 files for each ASU pblio v0.1 compiled with go1.4.1 Each system has:

Fedora 20 6 Intel Xeon E5-2620 @ 2GHz 64 GB RAM 5 300GB SAS Drives 10Gbit Network

slide-30
SLIDE 30

CACHE WARMUP IS TIME CACHE WARMUP IS TIME COMSU MIN G COMSU MIN G

16 hours

slide-31
SLIDE 31

I N C R E AS E D R ES PO NS E TIM E I N C R E AS E D R ES PO NS E TIM E

73% Increase

slide-32
SLIDE 32

STO RAG E BAC K E N D I O PS STO RAG E BAC K E N D I O PS REDUCTION REDUCTION

BSU = 31 or 1550 IOPS

~75% IOPS Reduction

slide-33
SLIDE 33

CURRENT STATUS CURRENT STATUS

slide-34
SLIDE 34

M I L ESTO N ES M I L ESTO N ES

  • 1. Create Cache Map - COMPLETED
  • 2. Create Log - COMPLETED
  • 3. Create Benchmark application - COMPLETED
  • 4. Design pblcached architecture - IN PROGRESS
slide-35
SLIDE 35

NEXT: QEMU SHARED CACHE NEXT: QEMU SHARED CACHE

Work with the community to bring this technology to QEMU Possible architecture: Some conditions to think about: VM migration Volume deletion VM crash

slide-36
SLIDE 36

FUTURE FUTURE

Hyperconvergence Peer-cache Writeback Shared cache QoS using mClock* Possible integrations with Ceph and GlusterFS backends

* A. Gulati et al., mClock: Handling Throughput Variability for Hypervisor IO Scheduling

slide-37
SLIDE 37

JOIN! JOIN!

Github: IRC Freenode: #pblcache Google Group: Mail list: https://github.com/pblcache/pblcache https://groups.google.com/forum/#!forum/pblcache pblcache@googlegroups.com

slide-38
SLIDE 38

FROM THIS... FROM THIS...

slide-39
SLIDE 39

TO THIS TO THIS

slide-40
SLIDE 40