PBLCACHE PBLCACHE
Vault Boston 2015 - Luis Pabón - Red Hat A client side persistent block cache for the data center
PBLCACHE PBLCACHE A client side persistent block cache for the data - - PowerPoint PPT Presentation
PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabn - Red Hat ABOUT ME ABOUT ME LUIS PAB N LUIS PAB N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon QUESTIONS:
Vault Boston 2015 - Luis Pabón - Red Hat A client side persistent block cache for the data center
LUIS PABÓ N LUIS PABÓ N
Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon
Storage SSD Compute Node
What are the benefits of client side persistent caching? How to effectively use the SSD?
* S. Byan, et al., Mercury: Host-side flash caching for the data center Use in memory data structures to handle cache misses as quickly as possible Write sequentially to the SSD Cache must be persistent since warming could be time consuming Increase storage backend availability by reducing read requests
M E RC U RY Q E M U I N T EG RATION M E RC U RY Q E M U I N T EG RATION
PBLCACHE PBLCACHE
Persistent, block based, look aside cache for QEMU User space library/application Based on ideas described in the Mercury paper Requires exclusive access to mutable objects
Persistent BLock Cache
GOAL: QEMU SHARED CACHE GOAL: QEMU SHARED CACHE
PBLCACHE ARCHITECTURE PBLCACHE ARCHITECTURE
PBL Application Cache Map Log SSD
PBL APPLICATION PBL APPLICATION
Sets up the cache map and log Decides how to use the cache (writethrough, read-miss) Inserts, retrieves, or invalidates blocks from the cache
Cache map Log Msg Queue Pbl App
CACHE MAP CACHE MAP
Composed of two data structures Maintains all block metadata
Address Map Block Descriptor Array
ADDRESS MAP ADDRESS MAP
Address Map Block Descriptor Array
Implemented using as a hash table Translates object blocks to Block Descriptor Array (BDA) indeces Cache misses are determined extremely fast
BLOCK DESCRIPTOR ARRAY BLOCK DESCRIPTOR ARRAY
Address Map Block Descriptor Array
Contains metadata for blocks stored in the log Length is equal to the maximum number of blocks stored in the log Handles CLOCK evictions Invalidations are extremely fast
Insertions always append
CACHE MAP I/O FLOW CACHE MAP I/O FLOW
Block Descriptor Array
CACHE MAP I/O FLOW CACHE MAP I/O FLOW
Get In address map Miss Hit Set CLOCK bit in BDA Read from log No Yes
CACHE MAP I/O FLOW CACHE MAP I/O FLOW
Invalidate Free BDA index Delete from map
LOG LOG
Block location determined by BDA CLOCK optimized with segment read-ahead Segment pool with buffered writes Contiguous block support
Segments SSD
LOG SEGMENT STATE MACHINE LOG SEGMENT STATE MACHINE
LOG READ I/O FLOW LOG READ I/O FLOW
Read In a segment? Read from segment Read from SSD Yes No
PERSISTENT METADATA PERSISTENT METADATA
Save address map to a file on application shutdown Cache warm on application restart Not designed to be durable System crash will cause metadata file not to be created
PBL APPLICATION PBL APPLICATION
PBLIO PBLIO
Benchmark tool Uses an enterprise workload workload generator from NetApp* Cache setup as write through Can be used with or without pblcache Documentation
https://github.com/pblcache/pblcache/wiki/Pblio * S. Daniel et al., A portable, open-source implementation of the SPC-1 workload * https://github.com/lpabon/goioworkload
ENTERPRISE WORKLOAD ENTERPRISE WORKLOAD
Synthetic OLTP enterprise workload generator Tests for maximum number of IOPS before exceeding 30ms latency Divides storage system into three logical storage units: ASU1 - Data Store - 45% of total storage - RW ASU2 - User Store - 45% of total storage - RW ASU3 - Log - 10% of total storage - Write Only BSU - Business Scaling Units 1 BSU = 50 IOPS
S IM P L E E XAM P L E S IM P L E E XAM P L E
$ fallocate -l 45MiB file1 $ fallocate -l 45MiB file2 $ fallocate -l 10MiB file3 $ $ ./pblio -asu1=file1 \
ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s
RAW D EVICES E XAMPL E RAW D EVICES E XAMPL E
$ ./pblio -asu1=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde \
CACHE EXAMPLE CACHE EXAMPLE
$ fallocate -l 10MiB mycache $ ./pblio -asu1=file1 -asu2=file2 -asu3=file3 \
C Size : 0.01 GB ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s
Read Hit Rate: 0.4457 Invalidate Hit Rate: 0.6764 Read hits: 1120 Invalidate hits: 347 Reads: 2513 Insertions: 1906 Evictions: 0 Invalidations: 513 == Log Information == Ram Hit Rate: 1.0000 Ram Hits: 1120 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 0 Wraps: 1 Segments Skipped: 0 Mean Read Latency: 0.00 usec Mean Segment Read Latency: 4396.77 usec Mean Write Latency: 1162.58 usec
C Size : 185.75 GB ASU1 : 673.83 GB ASU2 : 673.83 GB ASU3 : 149.74 GB BSUs : 32 Contexts: 1 Run time: 600 s
Read Hit Rate: 0.7004 Invalidate Hit Rate: 0.7905 Read hits: 528539 Invalidate hits: 120189 Reads: 754593 Insertions: 378093 Evictions: 303616 Invalidations: 152039 == Log Information == Ram Hit Rate: 0.0002 Ram Hits: 75 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 445638 Wraps: 0 Segments Skipped: 0 Mean Read Latency: 850.89 usec Mean Segment Read Latency: 2856.16 usec Mean Write Latency: 6472.74 usec
L ATENCY OVER 30MS L ATENCY OVER 30MS
TEST SETUP TEST SETUP
Client using 180GB SAS SSD (about 10% of workload size) GlusterFS 6x2 Cluster 100 files for each ASU pblio v0.1 compiled with go1.4.1 Each system has:
Fedora 20 6 Intel Xeon E5-2620 @ 2GHz 64 GB RAM 5 300GB SAS Drives 10Gbit Network
CACHE WARMUP IS TIME CACHE WARMUP IS TIME COMSU MIN G COMSU MIN G
16 hours
I N C R E AS E D R ES PO NS E TIM E I N C R E AS E D R ES PO NS E TIM E
73% Increase
STO RAG E BAC K E N D I O PS STO RAG E BAC K E N D I O PS REDUCTION REDUCTION
BSU = 31 or 1550 IOPS
M I L ESTO N ES M I L ESTO N ES
NEXT: QEMU SHARED CACHE NEXT: QEMU SHARED CACHE
Work with the community to bring this technology to QEMU Possible architecture: Some conditions to think about: VM migration Volume deletion VM crash
FUTURE FUTURE
Hyperconvergence Peer-cache Writeback Shared cache QoS using mClock* Possible integrations with Ceph and GlusterFS backends
* A. Gulati et al., mClock: Handling Throughput Variability for Hypervisor IO Scheduling
JOIN! JOIN!
Github: IRC Freenode: #pblcache Google Group: Mail list: https://github.com/pblcache/pblcache https://groups.google.com/forum/#!forum/pblcache pblcache@googlegroups.com