PBLCACHE PBLCACHE A client side persistent block cache for the data - PowerPoint PPT Presentation

PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabón - Red Hat

ABOUT ME ABOUT ME LUIS PABÓ N LUIS PABÓ N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon

QUESTIONS: QUESTIONS: What are the bene fi ts of client side persistent caching? How to effectively use the SSD? Compute Node Storage SSD

MERCURY* MERCURY* Use in memory Write Increase storage Cache must be data structures sequentially to backend availability persistent since to handle cache the SSD by reducing read warming could be misses as requests time consuming quickly as possible * S. Byan, et al., Mercury: Host-side fl ash caching for the data center

M E RC U RY Q E M U I N T EG RATION M E RC U RY Q E M U I N T EG RATION

PBLCACHE PBLCACHE

PBLCACHE PBLCACHE P ersistent BL ock Cache Persistent, block based, look aside cache for QEMU User space library/application Based on ideas described in the Mercury paper Requires exclusive access to mutable objects

GOAL: QEMU SHARED CACHE GOAL: QEMU SHARED CACHE

PBLCACHE ARCHITECTURE PBLCACHE ARCHITECTURE PBL Application Cache Map Log SSD

PBL APPLICATION PBL APPLICATION Sets up the cache map and log Decides how to use the cache (writethrough, read-miss) Inserts, retrieves, or invalidates blocks from the cache Pbl App Msg Queue Cache map Log

CACHE MAP CACHE MAP Composed of two data structures Maintains all block metadata Address Map Block Descriptor Array

ADDRESS MAP ADDRESS MAP Implemented using as a hash table Translates object blocks to Block Descriptor Array (BDA) indeces Cache misses are determined extremely fast Address Map Block Descriptor Array

BLOCK DESCRIPTOR ARRAY BLOCK DESCRIPTOR ARRAY Contains metadata for blocks stored in the log Length is equal to the maximum number of blocks stored in the log Handles CLOCK evictions Invalidations are extremely fast Address Map Block Descriptor Array Insertions always append

CACHE MAP I/O FLOW CACHE MAP I/O FLOW Block Descriptor Array

CACHE MAP I/O FLOW CACHE MAP I/O FLOW Get In address map No Yes Miss Hit Set CLOCK bit in BDA Read from log

CACHE MAP I/O FLOW CACHE MAP I/O FLOW Invalidate Free BDA index Delete from map

LOG LOG Block location determined by BDA CLOCK optimized with segment read-ahead Segment pool with buffered writes Contiguous block support Segments SSD

LOG SEGMENT STATE MACHINE LOG SEGMENT STATE MACHINE

LOG READ I/O FLOW LOG READ I/O FLOW Read In a segment? Yes No Read from segment Read from SSD

PERSISTENT METADATA PERSISTENT METADATA Save address map to a fi le on application shutdown Cache warm on application restart Not designed to be durable System crash will cause metadata fi le not to be created

PBLIO BENCHMARK PBLIO BENCHMARK PBL APPLICATION PBL APPLICATION

PBLIO PBLIO Benchmark tool Uses an enterprise workload workload generator from NetApp* Cache setup as write through Can be used with or without pblcache Documentation https://github.com/pblcache/pblcache/wiki/Pblio * S. Daniel et al., A portable, open-source implementation of the SPC-1 workload * https://github.com/lpabon/goioworkload

ENTERPRISE WORKLOAD ENTERPRISE WORKLOAD Synthetic OLTP enterprise workload generator Tests for maximum number of IOPS before exceeding 30ms latency Divides storage system into three logical storage units: ASU1 - Data Store - 45% of total storage - RW ASU2 - User Store - 45% of total storage - RW ASU3 - Log - 10% of total storage - Write Only BSU - Business Scaling Units 1 BSU = 50 IOPS

S IM P L E E XAM P L E S IM P L E E XAM P L E $ fallocate -l 45MiB file1 $ fallocate -l 45MiB file2 $ fallocate -l 10MiB file3 $ $ ./pblio -asu1=file1 \ -asu2=file2 \ -asu3=file3 \ -runlen=30 -bsu=2 ----- pblio ----- Cache : None ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2895 ms

RAW D EVICES E XAMPL E RAW D EVICES E XAMPL E $ ./pblio -asu1=/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde \ -asu2=/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi \ -asu3=/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm \ -runlen=30 -bsu=2

CACHE EXAMPLE CACHE EXAMPLE $ fallocate -l 10MiB mycache $ ./pblio -asu1=file1 -asu2=file2 -asu3=file3 \ -runlen=30 -bsu=2 -cache=mycache ----- pblio ----- Cache : mycache (New) C Size : 0.01 GB ASU1 : 0.04 GB ASU2 : 0.04 GB ASU3 : 0.01 GB BSUs : 2 Contexts: 1 Run time: 30 s ----- Avg IOPS:98.63 Avg Latency:0.2573 ms Read Hit Rate: 0.4457 Invalidate Hit Rate: 0.6764 Read hits: 1120 Invalidate hits: 347 Reads: 2513 Insertions: 1906 Evictions: 0 Invalidations: 513 == Log Information == Ram Hit Rate: 1.0000 Ram Hits: 1120 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 0 Wraps: 1 Segments Skipped: 0 Mean Read Latency: 0.00 usec Mean Segment Read Latency: 4396.77 usec Mean Write Latency: 1162.58 usec

L ATENCY OVER 30MS L ATENCY OVER 30MS ----- pblio ----- Cache : /dev/sdg (Loaded) C Size : 185.75 GB ASU1 : 673.83 GB ASU2 : 673.83 GB ASU3 : 149.74 GB BSUs : 32 Contexts: 1 Run time: 600 s ----- Avg IOPS:1514.92 Avg Latency:112.1096 ms Read Hit Rate: 0.7004 Invalidate Hit Rate: 0.7905 Read hits: 528539 Invalidate hits: 120189 Reads: 754593 Insertions: 378093 Evictions: 303616 Invalidations: 152039 == Log Information == Ram Hit Rate: 0.0002 Ram Hits: 75 Buffer Hit Rate: 0.0000 Buffer Hits: 0 Storage Hits: 445638 Wraps: 0 Segments Skipped: 0 Mean Read Latency: 850.89 usec Mean Segment Read Latency: 2856.16 usec Mean Write Latency: 6472.74 usec

EVALUATION EVALUATION

TEST SETUP TEST SETUP Client using 180GB SAS SSD (about 10% of workload size) GlusterFS 6x2 Cluster 100 fi les for each ASU pblio v0.1 compiled with go1.4.1 Each system has: Fedora 20 6 Intel Xeon E5-2620 @ 2GHz 64 GB RAM 5 300GB SAS Drives 10Gbit Network

CACHE WARMUP IS TIME CACHE WARMUP IS TIME COMSU MIN G COMSU MIN G 16 hours

I N C R E AS E D R ES PO NS E TIM E I N C R E AS E D R ES PO NS E TIM E 73% Increase

STO RAG E BAC K E N D I O PS STO RAG E BAC K E N D I O PS REDUCTION REDUCTION BSU = 31 or 1550 IOPS ~75% IOPS Reduction

CURRENT STATUS CURRENT STATUS

M I L ESTO N ES M I L ESTO N ES 1. Create Cache Map - COMPLETED 2. Create Log - COMPLETED 3. Create Benchmark application - COMPLETED 4. Design pblcached architecture - IN PROGRESS

NEXT: QEMU SHARED CACHE NEXT: QEMU SHARED CACHE Work with the community to bring this technology to QEMU Possible architecture: Some conditions to think about: VM migration Volume deletion VM crash

FUTURE FUTURE Hyperconvergence Peer-cache Writeback Shared cache QoS using mClock* Possible integrations with Ceph and GlusterFS backends * A. Gulati et al., mClock: Handling Throughput Variability for Hypervisor IO Scheduling

JOIN! JOIN! Github: https://github.com/pblcache/pblcache IRC Freenode: #pblcache Google Group: https://groups.google.com/forum/#!forum/pblcache Mail list: pblcache@googlegroups.com

FROM THIS... FROM THIS...

TO THIS TO THIS

PBLCACHE PBLCACHE A client side persistent block cache for the data - PowerPoint PPT Presentation

PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabn - Red Hat ABOUT ME ABOUT ME LUIS PAB N LUIS PAB N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon QUESTIONS:

@GregBala Greg.b@BDAEntertainment.com About us Realm of Empires - MMORTS Started Played on

Constant delay enumeration for FO queries over databases with local bounded expansion Luc Segoufin

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Bayesian Generative Active Deep Learning Toan Tran 1 Thanh-Toan Do 2 Ian Reid 1 Gustavo Carneiro 1

Discrete Flavor Symmetries and Origin of CP Violation Mu-Chun Chen, University of California at

Decision Support for Policing Violent Crime Jim Q. Smith & Aditi Shenvi (with Rob Procter,

MA111: Contemporary mathematics . Jack Schmidt University of Kentucky January 20, 2012

Network Structure A Hyperbolic Space Analytics Framework for Big Network Data and Their

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

The Media War at the UN and the DPRK Why Netizen Journalism Matters Ronda Hauben Stony Brook

How Green Was My Valley Joe Conway joe.conway@crunchydata.com mail@joeconway.com Crunchy Data

Overview of New BiDirectional Amplifier Requirements for Hospitals Florida Hospital

E XAMPLE 1 SOL 1 SOL 2 Given: Given: 1. AEC = 90 1. AEC = 90 2. BDA = 90 2. BDA

International practices in agricultural spraying and fertilizer spreading Our speaker VITALII

Connected Living London Presentation 29 May 2020

Best Position Algorithms for Top-k Queries* Reza Akbarinia, Esther Pacitti, Patrick Valduriez

t r Prsr t

Coupled 4D-variational physical and biological data assimilation in the California Current System

Chair of Computer Science 5 RWTH Aachen University Learning Layers Analysis of Overlapping

Evaluating Viability of Network Functions on Lambda Architecture By Arjun Singhvi, Anshul

Shinsuke Satoh, F. Isoda, T. Sano, H. Hanado (NICT), T. Ushio (Tokyo Metropolitan Univ), S. Otsuka

Welcome Data-Driven Discovery Teik C. Lim, PhD Provost and Vice President for Academic Affairs

Markov chain Monte Carlo Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad

A practical primal-dual interior-point algorithm for nonsymmetric conic optimization September 8,

PBLCACHE PBLCACHE A client side persistent block cache for the data - PowerPoint PPT Presentation

PBLCACHE PBLCACHE A client side persistent block cache for the data center Vault Boston 2015 - Luis Pabn - Red Hat ABOUT ME ABOUT ME LUIS PAB N LUIS PAB N Principal Software Engineer, Red Hat Storage IRC, GitHub: lpabon QUESTIONS:

@GregBala Greg.b@BDAEntertainment.com About us Realm of Empires - MMORTS Started Played on

Constant delay enumeration for FO queries over databases with local bounded expansion Luc Segoufin

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Bayesian Generative Active Deep Learning Toan Tran 1 Thanh-Toan Do 2 Ian Reid 1 Gustavo Carneiro 1

Discrete Flavor Symmetries and Origin of CP Violation Mu-Chun Chen, University of California at

Decision Support for Policing Violent Crime Jim Q. Smith &amp; Aditi Shenvi (with Rob Procter,

MA111: Contemporary mathematics . Jack Schmidt University of Kentucky January 20, 2012

Network Structure A Hyperbolic Space Analytics Framework for Big Network Data and Their

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

The Media War at the UN and the DPRK Why Netizen Journalism Matters Ronda Hauben Stony Brook

How Green Was My Valley Joe Conway joe.conway@crunchydata.com mail@joeconway.com Crunchy Data

Overview of New BiDirectional Amplifier Requirements for Hospitals Florida Hospital

E XAMPLE 1 SOL 1 SOL 2 Given: Given: 1. AEC = 90 1. AEC = 90 2. BDA = 90 2. BDA

International practices in agricultural spraying and fertilizer spreading Our speaker VITALII

Connected Living London Presentation 29 May 2020

Best Position Algorithms for Top-k Queries* Reza Akbarinia, Esther Pacitti, Patrick Valduriez

t r Prsr t

Coupled 4D-variational physical and biological data assimilation in the California Current System

Chair of Computer Science 5 RWTH Aachen University Learning Layers Analysis of Overlapping

Evaluating Viability of Network Functions on Lambda Architecture By Arjun Singhvi, Anshul

Shinsuke Satoh, F. Isoda, T. Sano, H. Hanado (NICT), T. Ushio (Tokyo Metropolitan Univ), S. Otsuka

Welcome Data-Driven Discovery Teik C. Lim, PhD Provost and Vice President for Academic Affairs

Markov chain Monte Carlo Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad

A practical primal-dual interior-point algorithm for nonsymmetric conic optimization September 8,

Decision Support for Policing Violent Crime Jim Q. Smith & Aditi Shenvi (with Rob Procter,