EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott - PowerPoint PPT Presentation

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache

90 seconds

What do caches touch? Signing up* Searching* Logging in Viewing title details Choosing a profile Playing a title* Picking liked videos Subtitle / language prefs Personalization* Rating a title Loading home page* My List Scrolling home page* Video history* A/B tests UI strings Video image selection Video production* * multiple caches involved

Home Page Request

Ephemeral Volatile Cache Key-Value store optimized for AWS and tuned for Netflix use cases

What is EVCache? Distributed, sharded, replicated key-value store Tunable in-region and global replication Based on Memcached Resilient to failure Topology aware Linearly scalable Seamless deployments

Why Optimize for AWS Instances disappear Zones fail Regions become unstable Network is lossy Customer requests bounce between regions Failures happen and we test all the time

EVCache Use @ Netflix Hundreds of terabytes of data Trillions of ops / day Tens of billions of items stored Tens of millions of ops / sec Millions of replications / sec Thousands of servers Hundreds of instances per cluster Hundreds of microservice clients Tens of distinct clusters 3 regions 4 engineers

Architecture Application Server Client Library Client Memcached EVCar Eureka (Service Discovery)

Architecture us-west-2a us-west-2b us-west-2c Client Client Client

Reading (get) us-west-2a us-west-2b us-west-2c Client Primary Secondary

Writing (set, delete, add, etc.) us-west-2a us-west-2b us-west-2c Client Client Client

Use Case: Lookaside Cache Application Client Library Client Ribbon Client S S S S Data Flow C C C C

Use Case: Transient Data Store Time Application Application Application Client Library Client Library Client Library Client Client Client

Use Case: Primary Store Online Application Client Library Client Online Services Offline Services Offline / Nearline Precomputes for Recommendations Data Flow

Use Case: Versioned Primary Store Online Application Archaius Client Library (Dynamic Properties) Client Online Services Offline Services Control System Offline Compute (Valhalla) Data Flow

Use Case: High Volume && High Availability Application Client Library Optional In-memory Remote Ribbon Client Compute & Publish S S S S on schedule Data Flow C C C C

Pipeline of Personalization Online 1 Online 2 Online Services Offline Services Compute A Compute D Compute B Compute C Compute E Data Flow

Additional Features Kafka ● Global data replication ● Consistency metrics Key Iteration ● Cache warming ● Lost instance recovery ● Backup (and restore)

Additional Features (Kafka) Global data replication Consistency metrics

Cross-Region Replication Region A Region B 1 mutate 7 read APP APP 4 get data for set 6 mutate 2 send metadata Repl Proxy Repl Proxy g s m d n e s s p t t h 5 3 poll msg Kafka Kafka Repl Relay Repl Relay

Additional Features (Key Iteration) Cache warming Lost instance recovery Backup (and restore)

Cache Warming Application Client Library Client Control Cache Warmer Data Flow S3 (Spark) Control Flow Metadata Flow

Moneta Next-generation EVCache server

Moneta Moneta: The Goddess of Memory Juno Moneta: The Protectress of Funds for Juno ● Evolution of the EVCache server ● Cost optimization ● EVCache on SSD ● Ongoing lower EVCache cost per stream ● Takes advantage of global request patterns

Old Server ● Stock Memcached and EVCar (sidecar) ● All data stored in RAM in Memcached ● Expensive with global expansion / N+1 architecture Memcached EVCar external

Optimization ● Global data means many copies ● Access patterns are heavily region-oriented ● In one region: ○ Hot data is used often ○ Cold data is almost never touched ● Keep hot data in RAM, cold data on SSD ● Size RAM for working set, SSD for overall dataset

New Server ● Adds Rend and Mnemonic ● Still looks like Memcached ● Unlocks cost-efficient storage & server-side intelligence

Rend go get github.com/netflix/rend

Rend ● High-performance Memcached proxy & server ● Written in Go ○ Powerful concurrency primitives ○ Productive and fast ● Manages the L1/L2 relationship ● Tens of thousands of connections

Rend ● Modular set of libraries and an example main() ● Manages connections, request orchestration, and communication ● Low-overhead metrics library M Connection Management ● Multiple orchestrators E T Server Loop Protocol ● Parallel locking for data integrity R Request Orchestration I ● Efficient connection pool C Backend Handlers S

Mnemonic

Mnemonic ● Manages data storage on SSD ● Uses Rend server libraries ○ Handles Memcached protocol ● Maps Memcached ops to RocksDB ops Rend Server Core Lib (Go) Mnemonic Op Handler (Go) Mnemonic Core (C++) RocksDB (C++)

Why RocksDB? ● Fast at medium to high write load ○ Disk--write load higher than read load (because of Memcached) ● Predictable RAM Usage Record A Record B memtable memtable memtable . . . SST SST SST SST: Static Sorted Table

How we use RocksDB ● No Level Compaction ○ Generated too much traffic to SSD ○ High and unpredictable read latencies ● No Block Cache ○ Rely on Local Memcached ● No Compression

How we use RocksDB ● FIFO Compaction ○ SST’s ordered by time ○ Oldest SST deleted when full ○ Reads access every SST until record found

How we use RocksDB ● Full File Bloom Filters ○ Full Filter reduces unnecessary SSD reads ● Bloom Filters and Indices pinned in memory ○ Minimize SSD access per request

How we use RocksDB ● Records sharded across multiple RocksDB per node ○ Reduces number of files checked to decrease latency Mnemonic Core Key: ABC Key: XYZ ... R R R R

Region-Locality Optimizations ● Replication and Batch updates only RocksDB* ○ Keeps Region-Local and “hot” data in memory ○ Separate Network Port for “off-line” requests ○ Memcached data “replaced”

FIFO Limitations ● FIFO compaction not suitable for all use cases ○ Very frequently updated records may push out valid records ● Expired Records still exist SST SST SST ● Requires Larger Bloom Filters Record A1 Record A1 Record C Record A2 Record A2 Record D Record B1 Record B1 Record E Record A3 Record A3 Record F Record B2 Record B2 Record G Record B3 Record B3 Record H time

AWS Instance Type ● i2.xlarge ○ 4 vCPU ○ 30 GB RAM ○ 800 GB SSD ■ 32K IOPS (4KB Pages) ■ ~130MB/sec

Moneta Perf Benchmark (High-Vol Online Requests)

Moneta Perf Benchmark (cont)

Moneta Performance in Production (Batch Systems) ● Request ‘get’ 95%, 99% = 729 μs, 938 μs ● ~20 KB Records ● ~99% Overall Hit Rate ● L1 ‘get’ 95%, 99% = 153 μs, 191 μs ● ~90% L1 Hit Rate ● L2 ‘get’ 95%, 99% = 1005 μs, 1713 μs

Moneta Performance in Prod (High Vol-Online Req) ● Request ‘get’ 95%, 99% = 174 μs, 588 μs ● ~1 KB Records ● ~98% Overall Hit Rate ● L1 ‘get’ 95%, 99% = 145 μs, 190 μs ● ~97% L1 Hit Rate ● L2 ‘get’ 95%, 99% = 770 μs, 1330 μs

Moneta Performance in Prod (High Vol-Online Req) Get Percentiles: Set Percentiles: ● 50 th : 102 μs (101 μs) ● 50 th : 97.2 μs (87.2 μs) ● 75 th : 120 μs (115 μs) ● 75 th : 107 μs (101 μs) ● 90 th : 146 μs (137 μs) ● 90 th : 125 μs (115 μs) ● 95 th : 174 μs (166 μs) ● 95 th : 138 μs (126 μs) ● 99 th : 588 μs (427 μs) ● 99 th : 177 μs (152 μs) ● 99.5 th : 733 μs (568 μs) ● 99.5 th : 208 μs (169 μs) ● 99.9 th : 1.39 ms (979 μs) ● 99.9 th : 1.19 ms (318 μs) Latencies: peak (trough)

70% 70% Reduction in cost*

Challenges/Concerns ● Less Visibility ○ Unclear of Overall Data Size because of duplicates and expired records ○ Restrict Unique Data Set to ½ of Max for Precompute Batch Data ● Lower Max Throughput than Memcached-based Server ○ Higher CPU usage ○ Planning must be better so we can handle unusually high request spikes

Current/Future Work ● Investigate Blob Storage feature ○ Less Data read/write from SSD during Level Compaction ○ Lower Latency, Higher Throughput ○ Better View of Total Data Size ● Purging Expired SST’s earlier ○ Useful in “short” TTL use cases ○ May purge 60%+ SST earlier than FIFO Compaction ○ Reduce Worst Case Latency ○ Better Visibility of Overall Data Size ● Inexpensive Deduping for Batch Data

Open Source https://github.com/netflix/EVCache https://github.com/netflix/rend

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott - PowerPoint PPT Presentation

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Searching* Logging in Viewing title details Choosing a profile Playing a title* Picking liked

Prescription Drug Costs and Expenditures A Call For Action A prescription for lowering the cost

Increasing Competitiveness / Lowering Costs with Supply Chain Management and Security Standards

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of

A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed Pipeline Louis-Charles Trudeau 1

Cache Management Improving Memory Locality and Reducing Memory Latency Introduction Memory

Patents, Lowering Prosecution Costs THURSDAY, OCTOBER 13, 2016 1pm Eastern | 12pm Central

Lowering the Operational Costs through Improving Crew Readiness and Automated Real-time Decision

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Advancing Cancer Diagnosis Improving patient outcomes while lowering health care costs February

Advancing Cancer Diagnosis Improving patient outcomes while lowering health care costs NYSE

Advancing Cancer Diagnosis Improving patient outcomes while lowering health care costs NYSE

Lowering Your Lead Generation Costs 1 Its Simply about Leads and Conversion 2 Budgets

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

Memory Access Latency Joshua San Miguel Natalie Enright Jerger Chip Multiprocessor main memory

Group Captive Stop-Loss Gain Control and Transparency While Lowering Your Costs A Presentation

Requ quirement ments for or Requ quirement ments for or Mult ulticas icast in L3 VPNs

Middle author dilemma: how to recognize critical contributions of multidisciplinary teams Melissa

Math Time MATH Monday 5/11/20 First Grade Get your whiteboard and expo marker out. Todays

Machine Learning for Systems and Systems for Machine Learning Jeff Dean Google Brain team

Michel Electron Reconstruction Aidan Reynolds 15 th November 2017 Outline Introduction

FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for

The MultiJEDI ERC Project: Multilingual Joint Word Sense Disambiguation Roberto Navigli

Stanford CS193p Developing Applications for iOS Winter 2017 CS193p Winter 2017 Today What is