EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott - - PowerPoint PPT Presentation

evcache lowering costs for a low latency cache with
SMART_READER_LITE
LIVE PREVIEW

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott - - PowerPoint PPT Presentation

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Searching* Logging in Viewing title details Choosing a profile Playing a title* Picking liked


slide-1
SLIDE 1

EVCache: Lowering Costs for a Low Latency Cache with RocksDB

Scott Mansfield Vu Nguyen EVCache

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

90 seconds

slide-14
SLIDE 14

What do caches touch?

Signing up* Logging in Choosing a profile Picking liked videos Personalization* Loading home page* Scrolling home page* A/B tests Video image selection Searching* Viewing title details Playing a title* Subtitle / language prefs Rating a title My List Video history* UI strings Video production*

* multiple caches involved

slide-15
SLIDE 15

Home Page Request

slide-16
SLIDE 16
slide-17
SLIDE 17

Key-Value store optimized for AWS and tuned for Netflix use cases Ephemeral Volatile Cache

slide-18
SLIDE 18

What is EVCache?

Distributed, sharded, replicated key-value store Tunable in-region and global replication Based on Memcached Resilient to failure Topology aware Linearly scalable Seamless deployments

slide-19
SLIDE 19

Why Optimize for AWS

Instances disappear Zones fail Regions become unstable Network is lossy Customer requests bounce between regions Failures happen and we test all the time

slide-20
SLIDE 20

EVCache Use @ Netflix

Hundreds of terabytes of data Trillions of ops / day Tens of billions of items stored Tens of millions of ops / sec Millions of replications / sec Thousands of servers Hundreds of instances per cluster Hundreds of microservice clients Tens of distinct clusters 3 regions 4 engineers

slide-21
SLIDE 21

Architecture

Server

Memcached EVCar Application Client Library Client Eureka (Service Discovery)

slide-22
SLIDE 22

Architecture

us-west-2a us-west-2c us-west-2b Client Client Client

slide-23
SLIDE 23

Reading (get)

us-west-2a us-west-2c us-west-2b Client

Primary Secondary
slide-24
SLIDE 24

Writing (set, delete, add, etc.)

us-west-2a us-west-2c us-west-2b Client Client Client

slide-25
SLIDE 25

Use Case: Lookaside Cache

Application Client Library Client Ribbon Client S S S S C C C C

Data Flow
slide-26
SLIDE 26

Use Case: Transient Data Store

Application Client Library Client Application Client Library Client Application Client Library Client

Time
slide-27
SLIDE 27

Use Case: Primary Store

Offline / Nearline Precomputes for Recommendations

Online Services Offline Services

Online Application Client Library Client

Data Flow
slide-28
SLIDE 28

Online Services Offline Services

Use Case: Versioned Primary Store

Offline Compute Online Application Client Library Client

Data Flow

Archaius (Dynamic Properties) Control System (Valhalla)

slide-29
SLIDE 29

Use Case: High Volume && High Availability

Compute & Publish

  • n schedule
Data Flow

Application Client Library In-memory Remote Ribbon Client

Optional

S S S S C C C C

slide-30
SLIDE 30

Pipeline of Personalization

Compute A Compute B Compute C Compute D

Online Services Offline Services

Compute E

Data Flow

Online 1 Online 2

slide-31
SLIDE 31

Additional Features

Kafka

  • Global data replication
  • Consistency metrics

Key Iteration

  • Cache warming
  • Lost instance recovery
  • Backup (and restore)
slide-32
SLIDE 32

Additional Features (Kafka)

Global data replication Consistency metrics

slide-33
SLIDE 33

Region B Region A APP APP Repl Proxy Repl Relay

1 mutate 2 send metadata 3 poll msg 5 h t t p s s e n d m s g 6 mutate 4 get data for set

Kafka Repl Relay Kafka Repl Proxy

Cross-Region Replication

7 read
slide-34
SLIDE 34

Additional Features (Key Iteration)

Cache warming Lost instance recovery Backup (and restore)

slide-35
SLIDE 35

Cache Warming

Cache Warmer (Spark) Application Client Library Client Control

S3

Data Flow Metadata Flow Control Flow
slide-36
SLIDE 36

Moneta

Next-generation EVCache server

slide-37
SLIDE 37

Moneta

Moneta: The Goddess of Memory Juno Moneta: The Protectress of Funds for Juno

  • Evolution of the EVCache server
  • Cost optimization
  • EVCache on SSD
  • Ongoing lower EVCache cost per stream
  • Takes advantage of global request patterns
slide-38
SLIDE 38

Old Server

  • Stock Memcached and EVCar (sidecar)
  • All data stored in RAM in Memcached
  • Expensive with global expansion / N+1 architecture

Memcached EVCar

external

slide-39
SLIDE 39

Optimization

  • Global data means many copies
  • Access patterns are heavily region-oriented
  • In one region:

○ Hot data is used often ○ Cold data is almost never touched

  • Keep hot data in RAM, cold data on SSD
  • Size RAM for working set, SSD for overall dataset
slide-40
SLIDE 40

New Server

  • Adds Rend and Mnemonic
  • Still looks like Memcached
  • Unlocks cost-efficient storage & server-side intelligence
slide-41
SLIDE 41

go get github.com/netflix/rend

Rend

slide-42
SLIDE 42

Rend

  • High-performance Memcached proxy & server
  • Written in Go

○ Powerful concurrency primitives ○ Productive and fast

  • Manages the L1/L2 relationship
  • Tens of thousands of connections
slide-43
SLIDE 43

Rend

  • Modular set of libraries and an example main()
  • Manages connections, request orchestration, and

communication

  • Low-overhead metrics library
  • Multiple orchestrators
  • Parallel locking for data integrity
  • Efficient connection pool

Server Loop Request Orchestration Backend Handlers M E T R I C S Connection Management Protocol

slide-44
SLIDE 44

Mnemonic

slide-45
SLIDE 45

Mnemonic

  • Manages data storage on SSD
  • Uses Rend server libraries

○ Handles Memcached protocol

  • Maps Memcached ops to RocksDB ops

Rend Server Core Lib (Go) Mnemonic Op Handler (Go) Mnemonic Core (C++) RocksDB (C++)

slide-46
SLIDE 46

Why RocksDB?

  • Fast at medium to high write load

○ Disk--write load higher than read load (because of Memcached)

  • Predictable RAM Usage
memtable Record A Record B SST memtable memtable SST SST

. . . SST: Static Sorted Table

slide-47
SLIDE 47

How we use RocksDB

  • No Level Compaction

○ Generated too much traffic to SSD ○ High and unpredictable read latencies

  • No Block Cache

○ Rely on Local Memcached

  • No Compression
slide-48
SLIDE 48

How we use RocksDB

  • FIFO Compaction

○ SST’s ordered by time ○ Oldest SST deleted when full ○ Reads access every SST until record found

slide-49
SLIDE 49

How we use RocksDB

  • Full File Bloom Filters

○ Full Filter reduces unnecessary SSD reads

  • Bloom Filters and Indices pinned in memory

○ Minimize SSD access per request

slide-50
SLIDE 50

How we use RocksDB

  • Records sharded across multiple RocksDB per node

○ Reduces number of files checked to decrease latency

R

...

Mnemonic Core

Key: ABC Key: XYZ R R R
slide-51
SLIDE 51

Region-Locality Optimizations

  • Replication and Batch updates only RocksDB*

○ Keeps Region-Local and “hot” data in memory ○ Separate Network Port for “off-line” requests ○ Memcached data “replaced”

slide-52
SLIDE 52

FIFO Limitations

  • FIFO compaction not suitable for all use cases

○ Very frequently updated records may push out valid records

  • Expired Records still exist
  • Requires Larger Bloom Filters
SST Record A2 Record B1 Record B2 Record A3 Record A1 Record A2 Record B1 Record B2 Record A3 Record A1 Record B3 Record B3 Record C Record D Record E Record F Record G Record H time SST SST
slide-53
SLIDE 53

AWS Instance Type

  • i2.xlarge

○ 4 vCPU ○ 30 GB RAM ○ 800 GB SSD ■ 32K IOPS (4KB Pages) ■ ~130MB/sec

slide-54
SLIDE 54

Moneta Perf Benchmark (High-Vol Online Requests)

slide-55
SLIDE 55

Moneta Perf Benchmark (cont)

slide-56
SLIDE 56

Moneta Perf Benchmark (cont)

slide-57
SLIDE 57

Moneta Perf Benchmark (cont)

slide-58
SLIDE 58

Moneta Performance in Production (Batch Systems)

  • Request ‘get’ 95%, 99% = 729 μs, 938 μs
  • L1 ‘get’ 95%, 99% = 153 μs, 191 μs
  • L2 ‘get’ 95%, 99% = 1005 μs, 1713 μs
  • ~20 KB Records
  • ~99% Overall Hit Rate
  • ~90% L1 Hit Rate
slide-59
SLIDE 59

Moneta Performance in Prod (High Vol-Online Req)

  • Request ‘get’ 95%, 99% = 174 μs, 588 μs
  • L1 ‘get’ 95%, 99% = 145 μs, 190 μs
  • L2 ‘get’ 95%, 99% = 770 μs, 1330 μs
  • ~1 KB Records
  • ~98% Overall Hit Rate
  • ~97% L1 Hit Rate
slide-60
SLIDE 60

Get Percentiles:

  • 50th: 102 μs (101 μs)
  • 75th: 120 μs (115 μs)
  • 90th: 146 μs (137 μs)
  • 95th: 174 μs (166 μs)
  • 99th: 588 μs (427 μs)
  • 99.5th: 733 μs (568 μs)
  • 99.9th: 1.39 ms (979 μs)

Moneta Performance in Prod (High Vol-Online Req)

Set Percentiles:

  • 50th: 97.2 μs (87.2 μs)
  • 75th: 107 μs (101 μs)
  • 90th: 125 μs (115 μs)
  • 95th: 138 μs (126 μs)
  • 99th: 177 μs (152 μs)
  • 99.5th: 208 μs (169 μs)
  • 99.9th: 1.19 ms (318 μs)

Latencies: peak (trough)

slide-61
SLIDE 61

70%

Reduction in cost*

70%

slide-62
SLIDE 62

Challenges/Concerns

  • Less Visibility

○ Unclear of Overall Data Size because of duplicates and expired records ○ Restrict Unique Data Set to ½ of Max for Precompute Batch Data

  • Lower Max Throughput than Memcached-based Server

○ Higher CPU usage ○ Planning must be better so we can handle unusually high request spikes

slide-63
SLIDE 63

Current/Future Work

  • Investigate Blob Storage feature

○ Less Data read/write from SSD during Level Compaction ○ Lower Latency, Higher Throughput ○ Better View of Total Data Size

  • Purging Expired SST’s earlier

○ Useful in “short” TTL use cases ○ May purge 60%+ SST earlier than FIFO Compaction ○ Reduce Worst Case Latency ○ Better Visibility of Overall Data Size

  • Inexpensive Deduping for Batch Data
slide-64
SLIDE 64

Open Source

https://github.com/netflix/EVCache https://github.com/netflix/rend

slide-65
SLIDE 65

Thank You

smansfield@netflix.com (@sgmansfield) nguyenvu@netflix.com techblog.netflix.com

slide-66
SLIDE 66
slide-67
SLIDE 67

Failure Resilience in Client

  • Operation Fast Failure
  • Tunable Retries
  • Operation Queues
  • Tunable Latch for Mutations
  • Async Replication through Kafka
slide-68
SLIDE 68

Region A APP Consistency Checker

1 mutate 2 send metadata 3 poll msg

Kafka

Consistency Metrics

4 pull data

Atlas (Metrics Backend)

5 report

Client Dashboards

slide-69
SLIDE 69

Lost Instance Recovery

Cache Warmer (Spark) Application Client Library Client

S3

Partial Data Flow Metadata Flow Control Flow

Control Zone A Zone B

Data Flow
slide-70
SLIDE 70

Backup (and Restore)

Cache Warmer (Spark) Application Client Library Client Control

S3

Data Flow Control Flow
slide-71
SLIDE 71

Moneta in Production

  • Serving all of our personalization data
  • Rend runs with two ports:

○ One for standard users (read heavy or active data management) ○ Another for async and batch users: Replication and Precompute

  • Maintains working set in RAM
  • Optimized for precomputes

○ Smartly replaces data in L1

external internal

EVCar Memcached (RAM) Mnemonic (SSD) Std Batch

slide-72
SLIDE 72

Rend batching backend