CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - - PowerPoint PPT Presentation

curb tail latency
SMART_READER_LITE
LIVE PREVIEW

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - - PowerPoint PPT Presentation

IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of Twemcache & Twitters Redis fork operations of thousands of machines hundreds of (internal) customers Now working on


slide-1
SLIDE 1

CURB TAIL LATENCY

IN-MEMORY CACHING: WITH PELIKAN

slide-2
SLIDE 2

ABOUT ME

  • 6 years at Twitter, on cache
  • maintainer of Twemcache & Twitter’s Redis fork
  • operations of thousands of machines
  • hundreds of (internal) customers
  • Now working on Pelikan, a next-gen cache framework to replace the above @twitter
  • Twitter: @thinkingfish
slide-3
SLIDE 3

CACHE PERFORMANCE

THE PROBLEM:

slide-4
SLIDE 4

CACHE RULES EVERYTHING AROUND ME

CACHE

DB

SERVICE

slide-5
SLIDE 5

CACHE RUINS EVERYTHING AROUND ME

CACHE

DB

SERVICE 😤 😤

slide-6
SLIDE 6

LATENCY & FANOUT

  • what determines overall 99%-ile of

req?

CACHE SERVICE CACHE CACHE

req: all tweets for #qcon ⇒ tid 1, tid 2, …, tid n (assume n is large)

fanout percentile 1 p99 10 p999 100 p9999 1000 p99999

slide-7
SLIDE 7

LATENCY & DEPENDENCY

  • what determines overall 99%-ile?
  • adding all latencies together
  • N steps ⇒ Nx exposure to tail latency

SERVICE A

get timeline get tweets get users for each tweet

SERVICE B SERVICE C

slide-8
SLIDE 8

CACHE IS UBIQUITOUS

  • Exposure of cache tail

latency increases with both scale and dependency!

CACHE A SERVICE A CACHE A CACHE A CACHE B SERVICE B CACHE B CACHE B CACHE C SERVICE C CACHE C CACHE C

slide-9
SLIDE 9

GOOD CACHE PERFORMANCE = PREDICTABLE LATENCY

slide-10
SLIDE 10

GOOD CACHE PERFORMANCE = PREDICTABLE TAIL LATENCY

slide-11
SLIDE 11

“MILLIONS OF QPS PER MACHINE” “SUB-MILLISECOND LATENCIES” “NEAR LINE-RATE THROUGHPUT” …

KING OF PERFORMANCE

slide-12
SLIDE 12

“USUALLY PRETTY FAST” “HICCUPS EVERY ONCE IN A WHILE” “TIMEOUT SPIKES AT THE TOP OF THE HOUR” “SLOW ONLY WHEN MEMORY IS LOW” …

GHOSTS OF PERFORMANCE

slide-13
SLIDE 13

I SPENT FIRST 3 MONTHS AT TWITTER LEARNING CACHE BASICS… …AND THE NEXT 5 YEARS CHASING GHOSTS

slide-14
SLIDE 14
slide-15
SLIDE 15

MINIMIZE INDETERMINISTIC BEHAVIOR

CHAINING DOWN GHOSTS =

slide-16
SLIDE 16

HOW?

IDENTIFY AVOID MITIGATE

slide-17
SLIDE 17

CACHING IN DATACENTER

A PRIMER:

slide-18
SLIDE 18

DATACENTER

  • geographically centralized
  • highly homogeneous network
  • relatively reliable infrastructure
slide-19
SLIDE 19

MAINLY:

REQUEST → RESPONSE

CACHING

INITIALLY:

CONNECT

ALSO (BECAUSE WE ARE GROWN-UPS):

STATS, LOGGING, HEALTH CHECK…

slide-20
SLIDE 20

CACHE SERVER: BIRD’S VIEW

HOST

event-driven server protocol data storage OS network infrastructure

slide-21
SLIDE 21

HOW DID WE UNCOVER THE

UNCERTAINTIES?

slide-22
SLIDE 22

” “

BANDWIDTH UTILIZATION WENT WAY UP, EVEN THOUGH REQUEST RATE WAS WAY LOWER.

slide-23
SLIDE 23

SYSCALLS

slide-24
SLIDE 24

CONNECTING IS SYSCALL-HEAVY

read event accept config register

4+ syscalls

slide-25
SLIDE 25

REQUEST IS SYSCALL-LIGHT

read event IO (read) post- read parse process compose write event IO (write) post- write

3 syscalls*

*: event loop returns multiple read events at once, I/O syscalls can be further amortized by batching/pipelining

slide-26
SLIDE 26

TWEMCACHE IS MOSTLY SYSCALLS

  • 1-2 µs overhead per call
  • dominate CPU time in simple cache
  • What if we have 100k conns / sec?

source

slide-27
SLIDE 27

CONNECTION STORM

culprit:

slide-28
SLIDE 28

” “

…TWEMCACHE RANDOM HICCUPS, ALWAYS AT THE TOP OF THE HOUR.

slide-29
SLIDE 29

DISK

cache tworker

l

  • g

g i n g

cron job “x”

I / O

slide-30
SLIDE 30

BLOCKING I/O

culprit:

slide-31
SLIDE 31

” “

WE ARE SEEING SEVERAL “BLIPS” AFTER EACH CACHE REBOOT…

slide-32
SLIDE 32

MEMCACHE RESTART … MANY REQUESTS TIMED OUT CONNECTION STORM SOME MORE REQUESTS TIMED OUT (REPEAT A FEW TIMES)

A TIMELINE

lock! lock!

slide-33
SLIDE 33

LOCKING

culprit:

slide-34
SLIDE 34

LOCKING FACTS

  • ~25ns per operation
  • more expensive on NUMA
  • much more costly when contended

source

slide-35
SLIDE 35

” “

HOSTS WITH LONG RUNNING TWEMCACHE/REDIS TRIGGER OOM DURING LOAD SPIKES.

slide-36
SLIDE 36

” “

REDIS INSTANCES THAT STARTED EVICTING SUDDENLY GOT SLOWER.

slide-37
SLIDE 37

MEMORY LAYOUT / OPS

culprit:

slide-38
SLIDE 38

CONNECTION STORM BLOCKING I/O LOCKING MEMORY

SUMMARY

slide-39
SLIDE 39

HOW TO MITIGATE?

slide-40
SLIDE 40

PUT OPERATIONS OF DIFFERENT NATURE / PURPOSE ON SEPARATE THREADS

HIDE EXPENSIVE OPS

slide-41
SLIDE 41

DATA PLANE, CONTROL PLANE

slide-42
SLIDE 42

STATS AGGREGATION STATS EXPORTING LOG DUMP LOG ROTATION …

SLOW: CONTROL PLANE

slide-43
SLIDE 43

FAST: DATA PLANE / REQUEST

read event IO (read) post- read parse process compose write event IO (write) post- write

:

tworker

slide-44
SLIDE 44

FAST: DATA PLANE / CONNECT

read event accept config read event register

:

tworker

:

tserver

dispatch

slide-45
SLIDE 45

LATENCY-ORIENTED THREADING

tworker tserver tadmin

new connection logging, stats update logging, stats update

REQUESTS CONNECTS OTHER

slide-46
SLIDE 46

WHAT TO AVOID?

slide-47
SLIDE 47

LOCKING

slide-48
SLIDE 48

WHAT WE KNOW

  • inter-thread communication in cache
  • stats
  • logging
  • connection hand-off
  • locking propagates blocking/delay

between threads

tworker tserver tadmin

new connection logging, stats update logging, stats update

slide-49
SLIDE 49

MAKE STATS UPDATE LOCKLESS

LOCKLESS OPERATIONS

w/ atomic instructions

slide-50
SLIDE 50

MAKE LOGGING LOCKLESS

LOCKLESS OPERATIONS

RING/CYCLIC BUFFER

read position

writer reader

write position

slide-51
SLIDE 51

MAKE CONNECTION HAND-OFF LOCKLESS

LOCKLESS OPERATIONS

RING ARRAY

read position

writer reader

write position … …

slide-52
SLIDE 52

MEMORY

slide-53
SLIDE 53

WHAT WE KNOW

  • alloc-free cause fragmentation
  • internal vs external fragmentation
  • OOM/swapping is deadly
  • memory alloc/copy relatively

expensive source

slide-54
SLIDE 54

AVOID EXTERNAL FRAGMENTATION CAP ALL MEMORY RESOURCES

PREDICTABLE FOOTPRINT

slide-55
SLIDE 55

REUSE BUFFER PREALLOCATE

PREDICTABLE RUNTIME

slide-56
SLIDE 56

PELIKAN CACHE

IMPLEMENTATION

slide-57
SLIDE 57

WHAT IS PELIKAN CACHE?

  • (Datacenter-) Caching framework
  • A summary of Twitter’s cache ops
  • Perf goal: deterministically fast
  • Clean, modular design
  • Open-source

waitless logging lockless metrics composed config channels buffers timer alarm poo ling streams events data store parse/compose/trace data model request response server

  • rchestration

threading common core cache process

pelikan.io

slide-58
SLIDE 58

A COMPARISON

PERFORMANCE DESIGN DECISIONS

latency-oriented threading Memory/ fragmentation Memory/ buffer caching Memory/ pre-allocation, cap locking

Memcached

partial internal partial partial yes

Redis

no->partial external no partial no->yes

Pelikan

yes internal yes yes no

slide-59
SLIDE 59

MEMCACHED REDIS

TO BE FAIR…

  • multiple threads can boost throughput
  • binary protocol + SASL
  • rich set of data structures
  • RDB
  • master-slave replication
  • redis-cluster
  • modules
  • tools
slide-60
SLIDE 60

ALWAYS FAST

SCALABLE CACHE IS…

slide-61
SLIDE 61

” “

CAREFUL ABOUT MOVING TO MULTIPLE WORKER THREADS

slide-62
SLIDE 62

QUESTIONS?