CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - - PowerPoint PPT Presentation

curb tail latency
SMART_READER_LITE
LIVE PREVIEW

CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on - - PowerPoint PPT Presentation

IN-MEMORY CACHING: CURB TAIL LATENCY WITH PELIKAN ABOUT ME 6 years at Twitter, on cache maintainer of Twemcache (OSS), Twitters Redis fork operations of thousands of machines hundreds of (internal) customers Now working on


slide-1
SLIDE 1

CURB TAIL LATENCY

IN-MEMORY CACHING: WITH PELIKAN

slide-2
SLIDE 2

ABOUT ME

  • 6 years at Twitter, on cache
  • maintainer of Twemcache (OSS), Twitter’s Redis fork
  • operations of thousands of machines
  • hundreds of (internal) customers
  • Now working on Pelikan, a next-gen cache framework to replace the above @twitter
  • Twitter: @thinkingfish
slide-3
SLIDE 3

CACHE PERFORMANCE

THE PROBLEM:

slide-4
SLIDE 4

CACHE RULES EVERYTHING AROUND ME

CACHE

DB

SERVICE

slide-5
SLIDE 5

CACHE RUINS EVERYTHING AROUND ME

CACHE

DB

SERVICE 😤

SENSITIVE!

😤

slide-6
SLIDE 6

GOOD CACHE PERFORMANCE = PREDICTABLE LATENCY

slide-7
SLIDE 7

GOOD CACHE PERFORMANCE = PREDICTABLE TAIL LATENCY

slide-8
SLIDE 8

“MILLIONS OF QPS PER MACHINE” “SUB-MILLISECOND LATENCIES” “NEAR LINE-RATE THROUGHPUT” …

KING OF PERFORMANCE

slide-9
SLIDE 9

“USUALLY PRETTY FAST” “HICCUPS EVERY ONCE IN A WHILE” “TIMEOUT SPIKES AT THE TOP OF THE HOUR” “SLOW ONLY WHEN MEMORY IS LOW” …

GHOSTS OF PERFORMANCE

slide-10
SLIDE 10

I SPENT FIRST 3 MONTHS AT TWITTER LEARNING CACHE BASICS… …AND THE NEXT 5 YEARS CHASING GHOSTS

slide-11
SLIDE 11
slide-12
SLIDE 12

MINIMIZE INDETERMINISTIC BEHAVIOR

CONTAIN GHOSTS =

slide-13
SLIDE 13

HOW?

IDENTIFY AVOID MITIGATE

slide-14
SLIDE 14

CACHING IN DATACENTER

A PRIMER:

slide-15
SLIDE 15

CONTEXT

  • geographically centralized
  • highly homogeneous network
  • reliable, predictable infrastructure
  • long-lived connections
  • high data rate
  • simple data/operations
slide-16
SLIDE 16

MAINLY:

REQUEST → RESPONSE

CACHE IN PRODUCTION

INITIALLY:

CONNECT

ALSO (BECAUSE WE ARE ADULTS):

STATS, LOGGING, HEALTH CHECK…

slide-17
SLIDE 17

CACHE: BIRD’S VIEW

HOST

event-driven server protocol data storage OS network infrastructure

slide-18
SLIDE 18

HOW DID WE UNCOVER THE

UNCERTAINTIES?

slide-19
SLIDE 19

” “

BANDWIDTH UTILIZATION WENT WAY UP, BUT REQUEST RATE WAY DOWN.

slide-20
SLIDE 20

SYSCALLS

slide-21
SLIDE 21

CONNECTING IS SYSCALL-HEAVY

read event accept config register

4+ syscalls

slide-22
SLIDE 22

REQUEST IS SYSCALL-LIGHT

read event IO (read) post- read parse process compose write event IO (write) post- write

3 syscalls*

*: event loop returns multiple read events at once, I/O syscalls can be further amortized by batching/pipelining

slide-23
SLIDE 23

TWEMCACHE IS MOSTLY SYSCALLS

  • 1-2 µs overhead per call
  • dominate CPU time in simple cache
  • What if we have 100k conns / sec?

source

slide-24
SLIDE 24

CONNECTION STORM

culprit:

slide-25
SLIDE 25

” “

…TWEMCACHE RANDOM HICCUPS, ALWAYS AT THE TOP OF THE HOUR.

slide-26
SLIDE 26

DISK

cache tworker

l

  • g

g i n g

cron job “x”

I / O

slide-27
SLIDE 27

BLOCKING I/O

culprit:

slide-28
SLIDE 28

” “

WE ARE SEEING SEVERAL “BLIPS” AFTER EACH CACHE REBOOT…

slide-29
SLIDE 29

LOCKING FACTS

  • ~25ns per operation
  • more expensive on NUMA
  • much more costly when contended

source

slide-30
SLIDE 30

MEMCACHE RESTART … EVERYTHING IS FINE REQUESTS SUDDENLY GET SLOW/TIMED-OUT CONNECTION STORM CLIENTS TOPPLE SLOWLY RECOVER (REPEAT A FEW TIMES) … STABILIZE

A TIMELINE

lock! lock!

slide-31
SLIDE 31

LOCKING

culprit:

slide-32
SLIDE 32

” “

HOSTS WITH LONG RUNNING CACHE TRIGGERS OOM WHEN LOAD SPIKE.

slide-33
SLIDE 33

” “

REDIS INSTANCES WERE KILLED BY SCHEDULER.

slide-34
SLIDE 34

MEMORY

culprit:

slide-35
SLIDE 35

CONNECTION STORM BLOCKING I/O LOCKING MEMORY

SUMMARY

slide-36
SLIDE 36

HOW TO MITIGATE?

slide-37
SLIDE 37

DATA PLANE, CONTROL PLANE

slide-38
SLIDE 38

PUT OPERATIONS OF DIFFERENT NATURE / PURPOSE ON SEPARATE THREADS

HIDE EXPENSIVE OPS

slide-39
SLIDE 39

LISTENING (ADMIN CONNECTIONS) STATS AGGREGATION STATS EXPORTING LOG DUMP

SLOW: CONTROL PLANE

slide-40
SLIDE 40

FAST: DATA PLANE / REQUEST

read event IO (read) post- read parse process compose write event IO (write) post- write

:

tworker

slide-41
SLIDE 41

FAST: DATA PLANE / CONNECT

read event accept config read event register

:

tworker

:

tserver

dispatch

slide-42
SLIDE 42

LATENCY-ORIENTED THREADING

tworker tserver tadmin

new connection logging, stats update logging, stats update

REQUESTS CONNECTS OTHER

slide-43
SLIDE 43

WHAT TO AVOID?

slide-44
SLIDE 44

LOCKING

slide-45
SLIDE 45

WHAT WE KNOW

  • inter-thread communication in cache
  • stats
  • logging
  • connection hand-off
  • locking propagates blocking/delay

between threads

tworker tserver tadmin

new connection logging, stats update logging, stats update

slide-46
SLIDE 46

MAKE STATS UPDATE LOCKLESS

LOCKLESS OPERATIONS

w/ atomic instructions

slide-47
SLIDE 47

MAKE LOGGING WAITLESS

LOCKLESS OPERATIONS

RING/CYCLIC BUFFER

read position

writer reader

write position

slide-48
SLIDE 48

MAKE CONNECTION HAND-OFF LOCKLESS

LOCKLESS OPERATIONS

RING ARRAY

read position

writer reader

write position … …

slide-49
SLIDE 49

MEMORY

slide-50
SLIDE 50

WHAT WE KNOW

  • alloc-free cause fragmentation
  • internal vs external fragmentation
  • OOM/swapping is deadly
  • memory alloc/copy relatively

expensive source

slide-51
SLIDE 51

AVOID EXTERNAL FRAGMENTATION CAP ALL MEMORY RESOURCES

PREDICTABLE FOOTPRINT

slide-52
SLIDE 52

REUSE BUFFER PREALLOCATE

PREDICTABLE RUNTIME

slide-53
SLIDE 53

PELIKAN CACHE

IMPLEMENTATION

slide-54
SLIDE 54

WHAT IS PELIKAN CACHE?

  • (Datacenter-) Caching framework
  • A summary of Twitter’s cache ops
  • Perf goal: deterministically fast
  • Clean, modular design
  • Open-source

waitless logging lockless metrics composed config channels buffers timer alarm poo ling streams events data store parse/compose/trace data model request response server

  • rchestration

threading common core cache process

pelikan.io

slide-55
SLIDE 55

A COMPARISON

PERFORMANCE DESIGN DECISIONS

latency-oriented threading Memory/ fragmentation Memory/ buffer caching Memory/ pre-allocation, cap locking

Memcached

partial internal partial partial yes

Redis

no->partial external no partial no->yes

Pelikan

yes internal yes yes no

slide-56
SLIDE 56

MEMCACHED REDIS

TO BE FAIR…

  • multiple worker threads
  • binary protocol + SASL
  • rich set of data structures
  • master-slave replication
  • redis-cluster
  • modules
  • tools
slide-57
SLIDE 57

ALWAYS FAST

THE BEST CACHE IS…

slide-58
SLIDE 58

QUESTIONS?