Bullet Cache Balancing speed and usability in a cache server Ivan - - PowerPoint PPT Presentation

bullet cache
SMART_READER_LITE
LIVE PREVIEW

Bullet Cache Balancing speed and usability in a cache server Ivan - - PowerPoint PPT Presentation

Bullet Cache Balancing speed and usability in a cache server Ivan Voras <ivoras@freebsd.org> What is it? People know what memcached is... mostly Example use case: So you have a web page which is just dynamic enough so you


slide-1
SLIDE 1

Bullet Cache

Balancing speed and usability in a cache server

Ivan Voras <ivoras@freebsd.org>

slide-2
SLIDE 2

What is it?

  • People know what memcached is... mostly
  • Example use case:

– So you have a web page which is just

dynamic enough so you can't cache it completely as an HTML dump

– You have a SQL query on this page which

is 99.99% always the same (same query, same answer)

– ...so you cache it

slide-3
SLIDE 3

Why a cache server?

  • Sharing between processes

– … on different servers

  • In environments which do not implement

application persistency

– CGI, FastCGI – PHP

  • Or you're simply lazy and want something

which works...

slide-4
SLIDE 4

A little bit of history...

  • This started as my “pet project”...

– It's so ancient, when I first started working

  • n it, Memcached was still single-threaded

– It's gone through at least one rewrite and

a whole change of concept

  • I made it because of the frustration I felt

while working with Memcached

– Key-value databases are so very basic – “I could do better than that” :)

slide-5
SLIDE 5

Now...

  • Used in production in my university's

project

  • Probably the fastest memory cache

engine available (in specific circumstances)

  • Available in FreeBSD ports

(databases/mdcached)

  • Has a 20-page User Manual :)
slide-6
SLIDE 6

What's wrong with memcached?

  • Nothing much – it's solid work
  • The classic problem:

cache expiry / invalidation

– memcached accepts a list of records to

expire (inefficient, need to maintain this list)

  • It's fast – but is it fast enough?

– Does it really make use of multiple CPUs

as efficiently as possible?

slide-7
SLIDE 7

Introducing the Bullet Cache

  • 1. Offers a smarter data structure to

the user side than a simple key-value pair

  • 2. Implements “interesting” internal

data structures

  • 3. Some interesting bells & whistles
slide-8
SLIDE 8

User-visible structure

  • Traditional (memcached) style:

– Key-value pairs – Relatively short keys (255 bytes) – ASCII-only keys (?) – (ASCII-only protocol) – Multi-record operations only with a list of

records

– Simple atomic operations (relatively

inefficient - atoi())

slide-9
SLIDE 9

Introducing record tags

  • They are metadata
  • Constraints:

– Must be fast (they are NOT db indexes) – Must allow certain types of bulk operations

  • The implementation:

– Both key and value are signed integers – No limit on the number of tags per record – Bulk queries: (tkey X) && (tval1, [tval2...])

slide-10
SLIDE 10

Record tags

  • I heard you like key-value records so I've

put key-value records into your key-value records...

record key record value k v k v k v ... generic metadata

slide-11
SLIDE 11

Metadata queries (1)

  • Use case example: a web application has

a page “/contacts” which contains data from several SQL queries as well as other sources (LDAP)

– Tag all cached records with

(tkey,tval) = (42, hash(“/contacts”))

– When constructing page, issue query:

get_by_tag_values(42, hash(“/contacts”))

– When expiring all data, issue query:

del_by_tag_values(42, hash(“/contacts”))

slide-12
SLIDE 12

Metadata queries (2)

  • Use case example: Application objects are

stored (serialized, marshalled) into the cache, and there's a need to invalidate (expire) all objects of a certain type

– Tag records with

(tkey, tval) = (object_type, instance_id)

– Expire with

del_by_tag_values(object_type, instance_id)

– Also possible: tagging object

interdependance

slide-13
SLIDE 13

Under the hood

  • It's “interesting”...
  • Started as a C project, now mostly

converted to C++ for easier modularization

– Still uses C-style structures and algorithms

for the core parts – i.e. not std::containers

  • Contains tests and benchmarks within the

main code base

– C and PHP client libraries

slide-14
SLIDE 14

The main data structure

  • A “forest of trees”,

anchored in hash table buckets

  • Buckets are

directly addressed by hashing record keys

  • Buckets are

protected by rwlocks

RW lock node node node node node node

Red-black tree

RW lock node node node node node node

Red-black tree

hash value = H1 hash value = H2

. . .

tree root tree root

Hash table

slide-15
SLIDE 15

Basic operation

1.Find h = Hash(key) 2.Acquire lock #h 3.Find record in RB tree indexed by key 4.Perform operation 5.Release lock #h

RW lock node node node node node node Red-black tree RW lock node node node node node node Red-black tree hash value = H1 hash value = H2 . . . tree root tree root Hash table

1 2 3 4 5 OP end

slide-16
SLIDE 16

Record tags follow a similar pattern

  • The tags index the

main structure and are maintained (almost) independently

Hash value = H1 RW lock TK1 TK2 TK3 TK4 Hash value = H2 RW lock TK1 TK2 TK3 TK4 ...

slide-17
SLIDE 17

Concurrency and locking

  • Concurrency is great – the default

configuration starts 256 record buckets and 64 tag buckets

  • Locking is without ordering assumptions

– *_trylock() for everything – rollback-and-retry – No deadlocks

  • Livelocks on the other hand need to be

investigated

slide-18
SLIDE 18

Two-way linking between records and tags

RW lock node node node node node node RW lock node node node node node node hash value = H1 hash value = H2

. . .

tree root tree root

Hash table

hash value = H1 RW lock TK1 TK2 TK3 TK4 hash value = H2 RW lock TK1 TK2 TK3 TK4 ...

slide-19
SLIDE 19

Concurrency

  • Scenario 1:

– A record is

referenced → need to hold N tag bucket locks

  • Scenario 2:

– A tag is

referenced → need to hold M record bucket locks

4 8 16 32 64 128 256

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

NCPU=64, SHARED NCPU=64, EXCLUSIVE NCPU=8, SHARED NCPU=8, EXCLUSIVE Number of hash table buckets (H) Percentage of uncontested lock acquisitions (U)

Percentage of uncontested lock acquisitions

slide-20
SLIDE 20

Multithreading models

  • Aka “which thread does what”
  • Three basic tasks:

– T1: Connection acceptance – T2: Network IO – T3: Payload work

  • The big question: how to distribute these

into threads?

slide-21
SLIDE 21

Multithreading models

  • SPED : Single process, event driven
  • SEDA : Staged, event-driven architecture
  • AMPED : Asymmetric, multi-process,

event-driven

  • SYMPED : Symmetric, multi-process,

event driven

Model New connection handler Network IO handler Payload work

SPED 1 thread In connection thread In connection thread SEDA 1 thread N1 threads N2 threads SEDA-S 1 thread N threads N threads AMPED 1 thread 1 thread N threads SYMPED 1 thread N threads In network thread

slide-22
SLIDE 22

All the models are event-driven

  • The “dumb” model: thread-per-

connection

  • Not really efficient

– (FreeBSD has experimented with KSE and

M:N threading but that didn't work out)

  • IO events: via kqueue(2)
  • Inter-thread synchronization: queues

signalled with CVs

slide-23
SLIDE 23

SPED

  • Single-threaded, event-driven
  • Very efficient on

single-CPU systems

  • Most efficient if

the operation is very fast (compared to network IO and event handling)

  • Used in efficient Unix network servers
slide-24
SLIDE 24

SEDA

  • Staged, event-driven
  • Different task threads

instantiated in different numbers

  • Generally,

N1 != N2 != N3

  • The most queueing
  • The most separation
  • f tasks – most CPUs used

T1 T2 T2 T2 T3 T3 T3 T3

slide-25
SLIDE 25

AMPED

  • Asymmetric multi-process event-driven
  • Asymmetric: N(T2) != N(T3)
  • Assumes network IO

processing is cheap compared to

  • peration processing
  • Moderate amount of queuing
  • Can use arbitrary number of CPUs

T1 T2 T3 T3 T3

slide-26
SLIDE 26

SYMPED

  • Symmetric multi-process event-driven
  • Symmetric: grouping of tasks
  • Assumes network IO

and operation processing are similarly expensive but uniform

  • Sequential processing

inside threads

  • Similar to multiple instances of SPED

T1 T2+T3 T2+T3 T2+T3

slide-27
SLIDE 27

Multithreading models in Bullet Cache

  • Command-line configuration:

– n : number of network threads – t : number of payload threads

  • n=0, t=0 : SPED
  • n=1, t>0 : AMPED
  • n>0, t=0 : SYMPED
  • n>1, t>0 : SEDA
  • n>1, t>1, n=t : SEDA-S (symmetrical)
slide-28
SLIDE 28

How does that work?

  • SEDA: the same network loop accepts

connections and network IO

  • Others: The network IO threads accept

messages, then either:

– process them in-thread or – queue them on worker thread queues

  • Response messages are either sent in-

thread from whichever thread generates them or finished with the IO event code

slide-29
SLIDE 29

Performance of various models

20 40 60 80 100 120 140 50 100 150 200 250 300 350 400 450 500

SPED SEDA SEDA-S AMPED SYMPED

Number of clients Thousands of transactions/s

Except in special circumstances, SYMPED is best

slide-30
SLIDE 30

Why is SYMPED efficient?

  • The same thread receives the message

and processes it

  • No queueing

– No context switching – In the optimal case: no any kind of

(b)locking delays

  • Downsides:

– Serializes network IO and processing

within the thread (which is ok if per-CPU)

slide-31
SLIDE 31

Notable performance

  • ptimizations
  • “zero-copy” operation

– Queries which do not involve complex

processing or record aggregation are are satisfied directly from data structures

  • “zero-malloc” operation

– The code re-uses memory buffers as much

as possible; the fast path is completely malloc()- and memcpy()-free

  • Adaptive dynamic buffer sizes

– malloc() usage is tracked to avoid realloc()

slide-32
SLIDE 32

State of the art

92 185 278 371 464 558 651 744 837 930 1023 1395 1861 2792 3723 200.000 TPS 400.000 TPS 600.000 TPS 800.000 TPS 1.000.000 TPS 1.200.000 TPS 1.400.000 TPS 1.600.000 TPS 1.800.000 TPS 2.000.000 TPS System A (June 2010) System B (Jan 2011) System B (Jun 2011) System C (Dec 2011) System D (Mar 2012)

Average record data size Performance

slide-33
SLIDE 33

State of the art

400.000 TPS 600.000 TPS 800.000 TPS 1.000.000 TPS 1.200.000 TPS 1.400.000 TPS 1.600.000 TPS 1.800.000 TPS 2.000.000 TPS

Performance Xeon 5675, 6-core, 3 GHz Xeon 5675, 6-core, 3 GHz

+HTT 9-stable, March 2012

slide-34
SLIDE 34

… under certain conditions

  • The optimal, fast path (zero-memcpy,

zero-malloc, optimal buffers)

– Which is actually less important, we know

that these algorithms are fast...

  • Using Unix domain sockets

– Much more important – FreeBSD's network stack (the TCP path) is

currently basically nonscalable to SMP?

– UDP path is more scalable … WIP

slide-35
SLIDE 35

TCP vs Unix sockets

104 209 314 420 525 639 735 840 945 1050 1576 2102 3153 4204 0 TPS 100,000 TPS 200,000 TPS 300,000 TPS 400,000 TPS 500,000 TPS 600,000 TPS 700,000 TPS 800,000 TPS 900,000 TPS 1,000,000 TPS System D / mdcached System D / memcached System D / redis

Average record size

slide-36
SLIDE 36

NUMA Effects

1 2 3 4 5 6 7 8 9 10 15 20 30 40 0 TPS 200,000 TPS 400,000 TPS 600,000 TPS 800,000 TPS 1,000,000 TPS 1,200,000 TPS 1,400,000 TPS 1,600,000 TPS 1,800,000 TPS 2,000,000 TPS System D System D / NUMA System D / ULE

cpuset(4)-bound to a single socket cpuset(4) client & server separate sockets No cpuset(4)

  • nly ULE

It's unlikely that better NUMA support would help at all...

slide-37
SLIDE 37

Scalability wrt number of records

1.000 10.000 100.000 1.000.000 10.000.000 300.000 TPS 320.000 TPS 340.000 TPS 360.000 TPS 380.000 TPS 400.000 TPS 420.000 TPS 440.000 TPS mdcached

slide-38
SLIDE 38

Bells & whistles

  • Binary protocol (endian-dependant)
  • Extensive atomic operation set

– cmpset, add, fetchadd, readandclear

  • “tstack” operations

– Every tag (tk,tv) identifies a stack – Push and pop operations on records

  • Periodic data dumps / chekpoints

– Cache pre-warm (load from file)

slide-39
SLIDE 39

Usage ideas

  • Application data cache, database cache

– Semantically tag cached records – Efficient retrieval and expiry (deletion)

  • Primary data storage

– High-performance ephemeral storage – Optional periodic checkpoints

  • Data sharing between app server nodes
  • Esoteric: distributed lock manager, stack
slide-40
SLIDE 40

Bullet Cache

Balancing speed and usability in a cache server http://www.sf.net/projects/mdcached

Ivan Voras <ivoras@freebsd.org>