ScyllaDB: Achieving No-Compromise Performance Avi Kivity, CTO - - PowerPoint PPT Presentation

scylladb achieving no compromise performance
SMART_READER_LITE
LIVE PREVIEW

ScyllaDB: Achieving No-Compromise Performance Avi Kivity, CTO - - PowerPoint PPT Presentation

ScyllaDB: Achieving No-Compromise Performance Avi Kivity, CTO @AviKivity (Hiring!) Agenda Background Goals Methods Conclusion Non-Agenda Docker Orchestration Microservices JVM GC Tuning Node.js JSON over HTTP


slide-1
SLIDE 1
slide-2
SLIDE 2

ScyllaDB: Achieving No-Compromise Performance

Avi Kivity, CTO

@AviKivity

(Hiring!)

slide-3
SLIDE 3

Agenda

Background Goals Methods Conclusion

slide-4
SLIDE 4

Non-Agenda

  • Docker
  • Microservices
  • Node.js
  • Docker
  • Orchestration
  • JVM GC Tuning
  • JSON over HTTP
  • Docker
slide-5
SLIDE 5

More Non-Agenda

  • Cache lines, coherency protocols
  • NUMA
  • Algorithms are the only thing that matters,

everything else is implementation detail

  • Docker
slide-6
SLIDE 6

Background - ScyllaDB

  • Clustered NoSQL database compatible with

Apache Cassandra

  • ~10X performance on same hardware
  • Low latency, esp. higher percentiles
  • Self tuning
  • C++14, fully asynchronous; Seastar!
slide-7
SLIDE 7

YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15, 30 Cassandra machines

3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra

slide-8
SLIDE 8

Log-Structured Merge Tree

SStable 1 SStable 2 SStable 3

Time

SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job

slide-9
SLIDE 9

High-level Goals

  • Efficiency:

○ Make the most out of every cycle

  • Utilization:

○ Squeeze every cycle from the machine

  • Control

○ Spend the cycles on what we want, when we want

slide-10
SLIDE 10

Characterizing the problem

  • Large numbers of small operations

○ Make coordination cheap

  • Lots of communications

○ Within the machine ○ With disk ○ With other machines

slide-11
SLIDE 11

Asynchrony, Everywhere

slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
  • Thread-per-core design

○ Never block

  • Asynchronous networking
  • Asynchronous file I/O
  • Asynchronous multicore

General Architecture

slide-15
SLIDE 15

Scylla has its own task scheduler

Traditional stack Scylla’s stack

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise is a pointer to eventually computed value Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack

Thread is a function pointer Stack is a byte array from 64k to megabytes

Context switch cost is

  • high. Large stacks pollutes

the caches No sharing, millions of parallel events

slide-16
SLIDE 16

The Concurrency Dilemma

slide-17
SLIDE 17

Fundamental performance equation

Concurrency = Throughput * Latency

slide-18
SLIDE 18

Fundamental performance equation

Throughput = Concurrency Latency

slide-19
SLIDE 19

Fundamental performance equation

Latency = Concurrency Throughput

slide-20
SLIDE 20

Lower bounds for concurrency

  • Disks want minimum iodepth for full

throughput (heads/chips)

  • Remote nodes need concurrency to hide

network latency and their own min. concurrency

  • Compute wants work for each core
slide-21
SLIDE 21

Results of Mathematical Analysis

  • Want high concurrency (for throughput)
  • Want low concurrency (for latency)
  • Resources require concurrency for full

utilization

slide-22
SLIDE 22

Sources of concurrency

  • Users

○ Reduce concurrency / add nodes

  • Internal processes

○ Generate as much concurrency as possible ○ Schedule

slide-23
SLIDE 23

Resource Scheduling

Scheduler Storage 8 User read User write Compaction (internal) Streaming (internal) 30 12 50 50

slide-24
SLIDE 24

Why not the Linux I/O scheduler?

  • Can only communicate priority by originating

thread

  • Will reorder/merge like crazy
  • Disable
slide-25
SLIDE 25

Figuring out optimal disk concurrency

Max useful disk concurrency

slide-26
SLIDE 26

Cache design

Cache files or objects?

slide-27
SLIDE 27

Using the kernel page cache

  • 4k granularity
  • Thread-safe
  • Synchronous APIs
  • General-purpose
  • Lack of control (1)
  • Lack of control (2)
  • Exists
  • Hundreds of

hacker-years

  • Handling lots of edge

cases

slide-28
SLIDE 28

Unified cache

Cassandra Scylla

Key cache Row cache On-heap / Off-heap Linux page cache SSTables Unified cache SSTables

Tuning Parasitic rows Page faults

App thread Kernel SSD Page fault Suspend thread Initiate I/O Context switch I/O completes Interrupt Context switch Map page Resume thread SSTable page (4k) Your data (300b)

slide-29
SLIDE 29

Workload Conditioning

slide-30
SLIDE 30

Workload Conditioning

  • Internal feedback loops to balance competing loads

Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU

slide-31
SLIDE 31

Replacing the system memory allocator

slide-32
SLIDE 32

System memory allocator problems

  • Thread safe
  • Allocation back pressure
slide-33
SLIDE 33

Seastar memory allocator

  • Non-Thread safe!

○ Each core gets a private memory pool

  • Allocation back pressure

○ Allocator calls a callback when low on memory ○ Scylla evicts cache in response

slide-34
SLIDE 34

One allocator is not enough

slide-35
SLIDE 35

Remaining problems with malloc/free

  • Memory gets fragmented over time

○ If workload changes sizes of allocated objects

  • Allocating a large contiguous block

requires evicting most of cache

slide-36
SLIDE 36

OOM :(

Memory

slide-37
SLIDE 37

Log-structured memory allocation

  • The cache

○ Large majority of memory allocated ○ Small subset of allocation sites

  • Teach allocator how to move allocated
  • bjects around

○ Updating references

slide-38
SLIDE 38

Log-structured memory allocation

Fancy Animation

slide-39
SLIDE 39

Future Improvements

slide-40
SLIDE 40

Userspace TCP/IP stack

  • Thread-per-core design
  • Use DPDK to drive hardware
  • Present as experimental mode

○ Needs more testing and productization

slide-41
SLIDE 41

Query Compilation to Native Code

  • Use LLVM to JIT-compile CQL queries
  • Embed database schema and internal
  • bject layouts into the query
slide-42
SLIDE 42
  • Full control of the software stack can generate big

payoffs

  • Careful system design can maximize throughput
  • Without sacrificing latency
  • Without requiring endless end-user tuning
  • While having a lot of fun

Conclusions

slide-43
SLIDE 43
  • Download: http://www.scylladb.com
  • Twitter: @ScyllaDB
  • Source: http://github.com/scylladb/scylla
  • Mailing lists: scylladb-user @ groups.google.com
  • Company site & blog: http://www.scylladb.com

How to interact

slide-44
SLIDE 44

THE SCYLLA IS THE LIMIT Thank you.