Cloudius Systems presents: Writing a Modern Highly Scalable - - PowerPoint PPT Presentation

cloudius systems presents
SMART_READER_LITE
LIVE PREVIEW

Cloudius Systems presents: Writing a Modern Highly Scalable - - PowerPoint PPT Presentation

Cloudius Systems presents: Writing a Modern Highly Scalable Application Where Linux Helps You, Where Linux Stands in Your Way @glcst - Linuxcon 2016 Part 1: The application Part 2: The framework Part 1: The application The basics: - Scylla


slide-1
SLIDE 1

Cloudius Systems presents:

Writing a Modern Highly Scalable Application

Where Linux Helps You, Where Linux Stands in Your Way

@glcst - Linuxcon 2016

slide-2
SLIDE 2

Part 1: The application Part 2: The framework

slide-3
SLIDE 3

Part 1: The application

The basics:

  • Scylla is a datastore.
  • Scylla is a nosql datastore
  • Scylla is a highly available eventually consistent datastore
  • Scylla is a highly available eventually consistent datastore, compatible with

Apache Cassandra.

slide-4
SLIDE 4

Some examples of datastores

SQL: Structured, no scale Document store: No structure Some scale Column store: Some structure Scale out Awesome HA/DR Key-value: Simple Scale Not a real DB

slide-5
SLIDE 5

Part 1: The application

The basics:

  • Scylla is a datastore.
  • Scylla is a nosql datastore
  • Scylla is a highly available eventually consistent datastore
  • Scylla is a highly available eventually consistent datastore, compatible with

Apache Cassandra.

  • Scylla is a highly available eventually consistent datastore, compatible with

Apache Cassandra, but with 10x its throughput.

slide-6
SLIDE 6

Where you had consistency/durability:

  • user-defined replication factor (RF) and consistency level (CL)
  • Write behavior determined by RF:
  • Durable for less than RF failures.
  • Read behavior determined by CL:
  • Consistent for CL >= RF / 2 + 1
  • Availability increases as RF increases, CL decreases.
  • Tunable consistency: meet the needs of the application.
  • Tables where eventual consistency can be tolerated use high RF, low CL.
  • Tables with data that must remain in sync, use high CL.
slide-7
SLIDE 7

Where you had a “primary key”:

  • 2 components: partition key, clustering key (optional)

https://jslvtr.gitbooks.io/big-data-analysis/

slide-8
SLIDE 8

YCSB Benchmark: 3 Scylla cluster vs 3, 9, 15, 30 Cassandra

Throughput

slide-9
SLIDE 9

YCSB Benchmark:

slide-10
SLIDE 10

How do we get 10 x throughput?

  • “Just rewrite in C++ can’t make it 10x faster”
  • True, but it allows us to (easily) do the things that can.
  • Control how we use memory
  • Per-core memory allocation
  • No garbage collections -> no (unpredictable) pauses.
  • Proximity to the hardware
  • Examples are userspace disk scheduler, and userspace network stack
slide-11
SLIDE 11

Part 2: The framework

  • Seastar is a highly scalable thread-per-core framework
  • I/O intensive applications
  • Turns out a datastore is a good example of an I/O intensive application
  • Cost a context switch: 1 us (Paul Turner, LPC 2013) - “Majority of the context-switching

cost attributable to the complexity of the scheduling decision by a modern SMP cpu scheduler.”

  • For a 100ms CPU hog: 0.001 %
  • For a 1 ms HDD latency (not counting seek): 0.1 %
  • For a single request NVMe request: (Samsung SM951-NVMe M.2: avg. lat = 22µs): ~5%
slide-12
SLIDE 12

SCYLLA AND SEASTAR ARE DIFFERENT

❏ DMA ❏ Log structured merge tree ❏ DB-aware cache ❏ Userspace I/O scheduler ❏ NUMA friendly ❏ Log structured allocator ❏ Zero copy ❏ Thread per core ❏ lock-free ❏ Task scheduler ❏ Reactor programing ❏ Multi queue ❏ Poll mode ❏ Userspace TCP/IP

slide-13
SLIDE 13

SCYLLA DB: NETWORK COMPARISON

  • KVM was invented by Avi in 2006, development was managed by Dor
  • It was a new hypervisor after VMW, Xen had dominated the market
  • By smart design choices and leveraging Linux and the hardware it became the most

performing hypervisor. ○ KVM holds SPECvirt performance record ○ KVM holds max IOPS record

  • The Open Virtualization Alliance includes hundreds of companies, including HP, IBM,

Intel, AMD, Red Hat, etc

  • KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP,

Google, DigitalOcean, etc.

Kernel

Cassandra TCP/IP

Scheduler queue queue queue queue queue threads NIC Queues

Kernel

Traditional stack Seastar’s sharded stack

Memory

Lock contention Cache contention NUMA unfriendly

Application TCP/IP

Task Scheduler queue queue queue queue queue

smp queue

NIC Queue DPDK Kernel (isn’t involved)

Userspace Application TCP/IP

Task Scheduler queue queue queue queue queue

smp queue

NIC Queue DPDK Kernel (isn’t involved)

Userspace Application TCP/IP

Task Scheduler queue queue queue queue queue

smp queue

NIC Queue DPDK Kernel (isn’t involved)

Userspace Core Database TCP/IP

Task Scheduler queue queue queue queue queue

smp queue

NIC Queue DPDK Kernel (isn’t involved)

Userspace

No contention Linear scaling NUMA friendly

slide-14
SLIDE 14

Seastar Programming model

  • KVM was invented by Avi in 2006, development was managed by Dor
  • It was a new hypervisor after VMW, Xen had dominated the market
  • By smart design choices and leveraging Linux and the hardware it became the most

performing hypervisor. ○ KVM holds SPECvirt performance record ○ KVM holds max IOPS record

  • The Open Virtualization Alliance includes hundreds of companies, including HP, IBM,

Intel, AMD, Red Hat, etc

  • KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP,

Google, DigitalOcean, etc.

return open_file_dma(name, flags).then([] (file f) { return f.dma_read(pos, buf, size); }).then([] { /* do something else */ }).handle_exception([] { /* handle an exception */ });

slide-15
SLIDE 15

Seastar has its own task scheduler

Traditional stack Scylla’s stack

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise Task Promise Task Promise Task Promise Task

CPU

Promise is a pointer to eventually computed value Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack

Thread is a function pointer Stack is a byte array from 64k to megabytes

Context switch cost is

  • high. Large stacks pollutes

the caches No sharing, millions of parallel events

slide-16
SLIDE 16

Seastar minimizes cross CPU access

❏ A task is always scheduled in the same CPU it was originated ❏ Local memo

slide-17
SLIDE 17
  • A task is always scheduled in the CPU in which it originated
  • local memory allocation, local memory freeing
  • Cross-cpu communication can happen, but is explicit
  • submit_to()
  • map_reduce()

Seastar minimizes cross CPU access

slide-18
SLIDE 18
  • Modern NoSQL databases trust it too much.
  • Both MongoDB and Cassandra just trust the Linux page cache
  • Wrong granularity, false sharing, unpredictable latencies.
  • Example: 1k rows per page. 3 hot rows, but also the coldest row. Which to

evict?

Linux page cache

slide-19
SLIDE 19
  • Asynchronous I/O is not really asynchronous
  • “It’s ok, if it blocks something else runs instead”
  • there is no something else
  • “Thread per core” really becomes “two threads per core”
  • XFS blocks under heavy load. Otherwise ok.

Linux filesystems: our greatest enemies.

slide-20
SLIDE 20

I/O Scheduling

Query Commitlog Compaction

Queue Queue Queue

Userspace I/O Scheduler Disk

slide-21
SLIDE 21

I/O Scheduling

# ./fsqual context switch per appending io: 1 (BAD) # ./fsqual context switch per appending io: 0 (GOOD)

ext4, 4.3.3 XFS, 3.15

slide-22
SLIDE 22

I/O Scheduling

slide-23
SLIDE 23

I/O Scheduling

increased latency for no gain XFS screams. Better avoid it.

slide-24
SLIDE 24

I/O Scheduling

Shares distri­bution Throughput (KB/s) C1 C2 C3 C4 10, 10, 10, 10 137506 137501 137501 137501 100, 100, 100, 100 137504 137499 137499 137499 10, 20, 40, 80 37333 73732 146566 292375 100, 10, 10, 10 421211 42922 42922 42922

4 classes disputing the same I/O Queue, with various shares distributions, single core. 550 MB/s SSD fully saturated. From ScyllaDB’s blog: http://www.scylladb.com/2016/04/29/io-scheduler-2/

slide-25
SLIDE 25
slide-26
SLIDE 26

+ Download: http://www.scylladb.com + Twitter: @ScyllaDB + Source: http://github.com/scylladb/scylla + Mailing lists: scylladb-user @ groups.google.com + Company site & blog: http://www.scylladb.com/

How to interact

slide-27
SLIDE 27

SCYLLA, NoSQL GOES NATIVE Thank you.