Cloudius Systems presents: Writing a Modern Highly Scalable - PowerPoint PPT Presentation

Cloudius Systems presents: Writing a Modern Highly Scalable Application Where Linux Helps You, Where Linux Stands in Your Way @glcst - Linuxcon 2016

Part 1: The application Part 2: The framework

Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore - Scylla is a highly available eventually consistent datastore, compatible with Apache Cassandra.

Some examples of datastores SQL: Document store: Column store: Key-value: Structured, No structure Some structure Simple no scale Some scale Scale out Scale Awesome HA/DR Not a real DB

Part 1: The application The basics: - Scylla is a datastore. - Scylla is a nosql datastore - Scylla is a highly available eventually consistent datastore - Scylla is a highly available eventually consistent datastore, compatible with Apache Cassandra. - Scylla is a highly available eventually consistent datastore, compatible with Apache Cassandra, but with 10x its throughput.

Where you had consistency/durability: - user-defined replication factor (RF) and consistency level (CL) - Write behavior determined by RF: - Durable for less than RF failures. - Read behavior determined by CL: - Consistent for CL >= RF / 2 + 1 - Availability increases as RF increases, CL decreases. - Tunable consistency: meet the needs of the application. - Tables where eventual consistency can be tolerated use high RF, low CL. - Tables with data that must remain in sync, use high CL.

Where you had a “primary key”: - 2 components: partition key, clustering key (optional) https://jslvtr.gitbooks.io/big-data-analysis/

YCSB Benchmark: 3 Scylla cluster vs 3, 9, 15, 30 Cassandra Throughput

YCSB Benchmark:

How do we get 10 x throughput? - “Just rewrite in C++ can’t make it 10x faster” - True, but it allows us to (easily) do the things that can. - Control how we use memory - Per-core memory allocation - No garbage collections -> no (unpredictable) pauses. - Proximity to the hardware - Examples are userspace disk scheduler, and userspace network stack

Part 2: The framework - Seastar is a highly scalable thread-per-core framework - I/O intensive applications - Turns out a datastore is a good example of an I/O intensive application - Cost a context switch: 1 us (Paul Turner, LPC 2013) - “Majority of the context-switching cost attributable to the complexity of the scheduling decision by a modern SMP cpu scheduler.” - For a 100ms CPU hog: 0.001 % - For a 1 ms HDD latency (not counting seek): 0.1 % - For a single request NVMe request: (Samsung SM951-NVMe M.2: avg. lat = 22µs): ~5%

SCYLLA AND SEASTAR ARE DIFFERENT ❏ Multi queue ❏ Thread per core ❏ DMA ❏ NUMA friendly ❏ Poll mode ❏ lock-free ❏ Log structured ❏ Log structured ❏ Userspace ❏ Task scheduler merge tree allocator TCP/IP ❏ Reactor ❏ DB-aware cache ❏ Zero copy programing ❏ Userspace I/O scheduler

SCYLLA DB: NETWORK COMPARISON Seastar’s sharded stack Traditional stack Core No contention Application Database queue queue Cassandra Application Lock contention queue queue Application queue Linear scaling threads Cache contention NUMA friendly TCP/IP TCP/IP NUMA unfriendly TCP/IP TCP/IP Kernel queue queue queue queue Task Scheduler queue queue queue queue queue Task Scheduler queue queue queue queue smp queue queue Task Scheduler queue queue queue smp queue queue Kernel Task Scheduler queue smp queue queue TCP/IP Memory Scheduler smp queue DPDK DPDK DPDK DPDK Userspace Userspace Userspace Userspace Kernel Kernel (isn’t Kernel (isn’t Kernel NIC involved) (isn’t NIC involved) (isn’t Queue NIC NIC involved) Queue NIC involved) Queues Queue Queue ● KVM was invented by Avi in 2006, development was managed by Dor ● It was a new hypervisor after VMW, Xen had dominated the market By smart design choices and leveraging Linux and the hardware it became the most ● performing hypervisor. ○ KVM holds SPECvirt performance record ○ KVM holds max IOPS record The Open Virtualization Alliance includes hundreds of companies, including HP, IBM, ● Intel, AMD, Red Hat, etc ● KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP, Google, DigitalOcean, etc.

Seastar Programming model return open_file_dma(name, flags).then([] (file f) { return f.dma_read(pos, buf, size); }).then([] { /* do something else */ }).handle_exception([] { /* handle an exception */ }); ● KVM was invented by Avi in 2006, development was managed by Dor ● It was a new hypervisor after VMW, Xen had dominated the market By smart design choices and leveraging Linux and the hardware it became the most ● performing hypervisor. ○ KVM holds SPECvirt performance record ○ KVM holds max IOPS record The Open Virtualization Alliance includes hundreds of companies, including HP, IBM, ● Intel, AMD, Red Hat, etc ● KVM is the engine behind many clouds such as OpenStack, IBM, NTT, Fujitsu, HP, Google, DigitalOcean, etc.

Seastar has its own task scheduler Traditional stack Scylla’s stack Promise Promise Promise Task Thread is a Promise is a Promise Thread Task Thread Promise Thread Task Thread function pointer pointer to Task Thread Task Thread Promise eventually Thread Promise Thread Promise Stack Task Stack Promise Stack Stack is a byte Task computed value Promise Stack Task Stack Task Stack array from 64k Task Stack Promise Stack Promise to megabytes Task is a Promise Task Promise Task Promise pointer to a Task Task Task lambda function Context switch cost is high. Large stacks pollutes No sharing, millions of Promise Promise Promise Scheduler Scheduler Scheduler Scheduler Scheduler Task Promise Task Promise Task parallel events Task Task the caches CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

Seastar minimizes cross CPU access ❏ A task is always scheduled in the same CPU it was originated ❏ Local memo

Seastar minimizes cross CPU access - A task is always scheduled in the CPU in which it originated - local memory allocation, local memory freeing - Cross-cpu communication can happen, but is explicit - submit_to() - map_reduce()

Linux page cache - Modern NoSQL databases trust it too much. - Both MongoDB and Cassandra just trust the Linux page cache - Wrong granularity, false sharing, unpredictable latencies. - Example: 1k rows per page. 3 hot rows, but also the coldest row. Which to evict?

Linux filesystems: our greatest enemies. - Asynchronous I/O is not really asynchronous - “It’s ok, if it blocks something else runs instead” - there is no something else - “Thread per core” really becomes “two threads per core” - XFS blocks under heavy load. Otherwise ok.

I/O Scheduling Query Queue Userspace I/O Commitlog Queue Disk Scheduler Compaction Queue

I/O Scheduling ext4, 4.3.3 # ./fsqual context switch per appending io: 1 (BAD) XFS, 3.15 # ./fsqual context switch per appending io: 0 (GOOD)

I/O Scheduling

I/O Scheduling increased latency for no gain XFS screams. Better avoid it.

I/O Scheduling Shares distribution Throughput (KB/s) C1 C2 C3 C4 10, 10, 10, 10 137506 137501 137501 137501 100, 100, 100, 100 137504 137499 137499 137499 10, 20, 40, 80 37333 73732 146566 292375 100, 10, 10, 10 421211 42922 42922 42922 4 classes disputing the same I/O Queue, with various shares distributions, single core. 550 MB/s SSD fully saturated. From ScyllaDB’s blog: http://www.scylladb.com/2016/04/29/io-scheduler-2/

How to interact + Download: http://www.scylladb.com + Twitter: @ScyllaDB + Source: http://github.com/scylladb/scylla + Mailing lists: scylladb-user @ groups.google.com + Company site & blog: http://www.scylladb.com/

SCYLLA, NoSQL GOES NATIVE Thank you.

Cloudius Systems presents: Writing a Modern Highly Scalable - PowerPoint PPT Presentation

Cloudius Systems presents: Writing a Modern Highly Scalable Application Where Linux Helps You, Where Linux Stands in Your Way @glcst - Linuxcon 2016 Part 1: The application Part 2: The framework Part 1: The application The basics: - Scylla

Presents Presents Presents Presents Presents Background History/Future History of the

OSv - A Modern Semi-POSIX LibraryOS Glauber Costa, Lead Engineer glommer@cloudius-systems.com

Hospital Outpatient Services: New CMS Supervision Requirements presents presents C Complying

Andrea Electronics Corporation Presents Andrea Electronics Corporation Presents Array Microphone

PRESENTS: PRESENTS: THE PRODUCT WHERE NATURE GIVES A HELPING HAND A ready to use

Derivatives and Hedging Accounting: FAS 133 and Beyond presents presents M Mastering the

Patent Marking: Federal Circuit Expands Liability Threat p y presents presents Complying With

RevITup TechEd Series Presents RevITup TechEd Series Presents Hands Hands- -Free Business

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

West Virginia University West Virginia University Presents Presents Creation of a Catapult

Presents Presents I s Your $$$ Secure? How Understanding Financial Statements Can Put You in

Debt Exchange Offers: Legal Strategies for Distressed Issuers presents presents N Navigating

Syndicated Loan Facilities With Lenders and Agents Facing Default presents presents Mi i

The Report of Foreign Bank and Financial Accounts: Tough New Requirements presents presents M

New SEC Custody Rules for Investment Advisers presents presents I Implementing Effective

Distressed Commercial Real Estate Debt: New Opportunities and Legal Risks presents presents

Linear Prediction Analysis of Speech Sounds Berlin Chen 2004 References: 1. X. Huang et. al.,

Optimize Primary Care Teams to Meet Patients Medical AND Behavioral Needs A 12- month IHI

Vocoders 1 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass

High Energy WW Scattering at the LHC James (Jamie) Gainer University of Florida August 19, 2013

Matching the Analysis Scheme to the Signal Fritz Menzer (fritz.menzer@epfl.ch) Communication

Precision Measurement of Parity-violation in Deep Inelastic Scattering Over a Broad Kinematic

Instrumenting Your Business For Success With DevOps Robert Benefield Evolve Beyond, Ltd

Digital Cinematography Color Science Basics 11/5/15 2:56 PM 40211