Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE - PowerPoint PPT Presentation

Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE AELMORE@CS.UCHICAGO.EDU

Collaboration Between Many Rebecca Taft, Vaibhav Arora, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andy Pavlo, Amr El Abbadi, Divy Agrawal, Michael Stonebraker E-Store @ VLDB 2015, Squall @ SIGMOD 2015, E-Store++ @ ????

Databases are Great Developer ease via ACID Turing Award winning great

But they are Rigid and Complex

Growth… Rapid growth of some web services led to Average Millions of Active Users design of new “web-scale” databases… 350 300 250 200 150 100 50 0 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 09 09 10 10 10 10 11 11 11 11 12 12 12

Rise of NoSQL Scaling is needed Chisel away at functionality ◦ No transactions ◦ No secondary indexes ◦ Minimal recovery ◦ Mixed Consistency Not always suitable…

Workloads Fluctuate Resources Demand Time Slide Credits: Berkeley RAD Lab

Peak Provisioning Resources Capacity Demand Time Unused resources Slide Credits: Berkeley RAD Lab

Peak Provisioning isn’t Perfect Resources Time Slide Credits: Berkeley RAD Lab

Growth is not always sustained Average Millions of Active Users 350 300 250 200 150 100 50 0 Q3 09 Q4 09 Q1 10 Q2 10 Q3 10 Q4 10 Q1 11 Q2 11 Q3 11 Q4 11 Q1 12 Q2 12 Q3 12 Q4 12 Q1 13 Q2 13 Q3 13 Q4 13 Q1 14 Q2 14 Q3 14 Q4 14 http://www.statista.com/statistics/273569/monthly-active-users-of-zynga-games/

Need Elasticity ELASTICITY > SCALABILITY

The Promise of Elasticity Resources Capacity Demand Time Unused resources Slide Credits: Berkeley RAD Lab

Primary use-cases for elasticity Database-as-a-Service with elastic placement of non- correlated tenants, often low utilization per tenant. High-throughput transactional systems (OLTP)

No Need to Weaken the Database!

High Throughput = Main Memory Cost per GB for RAM is dropping. Network memory is faster than local disk. Much faster than disk based DBs.

Approaches for “NewSQL” main- memory * Highly concurrent, latch-free data structures Partitioning into single-threaded executors *Excuse the generalization

Procedure Name Input Parameters Client Application Slide Credits: Andy Pavlo

Database Partitioning TPC-C Schema Schema Tree WAREHOUSE WAREHOUSE ITEM DISTRICT STOCK DISTRICT STOCK CUSTOMER CUSTOMER ORDERS ITEM Replicated ORDERS ORDER_ITEM ORDER_ITEM Slide Credits: Andy Pavlo

Database Partitioning Schema Tree Partitions P1 P2 P3 P4 P5 WAREHOUSE P1 P2 P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 ITEM ITEM DISTRICT STOCK P1 P2 P3 P4 P5 CUSTOMER P3 P4 ITEM ITEM P1 P2 P3 P4 P5 ORDERS ITEM ITEM ITEMj ITEM ITEM ITEM Replicated P5 P1 P2 P3 P4 P5 ITEM ORDER_ITEM Slide Credits: Andy Pavlo

The Problem: Workload Skew Many OLTP applications suffer from variable load and high skew: Extreme Skew: 40-60% of NYSE trading volume is on 40 individual stocks Time Variation: Load “follows the sun” Seasonal Variation: Ski resorts have high load in the winter months Load Spikes: First and last 10 minutes of trading day have 10X the average volume Hockey Stick Effect: A new application goes “viral”

The Problem: Workload Skew No Skew Uniform data access Low Skew 2/3 of queries access top 1/3 of data High Skew Few very hot items

The Problem: Workload Skew High skew increases latency by 10X and decreases throughput by 4X Partitioned shared-nothing systems are especially susceptible

The Problem: Workload Skew Possible solutions: o Provision resources for peak load (Very expensive and brittle!) o Limit load on system (Poor performance!) o Enable system to elastically scale in or out to dynamically adapt to changes in load

Elastic Scaling Key Value Key Value Key Value 1 kjfhlaks 5 wekjxmn 9 mznjku 2 ewoiej 6 fmeixtm 10 ewrenx 3 fsakjfkf 7 ewjkdm 11 qieucm 4 fkjwei 8 weiuqw 12 roiwrio Key Value Key Value Key Value Key Value 1 kjfhlaks 5 wekjxmn 9 mznjku 11 qieucm 2 ewoiej 6 fmeixtm 10 ewrenx 12 roiwrio 3 fsakjfkf 7 ewjkdm 4 fkjwei 8 weiuqw

Load Balancing Key Value Key Value Key Value 1 kjfhlaks 5 wekjxmn 9 mznjku 2 ewoiej 6 fmeixtm 10 ewrenx 3 fsakjfkf 7 ewjkdm 11 qieucm 4 fkjwei 8 weiuqw 12 roiwrio Key Value Key Value Key Value 1 kjfhlaks 5 wekjxmn 9 mznjku 2 ewoiej 6 fmeixtm 3 fsakjfkf 7 ewjkdm 4 fkjwei 8 weiuqw 10 ewrenx 12 roiwrio 11 qieucm

Two-Tiered Partitioning What if only a few specific tuples are very hot? Deal with them separately! Two tiers: 1. Individual hot tuples, mapped explicitly to partitions 2. Large blocks of colder tuples, hash- or range-partitioned at coarse granularity Possible implementations: Fine-grained range partitioning o Consistent hashing with virtual nodes o Lookup table combined with any standard partitioning scheme o Existing systems are “one-tiered” and partition data only at course granularity Unable to handle cases of extreme skew o

E-Store End-to-end system which extends H-Store (a distributed, shared-nothing, main memory DBMS) with automatic, adaptive, two-tiered elastic partitioning

E-Store Normal Load operation, Reconfiguration imbalance high level complete detected monitoring Tuple level Online reconfiguration monitoring (Squall) (E-Monitor) Tuple placement New partition Hot tuples, planning plan partition-level (E-Planner) access counts

E-Monitor: High-Level Monitoring High level system statistics collected every ~1 minute o CPU indicates system load, used to determine whether to add or remove nodes, or re- shuffle the data o Accurate in H-Store since partition executors are pinned to specific cores o Cheap to collect o When a load imbalance (or overload/underload) is detected, detailed monitoring is triggered

E-Monitor: Tuple-Level Monitoring Tuple-level statistics collected in case of load imbalance Finds the top 1% of tuples accessed per partition (read or written) during a 10 second window o Finds total access count per block of cold tuples o Can be used to determine workload distribution, using tuple access count as a proxy for system load Reasonable assumption for main-memory DBMS w/ OLTP workload o Minor performance degradation during collection

E-Monitor: Tuple-Level Monitoring Sample output

E-Planner Given current partitioning of data, system statistics and hot tuples/partitions from E-Monitor, E- Planner determines: Whether to add or remove nodes o How to balance load o Optimization problem: minimize data movement (migration is not free) while balancing system load. We tested five different data placement algorithms: One-tiered bin packing (ILP – computationally intensive!) o Two-tiered bin packing (ILP – computationally intensive!) o First Fit (global repartitioning to balance load) o Greedy (only move hot tuples) o Greedy Extended (move hot tuples first, then cold blocks until load is balanced) o

E-Planner: Greedy Extended Algorithm Hot Accesses Range Accesses tuples 3-1000 5,000 0 20,000 1000-2000 3,000 1 12,000 2000-3000 2,000 New YCSB partition 2 5,000 … … plan "usertable": { Current YCSB partition plan ? 0: [1000-100000) 1: [1-2),[100000-200000) "usertable": { 2: [200000-300000),[0-1), [2-1000) 0: [0-100000) 1: [100000-200000) } 2: [200000-300000) }

E-Planner: Greedy Extended Algorithm Target cost per Partition Keys Total Cost (tuple partition: 35,000 accesses) 0 [0-100000) 77,000 1 [100000-200000) 23,000 2 [200000-300000) 5,000 Hot Accesses Range Accesses tuples 3-1000 5,000 0 20,000 1000-2000 3,000 1 12,000 2000-3000 2,000 2 5,000 … …

E-Planner: Greedy Extended Algorithm Target cost per Partition Keys Total Cost (tuple partition: 35,000 accesses) 0 [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0- 25,000 1) Hot Accesses Range Accesses tuples 3-1000 5,000 0 20,000 1000-2000 3,000 1 12,000 2000-3000 2,000 2 5,000 … …

Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE - PowerPoint PPT Presentation

Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE AELMORE@CS.UCHICAGO.EDU Collaboration Between Many Rebecca Taft, Vaibhav Arora, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andy Pavlo, Amr

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

Cache Systems CPU Main Main CPU Memory Memory 400MHz 10MHz Cache 10MHz Memory Hierarchy

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

1 Store Buffer Design Example Memory Dependence Any load instruction receives the memory Store

Memory Questions? ! What is main memory? CSCI [4|6]730 ! How does multiple processes share memory

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

CSCI [4|6]730 Operating Systems Main Memory Maria Hybinette, UGA Maria Hybinette, UGA Memory

Project Planning and Project Management Week 2: Project Life Cycles Kay Dudman 1 Last week

Project Planning and Management Kay Dudman Slide 1 of 39 Structure of the Module Lectures (1

Permutation tests for coefficients of variation in general one-way ANOVA models Markus Pauly 1 and

GCSE re-sits: develop your practice (Level 5 module) maths Session 6 Improving learning in

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Lecture 2.5: Proofs in propositional calculus Matthew Macauley Department of Mathematical

Applications in finite state automata Organisation and Introduction Kurt Eberle