Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE - - PowerPoint PPT Presentation

building an elastic main memory database e store
SMART_READER_LITE
LIVE PREVIEW

Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE - - PowerPoint PPT Presentation

Building an Elastic Main-Memory Database: E-Store AARON J. ELMORE AELMORE@CS.UCHICAGO.EDU Collaboration Between Many Rebecca Taft, Vaibhav Arora, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andy Pavlo, Amr


slide-1
SLIDE 1

Building an Elastic Main-Memory Database: E-Store

AARON J. ELMORE AELMORE@CS.UCHICAGO.EDU

slide-2
SLIDE 2

Collaboration Between Many

Rebecca Taft, Vaibhav Arora, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andy Pavlo, Amr El Abbadi, Divy Agrawal, Michael Stonebraker E-Store @ VLDB 2015, Squall @ SIGMOD 2015, E-Store++ @ ????

slide-3
SLIDE 3

Databases are Great

Developer ease via ACID Turing Award winning great

slide-4
SLIDE 4

But they are Rigid and Complex

slide-5
SLIDE 5

Growth…

50 100 150 200 250 300 350 Q3 09 Q4 09 Q1 10 Q2 10 Q3 10 Q4 10 Q1 11 Q2 11 Q3 11 Q4 11 Q1 12 Q2 12 Q3 12

Average Millions of Active Users

Rapid growth of some web services led to design of new “web-scale” databases…

slide-6
SLIDE 6

Rise of NoSQL

Scaling is needed Chisel away at functionality

  • No transactions
  • No secondary indexes
  • Minimal recovery
  • Mixed Consistency

Not always suitable…

slide-7
SLIDE 7

Workloads Fluctuate

Demand Time Resources

Slide Credits: Berkeley RAD Lab

slide-8
SLIDE 8

Peak Provisioning

Demand Capacity Time Resources Unused resources

Slide Credits: Berkeley RAD Lab

slide-9
SLIDE 9

Peak Provisioning isn’t Perfect

Time Resources

Slide Credits: Berkeley RAD Lab

slide-10
SLIDE 10

Growth is not always sustained

50 100 150 200 250 300 350 Q3 09 Q4 09 Q1 10 Q2 10 Q3 10 Q4 10 Q1 11 Q2 11 Q3 11 Q4 11 Q1 12 Q2 12 Q3 12 Q4 12 Q1 13 Q2 13 Q3 13 Q4 13 Q1 14 Q2 14 Q3 14 Q4 14

Average Millions of Active Users

http://www.statista.com/statistics/273569/monthly-active-users-of-zynga-games/

slide-11
SLIDE 11

Need Elasticity

ELASTICITY > SCALABILITY

slide-12
SLIDE 12

The Promise of Elasticity

Demand Capacity Time Resources Unused resources

Slide Credits: Berkeley RAD Lab

slide-13
SLIDE 13

Primary use-cases for elasticity

Database-as-a-Service with elastic placement of non- correlated tenants, often low utilization per tenant. High-throughput transactional systems (OLTP)

slide-14
SLIDE 14

No Need to Weaken the Database!

slide-15
SLIDE 15

High Throughput = Main Memory

Cost per GB for RAM is dropping. Network memory is faster than local disk. Much faster than disk based DBs.

slide-16
SLIDE 16

Approaches for “NewSQL” main- memory*

Highly concurrent, latch-free data structures Partitioning into single-threaded executors

*Excuse the generalization

slide-17
SLIDE 17

Client Application

Procedure Name Input Parameters

Slide Credits: Andy Pavlo

slide-18
SLIDE 18

Database Partitioning

DISTRICT CUSTOMER ORDER_ITEM ITEM STOCK WAREHOUSE ORDERS DISTRICT CUSTOMER ORDER_ITEM STOCK ORDERS ITEM

Replicated

WAREHOUSE

TPC-C Schema Schema Tree

Slide Credits: Andy Pavlo

slide-19
SLIDE 19

ITEM ITEMj ITEM ITEM ITEM

Database Partitioning

P2 P4 DISTRICT CUSTOMER ORDER_ITEM STOCK ORDERS

Replicated

WAREHOUSE P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P5 P5 P5 P5 P5 P5 P5 P3 P1

ITEM ITEM ITEM ITEM ITEM

Partitions

ITEM

Schema Tree

Slide Credits: Andy Pavlo

slide-20
SLIDE 20

The Problem: Workload Skew

Many OLTP applications suffer from variable load and high skew: Extreme Skew: 40-60% of NYSE trading volume is on 40 individual stocks Time Variation: Load “follows the sun” Seasonal Variation: Ski resorts have high load in the winter months Load Spikes: First and last 10 minutes of trading day have 10X the average volume Hockey Stick Effect: A new application goes “viral”

slide-21
SLIDE 21

High Skew Low Skew No Skew Uniform data access 2/3 of queries access top 1/3

  • f data

Few very hot items

The Problem: Workload Skew

slide-22
SLIDE 22

The Problem: Workload Skew

High skew increases latency by 10X and decreases throughput by 4X Partitioned shared-nothing systems are especially susceptible

slide-23
SLIDE 23

The Problem: Workload Skew

Possible solutions:

  • Provision resources for peak load (Very expensive and brittle!)
  • Limit load on system (Poor performance!)
  • Enable system to elastically scale in or out to dynamically adapt to changes in load
slide-24
SLIDE 24

Elastic Scaling

Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx 11 qieucm 12 roiwrio Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx Key Value 11 qieucm 12 roiwrio

slide-25
SLIDE 25

Load Balancing

Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx 11 qieucm 12 roiwrio Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei 10 ewrenx 11 qieucm Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw 12 roiwrio Key Value 9 mznjku

slide-26
SLIDE 26

Two-Tiered Partitioning

What if only a few specific tuples are very hot? Deal with them separately! Two tiers:

1. Individual hot tuples, mapped explicitly to partitions 2. Large blocks of colder tuples, hash- or range-partitioned at coarse granularity

Possible implementations:

  • Fine-grained range partitioning
  • Consistent hashing with virtual nodes
  • Lookup table combined with any standard partitioning scheme

Existing systems are “one-tiered” and partition data only at course granularity

  • Unable to handle cases of extreme skew
slide-27
SLIDE 27

E-Store

End-to-end system which extends H-Store (a distributed, shared-nothing, main memory DBMS) with automatic, adaptive, two-tiered elastic partitioning

slide-28
SLIDE 28

E-Store

Normal

  • peration,

high level monitoring Tuple level monitoring (E-Monitor) Tuple placement planning (E-Planner) Online reconfiguration (Squall) Load imbalance detected Hot tuples, partition-level access counts Reconfiguration complete New partition plan

slide-29
SLIDE 29

E-Monitor: High-Level Monitoring

High level system statistics collected every ~1 minute

  • CPU indicates system load, used to determine whether to add or remove nodes, or re-

shuffle the data

  • Accurate in H-Store since partition executors are pinned to specific cores
  • Cheap to collect
  • When a load imbalance (or overload/underload) is detected, detailed monitoring is

triggered

slide-30
SLIDE 30

E-Monitor: Tuple-Level Monitoring

Tuple-level statistics collected in case of load imbalance

  • Finds the top 1% of tuples accessed per partition (read or written) during a 10 second window
  • Finds total access count per block of cold tuples

Can be used to determine workload distribution, using tuple access count as a proxy for system load

  • Reasonable assumption for main-memory DBMS w/ OLTP workload

Minor performance degradation during collection

slide-31
SLIDE 31

E-Monitor: Tuple-Level Monitoring

Sample output

slide-32
SLIDE 32

E-Planner

Given current partitioning of data, system statistics and hot tuples/partitions from E-Monitor, E- Planner determines:

  • Whether to add or remove nodes
  • How to balance load

Optimization problem: minimize data movement (migration is not free) while balancing system load. We tested five different data placement algorithms:

  • One-tiered bin packing (ILP – computationally intensive!)
  • Two-tiered bin packing (ILP – computationally intensive!)
  • First Fit (global repartitioning to balance load)
  • Greedy (only move hot tuples)
  • Greedy Extended (move hot tuples first, then cold blocks until load is balanced)
slide-33
SLIDE 33

E-Planner: Greedy Extended Algorithm

Current YCSB partition plan "usertable": { 0: [0-100000) 1: [100000-200000) 2: [200000-300000) }

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

New YCSB partition plan "usertable": { 0: [1000-100000) 1: [1-2),[100000-200000) 2: [200000-300000),[0-1), [2-1000) }

?

slide-34
SLIDE 34

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [0-100000) 77,000 1 [100000-200000) 23,000 2 [200000-300000) 5,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-35
SLIDE 35

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [0-100000) 77,000 1 [100000-200000) 23,000 2 [200000-300000) 5,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-36
SLIDE 36

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-37
SLIDE 37

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-38
SLIDE 38

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-39
SLIDE 39

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-40
SLIDE 40

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-41
SLIDE 41

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1) 25,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-42
SLIDE 42

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1), [2-3) 30,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-43
SLIDE 43

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1), [2-3) 30,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-44
SLIDE 44

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1), [2-3) 30,000 Target cost per partition: 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

slide-45
SLIDE 45

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [1000-100000) 35,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1), [2-1000) 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

Target cost per partition: 35,000

slide-46
SLIDE 46

E-Planner: Greedy Extended Algorithm

Partition Keys Total Cost (tuple accesses) [1000-100000) 35,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0- 1), [2-1000) 35,000

Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … …

Target cost per partition: 35,000

slide-47
SLIDE 47

E-Planner: Other Heuristic algorithms

Greedy

  • Like Greedy Extended, but the algorithm stops after all hot tuples have been moved
  • If there are not many hot tuples (e.g. low skew), may not sufficiently balance the workload

First Fit

  • First packs hot tuples onto partitions, filling one partition at a time
  • Then packs blocks of cold tuples, filling the remaining partitions one at a time
  • Results in a balanced workload, but does not attempt to limit the amount of data movement
slide-48
SLIDE 48

E-Planner: Optimal Algorithms

Two-Tiered Bin Packer

  • Uses Integer Linear Programming (ILP) to optimally pack hot tuples and cold blocks onto partitions
  • Constraints: each tuple/block must be assigned to exactly one partition, and each partition must

have total load less than the average + 5%

  • Optimization Goal: minimize the amount of data moved in order the satisfy the constraints

One-Tiered Bin Packer

  • Like Two-Tiered Bin Packer, but can only pack blocks of tuples, not individual tuples
  • Both are computationally intensive, but show one- and two-tiered approaches in best light
slide-49
SLIDE 49

Squall

Given plan from E-Planner, Squall physically moves the data while the system is live For immediate benefit, moves data from hottest partitions to coldest partitions first More on this in a bit…

slide-50
SLIDE 50

Results – Two-Tiered V. One-Tiered

YCSB High skew YCSB Low skew

slide-51
SLIDE 51

Results – Heuristic Planners

YCSB High skew YCSB Low skew

slide-52
SLIDE 52

Results

YCSB High skew YCSB Low skew

slide-53
SLIDE 53

But What About…

Distributed Transactions??? Current E-Store does not take them into account when planning data movement Ok when most transactions access a single partitioning key – tends to be the case for “tree schemas” such as YCSB, Voter, and TPC-C E-Store++ will address the general case

  • More later…
slide-54
SLIDE 54

Squall

FINE-GRAINED LIVE RECONFIGURATION FOR PARTITIONED MAIN MEMORY DATABASES

slide-55
SLIDE 55

The Problem

Need to migrate tuples between partitions to reflect the updated partitioning. Would like to do this without bringing the system offline:

  • Live Reconfiguration

Similar to live migration of an entire database between servers.

slide-56
SLIDE 56

Existing Solutions are Not Ideal

Predicated on disk based solutions with traditional concurrency and recovery. Zephyr: Relies on concurrency (2PL) and disk pages. ProRea: Relies on concurrency (SI and OCC) and disk pages. Albatross: Relies on replication and shared disk storage. Also introduces strain on source. Slacker: Replication middleware based.

slide-57
SLIDE 57

Not Your Parent’s Migration

More than a single source and destination

  • Want lightweight coordination

Single threaded execution model

  • Either doing work or migration

Presence of distributed transactions and replication

Migrating 2 warehouses in TPC-C In E-Store with a Zephyr like migration

slide-58
SLIDE 58

Squall

Given plan from E-Planner, Squall physically moves the data while the system is live Conforms to H-Store single-threaded execution model

  • While data is moving, transactions are blocked

To avoid performance degradation, Squall moves small chunks of data at a time, interleaved with regular transaction execution

slide-59
SLIDE 59

Reconfiguration

(New Plan, Leader ID)

Pull W_ID=2

Partition 2 Partition 3

Squall Steps

Pull W_ID>5

Client

Partition 1 Partition 4 Partition 2 Partition 3 Partition 1 Partition 4

1 2 3 4 5 6 7 8 9 10

district customer stock warehouse

1 2 3 4 5 6 7 8 9 10

Incoming: 2 Outgoing: 5 Outgoing: 2 Incoming: 5

  • 1. Identify migrating data
  • 2. Live reactive pulls for required data
  • 3. Periodic lazy/async pulls for large chunks
slide-60
SLIDE 60

Keys to Performance

Redirect or pull only if needed. Properly size reconfiguration granule. Split large reconfigurations to limit demands on single partition. Tune what gets pulled. Sometimes pull a little extra.

Migrating 2 warehouses in TPC-C In E-Store with a Zephyr like migration

slide-61
SLIDE 61

Redirect and Pull Only When Needed

slide-62
SLIDE 62

Data Migration

Query arrives, must be trapped to check if data is potentially moving. Check key map, then ranges list. If either source or destination partition is local check their map, keep local if possible. If neither partition is local, forward to destination. If data is not moving, process transaction.

slide-63
SLIDE 63

Trap for Data Movement

If txn requires incoming data, block execution and schedule data pull.

  • Can only block dependent nodes in query plan
  • Upon receipt mark and dirty tracking structures, and unblock.

If txn requires lost data, restart as distributed transaction or forward request.

slide-64
SLIDE 64

Data Pull Requests

Live data pulls are scheduled at destination as high priority transactions. Current transaction finishes before extraction. Timeout detection is needed.

slide-65
SLIDE 65

Chunk Data for Asynchronous Pulls

slide-66
SLIDE 66

Why Chunk?

Unknown amount of data when not partitioned by clustered index. Customers by W_ID in TPC-C Time spent extracting, is time not spent on TXNS. Want a mechanism to support partial extraction while maintaining consistency.

slide-67
SLIDE 67

Async Pulls

Periodically pull chunks of cold data These pulls are answered lazy Execution is interwoven with extracting and sending data (dirty the range though!)

slide-68
SLIDE 68

Mitigating Async Pulls

Partition 1

Txn Queue Idle Clock

Partition 2

Txn Queue Pull Async Data

slide-69
SLIDE 69

New Transactions Take Precedent

Partition 1

Txn Queue

Partition 2

Txn Queue

slide-70
SLIDE 70

Extract up to Chunk Limit

Txn Queue

Partition 2

Txn Queue

Partition 1

Important to note data is partially migrated!

slide-71
SLIDE 71

Repeat Until Complete

Partition 1

Txn Queue

Partition 2

Txn Queue Repeat chunking until complete. New transactions still take precedent

slide-72
SLIDE 72

Sizing Chunks

Static analysis to set chunk sizes, future work to dynamically set sizing and scheduling. Impact on chunk sizes on a 10% reconfiguration during a YCSB workload.

slide-73
SLIDE 73

Space Async Pulls

Introduce delay at destination between new async pull requests. Impact on chunk sizes on a 10% reconfiguration during a YCSB workload with 8mb chunk size.

slide-74
SLIDE 74

Splitting Reconfigurations

Split by pairs of source and destination Example: partition 1 is migrating W_ID 2,3 to partitions 3 and 7, execute as two reconfigurations. If migrating large objects, split them and use distributed transactions.

slide-75
SLIDE 75

Splitting into Sub-Plans

Set a cap on sub-plan splits, and split on pairs and ability to decompose migrating objects

slide-76
SLIDE 76

All about trade-offs

Trading off time to complete migration and performance degradation. Future work to consider automating this trade-off based on service level objectives.

slide-77
SLIDE 77

Results Highlight

TPC-C load balancing hotspot warehouses

slide-78
SLIDE 78

YCSB Latency

YCSB cluster consolidation 4 to 3 nodes YCSB data shuffle 10% pairwise

slide-79
SLIDE 79

I Fell Asleep… What Happened

Skew happens, two-tiered partitioning for greedy load-balancing and elastic growth helps If you have work or migrate, be careful to break up the migrations and don’t be too needy on any one partition. We are thinking hard about skewed workloads that aren’t trivial to partition.

Questions?