CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus - - PowerPoint PPT Presentation

clouddb
SMART_READER_LITE
LIVE PREVIEW

CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus - - PowerPoint PPT Presentation

CloudDB: A Data Store for all Sizes in the Cloud Hakan Hacigumus Data Management Research NEC Laboratories America http://www.nec-labs.com/dm www.nec-labs.com What I will try to cover Historical perspective and motivation (


slide-1
SLIDE 1

CloudDB:

A Data Store for all Sizes in the Cloud

Hakan Hacigumus

Data Management Research NEC Laboratories America http://www.nec-labs.com/dm

www.nec-labs.com

slide-2
SLIDE 2

2

NEC Labs Data Management Research

What I will try to cover

 Historical perspective and motivation  (Preliminary) Technical Approach  Current Status  Food for Thought

slide-3
SLIDE 3

3

NEC Labs Data Management Research

Why Data Management Research?

 Many Data Management

Technologies and Products have been around

 Data Centers have evolved

  • ver the time

 Data Center hosting

became a business

 Database Community was

successful in creating technologies and business

slide-4
SLIDE 4

4

NEC Labs Data Management Research

Why Data Management (Again)?

Amount of Data Amount of business data doubles every 12-18 months New Data Types Relational databases only manage 10-15% of the available data New Data Sources Individual user via Web2.0 applications, social sides, collaboration, mobile devices, sensors, etc New Usage Patterns Around the clock, around the world, highly interconnected Large Number of Users Unprecedented increase and fluctuations New Type of Apps Highly integrated, Extremely data intensive (Good Old) Database

slide-5
SLIDE 5

5

NEC Labs Data Management Research

Cloud Computing

 A paradigm shift in how and where a workload is generated and it gets

executed

 Cloud service provider – Cloud service consumer

 Market Size

 Data Management Market ~$20B  IT Cloud Service ~$42B (by 2012) (IDC)

Cloud Provider

A P I

slide-6
SLIDE 6

6

NEC Labs Data Management Research

Cloud Computing

 A paradigm shift in how and where a workload is generated and it gets

executed

 Cloud service provider – Cloud service consumer

 Market Size

 Data Management Market ~$20B  IT Cloud Service ~$42B (by 2012) (IDC)

Cloud Provider

A P I

slide-7
SLIDE 7

7

NEC Labs Data Management Research

Animoto on Amazon EC2

Rapid growth in three days, the number of users increased from 25k to 250k

Number of servers from 50 to 3500

Assume $500 per machine, $1.75M!

Instead, they used Amazon EC2

 A no-infrastructure startup  Biggest piece of hardware

 A (fancy) espresso

machine!

Problem: It is not trivial to distribute users’ accesses to the data by just scaling out cloud computing nodes

slide-8
SLIDE 8

8

NEC Labs Data Management Research

Database-as-a-Service?

ICDE 2002! Reaction: Cool but…

Technology Regulations

Psychological Acceptance

Business Model

slide-9
SLIDE 9

9

NEC Labs Data Management Research

Data Management in Cloud

 Cloud computing model may provide a platform to

address new challenges

 But the problem is:

 Data Management Systems were not designed and

implemented with cloud computing model in mind

 So the question is:

 What are the data management challenges we need to

address before the full potential of cloud computing can be realized?

slide-10
SLIDE 10

10

NEC Labs Data Management Research

Need for New Solutions

 Massive scalability to handle

 Very large amount of data  Very large number of diverse users/requests

 Elasticity to

 handle varying demand  optimize operating costs

 Flexibility to handle different data and processing models  Massively multi-tenanted to achieve economies of scale  More intelligent system monitoring and management

slide-11
SLIDE 11

11

NEC Labs Data Management Research

Cloud Data Management Challenges

# of queries / sec # of records / query

Large Analytic apps (OLAP) Large Transactional apps (OLTP) Small apps Key challenge: scalable multi- tenant hosting Key challenge: scalable read/write Key challenge: scalable scan and aggregation Key challenge: seamless data management

Ultimate goal Query scalability Data scalability Multi-tenancy

CloudDB

slide-12
SLIDE 12

12

NEC Labs Data Management Research

Buy All Sizes?

OLTP OLAP

? – NO!

slide-13
SLIDE 13

13

NEC Labs Data Management Research

Buy One Size?

OLTP OLAP

slide-14
SLIDE 14

14

NEC Labs Data Management Research

Let Someone Else Do All That

OLTP OLAP

Access and Management

slide-15
SLIDE 15

15

NEC Labs Data Management Research

Let Someone Else Do All That

OLTP OLAP

Access and Management Leveraging very specialized database technologies Easier integration with applications Easier adoption by developers (dominant force for adoption of cloud!) Easier and more flexible deployment options in the middleware

slide-16
SLIDE 16

16

NEC Labs Data Management Research

Wish Lists

Clients

  • Standard language API (e.g.,

SQL)

  • Identifiable and verifiable

Service Level Agreements

  • Common DBMS maintenance

tasks, (e.g. backup, versioning, patching etc.)

  • Availability of value-add

services, such as business analytics, information sharing, collaboration etc. Service Provider

  • Satisfying clients’ SLAs to

sustain revenue

  • Great cost efficiency via high

level of automation and resource sharing to ensure profitability

  • Maintaining an extendable

platform for value-add services

slide-17
SLIDE 17

17

NEC Labs Data Management Research

(Some) Storage Models

Store Type Main Purpose Pro Con Relational

  • Transaction processing
  • Standardization
  • Higher performance on

Online Transaction Processing (OLTP)

  • ACID properties
  • Scalability

Key/Value

  • Scalable data storage
  • Read/Write intensive

workload

  • Scalability
  • Standardization
  • Performance issues
  • Complex query

capability

  • ACID properties(?)

Column-Oriented

  • Analytics processing
  • Read optimized,

throughput oriented

  • Higher performance on

Online Analytical Processing (OLAP)

  • More flexible schema

evolution (?)

  • Standardization
  • Complex query

capability

slide-18
SLIDE 18

18

NEC Labs Data Management Research

Application Scenario

Personal Profile Management

  • Address
  • Phone
  • Notes
  • Contacts
  • Calendars
  • Reminders

Application v1

Profile Data User 1 Data User 2 Data

Information Portal

  • Online Shopping

Catalogs

  • Product Reviews
  • Subscriptions

Application v2

Portal Data Products Reviews

. . . . .

External Sources Relational Database Key/Value Store

Very difficult migration

  • Application developers (skills, time)
  • Architects (redesign)
  • Company (investment)
slide-19
SLIDE 19

19

NEC Labs Data Management Research

Data Model Decisions

 Problem: Users are forced to make a decision on the data model

based on the current needs of the applications

 Is it possible to make the “right” decision all the time?  Problem: The developer (client) has to re-architect their

application in order to take advantage of different data models

 How easy is it to change the architecture and the implementation? # of queries /sec Single RDBMS Clustering Sharding Key-value store Application Ver 1.0 Ver 2.0 Ver 3.0 Ver 4.0 Workload evolves…

slide-20
SLIDE 20

20

NEC Labs Data Management Research

Remember Data Independence?

1968 1970

slide-21
SLIDE 21

21

NEC Labs Data Management Research

Data Independence

 Decouple application logic

from data processing

 Let them be optimized and

managed independently

 Enabled decades of

innovation and improvement in databases

slide-22
SLIDE 22

22

NEC Labs Data Management Research

Data Independence

 The application should not have to be aware of the physical

  • rganization of the data (and how it can be accessed)

 All it needs is a logical (declarative) specification  CloudDB makes decisions based on application context, workload

characteristics, etc.

# of queries /sec Application

CloudDB: A layer for data independence

SQL API Relational Store Key/Value Store Analytics Store Data Load Query/Update

slide-23
SLIDE 23

23

NEC Labs Data Management Research

Language?

 New Breed Databases

 CouchDB, Project Voldemort (Dynamo), Cassandra,

BigTable, Tokyo Cabinet, MangoDB, SimpleDB, ….

 MapReduce/Hadoop  …

slide-24
SLIDE 24

24

NEC Labs Data Management Research

Some Reminders about SQL

 By far the most widely used data access language  It has nothing to do with

 How the data is stored  How the queries are executed  How the transactions are handled

 Very large number of skilled programmers  Huge amount of existing applications and tools

slide-25
SLIDE 25

25

NEC Labs Data Management Research

SQL is actually good?

 HIVE: SQL API op top of MapReduce  Google BigQuery: SQL over data stored in non-relational

databases

 ….

slide-26
SLIDE 26

26

NEC Labs Data Management Research

CloudDB - Guiding Principals

 Embrace heterogeneity

 One size does not fit all  Leverage specialized technologies

 Maintain and restore “declarative” nature of data

processing

 Understand and Define dimensions of scalability

slide-27
SLIDE 27

27

NEC Labs Data Management Research

CloudDB Middleware – Opaque vs. Transparent

System Independence?

The middleware would be responsible for making all the decisions regarding the choice of data stores, processing the queries, and end-to-end system optimization

While the middleware can abstract away the underlying storage systems, it should explicitly express certain essential aspects of the system, such as consistency levels and scalability of transactions

Results Applications SQL Queries API/Language Support (SQL)

CloudDB Middleware

….

Data Stores

Transaction Patterns Consistency / Scalability Opaque Transparent Distributed Query Processor

slide-28
SLIDE 28

28

NEC Labs Data Management Research

CloudDB Platform

Results (External) Applications SQL Queries Distributed Query Processor API/Language Support (JDBC,SQL) Intelligent Cloud Database Coordinator (ICDC) Workload Analysis Design Optimizer System Monitor Database Cluster Controller Client SLAs SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity Planner Multi Tenancy Manager (MTM)

Auto Sharding

Relational Store

Internal Query Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query Processing

Auto Replication Auto Partitioning

Internal Query Processing

Key-Value Store

CloudDB Store

Data Migration

slide-29
SLIDE 29

29

NEC Labs Data Management Research

CloudDB Platform – Key Points

Results (External) Applications SQL Queries Distributed Query Processor API/Language Support (JDBC,SQL) Intelligent Cloud Database Coordinator (ICDC) Workload Analysis Design Optimizer System Monitor Database Cluster Controller Client SLAs SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity Planner Multi Tenancy Manager (MTM)

Auto Sharding

Relational Store

Internal Query Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query Processing

Auto Replication Auto Partitioning

Internal Query Processing

Key-Value Store

CloudDB Store

Data Migration

One Unified, Standard API Intelligent Analysis and Decision Making Specialized Stores for Specific Needs

slide-30
SLIDE 30

30

NEC Labs Data Management Research

Our Data Management Platform Key Research Areas

Results (External) Applications SQL Queries Distributed Query Processor API/Language Support (JDBC,SQL) Intelligent Cloud Database Coordinator (ICDC) Workload Analysis Design Optimizer System Monitor Database Cluster Controller Client SLAs SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity Planner Multi Tenancy Manager (MTM)

Auto Sharding

Relational Store

Internal Query Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query Processing

Auto Replication Auto Partitioning

Internal Query Processing

Key-Value Store

CloudDB Store

Data Migration

Intelligent Management Workload Management

Data Stores

Specialized Stores for Specific Needs Intelligent Analysis and Decision Making One Unified, Standard API

slide-31
SLIDE 31

31

NEC Labs Data Management Research

CloudDB System Architecture -- Microsharding is a part of CloudDB

Results (External) Applications SQL Queries Distributed Query Processor API/Language Support (JDBC,SQL) Intelligent Cloud Database Coordinator (ICDC) Workload Analysis Design Optimizer System Monitor Database Cluster Controller Client SLAs SLA Aware Dispatcher

Scheduler Scheduler Scheduler

Capacity Planner Multi Tenancy Manager (MTM)

Auto Sharding

Relational Store

Internal Query Processing

Auto Replication Auto Partitioning

Analytics Store

Internal Query Processing

Auto Replication Auto Partitioning

Internal Query Processing

Key-Value Store

CloudDB Store

Data Migration

Microsharding

slide-32
SLIDE 32

32

NEC Labs Data Management Research

Pool of Servers

SQL over Key-Value Stores

 Microsharding to enable SQL over key-value stores

Application SQL

Key- access

Applications Storage nodes (Storage cloud) Query execution nodes (Relational middleware)

Key-Value Store

Application

Pool of Servers

Key challenge: limited access capabilities (only key-based put/ get)

slide-33
SLIDE 33

33

NEC Labs Data Management Research

Microsharding

 Key-Value stores are good at scaling write intensive

workloads

 But, they don’t leverage a large body of technologies

developed in databases over the decades such as:

 Relationships  Transactions  Advanced query functions etc.

 These are hand-coded by developers  Microsharding aims at bringing those capabilities into key-

value stores in a principled way

slide-34
SLIDE 34

34

NEC Labs Data Management Research

Key Technical Questions Addressed

 How can we map relational schemas to key-value store

data models?

 How can we map relational tuples to key-value objects?  Once we have those mappings, how can we define

transaction classes that can be supported in a scalable way in key-value stores?

 What are the system implementation issues with such a

middleware?

slide-35
SLIDE 35

35

NEC Labs Data Management Research

Query and Data Transformation

 Physical design: mapping between relational data

and K/V data

TABLE users ( id primary key …) TABLE reviews ( id: primary key user_id : foreign key to orders …) SELECT * FROM users, reviews WEHRE users.id= reviews.user_id and users.id = ? NEST reviews BY user_id ….

users reviews reviews reviews

GET UNNEST

Physical Design Query plan Transformed data (KV data) Schema (+data) Query (template) “Microshard” User[Review]

slide-36
SLIDE 36

36

NEC Labs Data Management Research

Microsharding

 A microshard is

 a logical unit of data  a principled way to shard a database into small fragments  a unit of transactional data access  is accessed by its key, key of root relation Key= 1 Key= 2 Key= 3 Key= N microshard microshard microshard microshard Transaction on Users key =1 Transaction on Users key =1 Transaction on Users key =2 Transaction on Users key =3

slide-37
SLIDE 37

37

NEC Labs Data Management Research

Isolation Levels

 No consistency guarantee on read/write outside of a microshard

T T T T T T

transaction group transaction group microshard microshard Distributed on key-value store Distributed

  • n query

execution nodes

slide-38
SLIDE 38

38

NEC Labs Data Management Research

Scale Independence

Experiment Setup

 RUBiS benchmark (eBay type auction application)  Read/Write workload (transition matrix)  Short think time to saturate the system  Voldemort (Dynamo) key-value store

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 2.5 5 7.5 10 12.5 15 17.5 20 Throughput (1000 sessions / sec) Number of emulated concurrent clients (thousands)

3 Voldemort nodes 4 Voldemort nodes 5 Voldemort nodes 6 Voldemort nodes

Message: Ability to automatically scale to more concurrent sessions (throughput) simply by increasing the number of key-value nodes

slide-39
SLIDE 39

39

NEC Labs Data Management Research

Directions/Questions

 Support for Specifying Relaxed Consistency

 Tooling to relax consistency just to the degree that there

exists a feasible solution (physical design and query plans) for the specification

 Scalable Data Organization over heterogeneous data

stores

 Physical design over heterogeneous stores such that the

service level specifications are met

 Scalability vs. Consistency

slide-40
SLIDE 40

40

NEC Labs Data Management Research

The Cast

 NEC Labs Researchers

 Hakan Hacigumus  Yun Chi  Wang-Pin Hsiung  Hojjat Jafarpour  Hyun J. Moon  Oliver Po  Junichi Tatemura  Jagan Sankaranarayanan

 Advisors/Collaborators  Michael Carey (U. of California, Irvine)  Hector Garcia-Molina (Stanford)  Jeff Naughton (U. of Wisconsin, Madison)

slide-41
SLIDE 41

41

NEC Labs Data Management Research

CloudDB would be…

 A unified data management platform that provides

capabilities to transparently and efficiently support heterogeneous workloads by leveraging specialized storage models with SLA-conscious profit optimization in the cloud.

slide-42
SLIDE 42

42

NEC Labs Data Management Research

Thank You!