Data Systems for the Cloud Instructor: Matei Zaharia - PowerPoint PPT Presentation

Data Systems for the Cloud Instructor: Matei Zaharia cs245.stanford.edu

Outline What is the cloud and what’s different with it? S3 & Dynamo: object stores Aurora: transactional DBMS BigQuery: analytical DBMS Delta Lake: ACID over object stores CS 245 2

What is Cloud Computing? Computing as a service, managed by an external party » Software as a Service (SaaS): application hosted by a provider, e.g. Salesforce, Gmail » Platform as a Service (PaaS): APIs to program against, e.g. DB or web hosting » Infrastructure as a Service (IaaS): raw computing resources, e.g. VMs on AWS Large shift in industry over the past 20 years! CS 245 4

History of Cloud Computing Old idea, but became successful in the 2000s 1960 1970 1980 1990 2000 2010 2020 AWS Aurora, Virtual private Amazon S3, “Utility computing” Lambda (2014) web servers EC2 (2006) first used, talking about shared mainframes Sun Cloud, Google BigQuery (2011), HP Utility Datacenter, AWS Redshift (2012) Loudcloud, VMware Salesforce (1999) CS 245 5

Public vs Private Cloud Public cloud = the provider is another company (e.g. AWS, Microsoft Azure) Private cloud = internal PaaS/IaaS system (e.g. VMware) We’ll discuss public cloud here since that is more interesting & common CS 245 6

Development Process Traditional Software Cloud Software Dev Team Dev + Ops Team Vendor 1-2 weeks 6-12 months Release 6-12 months Users Users Users Users Customers Ops Ops Ops Ops Users Users Users Users Ops Ops Ops Ops 7

Why Might Customers Use Cloud Services? Management built-in: more value than the software bits alone (security, availability, etc) Elasticity: pay-as-you-go, scale on demand Better features released faster

Differences in Building Cloud Software + Release cycle: send releases to users faster, get feedback faster + Only need to maintain 2 software versions (current & next), fewer configs than on-premise – Upgrading without regressions: critical for users to trust your cloud because updates are forced

Differences in Building Cloud Software – Building a multitenant service: significant scaling, security and performance isolation work – Operating the service: security, availability, monitoring, scalability, etc + Monitoring: see usage live for operations and product analytics

How Do These Factors Affect Data Systems? Data systems already had to support many users robustly, but a few challenges and opportunities arise: » Much larger scale: millions of users, VMs, … » Multitenancy: users don’t trust each other, so need stronger security, perf isolation, etc » Elasticity: how can our system be elastic? » Updatability: avoid regressions & downtime CS 245 11

S3, Dynamo & Object Stores Goal: I just want to store some bytes reliably and cheaply for a long time period Interface: key-value stores » Objects have a key (e.g. bucket/imgs/1.jpg ) and value (arbitrary bytes) » Values can be up to a few TB in size » Can only do operations on 1 key atomically Consistency: eventual consistency Store trillions of objects and exabytes of data CS 245 13

Example: S3 API PUT(key, value): write object with a key » Atomic update: replaces the whole object GET(key, [range]): return object with a key » Can optionally read a byte range in the object LIST([startKey]): list keys in a bucket in lexicographic order, starting at a given key » Limit of 1000 returned keys per call CS 245 14

S3 Consistency Model Eventual consistency: different readers may see different versions of the same object Read-your-own-writes for new PUT: if you GET a new object that you PUT, you see it » Unless you had previously called GET while it was missing, in which case you might not! No guarantee for LIST after PUT: you may not see the new object in LIST! CS 245 15

Why These Choices? The primary goal is scale: keep the interface very simple to support trillions of objects » No cross-object operations except LIST! Mostly target immutable or rarely changing data, so consistency is not as important Can try to build stronger consistency on top CS 245 16

Implementing Object Stores CS 245 17

Goals CS 245 18

Goals CS 245 19

Goals Obviously different for S3! CS 245 20

Dynamo Implementation Commodity nodes with local storage on disks Nodes form a “ring” to split up the key space among them » Actually, each node covers many ranges (overpartitioning) Use quorums and gossip to manage updates to each key CS 245 21

Reads and Writes to Dynamo Quorums with configurable # of writers Client 1: Write and readers required for success » E.g. 3 nodes, write to 2, read from 2 » E.g. 3 nodes, write to 2, read from 1 Client 2: Read (weaker consistency!) Nodes gossip & merge updates in an application-specific way CS 245

Usage of Object Stores Very widely used (probably the largest storage systems in the world) But the semantics can be complex » E.g. many users try to mount these as file systems but they’re not the same CS 245 23

Outline What is the cloud and what’s different with it? S3 & Dynamo: object storage Aurora: transactional DBMS BigQuery: analytical DBMS Delta Lake: ACID over object stores CS 245 24

Amazon Aurora Goal: I want a transactional DBMS managed by the cloud vendor Interface: same as MySQL/Postgres » ODBC, JDBC, etc Consistency: strong consistency (similar to traditional DBMSes) Some of the largest & most profitable cloud services CS 245 25

Initial Attempt at DBMS on AWS Just run an existing DBMS (e.g. MySQL) on cloud VMs, and use replicated disk storage Primary Backup Apply log to recreate same pages log MySQL MySQL pages pages Replicated disk (e.g. EBS) Same thing users would do on-premise CS 245 26

Problems with This Model Elasticity: doesn’t leverage the elastic nature of the cloud, or give users elasticity Efficiency: mirroring and disk-level replication is expensive at global scale CS 245 27

Inefficiency of Mirrored DBMS Write amplification: each write at app level results in many writes to physical storage For Aurora, Amazon wanted “4 out of 6” quorums (3 zones and 2 nodes in each zone) CS 245 28

Aurora’s Design Implement replication at a higher level: only replicate the redo log (not disk blocks) Enable elastic frontend and backend by decoupling API & storage servers » Lower cost and higher performance per tenant CS 245 29

Aurora’s Design Redo log CS 245 30

Design Details Logging uses async quorum: wait until 4 of 6 nodes reply (faster than waiting for all 6) Each storage node takes the log and rebuilds the DB pages locally Care taken to handle incomplete logs due to async quorums CS 245 31

Performance CS 245 32

Other Features from this Design Rapidly add or remove read replicas Serverless Aurora (only pay when actively running queries) Efficient DB recovery, cloning and rollback (use a prefix of the log and older pages) Plus many cloud-oriented features, e.g. zero-downtime updates CS 245 33

CS 245 34

Google BigQuery Goal: I want a cheap & fast analytical DBMS managed by the cloud vendor Interface: SQL, JDBC, ODBC, etc Consistency: depends on storage chosen (object stores or richer table storage) CS 245 36

Traditional Data Warehouses Provision a fixed set of nodes that have both storage and computing » Big servers with lots of disks, etc » Makes sense when buying servers on-premise Problem: no elasticity! Interestingly, this was the model chosen by AWS Redshift initially (using ParAccel) CS 245 37

BigQuery and Other Elastic Analytics Systems Separate compute and storage » One set of nodes (or the cloud object store) stores data, usually over 1000s of nodes » Separate set of nodes handle queries (again, possibly scaling out to 1000s) Users pay separately for storage & queries Get performance of 1000s of servers to run a query, but only pay for a few seconds of use CS 245 38

Results These elastic services generally provide better performance and cost for ad-hoc small queries than launching a cluster For big organizations or long queries, paying per query can be challenging, so these services let you bound total # of nodes CS 245 39

Interesting Challenges User-defined functions (UDFs): need to isolate across tenants (e.g. in separate VMs) Scheduling: How to quickly launch a query on many nodes and combine results? How to isolate users from each other? Indexing: BigQuery tries to mostly do scans over column-oriented files CS 245 40

Data Systems for the Cloud Instructor: Matei Zaharia - PowerPoint PPT Presentation

Data Systems for the Cloud Instructor: Matei Zaharia cs245.stanford.edu Outline What is the cloud and whats different with it? S3 & Dynamo: object stores Aurora: transactional DBMS BigQuery: analytical DBMS Delta Lake: ACID over

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

SAS and (the) Cloud Dave Annis SAS Solutions onDemand SAS and (the) Cloud Everyones Cloud

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

1 Defining the Legal Schedules Defining the Legal Schedules The Graph Test for Serializability

OldSQL vs. NoSQL vs. NewSQL on New OLTP Michael Stonebraker,

1/29/2009 Outline Whats the problem ? ARIES Terminology A Transaction Recovery Method

Mirror Smooth Superconducting RF Cavities by Mechanical Polishing with Minimal Acid Use CA Cooper

Tamper Resistance - a Cautionary Note Ross Anderson Markus Kuhn University of Cambridge

Durable Transactional Memory Can Scale With TimeStone * R. Madhava Krishnan , Jaeho Kim * , Ajit

drug holidays ver 7-10 7/13/2018 Long-term Treatment and Drug Financial Disclosures Holidays

Care and Feeding of Lead Acid Batteries Revision 1.0 2016 Presented by: Frederick B. Cook

Data Systems for the Cloud Instructor: Matei Zaharia - PowerPoint PPT Presentation

Data Systems for the Cloud Instructor: Matei Zaharia cs245.stanford.edu Outline What is the cloud and whats different with it? S3 & Dynamo: object stores Aurora: transactional DBMS BigQuery: analytical DBMS Delta Lake: ACID over

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

SAS and (the) Cloud Dave Annis SAS Solutions onDemand SAS and (the) Cloud Everyones Cloud

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

1 Defining the Legal Schedules Defining the Legal Schedules The Graph Test for Serializability

OldSQL vs. NoSQL vs. NewSQL on New OLTP Michael Stonebraker,

1/29/2009 Outline Whats the problem ? ARIES Terminology A Transaction Recovery Method

Mirror Smooth Superconducting RF Cavities by Mechanical Polishing with Minimal Acid Use CA Cooper

Tamper Resistance - a Cautionary Note Ross Anderson Markus Kuhn University of Cambridge

Durable Transactional Memory Can Scale With TimeStone * R. Madhava Krishnan , Jaeho Kim * , Ajit

drug holidays ver 7-10 7/13/2018 Long-term Treatment and Drug Financial Disclosures Holidays

Care and Feeding of Lead Acid Batteries Revision 1.0 2016 Presented by: Frederick B. Cook

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing