Data in the Cloud Happy 10 th ACM SoCC! Raghu Ramakrishnan CTO for - - PowerPoint PPT Presentation

data in the cloud
SMART_READER_LITE
LIVE PREVIEW

Data in the Cloud Happy 10 th ACM SoCC! Raghu Ramakrishnan CTO for - - PowerPoint PPT Presentation

Data in the Cloud Happy 10 th ACM SoCC! Raghu Ramakrishnan CTO for Data, Technical Fellow ACM SoCC Topics Over the Past 10 Years 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Word clouds courtesy Carlo Curino ACM SoCC Topics After


slide-1
SLIDE 1

Data in the Cloud

Happy 10th ACM SoCC!

Raghu Ramakrishnan

CTO for Data, Technical Fellow

slide-2
SLIDE 2

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

ACM SoCC Topics Over the Past 10 Years

Word clouds courtesy Carlo Curino

slide-3
SLIDE 3

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

ACM SoCC Topics After Filtering “data” and “cloud”

Word clouds courtesy Carlo Curino

slide-4
SLIDE 4

4

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

TRADITIONAL DATACENTER

Dedicated storage, high-end deployment

TRADITIONAL DATACENTER

Large, high-end deployment

TRADITIONAL DATACENTER

Large, high-end deployment

TRADITIONAL DATACENTER

Virtualized servers, high- end deployment Azure Storage Exchange Online SharePoint Online Azure Compute

Carbon Footprint of Cloud Computing

Electricity/core-hour Electricity/TB-year Electricity/mailbox-year Electricity/user-year

52%

more efficient

71%

more efficient

77%

more efficient

22%

more efficient

8

http://download.microsoft.com/download/7/3/9/739BC4AD-A855-436E-961D-9C95EB51DAF9/Microsoft_Cloud_Carbon_Study_2018.pdf

slide-9
SLIDE 9

AI in Operation & Optimization

IoT and big data platforms make it increasingly easy to optimize datacenters IoT telemetry, analytics and ML

  • ptimization

Predictive maintenance Capacity planning and workload placement

Microsoft Confidential 9

slide-10
SLIDE 10

Microsoft Confidential 10

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Ubiquitous Data

slide-15
SLIDE 15
slide-16
SLIDE 16

Scale Heterogeneity Many engines Many workloads Data silos Elastic storage Elastic compute Network latency

16

slide-17
SLIDE 17

Azure SQL DB Update-optimized Meta data XACT_STATE Governance Azure Cosmos DB Document model Meta data XACT_STATE Governance

OPERATIONAL

RELATIONAL NON-RELATIONAL

ANALYTICS

Azure SQL DW Analytics-optimized Meta data XACT_STATE Governance Spark, Hive, ML… Data Lake Meta data Governance

slide-18
SLIDE 18

Big Picture: Separation of Compute and State

Spark, Hive, ML… Azure SQL DW Azure SQL DB Data Lake Update-optimized Analytics-optimized Meta data Meta data Meta data XACT_STATE XACT_STATE Azure Cosmos DB Document model Meta data XACT_STATE

slide-19
SLIDE 19

Microsoft’s internal data lake

  • A data lake for all teams

@Microsoft

  • Good developer tools
  • Batch, Interactive, Streaming, ML
  • Used across Office, Xbox, Azure,

Windows, Ads, Bing, Skype, …

  • Production jobs and experimentation

Microsoft’s Internal Big Data Service

Azure Data Lake Store

HDFS as a PaaS cloud service

  • Microsoft’s serverless Big Data platform
  • Fully aligned with Hadoop ecosystem

and standards, with full support for Hadoop tools and engines as well as unique Microsoft capabilities

  • Migrated to ADLS
  • 1P = 3P

Enabling business growth: Office productivity revenue (45%YoY)* Intelligent Cloud (100% YoY)* Bing search share doubles

By the numbers

  • 9+ Exabytes of data, 8+ billion files
  • 100Ks of physical servers
  • Millions of interactive queries
  • Huge streaming pipelines
  • 100Ks of daily batch jobs
  • 15K+ developers
  • 300+ teams

MSR/GSL Collaboration

  • J. Zhou et. al., SCOPE: parallel databases

meet MapReduce, VLDBJ 21(5)

  • R. Ramakrishnan et. Al., Azure Data Lake

Store, SIGMOD 2017 Apache YARN Federation

slide-20
SLIDE 20

Traditional MPP DW Architecture

Data movement channel

DQE communication channel Compute Remote storage Adaptive cache

  • DMS
  • SQL
  • SSD Cache

Meta data Transactions Premium Standard Snapshot backups

Data Log

slide-21
SLIDE 21

Cloud-Native Scale-Out, Data Heterogeneity

Data movement channel

Compute Remote storage

  • ES
  • SQL
  • SSD Cache
  • Distribution-less

Columnar files

Standard

Meta data Transactions

Centralized services

Data and state separated from compute

Fault-tolerant scale-out

Online scaling

Data heterogeneity

➢ Converge DW and Lake

slide-22
SLIDE 22

Task

Workload Tasks

A next generation distributed query engine (blend massive scale batch QP with scale up (small scale-out) interactive QP)

Polaris Concurrency – Workload Aware Scheduling

Workload Task Graph

Task-cost Driven Scheduling Resource Aware Task Placement

% Resource Demand 25 5 10

Agg

5 Query 1 15 40 5 5 15 5 5 Query 2 25 5 10

Agg

15 5 40 5 5

State Machine Execution

State Transition

States

Waiting Execution Executing Failed Completed

Edges are precedence constraints Global Workload Graph that enables for workload optimizations across queries State Machines:

  • Guarantees precedence constraints are satisfied
  • Defines a formal model on how we recover from

failures

slide-23
SLIDE 23

Scalability: All TPC-H Queries at 1PB Scale!

Elastic DQP – Unlimited Scale

slide-24
SLIDE 24

MSR Collaboration

𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝑶 𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝑶 𝒕𝒋𝒜𝒇𝒑𝒈(𝒆𝒃𝒖𝒃) 𝑶

P . Antonopoulos, et. al,., Socrates: The New SQL Server in the Cloud. ACM SIGMOD 2019

slide-25
SLIDE 25

OLTP Data Warehouses NoSQL Big Data / Lake

slide-26
SLIDE 26

Unified Data Suite and Governance

Spark, Hive, ML… Data Lake Update-optimized Analytics-optimized Meta data Meta data Meta data XACT_STATE XACT_STATE Azure Cosmos DB Document storage

Global apps

Meta data XACT_STATE SQL

Governance

slide-27
SLIDE 27

Big Picture: Must Simplify Usability and Governance

  • Cloud
  • Elastic compute and storage is transformative
  • But compute-storage latency and bandwidth is key challenge
  • Edge blurs cloud/on-prem separation
  • ML
  • An integral part of data processing, with a rapidly growing community of its own
  • Implications for Data Management
  • Rethink what belongs in a “DBMS”—ML, data governance
  • Rethink data architectures from the ground up—OLTP/Analytics/HTAP
slide-28
SLIDE 28

App logic

  • ffline
  • nline

Model Scoring

Featurization Live Data Decisions

  • ther data

featurization Model Training

Model Development / Training

  • ffline featurization.

Model

  • ptimization

Model

deployment

Model policies Data Catalogs

  • rchestration

Governance

Model Tracking & Provenance Access Control Logs & Telemetry policies

  • A. Agrawal et al., Cloudy with high chance of DBMS: a 10-year prediction for Enterprise-Grade ML, CIDR 2020.

GSL Collaboration

slide-29
SLIDE 29

Big Data and Data warehousing BI AI and ML Data &

Operational Systems

A single pane of glass to…

Manage e data lifec ecyc ycle le

(collect, clean, publish, discover, curate, …)

Ensure e Data a Quality ity & Correctn ctness ess Assess s data compli lian ance ce, privacy acy & protection

  • n

Author r & manage e data policy

(access, use, retention, location, sharing)

Unified Governance

Across Cloud, Edge, On-Prem