DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November - - PowerPoint PPT Presentation

data warehouse
SMART_READER_LITE
LIVE PREVIEW

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November - - PowerPoint PPT Presentation

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO THE DREAM DATA WAREHOUSE (CIRCA 2012) Unlimited and Store all Extreme No compromises Instant Scaling your data simplicity full


slide-1
SLIDE 1

DATA WAREHOUSE

BUILT FOR THE CLOUD

QCON San Francisco, November 2019

Thierry Cruanes, Co-Founder & CTO

slide-2
SLIDE 2

THE DREAM DATA WAREHOUSE

(CIRCA 2012)

No management tasks, offered as a service Fast out-of-box with no tuning knobs Structured and semi-structured Petabyte scale at very low cost Full support for ACID transactions with read consistency ANSI SQL, RBAC No data silos 10x faster for the same price, no

  • ver provisioning

Extreme simplicity Store all your data No compromises full fledge Data Warehouse Unlimited and Instant Scaling

slide-3
SLIDE 3

WHY THEN?

OUR VIEW OF THE CLOUD…

20x 20x

§ Storage became dirt cheap § Flat network offered uniform bandwidth § Single core performance stalled § Data warehouse and analytic workload are mostly CPU bound

Design for abundance and not scarcity of resources

slide-4
SLIDE 4

THREE PILLARS

20x 20x

Multi-Tenant Service Multi-cluster shared data Architecture Immutable Scalable Storage

Leverage cloud elasticity and pay only what you use Instant scale Performance isolation Real-time Data sharing Extremely fast response time at scale Fine grain vertical and horizontal pruning on any column Automatically applied to any data (structured and semi- structured) Self-tuning, self-healing Transparent upgrade Service architecture designed for availability, durability and security

slide-5
SLIDE 5

ARCHITECTURE

slide-6
SLIDE 6

AN ARCHITECTURE BUILT FOR THE CLOUD

Traditional Architectures

Shared storage Single cluster

Shared-disk

Decentralized, local storage Single cluster

Shared-nothing Multi-cluster, shared data

Centralized, scale-out storage Multiple, independent compute clusters

slide-7
SLIDE 7

MULTI-CLUSTER, SHARED DATA ARCHITECTURE

No data silos Storage decoupled from compute Any data Native for structured & semi- structured Unlimited scalability Along many dimensions Low cost Compute on demand Instantly cloning Isolate prod from dev & qa Highly available 11 9’s durability, 4 9’s availability

Databases Clone Data Science

Virtual Warehouse

ETL & Data Loading

Virtual Warehouse

Finance

Virtual Warehouse

Dev, Test, QA

Virtual Warehouse

Dashboards

Virtual Warehouse

Marketing

Virtual Warehouse Virtual Warehouse

slide-8
SLIDE 8

VIRTUAL WAREHOUSE

How to allow concurrent workloads run without impacting each other?

One or more MPP compute cluster Unit of fault and performance isolation Use multiple warehouses to segregate workload Resizable on the fly Able to access data in any database Transparently caches data accessed Transaction manager synchronizes data access Automatic suspend when idle and resume when needed

SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache

Virtual warehouse A Virtual warehouse B Virtual warehouse C Virtual warehouse D

ETL Transformation SQL BI

slide-9
SLIDE 9

MULTI-CLUSTER WAREHOUSE

LEVERAGE ABUNDANCE OF COMPUTE RESOURCES Automatically scales compute resources based on concurrent usage Single virtual warehouse of multiple compute clusters Queries are load balanced across the clusters in a virtual warehouse Split across availability zones for high availability

Cluster 1 Cluster 2 Cluster 3 Virtual Warehouse Group Query Query Query

Query scheduler

slide-10
SLIDE 10

IN THE REAL-WORLD

Continuous Loading (4TB/day)

S3 <5min SLA

Virtual Warehouse Medium

ETL & Maintenance

Virtual Warehouse Large Virtual Warehouse 2X-Large

Reporting (Segmented) Interactive Dashboard

50% < 1s 85% < 2s 95% < 5s

Virtual Warehouse Auto Scale – X-Large x 5

4 trillion rows 3+ petabyte raw 8x compression ratio 25M+ micro-partitions

Prod DB

slide-11
SLIDE 11

SCALABLE IMMUTABLE STORAGE

slide-12
SLIDE 12

STORAGE IMMUTABILITY

Accumulates immutable data over time Well supported by all cloud vendor object stores Allow separation of storage and compute resources

Enable workload scalability

Heavily optimized for read mostly workload

Natural fit for analytic systems

Transaction management becomes a metadata problem

Multi-version concurrency control and Snapshot isolation semantic

Transaction coordination separated from storage and compute

Allow for consistent access across compute resources

slide-13
SLIDE 13

SCALABLE STORAGE

AUTOMATIC MICRO-PARTITIONING

Columnar Partitions

Data is automatically partitioned at load time Storage decoupled from compute Columnar organization in each micro-partition Enable both horizontal vertical pruning Micro partition – only few 10MBs Fine grain pruning, no skew Metadata structure tracks data distribution Very fast pruning at optimization time Applied to both structured and semi-structured data Very fast response time for both

slide-14
SLIDE 14

AUTOMATICALLY APPLIED TO SEMI-STRUCTURED DATA

> SELECT … FROM …

Semi-structured data

(JSON, Avro, XML, Parquet, ORC)

Structured data

(e.g., CSV, TSV, …)

Optimized storage

Optimized data type, no fixed schema or transformation required

Optimized SQL querying

Full benefit of database optimizations (pruning, filtering, …)

Native support

Loaded in raw form (e.g. JSON, Avro, XML)

slide-15
SLIDE 15

EXAMPLE

Compute Client Application

Web UI ODBC Driver JDBC Driver

Storage S3

Sale s Marketing Data

Cloud Services

Security Optimization Warehouse Mgmt Query Mgmt

Metadata Metadata Metadata

DDL

Custom Reports

Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node

Custom Reports

XL

Node Node Node Node Node Node Node Node

Campaign Analysis

Campaign Analysts

L

Storage

1 9 H 2 A I 3 B J 4 C K 5 D L 6 E M 7 F N 8 G O 1 3 6 8 2 B D F 1 B 6 F 3 8 J G K 7 O 4 H 2 C L P T Q U R V S W

Node Node Node Node Node Node Node Node

Loading WH

L

Loading WH

P Q R S T U V W P Q R S T U V W

HTTPS (JDBC/ODBC/Python)

slide-16
SLIDE 16

ENABLE DATA SHARING

Data Consumers Secure and integrated Snowflake’s access control model Only pay normal storage costs for shared data No limit to the number of consumer accounts with which a dataset may be shared

Providers

Get access to the data without any need to move

  • r transform it.

Query and combine shared data with existing data or join together data from multiple publishers

Consumers

Data Providers

slide-17
SLIDE 17

ENABLE GLOBAL REPLICATION

AWS (US West) Azure (US East) AWS (Ireland) AWS (Sydney) AWS Azure AWS (US East) Azure (Frankfurt) AWS (Frankfurt)

slide-18
SLIDE 18

MULTI-TENANT SERVICE

slide-19
SLIDE 19

DATA WAREHOUSE AS A SERVICE

Durability Multi-Tenant Service

No administration, self-tuning and healing, Transparent upgrade Service architecture designed for high availability and durability Security is at the core

Availability

All tier distributed over multiple datacenters with active-active data replication No maintenance downtime, fully transparent software & hardware upgrade Automatic repair of any failed servers with transparent re- execution of any failed queries Persistent session for load- balancing and transparent fail-over Synchronous replication of data over multiple data centers Automatic data retention and fail safe technology to guard against any data removal

slide-20
SLIDE 20

SNOWFLAKE SERVICE

Three independent layers

Cache Cache Cache Cache

Authentication & Access Control

Infrastructure manager Optimizer Transaction manager Security

Metadata

Cloud services

Compilation and Management

Data processing

Virtual warehouses

Storage

Databases

slide-21
SLIDE 21

MANAGED SERVICE

BUILT-IN DISASTER RECOVERY AND HIGH AVAILABILITY

Scale-out of all tiers

metadata, compute, storage

Resiliency across multiple availability zones

geographic separation separate power grids built for synchronous replication

Fully online updates & patches

zero downtime

Back pressure and throttling all the way back to the client

Cloud services Virtual warehouses Database storage

Services Metadata

slide-22
SLIDE 22

ADAPTIVE ALL THE WAY TO THE CORE

SELF TUNING & SELF HEALING INTERNALS

Adaptive Self-tuning Do no harm! Automatic Default

Automatic Memory Management Automatic Workload Management Automatic Distribution Method Automatic Degree of Parallelism Automatic Fault Handling

No Statistics No Vacuuming

slide-23
SLIDE 23

EXAMPLE: AUTOMATIC SKEW AVOIDANCE

Detect popular values on the build side of the join

popular values detected at runtime number of values no performance degradation kicks in when needed enabled by default for all joins Adaptive Self-tuning Do no harm! Automatic Default

Execution Plan

scan join scan filter

1

2 Use broadcast for those and directed join for the others 2

1

slide-24
SLIDE 24

WHAT’S NEXT?

SERVERLESS DATA SERVICES

Target predictable well-identified database workloads Horizontal scaling is automatic Fine grain unit of work allow for degree of parallelism to be arbitrarily small or large Secure since handled by the service Transparent retry on failures Service state entirely managed by the service Monitoring and observability of the service

slide-25
SLIDE 25

CLOUD NATIVE ARCHITECTURE

A GIFT THAT KEEPS ON GIVING