DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November - PowerPoint PPT Presentation

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO

THE DREAM DATA WAREHOUSE (CIRCA 2012) Unlimited and Store all Extreme No compromises Instant Scaling your data simplicity full fledge Data Warehouse No data silos Structured and No management Full support for ACID semi-structured tasks, offered as a transactions with 10x faster for the service read consistency same price, no Petabyte scale over provisioning at very low cost Fast out-of-box with ANSI SQL, RBAC no tuning knobs

WHY THEN? OUR VIEW OF THE CLOUD… § Storage became dirt cheap Design for § Flat network offered uniform abundance bandwidth and not § Single core performance 20x 20x stalled scarcity of resources § Data warehouse and analytic workload are mostly CPU bound

THREE PILLARS Multi-cluster shared Immutable Scalable Multi-Tenant data Architecture Storage Service Leverage cloud elasticity Extremely fast response Self-tuning, self-healing and pay only what you use time at scale Transparent upgrade 20x 20x Instant scale Fine grain vertical and horizontal pruning on any Service architecture Performance isolation column designed for availability, durability and security Real-time Data sharing Automatically applied to any data (structured and semi- structured)

ARCHITECTURE

AN ARCHITECTURE BUILT FOR THE CLOUD Traditional Architectures Shared-disk Shared-nothing Multi-cluster, shared data Shared storage Decentralized, local storage Centralized, scale-out storage Multiple, independent compute clusters Single cluster Single cluster

MULTI-CLUSTER, SHARED DATA ARCHITECTURE ETL & Data Loading No data silos Storage decoupled from compute Virtual Warehouse Any data Data Science Native for structured & semi- Finance structured Virtual Virtual Warehouse Virtual Warehouse Unlimited scalability Warehouse Along many dimensions Low cost Compute on demand Databases Instantly cloning Clone Isolate prod from dev & qa Virtual Virtual Warehouse Warehouse Marketing Dev, Test, QA Highly available 11 9’s durability, 4 9’s availability Virtual Warehouse Dashboards

VIRTUAL WAREHOUSE How to allow concurrent workloads run without impacting each other? One or more MPP compute cluster Virtual Virtual Virtual Virtual Unit of fault and performance isolation warehouse A warehouse B warehouse C warehouse D Use multiple warehouses to segregate ETL Transformation SQL BI workload Resizable on the fly SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache SSD/RAM Cache Able to access data in any database Transparently caches data accessed Transaction manager synchronizes data access Automatic suspend when idle and resume when needed

MULTI-CLUSTER WAREHOUSE LEVERAGE ABUNDANCE OF COMPUTE RESOURCES Query Automatically scales compute Query resources based on concurrent usage Query Single virtual warehouse of multiple compute clusters Query scheduler Queries are load balanced across the clusters in a virtual warehouse Split across availability zones for high Cluster 1 Cluster 2 Cluster 3 availability Virtual Warehouse Group

IN THE REAL-WORLD 50% < 1s Interactive Continuous 85% < 2s S3 Dashboard Loading (4TB/day) 95% < 5s <5min SLA Virtual Warehouse Virtual Warehouse Auto Scale – X-Large x 5 Medium Reporting ETL & (Segmented) Maintenance Prod DB Virtual Warehouse Virtual Warehouse 2X-Large Large 4 trillion rows 3+ petabyte raw 8x compression ratio 25M+ micro-partitions

SCALABLE IMMUTABLE STORAGE

STORAGE IMMUTABILITY Accumulates immutable data over time Well supported by all cloud vendor object stores Allow separation of storage and compute resources Enable workload scalability Heavily optimized for read mostly workload Natural fit for analytic systems Transaction management becomes a metadata problem Multi-version concurrency control and Snapshot isolation semantic Transaction coordination separated from storage and compute Allow for consistent access across compute resources

SCALABLE STORAGE AUTOMATIC MICRO-PARTITIONING Data is automatically partitioned at load time Storage decoupled from compute Partitions Columnar organization in each micro-partition Enable both horizontal vertical pruning Micro partition – only few 10MBs Fine grain pruning, no skew Metadata structure tracks data distribution Very fast pruning at optimization time Columnar Applied to both structured and semi-structured data Very fast response time for both

AUTOMATICALLY APPLIED TO SEMI-STRUCTURED DATA Semi-structured data > SELECT … FROM … (JSON, Avro, XML, Parquet, ORC) Structured data Optimized SQL (e.g., CSV, TSV, …) querying Full benefit of database optimizations (pruning, filtering, …) Native support Loaded in raw form (e.g. JSON, Avro, XML) Optimized storage Optimized data type, no fixed schema or transformation required

EXAMPLE Client Application JDBC Driver Web UI ODBC Driver HTTPS (JDBC/ODBC/Python) XL L L Compute Cloud Query Warehouse Security Optimization Custom Campaign Loading Services Mgmt Mgmt Reports Analysts WH DDL P Q R S L C 2 H S W 1 T 3 6 V 8 Node Node Node Node Node Node Node Node Node Node Node Node T U V W 4 O 7 K Q Metadata Metadata 2 B D U F Metadata Node Node Node Node Node Node Node Node Node Node Node Node G P J 8 3 Loading WH Campaign Analysis Node Node Node Node R F 6 B 1 Storage Storage 1 2 3 4 5 6 7 8 P Q R S Node Node Node Node 9 A B C D E F G T U V W S3 Custom Reports H I J K L M N O Data Sale Marketing s

ENABLE DATA SHARING Providers Consumers Secure and integrated Get access to the data Snowflake’s access control without any need to move model or transform it. Only pay normal storage costs Query and combine shared for shared data data with existing data or join together data from No limit to the number of multiple publishers consumer accounts with which a dataset may be shared Data Consumers Data Providers

ENABLE GLOBAL REPLICATION Azure AWS (Frankfurt) (Ireland) Azure AWS (US East) (Frankfurt) AWS AWS (US West) (US East) Azure AWS (Sydney) AWS

MULTI-TENANT SERVICE

DATA WAREHOUSE AS A SERVICE Multi-Tenant Service Availability Durability No administration, self-tuning All tier distributed over multiple Synchronous replication of and healing, datacenters with active-active data over multiple data centers data replication Transparent upgrade Automatic data retention and No maintenance downtime, fail safe technology to guard Service architecture designed fully transparent software & against any data removal for high availability and hardware upgrade durability Automatic repair of any failed Security is at the core servers with transparent re- execution of any failed queries Persistent session for load- balancing and transparent fail-over

SNOWFLAKE SERVICE Three independent layers Authentication & Access Control Cloud services Infrastructure Transaction Compilation and Management Optimizer Security manager manager Metadata Data processing Cache Cache Cache Cache Virtual warehouses Storage Databases

MANAGED SERVICE BUILT-IN DISASTER RECOVERY AND HIGH AVAILABILITY Scale-out of all tiers metadata, compute, storage Services Cloud Resiliency across multiple services availability zones Metadata geographic separation separate power grids built for synchronous replication Virtual Fully online updates & patches warehouses zero downtime Database Back pressure and throttling storage all the way back to the client

ADAPTIVE ALL THE WAY TO THE CORE SELF TUNING & SELF HEALING INTERNALS Adaptive Automatic Automatic Automatic Memory Distribution Degree of Self-tuning Management Method Parallelism Do no harm! Automatic No Vacuuming Automatic Automatic Fault Workload Default No Statistics Handling Management

EXAMPLE: AUTOMATIC SKEW AVOIDANCE 1 Detect popular values on the build side of the join 2 Use broadcast for those and directed join for the others Execution Plan Adaptive popular values detected at runtime 2 1 join Self-tuning number of values filter Do no harm! no performance degradation scan scan Automatic kicks in when needed Default enabled by default for all joins

WHAT’S NEXT? SERVERLESS DATA SERVICES Target predictable well-identified database workloads Horizontal scaling is automatic Fine grain unit of work allow for degree of parallelism to be arbitrarily small or large Secure since handled by the service Transparent retry on failures Service state entirely managed by the service Monitoring and observability of the service

CLOUD NATIVE ARCHITECTURE A GIFT THAT KEEPS ON GIVING

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November - PowerPoint PPT Presentation

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO THE DREAM DATA WAREHOUSE (CIRCA 2012) Unlimited and Store all Extreme No compromises Instant Scaling your data simplicity full

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

Data Warehouse of German Federal Police From Raw Data to Flexible Analytics Data Warehouse

Data Warehouse and Business Intelligence Webinar October 23, 2014 Objectives What is the

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

BI and WIC Data Warehouse Project Overview Reason for the Data Warehouse project EBT

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Financial Data Financial Data Warehouse Warehouse Some day, on the corporate balance sheet,

DATA WAREHOUSE How Business Intel and Data Warehouse works Information DEG uses Teacher

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

MARKET LEADER OF WAREHOUSE REAL ESTATE 8 % of the market of class A warehouse property of Russia

Purchasing / Warehouse / Print Shop Kathy Cartwright, Director Kelly ORourke, Purchasing

High Performance and Scalable MPI+X Library for Emerging HPC Clusters Talk at Intel HPC Developer

Anticipating the European Supercomputing Infrastructure of the Early 2020s Thomas C. Schulthess

VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling Bin

#FutureReady204 BOARD OF EDUCATION PRESENTATION JANUARY 22, 2018 Purpose Our purpose is: To

Using Tripwire to check cluster system integrity Elio P erez Calle Miguel C ardenas Montes

The Olympus High Performance Computing Cluster: A Resource for MIDAS Researchers Shawn T. Brown,

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Workforce in Iow as Creative Corridor University of Iowa January 2014 Strategic Skills Study

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November - PowerPoint PPT Presentation

DATA WAREHOUSE BUILT FOR THE CLOUD QCON San Francisco, November 2019 Thierry Cruanes, Co-Founder & CTO THE DREAM DATA WAREHOUSE (CIRCA 2012) Unlimited and Store all Extreme No compromises Instant Scaling your data simplicity full

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Europe Manchester, England North America - Factory Lehi, UT HQ &amp; Warehouse Salt Lake

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Data Warehouse and OLAP II Data Warehouse and OLAP II Week 6 1 Team Homework Assignment #8

Data Warehouse of German Federal Police From Raw Data to Flexible Analytics Data Warehouse

Data Warehouse and Business Intelligence Webinar October 23, 2014 Objectives What is the

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

BI and WIC Data Warehouse Project Overview Reason for the Data Warehouse project EBT

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Financial Data Financial Data Warehouse Warehouse Some day, on the corporate balance sheet,

DATA WAREHOUSE How Business Intel and Data Warehouse works Information DEG uses Teacher

Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1 Midterm I Midterm I Friday, March

MARKET LEADER OF WAREHOUSE REAL ESTATE 8 % of the market of class A warehouse property of Russia

Purchasing / Warehouse / Print Shop Kathy Cartwright, Director Kelly ORourke, Purchasing

High Performance and Scalable MPI+X Library for Emerging HPC Clusters Talk at Intel HPC Developer

Anticipating the European Supercomputing Infrastructure of the Early 2020s Thomas C. Schulthess

VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling Bin

#FutureReady204 BOARD OF EDUCATION PRESENTATION JANUARY 22, 2018 Purpose Our purpose is: To

Using Tripwire to check cluster system integrity Elio P erez Calle Miguel C ardenas Montes

The Olympus High Performance Computing Cluster: A Resource for MIDAS Researchers Shawn T. Brown,

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Workforce in Iow as Creative Corridor University of Iowa January 2014 Strategic Skills Study

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake