Delta Lake: Making Cloud Data Lakes Transactional and Scalable - PowerPoint PPT Presentation

Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15

About Me Databricks co-founder & Chief Architect - Designed most major things in “modern day” Apache Spark - #1 contributor to Spark by commits and net lines deleted PhD in databases from Berkeley

Building data analytics platform is hard ???? Data streams Insights

Traditional Data Warehouses ETL SQL OLTP Data Warehouse Insights databases

Challenges with Data Warehouses ETL pipelines are often complex and slow Ad-hoc pipelines to process data and ingest into warehouse No insights until daily data dumps have been processed Data Workloads often limited to SQL and BI tools Warehouse Data in proprietary formats Hard to do integrate streaming, ML, and AI workloads Performance is expensive Scaling up/out usually comes at a high cost

Dream of Data Lakes SQL ML, AI scalable ETL streaming Data streams Insights Data Lake

Data Lakes + Spark = Awesome! SQL, ML, STRUCTURED STREAMING STREAMING Data streams Insights Data Lake The 1st Unified Analytics Engine

Advantages of Data Lakes ETL pipelines are complex and slow simpler and fast Unified Spark API between batch and streaming simplifies ETL Raw unstructured data available as structured data in minutes Workloads limited not limited anything! Data Lake Data in files with open formats Integrate with data processing and BI tools Integrate with ML and AI workloads and tools Performance is expensive cheaper Easy and cost-effective to scale out compute and storage

Challenges of Data Lakes in practice

Challenges of Data Lakes in practice ETL @

Evolution of a Cutting-Edge Data Pipeline Events Streaming ? Analytics Data Lake Reporting

Evolution of a Cutting-Edge Data Pipeline Events Streaming Analytics Data Lake Reporting

Challenge #1: Historical Queries? λ -arch 1 λ -arch Events 1 λ -arch Streaming 1 Analytics Data Lake Reporting

Challenge #2: Messy Data? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Analytics 2 Validation Data Lake Reporting

Challenge #3: Mistakes and Failures? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics 2 Validation Partitioned 3 Reprocessing Data Lake Reporting

Challenge #4: Query Performance? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics Compaction 4 2 Validation Partitioned 2 Scheduled to 4 Avoid Compaction Reprocessing Data Lake 4 Reporting Compact Small Files

Data Lake Reliability Challenges Failed production jobs leave data in corrupt state requiring tedious recovery Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Lack of schema enforcement creates inconsistent and low quality data

Data Lake Performance Challenges Too many small or very big files - more time opening & closing files rather than reading content (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down when data has many dimensions and/or high cardinality columns Neither storage systems, nor processing engines are great at handling very large number of subdir/files

Figuring out what to read is too slow

Data integrity is hard

Band-aid solutions made it worse!

Everyone has the same problems

THE GOOD THE GOOD OF DATA LAKES OF DATA WAREHOUSES • Massive scale out • Pristine Data • Open Formats • Transactional Reliability • Mixed workloads • Fast SQL Queries

DELTA The The The RELIABILITY & SCALE LOW-LATENCY PERFORMANCE of data lake of streaming of data warehouse

Scalable storage + = DELTA Transactional log

DELTA pathToTable/ Scalable storage +---- 000.parquet table data stored as Parquet files +---- 001.parquet on HDFS, AWS S3, Azure Blob Stores +---- 002.parquet + ... | Transactional log +---- _delta_log/ +---- 000.json sequence of metadata files to track +---- 001.json operations made on the table ... stored in scalable storage along with table

Log Structured Storage | INSERT actions +---- _delta_log/ Changes to the table Add 001.parquet are stored as ordered, Add 002.parquet +---- 000.json atomic commits UPDATE actions +---- 001.json Remove 001.parquet Each commit is a set of ... Remove 002.parquet actions file in directory _delta_log Add 003.parquet

Log Structured Storage | INSERT actions +---- _delta_log/ Readers read the log in Add 001.parquet atomic units thus reading Add 002.parquet +---- 000.json consistent snapshots UPDATE actions +---- 001.json Remove 001.parquet ... readers will read Remove 002.parquet either [001+002].parquet Add 003.parquet or 003.parquet and nothing in-between

Mutual Exclusion Concurrent writers 000.json Writer 1 Writer 2 need to agree on the 001.json order of changes 002.json New commit files must be created mutually only one of the writers trying exclusively to concurrently write 002.json must succeed

Challenges with cloud storage Different cloud storage systems have different semantics to provide atomic guarantees Cloud Storage Atomic Atomic Solution Files Put if Visibility absent ✘ Azure Blob Store, Write to temp file, rename to ✔ Azure Data Lake final file if not present ✘ AWS S3 Separate service to perform all ✔ writes directly (single writer)

Concurrency Control Pessimistic Concurrency ✔ Avoid wasted work Block others from writing anything ✘ Distributed locks Hold lock, write data files, commit to log ✔ Mutual exclusion is enough! Optimistic Concurrency Assume it’ll be okay and write data files ✘ Breaks down if there a lot Try to commit to the log, fail on conflict of conflicts Enough as write concurrency is usually low

Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: B W: C 000001.json check if anything you 000002.json read has changed. 4. Try again. new file C does not conflict with new file B, so retry and commit successfully as 2.json

Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: A,B W: A,C 000001.json check if anything you read has changed. 4. Try again. Deletions of file A by user 1 conflicts with deletion by user 2, user 2 operation fails

Metadata/Checkpoints as Data Large tables can have millions of files in them! Even pulling them out of Hive [MySQL] would be a bottleneck. Add 1.parquet Checkpoint Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet

Challenges solved: Reliability Problem: Failed production jobs leave data in corrupt state requiring tedious recovery Solution: Failed write jobs do not update the commit log, DELTA hence partial / corrupt files not visible to readers

Challenges solved: Reliability Challenge : Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Solution: All reads have full snapshot consistency All successful writes are consistent DELTA In practice, most writes don't conflict Tunable isolation levels (serializability by default)

Challenges solved: Reliability Challenge : Lack of schema enforcement creates inconsistent and low quality data Solution: Schema recorded in the log Fails attempts to commit data with incorrect schema DELTA Allows explicit schema evolution Allows invariant and constraint checks (high data quality)

Challenges solved: Performance Challenge: Too many small files increase resource usage significantly Solution: Transactionally performed compaction using OPTIMIZE DELTA OPTIMIZE table WHERE date = '2019-04-04'

Challenges solved: Performance Challenge: Partitioning breaks down with many dimensions and/or high cardinality columns Solution: Optimize using multi-dimensional clustering on multiple columns DELTA OPTIMIZE conns WHERE date = '2019-04-04' ZORDER BY (srcIP, destIP)

Querying connection data at Apple Ad-hoc query of connection data based on different columns SELECT count(*) FROM conns Connections WHERE date = '2019-04-04' - date AND srcIp = '1.1.1.1' - srcIp - dstIp > PBs SELECT count(*) FROM conns > trillions of rows partitioning is bad as WHERE date = '2019-04-04' cardinality is high AND dstIp = '1.1.1.1'

Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8

Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 ideal file size = 4 rows 8

Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' 2 files AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8

Delta Lake: Making Cloud Data Lakes Transactional and Scalable - PowerPoint PPT Presentation

Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15 About Me Databricks co-founder & Chief Architect - Designed most major things in modern day Apache Spark - #1

Delta highlighting Delta highlighting edits highlighted Delta highlighting edits highlighted

Sacramento Countys Policies and Response To The Delta Vision and Bay-Delta Conservation Plan

Lake Ronkonkoma Ronkonkoma Update to the 1986 Clean Lakes Study Update to the 1986 Clean Lakes

The Great Lakes The Great Lakes The Great Lakes The Great Lakes and and and and St.

NH LAKES vs. Aquatic Invasive Species! Andrea LaMoreaux, Vice President New Hampshire Lakes

Laurance Lake Reservoir Course Reservoir Course Project j Laurance Lake Laurance Lake

Hidden Lake Vista -Montana Tufa Towers Mono Lake California Sylvan Lake ,Black Hills-South Dakota

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

DELTA Microelectronics Microchip for every industry Make more with less DELTA / ASIC

Delta Independent Science Board: Recommendation of the Delta Lead Scientist Dr. Elizabeth Canuel

Wind Turbine Noise Measurements in Practice Carsten Thomsen, DELTA, cth@delta.dk Simon Mller

Analyzing with P Analyzing with P Delta y y g g Delta Presenter: Presenter: Deborah

Phi Gamma Delta University of Tampa Phi Gamma Delta Delta Colony at UT 164 universities

Lectur Lecture 18: e 18: Electr Electrical Distr ical Distribution ibution Exam Example:

THE NATIONAL ECONOMY Source: Delta Associates; October 2018. THE NATIONAL ECONOMY Source: Delta

Detailed Task Analysis and Failure Modes and Effects Analysis HFE Requirements Development

+ Needs Assessment and Logic Models ScWk 240 Week 13 Slides + 2 What is Program

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

Learning objectives Understand the role of quality is the development process Test and

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions >>> ten =

ADVANCED DATABASE SYSTEMS Multi-Version Concurrency Control (Design Decisions) @ Andy_Pavlo

Pangloss: a novel Markov chain prefetcher The 3rd Data Prefetching Championship (co-located with

DE ltas, vulnerability and C limate C hange: M igration and A daptation (DECCMA) Kwasi Appeaning

Delta Lake: Making Cloud Data Lakes Transactional and Scalable - PowerPoint PPT Presentation

Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15 About Me Databricks co-founder & Chief Architect - Designed most major things in modern day Apache Spark - #1

Delta highlighting Delta highlighting edits highlighted Delta highlighting edits highlighted

Sacramento Countys Policies and Response To The Delta Vision and Bay-Delta Conservation Plan

Lake Ronkonkoma Ronkonkoma Update to the 1986 Clean Lakes Study Update to the 1986 Clean Lakes

The Great Lakes The Great Lakes The Great Lakes The Great Lakes and and and and St.

NH LAKES vs. Aquatic Invasive Species! Andrea LaMoreaux, Vice President New Hampshire Lakes

Laurance Lake Reservoir Course Reservoir Course Project j Laurance Lake Laurance Lake

Hidden Lake Vista -Montana Tufa Towers Mono Lake California Sylvan Lake ,Black Hills-South Dakota

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 25 November 2016 Lecture 8

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 27 November 2015 Lecture 8

DELTA Microelectronics Microchip for every industry Make more with less DELTA / ASIC

Delta Independent Science Board: Recommendation of the Delta Lead Scientist Dr. Elizabeth Canuel

Wind Turbine Noise Measurements in Practice Carsten Thomsen, DELTA, cth@delta.dk Simon Mller

Analyzing with P Analyzing with P Delta y y g g Delta Presenter: Presenter: Deborah

Phi Gamma Delta University of Tampa Phi Gamma Delta Delta Colony at UT 164 universities

Lectur Lecture 18: e 18: Electr Electrical Distr ical Distribution ibution Exam Example:

THE NATIONAL ECONOMY Source: Delta Associates; October 2018. THE NATIONAL ECONOMY Source: Delta

Detailed Task Analysis and Failure Modes and Effects Analysis HFE Requirements Development

+ Needs Assessment and Logic Models ScWk 240 Week 13 Slides + 2 What is Program

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

Learning objectives Understand the role of quality is the development process Test and

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions &gt;&gt;&gt; ten =

ADVANCED DATABASE SYSTEMS Multi-Version Concurrency Control (Design Decisions) @ Andy_Pavlo

Pangloss: a novel Markov chain prefetcher The 3rd Data Prefetching Championship (co-located with

DE ltas, vulnerability and C limate C hange: M igration and A daptation (DECCMA) Kwasi Appeaning

61A Lecture 6 Friday, September 7 Lambda Expressions 2 Lambda Expressions >>> ten =