delta lake making cloud data lakes transactional and
play

Delta Lake: Making Cloud Data Lakes Transactional and Scalable - PowerPoint PPT Presentation

Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15 About Me Databricks co-founder & Chief Architect - Designed most major things in modern day Apache Spark - #1


  1. Delta Lake: Making Cloud Data Lakes Transactional and Scalable Reynold Xin @rxin Stanford University, 2019-05-15

  2. About Me Databricks co-founder & Chief Architect - Designed most major things in “modern day” Apache Spark - #1 contributor to Spark by commits and net lines deleted PhD in databases from Berkeley

  3. Building data analytics platform is hard ???? Data streams Insights

  4. Traditional Data Warehouses ETL SQL OLTP Data Warehouse Insights databases

  5. Challenges with Data Warehouses ETL pipelines are often complex and slow Ad-hoc pipelines to process data and ingest into warehouse No insights until daily data dumps have been processed Data Workloads often limited to SQL and BI tools Warehouse Data in proprietary formats Hard to do integrate streaming, ML, and AI workloads Performance is expensive Scaling up/out usually comes at a high cost

  6. Dream of Data Lakes SQL ML, AI scalable ETL streaming Data streams Insights Data Lake

  7. Data Lakes + Spark = Awesome! SQL, ML, STRUCTURED STREAMING STREAMING Data streams Insights Data Lake The 1st Unified Analytics Engine

  8. Advantages of Data Lakes ETL pipelines are complex and slow simpler and fast Unified Spark API between batch and streaming simplifies ETL Raw unstructured data available as structured data in minutes Workloads limited not limited anything! Data Lake Data in files with open formats Integrate with data processing and BI tools Integrate with ML and AI workloads and tools Performance is expensive cheaper Easy and cost-effective to scale out compute and storage

  9. Challenges of Data Lakes in practice

  10. Challenges of Data Lakes in practice ETL @

  11. Evolution of a Cutting-Edge Data Pipeline Events Streaming ? Analytics Data Lake Reporting

  12. Evolution of a Cutting-Edge Data Pipeline Events Streaming Analytics Data Lake Reporting

  13. Challenge #1: Historical Queries? λ -arch 1 λ -arch Events 1 λ -arch Streaming 1 Analytics Data Lake Reporting

  14. Challenge #2: Messy Data? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Analytics 2 Validation Data Lake Reporting

  15. Challenge #3: Mistakes and Failures? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics 2 Validation Partitioned 3 Reprocessing Data Lake Reporting

  16. Challenge #4: Query Performance? λ -arch 1 λ -arch Events Validation 1 2 λ -arch Streaming 1 Reprocessing 3 Analytics Compaction 4 2 Validation Partitioned 2 Scheduled to 4 Avoid Compaction Reprocessing Data Lake 4 Reporting Compact Small Files

  17. Data Lake Reliability Challenges Failed production jobs leave data in corrupt state requiring tedious recovery Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Lack of schema enforcement creates inconsistent and low quality data

  18. Data Lake Performance Challenges Too many small or very big files - more time opening & closing files rather than reading content (worse with streaming) Partitioning aka “poor man’s indexing”- breaks down when data has many dimensions and/or high cardinality columns Neither storage systems, nor processing engines are great at handling very large number of subdir/files

  19. Figuring out what to read is too slow

  20. Data integrity is hard

  21. Band-aid solutions made it worse!

  22. Everyone has the same problems

  23. THE GOOD THE GOOD OF DATA LAKES OF DATA WAREHOUSES • Massive scale out • Pristine Data • Open Formats • Transactional Reliability • Mixed workloads • Fast SQL Queries

  24. DELTA The The The RELIABILITY & SCALE LOW-LATENCY PERFORMANCE of data lake of streaming of data warehouse

  25. Scalable storage + = DELTA Transactional log

  26. DELTA pathToTable/ Scalable storage +---- 000.parquet table data stored as Parquet files +---- 001.parquet on HDFS, AWS S3, Azure Blob Stores +---- 002.parquet + ... | Transactional log +---- _delta_log/ +---- 000.json sequence of metadata files to track +---- 001.json operations made on the table ... stored in scalable storage along with table

  27. Log Structured Storage | INSERT actions +---- _delta_log/ Changes to the table Add 001.parquet are stored as ordered, Add 002.parquet +---- 000.json atomic commits UPDATE actions +---- 001.json Remove 001.parquet Each commit is a set of ... Remove 002.parquet actions file in directory _delta_log Add 003.parquet

  28. Log Structured Storage | INSERT actions +---- _delta_log/ Readers read the log in Add 001.parquet atomic units thus reading Add 002.parquet +---- 000.json consistent snapshots UPDATE actions +---- 001.json Remove 001.parquet ... readers will read Remove 002.parquet either [001+002].parquet Add 003.parquet or 003.parquet and nothing in-between

  29. Mutual Exclusion Concurrent writers 000.json Writer 1 Writer 2 need to agree on the 001.json order of changes 002.json New commit files must be created mutually only one of the writers trying exclusively to concurrently write 002.json must succeed

  30. Challenges with cloud storage Different cloud storage systems have different semantics to provide atomic guarantees Cloud Storage Atomic Atomic Solution Files Put if Visibility absent ✘ Azure Blob Store, Write to temp file, rename to ✔ Azure Data Lake final file if not present ✘ AWS S3 Separate service to perform all ✔ writes directly (single writer)

  31. Concurrency Control Pessimistic Concurrency ✔ Avoid wasted work Block others from writing anything ✘ Distributed locks Hold lock, write data files, commit to log ✔ Mutual exclusion is enough! Optimistic Concurrency Assume it’ll be okay and write data files ✘ Breaks down if there a lot Try to commit to the log, fail on conflict of conflicts Enough as write concurrency is usually low

  32. Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: B W: C 000001.json check if anything you 000002.json read has changed. 4. Try again. new file C does not conflict with new file B, so retry and commit successfully as 2.json

  33. Solving Conflicts Optimistically 1. Record start version User 1 User 2 2. Record reads/writes 000000.json R: A R: A 3. If someone else wins, W: A,B W: A,C 000001.json check if anything you read has changed. 4. Try again. Deletions of file A by user 1 conflicts with deletion by user 2, user 2 operation fails

  34. Metadata/Checkpoints as Data Large tables can have millions of files in them! Even pulling them out of Hive [MySQL] would be a bottleneck. Add 1.parquet Checkpoint Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet

  35. Challenges solved: Reliability Problem: Failed production jobs leave data in corrupt state requiring tedious recovery Solution: Failed write jobs do not update the commit log, DELTA hence partial / corrupt files not visible to readers

  36. Challenges solved: Reliability Challenge : Lack of consistency makes it almost impossible to mix appends, deletes, upserts and get consistent reads Solution: All reads have full snapshot consistency All successful writes are consistent DELTA In practice, most writes don't conflict Tunable isolation levels (serializability by default)

  37. Challenges solved: Reliability Challenge : Lack of schema enforcement creates inconsistent and low quality data Solution: Schema recorded in the log Fails attempts to commit data with incorrect schema DELTA Allows explicit schema evolution Allows invariant and constraint checks (high data quality)

  38. Challenges solved: Performance Challenge: Too many small files increase resource usage significantly Solution: Transactionally performed compaction using OPTIMIZE DELTA OPTIMIZE table WHERE date = '2019-04-04'

  39. Challenges solved: Performance Challenge: Partitioning breaks down with many dimensions and/or high cardinality columns Solution: Optimize using multi-dimensional clustering on multiple columns DELTA OPTIMIZE conns WHERE date = '2019-04-04' ZORDER BY (srcIP, destIP)

  40. Querying connection data at Apple Ad-hoc query of connection data based on different columns SELECT count(*) FROM conns Connections WHERE date = '2019-04-04' - date AND srcIp = '1.1.1.1' - srcIp - dstIp > PBs SELECT count(*) FROM conns > trillions of rows partitioning is bad as WHERE date = '2019-04-04' cardinality is high AND dstIp = '1.1.1.1'

  41. Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8

  42. Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 ideal file size = 4 rows 8

  43. Multidimensional Sorting dstIp 1 2 3 4 5 6 7 8 1 SELECT count(*) FROM conns 2 WHERE date = '2019-04-04' 2 files AND srcIp = '1.1.1.1' 3 4 srcIp SELECT count(*) FROM conns WHERE date = '2019-04-04' 5 AND dstIp = '1.1.1.1' 6 7 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend