enabling ultra fast presto in the cloud with alluxio
play

Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) - PowerPoint PPT Presentation

Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC ALLUXIO 2019 Outline Alluxio Overview: History and its Open Source


  1. Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC ALLUXIO 2019

  2. Outline • Alluxio Overview: History and its Open Source Community • Presto Alluxio Stack (PAS) Today: Architecture, Benefit, Production Use Cases • Alluxio Structured Data Service: Deeper Integration with SQL Engines like Presto ALLUXIO 2019

  3. Alluxio Overview History and Open Source Community ALLUXIO 2019

  4. The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 Open Source project established & company to commercialize Alluxio founded 2015 Goal: Orchestrate Data for Analytics & ML in the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019

  5. Open Source Started From UC Berkeley AMPLab 1000+ contributors & Apache 2.0 Licensed growing GitHub’s Top 100 Most Valuable Repositories 4000+ Git Stars Join the Out of 96 Million conversation on Slack slackin.alluxio.io

  6. Companies Running Alluxio (Learn More) Financial Services Retail & Entertainment Data & Analytics Services Technology Consumer Telco & Media Travel & Transportation

  7. Four trends driving the need for a new architecture Separation of Hybrid – Multi Self-service Rise Compute & cloud data across of the object Storage environments the enterprise store

  8. Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE

  9. Data Ecosystem 1.0 – The Challenges COMPUTE Complex Low performance Expensive STORAGE

  10. Data silos cross data centers, regions, clouds COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS PRESTO Spark PRESTO AZURE S3 WAN WAN TENSOR HIVE Presto FLOW OBJECT HDFS HDFS STORE NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS

  11. Alluxio: an Open Source Data Orchestration System

  12. Data Platform using a Data Orch chestration Approach ch COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS ANY TENSOR DATA HIVE Presto FLOW SPARK PRESTO APP DATA DATA DATA DATA DATA DATA ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION ORCHESTRATION S3 NFS HDFS DATA IN DISPARATE STORAGE SYSTEMS

  13. Presto Alluxio Stack (PAS) Today Architecture, Benefit, Production Use Cases ALLUXIO 2019

  14. Why Presto on Alluxio § Distributed Data Orchestration (including caching) on Demand • Faster: Lower query latency • SLA: More consistent performance • Efficiency: More concurrency and Less data transfer § Deeper Presto Alluxio Integration • New Alluxio catalog service Now available as Developer Preview in v2.1 • New Alluxio transformation service 15

  15. How Presto Works with Alluxio Presto Presto Read/Write Read/Write Metadata Metadata Read/Write Data Hive Read/Write Hive Alluxio Metastore Data Metastore location=alluxio:///table location=s3://bucket/table Mounted to Alluxio 16

  16. How to Use Alluxio in Presto CLI Create A Table on Alluxio > CREATE TABLE alluxio_table (id varchar) WITH (external_location = 'alluxio:///table'); Read A Table from Alluxio > SELECT * FROM alluxio_table 17

  17. Challenges with running workloads on cloud storage Compute caching for S3 / Accelerate analytical GCS frameworks on the public cloud ▪ S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Presto ▪ S3 metadata operations are expensive Alluxio Alluxio making workloads run longer Alluxio Alluxio ▪ S3 egress costs add up making the Same instance / container solution expensive ▪ S3 is eventually consistent making it hard to predict query results or

  18. Challenges with Hybrid Cloud HDFS for Hybrid Burst big data workloads in Cloud hybrid cloud environments ▪ Accessing data over WAN too slow Solution Benefits ▪ Same performance as local Presto Presto ▪ Same end-user experience ▪ Copying data to compute cloud time Presto Presto consuming and complex Alluxio Alluxio Alluxio Alluxio ▪ Using another storage system like S3 means expensive application changes ▪ Using S3 via HDFS connector leads Same instance to extremely low performance / container ▪ 100% of I/O is offloaded

  19. Challenges running Big Data on Object Stores & Alluxio Solution Transition to Object Dramatically speed-up big data on object stores on premise store ▪ Object stores performance for big Presto data workloads can be very poor Presto Presto Presto Solution Benefits ▪ No native support for popular Alluxio Alluxio ▪ Same performance as HDFS Alluxio Alluxio frameworks ▪ Uses HDFS APIs ▪ Same end-user experience Same container ▪ Expensive metadata operations / machine reduce performance even more ▪ No support for hybrid environments directly ▪ Storage at fraction of the or or cost of HDFS

  20. Robolox Use Case | Compute Caching for Cloud Presto Presto Alluxio AWS S3 AWS S3 ▪ Cache hot data in Alluxio, leaving all data in S3 ▪ Reduce Presto queries from 10 sec to sub second ▪ Faster time to provide data scientists insights

  21. NetEase Games Leading Online Game Company in China Use Case | On-premise Caching for Presto Presto Presto Alluxio HDFS HDFS ▪ Large query variance during peak hours before ▪ Alluxio brings data local to Presto to reduce the latency during peak hours https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/

  22. Architecture: Colocate Alluxio with Presto • Black/Red line – Large Query variance without Alluxio • Green line - Stable query time with Alluxio

  23. JD.com Leading Online Retailer in China Use Case | On-premise Satellite Cluster for Presto SPARK Presto SPARK Presto Alluxio HDFS HDFS ▪ Presto workers may read remotely from HDFS datanodes -> large query variance ▪ Data local to Presto accelerates workloads https://www.slideshare.net/Alluxio/alluxio-in-jd

  24. Architecture: Colocate Alluxio with Presto 25

  25. Pe Performance Evaluation • Yellow line - Stable query time with Alluxio < 1sec after first query (cold read) • • Green line – JD Presto without Alluxio : > 10sec

  26. Mor More E Examp mples De Details ails: www.alluxio.io/power www ered ed-by by-allu alluxio io/ www www.alluxio.io/data-or orchestration on-su summit-2019/ 2019/ 27

  27. Common Use Cases Zero-copy burst workloads in On-premise satellite Accelerate query performance hybrid cloud environments compute clusters across data centers as cloud storage caching Satellite Presto Cluster Spark Spark Spark Presto Hive Presto Hive Hive Presto Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Main Hadoop Cluster Hive Spark 28

  28. Advanced Use Cases Spark Hive Presto Spark Presto Alluxio Alluxio Standalone Any public / private cloud Same data center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks across single or multiple clouds on the public cloud

  29. Now available as Developer Preview in v2.1 Alluxio Structured Data Service Deeper Integration with SQL Engines like Presto ALLUXIO 2019

  30. Storage Systems SQL Frameworks Files/Objects Tables Directories Schemas Impedance Mismatch Raw Bytes Rows/Columns Cost-efficiency Compute-optimized Durability Further Expand Benefits! Computation 31

  31. Benefits of Alluxio Data Orchestration Caching Unified Interface/Namespace Storage SQL Schema-Aware Optimizations Systems Frameworks Compute-Optimized Formats Physical Data Independence 32

  32. Alluxio Structured Data Service (from v2.1) Presto Alluxio Hive Connector Connector Alluxio Catalog Alluxio Caching Alluxio Transformation Service Service Service Hive Metastore Storage 33

  33. Alluxio Structured Data Service Summary • Significantly speed up queries! • Detailed presentation: www.alluxio.io/resources/videos/alluxio- innovations-for-structured-data/ • Try it out! 34

  34. Next Step § Check out more tutorials https://www.alluxio.io/presto/ § More Video & Slides: https://www.alluxio.io/data-orchestration- summit-2019/ § Additional Reads: • Starburst Presto + Alluxio = better together https://www.starburstdata.com/technical-blog/starburst-presto-alluxio-better-together/ • Top 5 performance tips running Presto with Alluxio https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1 • Presto + Alluxio + Hive Metastore on your Laptop in 10 min https://www.alluxio.io/blog/tutorial-presto-alluxio-hive-metastore-on-your-laptop-in-10-min/ • Alluxio Structure Data Service: https://www.alluxio.io/resources/videos/alluxio-innovations-for- structured-data/ 35

  35. Thank you! Questions? www.alluxio.io | slackin.alluxio.io | @alluxio | haoyuan@alluxio.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend