alluxio open source data orchestration for analytics and
play

Alluxio: Open Source Data Orchestration for Analytics and AI in the - PowerPoint PPT Presentation

Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com 2019-11-18 @ PDSW 2019 The Alluxio Story Originated as Tachyon project, at the UC Berkleys AMP Lab


  1. Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com 2019-11-18 @ PDSW 2019

  2. The Alluxio Story Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 Open Source project established & company to commercialize Alluxio founded 2015 Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2018 2019

  3. Early Days Contributors Growth 100+ Contributors Growth 70 46 30 15 3 1 v0.1 v0.2 v0.3 v0.4 v0.5 v0.6 v0.7 Dec ‘12 Apr ‘13 Oct ‘13 Feb ‘14 Jul ‘14 Mar ‘15 Jul ‘15

  4. Open Source Started From UC Berkeley AMPLab 1000+ contributors & Apache 2.0 Licensed growing GitHub’s Top 100 Most Valuable Repositories 4000+ Git Stars Join the Out of 96 Million conversation on Slack slackin.alluxio.io

  5. Companies Running Alluxio (Learn More) Financial Services Retail & Entertainment Data & Analytics Services Technology Consumer Telco & Media Travel & Transportation

  6. 4 big trends driving the need for a new architecture Rise Separation of Hybrid – Multi Self-service Compute & cloud of the object data across the Storage environments enterprise store

  7. Data Ecosystem - Beta Data Ecosystem 1.0 COMPUTE COMPUTE STORAGE STORAGE

  8. Data Ecosystem 1.0 –TheChallenges COMPUTE Complex Low performance Expensive S TORAGE

  9. Data stack journey and innovation paths Support more frameworks Co-located Disaggregated Support Presto, Spark Co-located Disaggregated across DCs without compute & HDFS compute & HDFS app changes on the same cluster on the same cluster HDFS for Hybrid Cloud Hive MR / Hive Burst HDFS data in HDFS the cloud, HDFS public or private Transition to Object store § Typically compute-bound § Compute & I/O can be clusters over 100% capacity scaled independently but Enable & accelerate § Compute & I/O need to be I/O still needed on HDFS big data on scaled together even when which is expensive not needed object stores

  10. Independent scaling of compute & storage POSIX Interface Java File API HDFS Interface S3 Interface REST API Data Orchestration for the Cloud HDFS Driver Swift Driver S3 Driver NFS Driver

  11. APIs to Interact with data in Alluxio Application have great flexibility to read / write data with many options Spark > rdd = sc.textFile(“alluxio://localhost:19998/myInput”) Presto CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') POSIX $ cat /mnt/alluxio/myInput Java FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));

  12. Challenges with supporting more frameworks across data centers On-premise satellite Support more frameworks compute clusters across data centers Data center A § Running new frameworks on existing an HDFS cluster can dramatically affect Presto performance of existing workloads § Orchestrating data to compute clusters in Alluxio another data center is typically a manual effort and time consuming § Storing and managing multiple copies of the data becomes expensive Hive MapReduce Data center B

  13. Challenges with running workloads on cloud storage Accelerate analytical frameworks Compute caching for S3 / GCS on the public cloud § S3 performance is variable and consistent query SLAs are hard to achieve Spark Spark Spark Spark § S3 metadata operations are expensive Alluxio Alluxio making workloads run longer Alluxio Alluxio § S3 egress costs add up making the Same instance / container solution expensive § S3 is eventually consistent making it hard to predict query results or

  14. Challenges with Hybrid Cloud Burst big data workloads in HDFS for Hybrid Cloud hybrid cloud environments § Accessing data over WAN too slow Solution Benefits § Same performance as local Presto Presto § Same end-user experience § Copying data to compute cloud time Presto Presto consuming and complex Alluxio Alluxio Alluxio Alluxio § Using another storage system like S3 means expensive application changes § Using S3 via HDFS connector leads Same instance to extremely low performance / container § 100% of I/O is offloaded

  15. Challenges running Big Data on Object Stores & Alluxio Solution Dramatically speed-up big data Transition to Object store on object stores on premise § Object stores performance for big Presto data workloads can be very poor Presto Presto Presto Solution Benefits § No native support for popular Alluxio Alluxio § Same performance as HDFS Alluxio Alluxio frameworks § Uses HDFS APIs § Same end-user experience Same container § Expensive metadata operations / machine reduce performance even more § No support for hybrid environments directly § Storage at fraction of the or or cost of HDFS

  16. Use Cases Alluxio Enables Accelerate big data frameworks Burst big data workloads in Dramatically speed-up big data on the public cloud hybrid cloud environments on object stores on premise Presto Hive Presto Spark Spark Hive Presto Spark Hive Presto Spark Hive Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Alluxio Same instance Same container Same instance / container / machine / container or or

  17. Advanced Use Cases Spark Hive Presto Spark Presto Alluxio Alluxio Standalone Any public / private cloud Same data center / region Any Cloud / Multi Cloud or or Enable big data on object stores Orchestrate data frameworks on across single or multiple clouds the public cloud

  18. Alluxio – Key innovations Data Locality Data Accessibility Data Elasticity with Intelligent for popular APIs & with a unified Multi-tiering API translation namespace Abstract data silos & storage Accelerate big data Run Spark, Hive, Presto, ML systems to independently scale workloads with transparent workloads on your data data on-demand with compute tiered local data located anywhere

  19. Data Locality with Intelligent Multi-tiering Local performance from remote data using multi-tier storage Read & Write Buffering Transparent to App RAM SSD HDD Hot Warm Cold Policies for pinning, promotion/demotion, TTL

  20. Data Accessibility via popular APIs and API Translation Convert from Client-side Interface to native Storage Interface POSIX Interface REST API Java File API HDFS Interface S3 Interface HDFS Driver S3 Driver Swift Driver NFS Driver

  21. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting with Transparent Naming

  22. Unified Namespace: Global Data Accessibility Transparent access to understorage makes all enterprise data available locally HDFS #1 SUPPORTS IT OPS FRIENDLY • HDFS • Storage mounted into Alluxio Object Store • NFS by central IT • OpenStack • Security in Alluxio mirrors NFS • Ceph source data • Amazon S3 • Authentication through HDFS #2 • Azure LDAP/AD • Google Cloud • Wireline encryption

  23. Alluxio Reference Architecture … WAN Alluxio Alluxio Worker Client RAM / SSD / HDD Under Store 1 Application … Alluxio Alluxio Worker Client Application RAM / SSD / HDD Under Store 2 Alluxio Zookeeper / Master RAFT Standby Master

  24. Policy Driven under File System Migration hdfs://host:port/directory/ Sales Reports

  25. Research Directions Machine-learning based Data Orchestration Policies Scalable and High-performance File System Metadata service Optimization for in-memory data partition / format Cross-layer optimization for distributed compute and storage systems

  26. JD.com | Performance Use Case in DC $70B e-commerce retailer PRESTO SPARK PRESTO SPARK ALLUXIO Separate Compute Separate Compute 3000 Node HDFS 3000 Node HDFS Datacenter Datacenter Project: Alluxio solution: Offload HDFS with separate clusters Alluxio offloads the network I/O as • • of Presto and Spark well as the compute Problem: Result: HDFS cluster is compute and Teams can run additional workloads • • network bound without taxing the existing HDFS Performance is inconsistent cluster •

  27. DBS Bank | Performance & Hybrid Largest bank in Southeast Asia AI & Analytics Analytics Frameworks ALLUXIO Analytics ALLUXIO AWS Frameworks Object Store Object Store HDFS Datacenter Datacenter Datacenter Initial Project: Alluxio solution: Digital Bank Initiative 1. Alluxio provides intelligent caching • Solve scaling challenges by separating layer for object storage • compute and using object storage 2. Burst workloads to hybrid cloud Problem: Result: Coupled systems were not flexible to Enables data on-demand, Alluxio now • • scale considered mature layer in stack

  28. Walmart | Performance Use Case in Cloud PRESTO PRESTO ALLUXIO OBJECT STORE OBJECT STORE Public Cloud Public Cloud Project: Alluxio solution: Utilize Presto for interactive queries • Alluxio provides intelligent distributed • on cloud object store compute caching layer for object storage Result: Problem: High performance queries • Low performance of queries too slow • Consistent performance • to be usable Interactive query performance for • Inconsistent performance of queries • analysts

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend