Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud
Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com
2019-11-18 @ PDSW 2019
Alluxio: Open Source Data Orchestration for Analytics and AI in the - - PowerPoint PPT Presentation
Alluxio: Open Source Data Orchestration for Analytics and AI in the Cloud Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com 2019-11-18 @ PDSW 2019 The Alluxio Story Originated as Tachyon project, at the UC Berkleys AMP Lab
Haoyuan (H.Y.) Li | Founder, Chairman & CTO | haoyuan@alluxio.com
2019-11-18 @ PDSW 2019
Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data at Memory Speed for the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2019 2018
v0.4 Feb ‘14 v0.3 Oct ‘13 v0.2 Apr ‘13 v0.1 Dec ‘12 v0.6 Mar ‘15 v0.5 Jul ‘14 v0.7 Jul ‘15
Consumer Travel & Transportation Telco & Media
Technology Financial Services Retail & Entertainment Data & Analytics Services
COMPUTE STORAGE STORAGE COMPUTE
S TORAGE COMPUTE
Co-located
Co-located compute & HDFS
Disaggregated compute & HDFS
MR / Hive HDFS Hive HDFS Disaggregated
Burst HDFS data in the cloud, public or private Support Presto, Spark across DCs without app changes Enable & accelerate big data on
Transition to Object store HDFS for Hybrid Cloud Support more frameworks
§ Typically compute-bound clusters over 100% capacity § Compute & I/O need to be scaled together even when not needed § Compute & I/O can be scaled independently but I/O still needed on HDFS which is expensive
Java File API HDFS Interface S3 Interface REST API POSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
> rdd = sc.textFile(“alluxio://localhost:19998/myInput”) CREATE SCHEMA hive.web WITH (location = 'alluxio://master:port/my-table/') $ cat /mnt/alluxio/myInput FileSystem fs = FileSystem.Factory.get(); FileInStream in = fs.openFile(new AlluxioURI("/myInput"));
§ Running new frameworks on existing an
HDFS cluster can dramatically affect performance of existing workloads
§ Orchestrating data to compute clusters in
another data center is typically a manual effort and time consuming
§ Storing and managing multiple copies of
the data becomes expensive
Data center A
On-premise satellite compute clusters across data centers
MapReduce Hive
Data center B
§ S3 performance is variable and consistent
query SLAs are hard to achieve
§ S3 metadata operations are expensive
making workloads run longer
§ S3 egress costs add up making the
solution expensive
§ S3 is eventually consistent making it hard
to predict query results
Accelerate analytical frameworks
Same instance / container
§ Accessing data over WAN too slow § Copying data to compute cloud time
consuming and complex
§ Using another storage system like S3
means expensive application changes
§ Using S3 via HDFS connector leads
to extremely low performance
Burst big data workloads in hybrid cloud environments
Same instance / container
Solution Benefits § Same performance as local § Same end-user experience § 100% of I/O is offloaded
§ Object stores performance for big
data workloads can be very poor
§ No native support for popular
frameworks
§ Expensive metadata operations
reduce performance even more
§ No support for hybrid environments
directly
Dramatically speed-up big data
Same container / machine
Solution Benefits § Same performance as HDFS § Uses HDFS APIs § Same end-user experience § Storage at fraction of the cost of HDFS
Burst big data workloads in hybrid cloud environments
Same instance / container
Accelerate big data frameworks
Same instance / container
Dramatically speed-up big data
Same container / machine
Any Cloud / Multi Cloud Same data center / region
Enable big data on object stores across single or multiple clouds
Standalone
Orchestrate data frameworks on the public cloud
Any public / private cloud
Abstract data silos & storage systems to independently scale data on-demand with compute Run Spark, Hive, Presto, ML workloads on your data located anywhere Accelerate big data workloads with transparent tiered local data
Hot Warm Cold
RAM SSD HDD
Read & Write Buffering Transparent to App
Policies for pinning, promotion/demotion, TTL
Java File API HDFS Interface S3 Interface REST API POSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
SUPPORTS
IT OPS FRIENDLY
by central IT
source data
LDAP/AD
HDFS #1 Object Store NFS HDFS #2
Alluxio Master Zookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker
RAM / SSD / HDD
Alluxio Worker
RAM / SSD / HDD
… … Application Application Under Store 1 Under Store 2
hdfs://host:port/directory/ Reports Sales
Project:
Problem:
network bound
$70B e-commerce retailer
Alluxio solution:
well as the compute Result:
without taxing the existing HDFS cluster
3000 Node HDFS PRESTO
Separate Compute
ALLUXIO
Datacenter
SPARK 3000 Node HDFS PRESTO
Separate Compute Datacenter
SPARK
ALLUXIO Analytics Frameworks AI & Analytics Object Store
AWS
Initial Project:
compute and using object storage Problem:
scale
Largest bank in Southeast Asia
Datacenter Datacenter
Alluxio solution: 1. Alluxio provides intelligent caching layer for object storage 2. Burst workloads to hybrid cloud Result:
considered mature layer in stack
HDFS Analytics Frameworks ALLUXIO Object Store
Datacenter
PRESTO OBJECT STORE
Public Cloud
Project:
Problem:
to be usable
Alluxio solution:
caching layer for object storage Result:
analysts
PRESTO OBJECT STORE
Public Cloud
ALLUXIO
DATA ORCHESTRATION SPARK HDFS SPARK
Kubernetes
OBJECT HBASE ETL SPARK HDFS OBJECT HBASE
Leading Chinese Telco serving 320 million subscribers
HDFS
Leading Online Game Company in China
https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/
Presto HDFS Presto Alluxio