Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) - - PowerPoint PPT Presentation

enabling ultra fast presto in the cloud with alluxio
SMART_READER_LITE
LIVE PREVIEW

Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) - - PowerPoint PPT Presentation

Enabling Ultra-fast Presto in the Cloud with Alluxio Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC ALLUXIO 2019 Outline Alluxio Overview: History and its Open Source


slide-1
SLIDE 1

ALLUXIO

2019

Enabling Ultra-fast Presto in the Cloud with Alluxio

Haoyuan (H.Y.) Li | Founder & CTO | Alluxio | haoyuan@alluxio.com | alluxio.io/slack 2019-12-11 @ Presto Summit NYC

slide-2
SLIDE 2

ALLUXIO

2019

Outline

  • Alluxio Overview: History and its Open

Source Community

  • Presto Alluxio Stack (PAS) Today:

Architecture, Benefit, Production Use Cases

  • Alluxio Structured Data Service: Deeper

Integration with SQL Engines like Presto

slide-3
SLIDE 3

ALLUXIO

2019

Alluxio Overview

History and Open Source Community

slide-4
SLIDE 4

The Alluxio Story

Originated as Tachyon project, at the UC Berkley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2013 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for Analytics & ML in the Cloud for data driven apps such as Big Data Analytics, ML and AI. 2018 2019 2018

slide-5
SLIDE 5

Open Source Started From UC Berkeley AMPLab

1000+ contributors & growing 4000+ Git Stars Apache 2.0 Licensed GitHub’s Top 100 Most Valuable Repositories Out of 96 Million Join the conversation on Slack slackin.alluxio.io

slide-6
SLIDE 6

Consumer Travel & Transportation Telco & Media

Companies Running Alluxio (Learn More)

Technology Financial Services Retail & Entertainment Data & Analytics Services

slide-7
SLIDE 7

Four trends driving the need for a new architecture

Separation of Compute & Storage Hybrid – Multi cloud environments Self-service data across the enterprise Rise

  • f the object

store

slide-8
SLIDE 8

Data Ecosystem - Beta Data Ecosystem 1.0

COMPUTE STORAGE STORAGE COMPUTE

slide-9
SLIDE 9

Data Ecosystem 1.0 – The Challenges

STORAGE COMPUTE

Complex Low performance Expensive

slide-10
SLIDE 10

Data silos cross data centers, regions, clouds

HDFS

HIVE

HDFS

Presto

NFS

TENSOR FLOW

OBJECT STORE

PRESTO

WAN

HDFS

WAN

S3

Spark AZURE PRESTO

DATA IN DISPARATE STORAGE SYSTEMS COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS

slide-11
SLIDE 11

Alluxio: an Open Source Data Orchestration System

slide-12
SLIDE 12

Data Platform using a Data Orch chestration Approach ch

HDFS

HIVE Presto

NFS

TENSOR FLOW

DATA IN DISPARATE STORAGE SYSTEMS

PRESTO

COMPUTE SPREAD ACROSS MANY DIFFERENT FRAMEWORKS

S3

SPARK

DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION DATA ORCHESTRATION

ANY DATA APP

DATA ORCHESTRATION

slide-13
SLIDE 13

ALLUXIO

2019

Presto Alluxio Stack (PAS) Today

Architecture, Benefit, Production Use Cases

slide-14
SLIDE 14

§ Distributed Data Orchestration (including caching) on Demand

  • Faster: Lower query latency
  • SLA: More consistent performance
  • Efficiency: More concurrency and Less data transfer

§ Deeper Presto Alluxio Integration

  • New Alluxio catalog service
  • New Alluxio transformation service

Why Presto on Alluxio

Now available as Developer Preview in v2.1

15

slide-15
SLIDE 15

How Presto Works with Alluxio

Presto Hive Metastore

location=s3://bucket/table

Read/Write Metadata Read/Write Data Presto Alluxio Mounted to Alluxio Hive Metastore

location=alluxio:///table

Read/Write Metadata Read/Write Data 16

slide-16
SLIDE 16

How to Use Alluxio in Presto CLI

> CREATE TABLE alluxio_table (id varchar) WITH (external_location = 'alluxio:///table'); > SELECT * FROM alluxio_table Create A Table on Alluxio Read A Table from Alluxio 17

slide-17
SLIDE 17

▪ S3 performance is variable and consistent

query SLAs are hard to achieve

▪ S3 metadata operations are expensive

making workloads run longer

▪ S3 egress costs add up making the

solution expensive

▪ S3 is eventually consistent making it hard

to predict query results

Challenges with running workloads on cloud storage

Compute caching for S3 / GCS

Accelerate analytical frameworks on the public cloud

Same instance / container

Alluxio Spark Alluxio Alluxio Spark Alluxio Spark Presto

  • r
slide-18
SLIDE 18

Alluxio Alluxio Alluxio

▪ Accessing data over WAN too slow ▪ Copying data to compute cloud time

consuming and complex

▪ Using another storage system like S3

means expensive application changes

▪ Using S3 via HDFS connector leads

to extremely low performance

Challenges with Hybrid Cloud

HDFS for Hybrid Cloud

Alluxio

Burst big data workloads in hybrid cloud environments

Same instance / container

Solution Benefits ▪ Same performance as local ▪ Same end-user experience ▪ 100% of I/O is offloaded

Presto Presto Presto Presto

slide-19
SLIDE 19

Alluxio Presto Alluxio Presto

Challenges running Big Data on Object Stores & Alluxio Solution

▪ Object stores performance for big

data workloads can be very poor

▪ No native support for popular

frameworks

▪ Expensive metadata operations

reduce performance even more

▪ No support for hybrid environments

directly

Transition to Object store

Dramatically speed-up big data

  • n object stores on premise

Same container / machine

  • r
  • r

Solution Benefits ▪ Same performance as HDFS ▪ Uses HDFS APIs ▪ Same end-user experience ▪ Storage at fraction of the cost of HDFS

Alluxio Presto Alluxio Presto

slide-20
SLIDE 20

Alluxio Presto AWS S3 Presto AWS S3

▪ Cache hot data in Alluxio, leaving all data in S3 ▪ Reduce Presto queries from 10 sec to sub second ▪ Faster time to provide data scientists insights

Robolox

Use Case | Compute Caching for Cloud

slide-21
SLIDE 21

Use Case | On-premise Caching for Presto

HDFS

▪ Large query variance during peak hours before ▪ Alluxio brings data local to Presto to reduce the latency during peak hours

NetEase Games

Leading Online Game Company in China

https://www.alluxio.io/blog/presto-on-alluxio-how-netease- games-leveraged-alluxio-to-boost-ad-hoc-sql-on-hdfs/

Presto HDFS Presto Alluxio

slide-22
SLIDE 22

Architecture: Colocate Alluxio with Presto

  • Black/Red line – Large Query variance without Alluxio
  • Green line - Stable query time with Alluxio
slide-23
SLIDE 23

Use Case | On-premise Satellite Cluster for Presto

HDFS SPARK

▪ Presto workers may read remotely from HDFS datanodes -> large query variance ▪ Data local to Presto accelerates workloads

JD.com

Leading Online Retailer in China

https://www.slideshare.net/Alluxio/alluxio-in-jd

Presto HDFS SPARK Presto Alluxio

slide-24
SLIDE 24

Architecture: Colocate Alluxio with Presto

25

slide-25
SLIDE 25

Pe Performance Evaluation

  • Yellow line - Stable query time with Alluxio
  • < 1sec after first query (cold read)
  • Green line – JD Presto without Alluxio : > 10sec
slide-26
SLIDE 26

Mor More E Examp mples

27

De Details ails: www www.alluxio.io/power ered ed-by by-allu alluxio io/ www www.alluxio.io/data-or

  • rchestration
  • n-su

summit-2019/ 2019/

slide-27
SLIDE 27

Common Use Cases

Accelerate query performance as cloud storage caching

Alluxio Spark Alluxio Alluxio Spark Alluxio Spark Presto

On-premise satellite compute clusters across data centers

Satellite Presto Cluster

Alluxio

Spark Hive

Main Hadoop Cluster

Presto

Zero-copy burst workloads in hybrid cloud environments

Hive Alluxio Hive Alluxio Hive Alluxio Presto Alluxio

28

slide-28
SLIDE 28

Advanced Use Cases

Spark Alluxio

Any Cloud / Multi Cloud Same data center / region

Presto

Enable big data on object stores across single or multiple clouds

Standalone

Spark Alluxio

Orchestrate data frameworks

  • n the public cloud

Any public / private cloud

  • r
  • r

Presto Hive

slide-29
SLIDE 29

ALLUXIO

2019

Alluxio Structured Data Service

Deeper Integration with SQL Engines like Presto

Now available as Developer Preview in v2.1

slide-30
SLIDE 30

31

Storage Systems SQL Frameworks

Files/Objects Directories Raw Bytes Cost-efficiency Durability Tables Schemas Rows/Columns Compute-optimized Computation

Impedance Mismatch Further Expand Benefits!

slide-31
SLIDE 31

Benefits of Alluxio Data Orchestration

32

Storage Systems SQL Frameworks

Caching Unified Interface/Namespace Schema-Aware Optimizations Compute-Optimized Formats Physical Data Independence

slide-32
SLIDE 32

Alluxio Structured Data Service (from v2.1)

33

Presto Alluxio Caching Service Alluxio Catalog Service Alluxio Transformation Service Hive Connector Alluxio Connector Hive Metastore Storage

slide-33
SLIDE 33

Alluxio Structured Data Service Summary

34

  • Significantly speed up queries!
  • Detailed presentation:

www.alluxio.io/resources/videos/alluxio- innovations-for-structured-data/

  • Try it out!
slide-34
SLIDE 34

§ Check out more tutorials https://www.alluxio.io/presto/ § More Video & Slides: https://www.alluxio.io/data-orchestration- summit-2019/ § Additional Reads:

  • Starburst Presto + Alluxio = better together

https://www.starburstdata.com/technical-blog/starburst-presto-alluxio-better-together/

  • Top 5 performance tips running Presto with Alluxio

https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1

  • Presto + Alluxio + Hive Metastore on your Laptop in 10 min

https://www.alluxio.io/blog/tutorial-presto-alluxio-hive-metastore-on-your-laptop-in-10-min/

  • Alluxio Structure Data Service: https://www.alluxio.io/resources/videos/alluxio-innovations-for-

structured-data/ 35

Next Step

slide-35
SLIDE 35

Thank you! Questions?

www.alluxio.io | slackin.alluxio.io | @alluxio | haoyuan@alluxio.com