Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 ｜@ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io

Agenda n What’s Apache Kylin? n Tech Highlights n Performance n Open Source n Q & A

What’s Kylin kylin ¡ ¡/ ¡ˈkiːˈlɪn ¡/ ¡ 麒麟 ¡ -‑-‑n. ¡(in ¡Chinese ¡art) ¡a ¡mythical ¡animal ¡of ¡composite ¡form ¡ ¡ Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets • Open ¡Sourced ¡on ¡Oct ¡1st, ¡2014 ¡ • Be ¡accepted ¡as ¡Apache ¡Incubator ¡Project ¡on ¡Nov ¡25th, ¡2014 ¡ http://kylin.io

Big Data Era n More and more data becoming available on Hadoop n Limitations in existing Business Intelligence (BI) Tools n Limited support for Hadoop n Data size growing exponentially n High latency of interactive queries n Scale-Up architecture n Challenges to adopt Hadoop as interactive analysis system n Majority of analyst groups are SQL savvy n No mature SQL interface on Hadoop n OLAP capability on Hadoop ecosystem not ready yet http://kylin.io

Business Needs for Big Data Analysis n Sub-second query latency on billions of rows n ANSI SQL for both analysts and engineers n Full OLAP capability to offer advanced functionality n Seamless Integration with BI Tools n Support of high cardinality and high dimensions n High concurrency – thousands of end users n Distributed and scale out architecture for large data volume n Open source solution http://kylin.io

Why not Build an engine from scratch? 6 http://kylin.io

Analytics Query Taxonomy Kylin ¡is ¡designed ¡to ¡accelerate ¡80+% ¡analyNcs ¡queries ¡performance ¡on ¡Hadoop ¡ High ¡Level ¡ • Very ¡High ¡Level, ¡e.g ¡GMV ¡by ¡ AggregaNon ¡ site ¡by ¡verNcal ¡by ¡weeks ¡ Strategy ¡ • Middle ¡level, ¡e.g ¡GMV ¡by ¡site ¡by ¡verNcal, ¡by ¡ Analysis ¡ category ¡(level ¡x) ¡past ¡12 ¡weeks ¡ Query ¡ OLAP ¡ Drill ¡Down ¡ OperaNon ¡ • Detail ¡Level ¡(Summary ¡Table) ¡ to ¡Detail ¡ Low ¡Level ¡ • First ¡Level ¡ AggregaNon ¡ AggragaNon ¡ OLTP ¡ TransacNon ¡ TransacNon ¡ • TransacNon ¡Data ¡ Level ¡ http://kylin.io

Technical Challenges Huge volume data n n Table scan Big table joins n n Data shuffling Analysis on different granularity n n Runtime aggregation expensive Map Reduce job n n Batch processing http://kylin.io

OLAP Cube – Balance between Space and Time Cuboid = one combination of dimensions • • Cube = all combination of dimensions (all cuboids) 0- D(apex) cuboid time item location supplier 1- D cuboids time, item time, location location, supplier item, location 2- D cuboids item, supplier Time, supplier time, location, supplier 3- D cuboids time, item, supplier time, item, location item, location, supplier 4- D(base) cuboid time, item, location, supplier • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells (9/15, milk, Urbana, Dairy_land) - < time, item, location, supplier > > 1. (9/15, milk, Urbana, *) - < time, item, location > > 2. (*, milk, Urbana, *) - < item, location > > 3. (*, milk, Chicago, *) - < item, location > > 4. (*, milk, *, *) - < item > > 5. http://kylin.io 9

From Relational to Key-Value http://kylin.io

Kylin Architecture Overview 3rd ¡Party ¡App ¡ SQL-‑Based ¡Tool ¡ Ø Online ¡Analysis ¡Data ¡Flow ¡ Ø Offline ¡Data ¡Flow ¡ (Web ¡App, ¡Mobile…) (BI ¡Tools: ¡Tableau…) ¡ Ø Clients/Users ¡interacNve ¡with ¡ REST ¡API JDBC/ODBC Kylin ¡via ¡SQL ¡ Ø OLAP ¡Cube ¡is ¡transparent ¡to ¡ users ¡ SQL SQL REST ¡Server ¡ Query ¡Engine ¡ Mid ¡Latency ¡-‑ ¡Minutes Low ¡ ¡Latency ¡-‑ ¡Seconds RouNng Metadata ¡ Data ¡ Hadoop OLAP ¡ Cube Hive Cube ¡ (HBase) Cube ¡Build ¡Engine ¡ (MapReduce…) Star ¡Schema ¡Data Key ¡Value ¡Data http://kylin.io 11

Features Highlights Extremely Fast OLAP Engine at Scale n Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data ANSI SQL Interface on Hadoop n Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions Seamless Integration with BI Tools n Kylin currently offers integration capability with BI Tools like Tableau. Interactive Query Capability n Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset MOLAP Cube n User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records http://kylin.io

Features Highlights Cons Compression and Encoding Support n Incremental Refresh of Cubes n Approximate Query Capability for distinct Count (HyperLogLog) n Leverage HBase Coprocessor for query latency n Job Management and Monitoring n Easy Web interface to manage, build, monitor and query cubes n Security capability to set ACL at Cube/Project Level n Support LDAP Integration n http://kylin.io

How Does Kylin Utilize Hadoop Components? Hive n n Input source n Pre-join star schema during cube building MapReduce n n Pre-aggregation metrics during cube building HDFS n n Store intermediated files during cube building. HBase n n Store data cube. n Serve query on data cube. n Coprocessor is used for query processing. http://kylin.io

Why Kylin is Fast? Pre-built cube – query result already be calculated n Leveraging distributed computing infrastructure n No runtime Hive table scan and MapReduce job n Compression and encoding n Put “Computing” to “Data” n Cached n http://kylin.io

Agenda n What’s Kylin n Tech Highlights n Performance n Open Source n Q & A

How to Define Cube? Data Modeling End ¡User ¡ Cube ¡Modeler ¡ Admin ¡ Cube: ¡… ¡ Row ¡Key Fact ¡Table: ¡… ¡ Column Dim row ¡A Val ¡1 Dimensions: ¡… ¡ Measures: ¡… ¡ row ¡B Val ¡2 Fact Storage(HBase): ¡… row ¡C Val ¡3 Dim Dim Column ¡Family Source ¡ Mapping ¡ Target ¡ ¡ Star ¡Schema Cube ¡Metadata HBase ¡Storage http://kylin.io

How to Define Cube? Cube Metadata • Dimension – Normal – Mandatory – Hierarchy – Derived • Measure – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) http://kylin.io

How to Define Cube? Mandatory Dimension Dimension that must present on cuboid n n E.g. Date Normal ¡ A ¡is ¡mandatory ¡ A B C A B C A B -‑ A B -‑ -‑ B C A -‑ C A -‑ C A -‑ -‑ A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Define Cube? Hierarchy Dimension Dimensions that form a “contains” relationship where parent level is n required for child level to make sense. n E.g. Year -> Month -> Day; Country -> City Normal ¡ A ¡-‑> ¡B ¡-‑> ¡C ¡is ¡hierarchy ¡ A B C A B C A B -‑ A B -‑ -‑ B C A -‑ -‑ A -‑ C -‑ -‑ -‑ A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Define Cube? Derived Dimension Dimensions on lookup table that can be derived by PK n n E.g. User ID -> [Name, Age, Gender] Normal ¡ A, ¡B, ¡C ¡is ¡derived ¡by ¡ID ¡ A B C ID A B -‑ -‑ -‑ B C A -‑ C A -‑ -‑ -‑ B -‑ -‑ -‑ C -‑ -‑ -‑ http://kylin.io

How to Build Cube? Cube Build Job Flow http://kylin.io

How to Build Cube? Cube Build Result http://kylin.io

How to Query Cube? Query Engine – Calcite Dynamic ¡data ¡management ¡framework. ¡ n Formerly ¡known ¡as ¡OpNq, ¡Calcite ¡is ¡an ¡Apache ¡incubator ¡project, ¡used ¡by ¡ n Apache ¡Drill ¡and ¡Apache ¡Hive, ¡among ¡others. ¡ hjp://opNq.incubator.apache.org ¡ ¡ ¡ n http://kylin.io

How to Query Cube? Calcite Plugins • Metadata SPI Me SPI – Provide table schema from kylin metadata • Optimize imize Rule le – Translate the logic operator into kylin operator • Rela latio ional l Opera rator r – Find right cube – Translate SQL into storage engine api call – Generate physical execute plan by linq4j java implementation • Resu sult lt En Enume mera rator r – Translate storage engine result into java implementation result. • SQL Funct SQ ctio ion – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) http://kylin.io

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io Agenda n Whats Apache Kylin? n Tech Highlights

Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Speed up Mission-Critical Analytics in the Cloud Billy Liu, VP of Kyligence, Apache Kylin PMC

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Discovering OLAP Dimensions in Semi-Structured Data Svetlana

Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A de fi

Do#the#middle#letters#of#OLAP#stand#for# Linear#Algebra#(LA)? ! Speaker: Lus A.

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE OLAP: On-Line Analytical Processing Salvatore Ruggieri

System for Supporting Real-time Analytics Feng Li, M. Tamer Ozsu, Gang Chen, Beng Chin Ooi

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // CHARITY HILTON L E C T U R E # 1 1 :

Event Sourcing Greg Young Event Sourcing says all state is transient and you only store facts.

The Data Cube as a Typed Linear Algebra Operator DBPL 2017 16th Symp. on DB Prog. Lang.

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. - PowerPoint PPT Presentation

Apache Kylin Introduction Dec 8, 2014 @ ApacheKylin Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com http://kylin.io Agenda n Whats Apache Kylin? n Tech Highlights

Apache Kylin Balance between Space and Time Debashis Saha | Luke Han 2015-06-09 http://kylin.io

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Speed up Mission-Critical Analytics in the Cloud Billy Liu, VP of Kyligence, Apache Kylin PMC

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

An Apache Based, Intelligent IoT Stack Trevor Grant PMC Apache Mahout Project PPMC Apache

Discovering OLAP Dimensions in Semi-Structured Data Svetlana

Multi-dimensional index structures Part I: motivation 144 Motivation: Data Warehouse A de fi

Do#the#middle#letters#of#OLAP#stand#for# Linear#Algebra#(LA)? ! Speaker: Lus A.

DATA MANAGEMENT FOR BUSINESS INTELLIGENCE OLAP: On-Line Analytical Processing Salvatore Ruggieri

System for Supporting Real-time Analytics Feng Li, M. Tamer Ozsu, Gang Chen, Beng Chin Ooi

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2018 // CHARITY HILTON L E C T U R E # 1 1 :

Event Sourcing Greg Young Event Sourcing says all state is transient and you only store facts.

The Data Cube as a Typed Linear Algebra Operator DBPL 2017 16th Symp. on DB Prog. Lang.

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation