Da Data c cubes i in A n Apache he H Hive Amareshwari - PowerPoint PPT Presentation

Da Data c cubes i in A n Apache he H Hive Amareshwari Sriramadasu Jaideep Dhok

• Engineer at Inmobi Amareshwari • Apache Hive– Committer • Apache Hadoop – PMC Sriramadasu • Working in Hadoop and eco systems since 2007

• Engineer at Inmobi • Working in distributed Jaideep systems and Hadoop eco Dhok system since 2007 • Apache Hive and Hadoop Contributor

Agenda Background Why Apache Hive Data cubes in Hive Queries on data cubes with examples Road map

About InMobi Global Mobile technology company enabling - Developers & Publishers to monetize - Advertisers to engage and acquire users @ Scale

Digital advertising – Intro Consumer Owns & Sells Wants to Real estate on target Users digital inventory Brings money Has reach to users Market place Courtesy: http://www.liesdamnedlies.com/

Hadoop @ InMobi – Factual Reporting & Analytics Ø 130 TB Hadoop warehouse Ø 5 TB SQL warehouse Ø Pipelines

Use cases Use cases • Sharing reports/data with customer (account level) Categorize use cases • Understand trends through data & • Batch queries exploration (analysis) • Adhoc queries • Debugging / Postmortem of issues • Interactive queries (troubleshooting) • Canned reports • Sizing & Estimation (Ex: inventory, reach) • Scheduled reports • Summary of Product lines, Geographies, • Infer insights through ML algorithms Network (Ex: Rev by Geo) • Sales/Revenue Targets vs Actuals • Tracking campaign performance • Tracking any metrics on REAL- TIME basis

Current state of analytics - Reporting Adhoc Querying system Dashboard system • Adhoc and Batch queries • Canned reports • Scheduled queries • Scheduled reports • Based on Hadoop Mapreduce • Interactive and adhoc queries • Provides UI and custom api • Provides UI and Custom api • Data is stored in HDFS • Data is stored in columnar DWH, InfoBright Customer facing system • Face to the outside world (Advertisers and publishers) • Interactive and adhoc queries • Provides UI and custom api • Data is stored in relational DB, Postgres

Current State - Problems • Disparate user experience • Disparate data storage systems causing inability to scale • Not leveraging community around

Apache Hive to the rescue What does Hive provide Associates structure to data Data warehouse features like facts, dimensions Provides Metastore and catalog service – Hcatalog Logical table associated with What is missing in Hive multiple physical storages Provides pluggable storage Pluggable execution engine for Accepts SQL like queries HQL HQL is widely adopted Query history, caching language by systems like Shark, Impala Scheduling queries Provides pluggable interface for adding new storage Has strong apache community

Data Layout Aggrk : measures (mak <= ma(k-1)), dimensions (dak < da(k-1)) … .. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Raw data : measures (mr), dimensions(dr) Other side of pyramid is aggregated at timed dimension

Data Layout Dim3 Dim2 Dim2-1 Aggrk : measures (mak <= ma(k-1)), dimensions Dim4 (dak < da(k-1)) … .. Aggr2 : measures (ma2 <= ma1), dimensions (da2 < da1) Dim1 Aggr1 : measures (ma1 <= mr), dimensions (da1 < dr) Dim4-1 Raw data : measures (mr), dimensions(dr)

Data Model Dimension Cube Storage Fact Table Table Physical Physical Dimension Fact tables tables

Data Model - Cube Measure • Column Measure : name, type, default aggregate, format string, start date, end date • Expression Measure : Associated Expression Cube Dimension • Simple Dimension: name, type, start date, end date • Referenced Dimension : Referencing table and column • Hierarchical Dimensions Measures Dimension :hierarchy • Expression Dimension : Associated expression

Data Model – Storage Storage Name End point Properties Ex : UA2, UJ1, Mpower-IB

Data Model – Fact Table Fact Table Fact Cube table Columns Fact Cube that it belongs table Storages on which it is present and the associated update periods Storage

Data Model – Dimension table Dimension Table Cube Dimension Columns table Dimension Dimension references table Storages on which it is present and Dimension associated snapshot dump period, if any. table Storage

Data Model – Storage tables and partitions Storage table Fact table • Fact storage table Associated storage descriptor Dimension table Partitioned by columns • Dimension storage table

Queries on Data cubes cube_table_reference: cube_table_factor | join_table CUBE SELECT [DISTINCT] select_expr, join_table: select_expr, ... cube_table_reference JOIN cube_table_factor [join_condition] FROM cube_table_reference | cube_table_reference {LEFT|RIGHT|FULL} [OUTER] WHERE [where_condition AND] JOIN cube_table_reference [join_condition] TIME_RANGE_IN(colName , from, to) cube_table_factor: cube_name [alias] [GROUP BY col_list] | ( cube_table_reference ) [HAVING having_expr] join_condition: [ORDER BY colList] ON equality_expression ( AND equality_expression )* [LIMIT number] equality_expression: expression = expression colOrder: ( ASC | DESC ) colList : colName colOrder? (',' colName colOrder?)*

Querying features • Resolve candidate dimension tables and the storage tables . • Resolve the candidate fact tables and the storage tables for the queried time range. • Automatically resolve joins using the relationships between cubes and dimension. • Automatically add aggregate functions to measures. • Add expression to group by clause, if projected; and project group by clause, if it is not.

Example query cube select name, stateid from citytable limit 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c2_citytable citytable LIMIT 100 • SELECT ( citytable . name ), ( citytable . stateid ) FROM c1_citytable citytable WHERE (citytable.dt = 'latest') LIMIT 100

Example query cube select citytable.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’) • SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable citytable ON ((testcube.cityid)= (citytable.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (citytable.dt = 'latest') GROUP BY(citytable.name)

Data ware house statistics Stats • Number of queries - 700 to 900 per day • Number of dimension tables - 125 • Number of fact tables – 24 • Number cubes – 15 • Size of the data • Total size – 136 TB • Dimension data – 400 MB compressed per hour • Raw data - 1.2 TB per day • Aggregated facts- 53GB per day

What is available Available in Hive • Data warehouse features like Available in github facts, dimensions • Logical table associated • Pluggable execution engine for with multiple physical HQL storages • Query history, caching • Scheduling queries

Pluggable execution engine Implements an interface • execute • explain • executeAsynchronously • fetchResults • Specify all storages it can support

Cube query with multiple execution engines Cube QL query Rewrite query for available execution engines Get cost of the rewritten query from each execution engine Pick up execution engine with least cost and fire the query

Future roadmap: Unified analytics

Thank you! • amareshwari@apache.org • jaideep.dhok@inmobi.com

Da Data c cubes i in A n Apache he H Hive Amareshwari - PowerPoint PPT Presentation

Da Data c cubes i in A n Apache he H Hive Amareshwari Sriramadasu Jaideep Dhok Engineer at Inmobi Amareshwari Apache Hive Committer Apache Hadoop PMC Sriramadasu Working in Hadoop and eco systems since 2007

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

Sandpile Groups of Cubes B. Anzis & R. Prasad August 1, 2016 Sandpile Groups of Cubes

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Maths Measurement Maths | Year 6 | Measurement | Volume of Cubes and Cuboids | Lesson 1 of 3:

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Getting Hadoop, Hive and HBase up and running in less than 15 mins ApacheCon NA 2013 Mark

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

FINDING, ASSESSING, AND INTEGRATING STATISTICAL SOURCES FOR DATA MINING Karin Becker 1 , Xiaojie

Counting the spanning trees of the 3-cube using edge slides Christopher Tuffley Institute of

CS 225 Data Structures Fe February 3 - Li Lifecycle G G Carl Evans Cop Copy Con Constru

Creating a Data Science Centric Organization Challenges and Opportunities Canadian Data

Cube-Attack-Like Cryptanalysis of Round-Reduced Keccak Using MILP Ling Song , Jian Guo FSE 2019 @

Lecture 1.1: What is a group? Matthew Macauley Department of Mathematical Sciences Clemson

sqr

Non-extendible Latin cuboids Ian Wanless Monash University, Australia D. Bryant, N. J. Cavenagh,

Da Data c cubes i in A n Apache he H Hive Amareshwari - PowerPoint PPT Presentation

Da Data c cubes i in A n Apache he H Hive Amareshwari Sriramadasu Jaideep Dhok Engineer at Inmobi Amareshwari Apache Hive Committer Apache Hadoop PMC Sriramadasu Working in Hadoop and eco systems since 2007

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Working the Hive 1 * What *When *How What to do Everyone who own or manages a hive must be

Apache HIVE Data Warehousing &amp; Analytics on Hadoop Hefu Chai What is HIVE? A system for

Sandpile Groups of Cubes B. Anzis &amp; R. Prasad August 1, 2016 Sandpile Groups of Cubes

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Maths Measurement Maths | Year 6 | Measurement | Volume of Cubes and Cuboids | Lesson 1 of 3:

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Getting Hadoop, Hive and HBase up and running in less than 15 mins ApacheCon NA 2013 Mark

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

FINDING, ASSESSING, AND INTEGRATING STATISTICAL SOURCES FOR DATA MINING Karin Becker 1 , Xiaojie

Counting the spanning trees of the 3-cube using edge slides Christopher Tuffley Institute of

CS 225 Data Structures Fe February 3 - Li Lifecycle G G Carl Evans Cop Copy Con Constru

Creating a Data Science Centric Organization Challenges and Opportunities Canadian Data

Cube-Attack-Like Cryptanalysis of Round-Reduced Keccak Using MILP Ling Song , Jian Guo FSE 2019 @

Lecture 1.1: What is a group? Matthew Macauley Department of Mathematical Sciences Clemson

sqr

Non-extendible Latin cuboids Ian Wanless Monash University, Australia D. Bryant, N. J. Cavenagh,

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

Sandpile Groups of Cubes B. Anzis & R. Prasad August 1, 2016 Sandpile Groups of Cubes

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb