Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai

What is HIVE? • A system for managing and querying structured data built on top of Hadoop • Uses Map-Reduce for execution • HDFS for storage • Extensible to other Data Repositories • Key Building Principles: • SQL on structured data as a familiar data warehousing tool • Extensibility (Pluggable map/reduce scripts in the language of your choice, Rich and User Defined data types, User Defined Functions) • Interoperability (Extensible framework to support different file and data formats)

What HIVE Is Not • Not designed for OLTP • Does not offer real-time queries

HIVE Architecture

Hive/Hadoop Usage @ Facebook • Types of Applications: • Summarization • Eg: Daily/Weekly aggregations of impression/click counts • Complex measures of user engagement • Ad hoc Analysis • Eg: how many group admins broken down by state/country • Data Mining (Assembling training data) • Eg: User Engagement as a function of user attributes • Spam Detection • Anomalous patterns for Site Integrity • Application API usage patterns • Ad Optimization • Too many to count ..

Hive Query Language • Basic SQL • CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); • SHOW TABLES '.*s'; • DESCRIBE sample; • ALTER TABLE sample ADD COLUMNS (new_col INT); • DROP TABLE sample; • Extensibility • Pluggable Map-reduce scripts • Pluggable User Defined Functions • Pluggable User Defined Types • Pluggable SerDes to read different kinds of Data Formats

Hive QL – Join pv_users page_view user pageid userid time pageid age userid age gender 1 9:08:01 111 1 25 X = 111 25 female 2 111 9:08:13 2 25 222 32 male 1 222 9:08:14 1 32 • SQL: INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);

Hive QL – Join in Map Reduce page_view pageid userid time key value key value 1 111 9:08:01 111 < 1, 1> 111 < 1, 1> 2 111 9:08:13 111 < 1, 2> 111 < 1, 2> 1 222 9:08:14 222 < 1, 1> 111 < 2, 25> Shuffle Sort Map user userid age gender key value key value 111 25 female 111 < 2, 25> 222 < 1, 1> 32 male 222 222 < 2, 32> 222 < 2, 32>

Hive QL – Join in Map Reduce pv_users key value 111 < 1, 1> Pageid age 111 < 1, 2> 1 25 111 < 2, 25> 2 25 Reduce key value pageid age 222 < 1, 1> 1 32 222 < 2, 32>

Integration with HBase • Reasons to use Hive on HBase: • A lot of data sitting in HBase due to its usage in a real-time environment, but never used for analysis • Give access to data in HBase usually only queried through MapReduce to people that don’t code (business analysts) • Reasons not to do it: • Run SQL queries on HBase to answer live user requests (it’s still a MR job)

Integration with HBase

Integration with HBase Hive can use tables that already exist in HBase or manage its own ones, but they still all reside in the same HBase instance Hive table definitions HBase Points to an existing table Manages this table from Hive

Integration with HBase When using an already existing table, defined as EXTERNAL Columns are mapped however you want, changing names and giving type Hive table definition HBase table persons people name STRING d:fullname age INT d:age siblings MAP<string, string> d:address f:

Reference • https://cwiki.apache.org/confluence/display/Hive/Home • Hive Facebook • StumbleUpon

Thanks

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai - PowerPoint PPT Presentation

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for managing and querying structured data built on top of Hadoop Uses Map-Reduce for execution HDFS for storage Extensible to other Data

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Da Data c cubes i in A n Apache he H Hive Amareshwari Sriramadasu Jaideep Dhok Engineer

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Getting Hadoop, Hive and HBase up and running in less than 15 mins ApacheCon NA 2013 Mark

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

What's new with Apache Tika? What's new with Apache Tika? What's New with Apache Tika? What's

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

CSN09101 Networked Services Week 8: Essential Apache Week 8: Essential Apache Module Leader: Dr

Integrating Apache Camel with Apache Syncope Dr. Colm higeartaigh, Talend. Speaker

Intervention Impacts in Joined-Up Intervention Impacts in Joined-Up HIV and TB Epidemics HIV and

Molecular Surveillance: What does it all mean? Clinical Implications and Community Response

Changes in the HIV Lab Testing Algorithm Dr. Severin Gose May 7, 2015 www.getSFcba.org

HIV Prevention through the Lens of Behavioral Economics Sebastian Linnemayr, PhD RAND March 16,

Beekeeping Land Valuation Production vs Processing Manual for the Appraisal of Agricultural Land

A W ebinar for Prim ary Care Practitioners CMS Innovation Center Agenda Introduction

beehive software-de fj ned networking Soheil Hassas Yeganeh Yashar ganjali University of Toronto

Winter Pollinator Management Amanda Skidmore, PhD Extension IPM Specialist for Urban and Small