Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors - - PowerPoint PPT Presentation

hive a petabyte scale data warehouse using hadoop
SMART_READER_LITE
LIVE PREVIEW

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors - - PowerPoint PPT Presentation

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure Team CS 743, Fall 2014 Conference Data Engineering (ICDE), 2010 IEEE UNIVERSITY OF WATERLOO Presenter Malek NAOUACH, Nets&Dist Sys November


slide-1
SLIDE 1

1

Hive* – A Petabyte Scale Data Warehouse Using Hadoop

Presenter Malek NAOUACH, Nets&Dist Sys Authors Facebook Data Infrastructure Team Conference Data Engineering (ICDE), 2010 IEEE

CS 743, Fall 2014

UNIVERSITY OF

WATERLOO

November 13th , 2014

slide-2
SLIDE 2

2

Overview*

Massively Parallel Fault Tolerant Linearly Scalable

Big Data Processing

MapReduce

Familiarity Decisions Making

Hadoop Hive

slide-3
SLIDE 3

3

Complex Data Types Associative arrays | Lists | Structs

Hive Data Structure*

Primitive Data Types INT | TINYINT | SMALLINT | BIGINT | BOOLEAN | FLOAT Complex Datatypes Composition

list<map<string, struct<p1:int, p2:int>>>

Complex Schema Creation

CREATE TABLE t1(st string, fl float, li list<map<string, struct<p1:int, p2:int>>>)

Hive Data Incorporation + SerDe Interface + ObjectInspector Interface + getObjectInspector method **Serialization Process of translating data structures or

  • bject state into a format that can be

stored and reconstructed later.

slide-4
SLIDE 4

4

NOT HiveQL Semantics INSERT | UPDATE | DELETE

Hive Query Language*

HiveQL Semantics (SQL) SUBQUERIES | INNER, LEFT & RIGHT OUTER JOINS | CARTESIAN PROD | GROUP By | AGGREGATION | UNION | CREATE TABLE HiveQL Supports Map-Red Programs

FROM ( MAP stocks USING 'python ce_mapper.py' AS (company,value) FROM stocksStat CLUSTER BY value ) a Reduce company,value USING 'python ce_reduce.py'

HiveQL Data Insertion INSERT OVERWRITE **HQL Hibernate Query Language

slide-5
SLIDE 5

5

Data Storage*

Logical Partitioning Buckets HDFS …... MetaData

Stocks

Hive MetaStore /hive/stocks/ /hive/stocks/2014-11-13/ /hive/stocks/2014-11-13/10 /hive/stocks/2014-11-13/11 /hive/stocks/2014-11-13/12 Prune/Bucket Data Library Schema

CREATE TABLE Stocks (Company STRING, val DOUBLE) PARTITIONED BY (day STRING, hr INT);

slide-6
SLIDE 6

6

System Architecture(1/3)*

HADOOP (MAP-REDUCE + HDFS) Job Tracker Name Node Data Node + Task Tracker

Hive

MetaStore Driver (Compiler, Optimizer, Executor) Thrift Server ODBC JDBC Web Interface CLI

slide-7
SLIDE 7

7

System Architecture (2/3)*

**Interoperability is the ability of a system to work with

  • ther systems without special effort on

the customer side. **Logical/Physical Plan Abstract Syntax Tree (AST) for the query, Query Block Tree, Involved Interfaces, Directed Acyclic Graph

E.Client Web UI CLI Thrift Interf. ODBC Interf. JDBC Interf.

Driver

  • 1. exeHiveQuery
  • 7. fetchResults

Query Compiler

  • 2. getExePhysPlan

MetaStore

  • 3. getMetaData
  • 4. sendMetaData
  • 5. sendExePhysPlan

Execution Engine

6.1. metaDataOps forDDLs

  • 5. exePhysPlan
  • 8. sendResults

6.1. exeJob 6.2. jobDone

Hive H A D O O P

slide-8
SLIDE 8

8

System Architecture (3/3)*

Job Tracker

6.1. exeJob 6.2. jobDone

Map Op. Tree SerDe Map Op. Tree SerDe Task Trackers (MAP) Task Trackers (Reduce) MapReduce Tasks Data Nodes

H A D O O P MapReduce HDFS

slide-9
SLIDE 9

9

HiveQL to Phys. Plan Exp. (1/3)*

FROM(SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds='2009-03-20')) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION (ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION (ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school

slide-10
SLIDE 10

10

HiveQL to Phys. Plan Exp. (2/3)*

status_updates (userid, status, ds) profiles (userid, school, gender)

slide-11
SLIDE 11

11

SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender SELECT subq1.school, COUNT(1) GROUP BY subq1.school

HiveQL to Phys. Plan Exp. (3/3)*

slide-12
SLIDE 12

12

Brief Recap.*

http://hive.apache.org/ http://hadoop.apache.org/

✔ Hive is created to simplify big data analysis. (1hour for new users to master) ✔ Hive is improving the performance of Hadoop. (+20% efficiency) ✔ Hive enables data processing at a fraction of the cost of more traditional WD. ✔ Hive is working towards to subsume SQL syntax. ✔ Hive is enhancing the Query Complier and the interoperability.

slide-13
SLIDE 13

13

Thanks!* Questions?