hive a petabyte scale data warehouse using hadoop
play

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors - PowerPoint PPT Presentation

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure Team CS 743, Fall 2014 Conference Data Engineering (ICDE), 2010 IEEE UNIVERSITY OF WATERLOO Presenter Malek NAOUACH, Nets&Dist Sys November


  1. Hive* – A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure Team CS 743, Fall 2014 Conference Data Engineering (ICDE), 2010 IEEE UNIVERSITY OF WATERLOO Presenter Malek NAOUACH, Nets&Dist Sys November 13 th , 2014 1

  2. Overview* MapReduce Fault Big Data Massively Tolerant Processing Parallel Decisions Hadoop Linearly Making Scalable Familiarity Hive 2

  3. Hive Data Structure* Complex Datatypes Composition Primitive Data Types list<map<string, struct<p1:int, p2:int>>> INT | TINYINT | SMALLINT | BIGINT | BOOLEAN | FLOAT Complex Schema Creation CREATE TABLE t1(st string, fl float, li list<map<string, struct<p1:int, p2:int>>>) Complex Data Types Hive Data Incorporation Associative arrays | Lists | Structs + SerDe Interface + ObjectInspector Interface + getObjectInspector method **Serialization Process of translating data structures or object state into a format that can be stored and reconstructed later. 3

  4. Hive Query Language* HiveQL Data Insertion HiveQL Semantics (SQL) INSERT OVERWRITE SUBQUERIES | INNER, LEFT & RIGHT OUTER JOINS | CARTESIAN PROD | GROUP By | AGGREGATION HiveQL Supports Map-Red Programs | UNION | CREATE TABLE FROM ( MAP stocks USING 'python ce_mapper.py' NOT HiveQL Semantics AS (company,value) INSERT | UPDATE | DELETE FROM stocksStat CLUSTER BY value ) a Reduce company,value USING 'python ce_reduce.py' **HQL Hibernate Query Language 4

  5. Data Storage* Hive MetaStore Library HDFS Schema Logical Partitioning MetaData Prune/Bucket Stocks Buckets Data …... /hive/stocks/ CREATE TABLE Stocks /hive/stocks/2014-11-13/ (Company STRING, val DOUBLE) /hive/stocks/2014-11-13/10 PARTITIONED BY (day /hive/stocks/2014-11-13/11 STRING, hr INT); /hive/stocks/2014-11-13/12 5

  6. System Architecture(1/3)* Hive JDBC ODBC Web CLI Thrift Server Interface MetaStore Driver (Compiler, Optimizer, Executor) HADOOP (MAP-REDUCE + HDFS) Name Node Job Tracker Data Node + Task Tracker 6

  7. System Architecture (2/3)* H Hive A 8. sendResults 6.2. jobDone Execution D Engine O 6.1. exeJob 5. exePhysPlan O P Thrift Interf. E.Client 7. fetchResults 6.1. metaDataOps ODBC Web UI Driver forDDLs Interf. 1. exeHiveQuery CLI JDBC 2. getExePhysPlan 5. sendExePhysPlan Interf. 4. sendMetaData Query MetaStore Compiler 3. getMetaData **Interoperability **Logical/Physical Plan is the ability of a system to work with Abstract Syntax Tree (AST) for the other systems without special effort on query, Query Block Tree, Involved 7 the customer side. Interfaces, Directed Acyclic Graph

  8. System Architecture (3/3)* MapReduce 6.2. jobDone Job Tracker 6.1. exeJob MapReduce Tasks Task Trackers Task Trackers H (MAP) (Reduce) A D Map Op. Map Op. O Tree Tree O SerDe SerDe P HDFS Data Nodes 8

  9. HiveQL to Phys. Plan Exp. (1/3)* FROM(SELECT a.status, b.school, b.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid AND a.ds='2009-03-20')) subq1 INSERT OVERWRITE TABLE gender_summary PARTITION (ds='2009-03-20') SELECT subq1.gender, COUNT(1) GROUP BY subq1.gender INSERT OVERWRITE TABLE school_summary PARTITION (ds='2009-03-20') SELECT subq1.school, COUNT(1) GROUP BY subq1.school 9

  10. HiveQL to Phys. Plan Exp. (2/3)* status_updates profiles (userid, status, ds) (userid, school, gender) 10

  11. HiveQL to Phys. Plan Exp. (3/3)* SELECT subq1.school, COUNT(1) SELECT subq1.gender, COUNT(1) GROUP BY subq1.school GROUP BY subq1.gender 11

  12. Brief Recap.* ✔ Hive is created to simplify big data analysis. (1hour for new users to master) ✔ Hive is improving the performance of Hadoop. (+20% efficiency) ✔ Hive enables data processing at a fraction of the cost of more traditional WD. ✔ Hive is working towards to subsume SQL syntax. ✔ Hive is enhancing the Query Complier and the interoperability. http://hadoop.apache.org/ http://hive.apache.org/ 12

  13. Thanks!* Questions? 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend