Using Pig, Hive, and Impala with Hadoop Jay Urbain, - PowerPoint PPT Presentation

¡ Using ¡Pig, ¡Hive, ¡and ¡Impala ¡with ¡ Hadoop ¡ ¡ Jay ¡Urbain, ¡PhD ¡

Velocity ¡ • We ¡are ¡genera<ng ¡data ¡faster ¡than ¡ever ¡ – Processes ¡are ¡increasingly ¡automated ¡ – People ¡are ¡increasingly ¡interac<ng ¡online ¡ – Systems ¡are ¡increasingly ¡interconnected ¡

Variety ¡ • We ¡are ¡producing ¡a ¡wide ¡variety ¡of ¡data ¡ – Social ¡network ¡connec<ons ¡ – Images, ¡audio, ¡and ¡video ¡ – Server ¡and ¡applica<on ¡log ¡files ¡ – Product ¡ra<ngs ¡on ¡shopping ¡and ¡review ¡Web ¡sites ¡ – And ¡much ¡more… ¡ • Not ¡all ¡of ¡this ¡maps ¡cleanly ¡to ¡the ¡rela<onal ¡model ¡

Volume ¡ • Every ¡day… ¡ – More ¡than ¡1.5 ¡billion ¡shares ¡are ¡traded ¡on ¡the ¡New ¡York ¡ Stock ¡Exchange ¡ – Facebook ¡stores ¡2.7 ¡billion ¡comments ¡and ¡‘Likes’ ¡ – Google ¡processes ¡about ¡24 ¡petabytes ¡of ¡data ¡ • Every ¡minute… ¡ – Foursquare ¡handles ¡more ¡than ¡2,000 ¡check-‑ins ¡ – TransUnion ¡makes ¡nearly ¡70,000 ¡updates ¡to ¡credit ¡files ¡ • And ¡every ¡second… ¡ – Banks ¡process ¡more ¡than ¡10,000 ¡credit ¡card ¡transac<ons ¡

Data ¡Has ¡Value ¡ • This ¡data ¡has ¡many ¡valuable ¡applica<ons ¡ – Product ¡recommenda<ons ¡ – Predic<ng ¡demand ¡ – Marke<ng ¡analysis ¡ – Fraud ¡detec<on ¡ – And ¡many, ¡many ¡more… ¡ • We ¡must ¡process ¡it ¡to ¡extract ¡that ¡value ¡ – And ¡processing ¡ all ¡the ¡data ¡can ¡yield ¡more ¡ accurate ¡results ¡

We ¡Need ¡a ¡System ¡that ¡Scales ¡ • We’re ¡genera<ng ¡too ¡much ¡data ¡to ¡process ¡with ¡tradi<onal ¡ tools ¡ • Two ¡key ¡problems ¡to ¡address ¡ ¡ – How ¡can ¡we ¡reliably ¡store ¡large ¡amounts ¡of ¡data ¡at ¡a ¡ reasonable ¡cost? ¡ – How ¡can ¡we ¡analyze ¡all ¡the ¡data ¡we ¡have ¡stored? ¡

Apache ¡Hadoop ¡ • Scalable ¡and ¡economical ¡data ¡storage ¡and ¡processing ¡ – Distributed ¡and ¡fault-‑tolerant ¡ ¡ – Harnesses ¡the ¡power ¡of ¡industry ¡standard ¡hardware ¡ • Heavily ¡inspired ¡by ¡technical ¡documents ¡published ¡by ¡Google ¡ • ‘Core’ ¡Hadoop ¡consists ¡of ¡two ¡main ¡components ¡ – Storage: ¡the ¡Hadoop ¡Distributed ¡File ¡System ¡(HDFS) ¡ – Processing: ¡MapReduce ¡

Apache ¡Pig ¡ • Apache ¡Pig ¡builds ¡on ¡Hadoop ¡to ¡offer ¡high-‑level ¡ data ¡processing ¡ – This ¡is ¡an ¡alterna<ve ¡to ¡wri<ng ¡low-‑level ¡ MapReduce ¡code ¡ – Pig ¡is ¡especially ¡good ¡at ¡joining ¡and ¡transforming ¡ data ¡ people = LOAD '/user/training/customers' AS (cust_id, name); orders = LOAD '/user/training/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;

Use ¡Case: ¡ETL ¡Processing ¡ • Pig ¡is ¡also ¡widely ¡used ¡for ¡Extract, ¡Transform, ¡ and ¡Load ¡(ETL) ¡processing ¡ Pig Jobs Running on Hadoop Cluster Operations Accounting Data Warehouse Validate Fix Remove Encode data errors duplicates values Call Center

Apache ¡Hive ¡ • Hive ¡is ¡another ¡abstrac<on ¡on ¡top ¡of ¡ MapReduce ¡ SELECT customers.cust_id, SUM(cost) AS total – Like ¡Pig, ¡it ¡also ¡reduces ¡development ¡<me ¡ ¡ FROM customers JOIN orders – Hive ¡uses ¡a ¡SQL-‑like ¡language ¡called ¡HiveQL ¡ ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;

Use ¡Case: ¡Log ¡File ¡Analy<cs ¡ • Server ¡log ¡files ¡are ¡an ¡important ¡source ¡of ¡data ¡ • Hive ¡allows ¡you ¡to ¡treat ¡a ¡directory ¡of ¡log ¡files ¡ like ¡a ¡table ¡ – Allows ¡SQL-‑like ¡queries ¡against ¡raw ¡data ¡ Dualcore Inc. Public Web Site (June 1 - 8) Product Unique Visitors Page Views Average Time on Page Bounce Rate Conversion Rate Tablet 5,278 5,894 17 seconds 23% 65% Notebook 4,139 4,375 23 seconds 47% 31% Stereo 2,873 2,981 42 seconds 61% 12% Monitor 1,749 1,862 26 seconds 74% 19% Router 987 1,139 37 seconds 56% 17% Server 314 504 53 seconds 48% 28% Printer 86 97 34 seconds 27% 64%

Apache ¡Sqoop ¡ • Sqoop ¡exchanges ¡data ¡between ¡a ¡database ¡and ¡Hadoop ¡ • It ¡can ¡import ¡all ¡tables, ¡a ¡single ¡table, ¡or ¡a ¡por<on ¡of ¡a ¡table ¡into ¡ HDFS ¡ – Result ¡is ¡a ¡directory ¡in ¡HDFS ¡containing ¡comma-‑delimited ¡text ¡ files ¡ • Sqoop ¡can ¡also ¡export ¡data ¡from ¡HDFS ¡back ¡to ¡the ¡database ¡ Database Hadoop Cluster

Cloudera ¡Impala ¡ • Massively ¡parallel ¡SQL ¡engine ¡which ¡runs ¡on ¡a ¡Hadoop ¡cluster ¡ – Inspired ¡by ¡Google’s ¡Dremel ¡project ¡ – Can ¡query ¡data ¡stored ¡in ¡HDFS ¡or ¡HBase ¡tables ¡ • High ¡performance ¡ ¡ – Typically ¡at ¡least ¡10 ¡<mes ¡faster ¡than ¡Pig, ¡Hive, ¡or ¡ MapReduce ¡ – High-‑level ¡query ¡language ¡(subset ¡of ¡SQL) ¡ • Impala ¡is ¡100% ¡Apache-‑licensed ¡open ¡source ¡

Where ¡Impala ¡Fits ¡Into ¡the ¡Data ¡ Center ¡ Transaction Records from Application Database Log Data from Documents from Web Servers File Server Hadoop Cluster with Impala Analyst using Analyst using Impala Impala via BI tool shell for ad hoc queries

Recap ¡of ¡Data ¡Analysis/Processing ¡ Tools ¡ • MapReduce ¡ – Low-‑level ¡processing ¡and ¡analysis ¡ • Pig ¡ – Procedural ¡data ¡flow ¡language ¡executed ¡using ¡MapReduce ¡ • Hive ¡ – SQL-‑based ¡queries ¡executed ¡using ¡MapReduce ¡ • Impala ¡ – High-‑performance ¡SQL-‑based ¡queries ¡using ¡a ¡custom ¡ execu<on ¡engine ¡

Comparing ¡Pig, ¡Hive, ¡and ¡Impala ¡ Descrip(on ¡of ¡Feature ¡ Pig ¡ Hive ¡ Impala ¡ SQL-based query language No Yes Yes User-defined functions (UDFs) Yes Yes No Process data with external scripts Yes Yes No Extensible file format support Yes Yes No Complex data types Yes Yes No Query latency High High Low Built-in data partitioning No Yes Yes Accessible via ODBC / JDBC No Yes Yes

What ¡kinds ¡of ¡NoSQL ¡ • NoSQL ¡solu<ons ¡fall ¡into ¡two ¡major ¡areas: ¡ – Key/Value ¡or ¡‘the ¡big ¡hash ¡table’. ¡ • Amazon ¡S3 ¡(Dynamo) ¡ • Voldemort ¡ • Scalaris ¡ • Memcached ¡(in-‑memory ¡key/value ¡store) ¡ • Redis ¡ ¡ – Schema-‑less ¡which ¡comes ¡in ¡mul<ple ¡flavors, ¡column-‑based, ¡ document-‑based ¡or ¡graph-‑based. ¡ • Cassandra ¡(column-‑based) ¡ • CouchDB ¡(document-‑based) ¡ • MongoDB(document-‑based) ¡ • Neo4J ¡(graph-‑based) ¡ • HBase ¡(column-‑based) ¡ ¡

Key/Value ¡ Pros : ¡ – very ¡fast ¡ – very ¡scalable ¡ – simple ¡model ¡ – able ¡to ¡distribute ¡horizontally ¡ ¡ Cons : ¡ ¡ -‑ ¡ many ¡data ¡structures ¡(objects) ¡can't ¡be ¡easily ¡modeled ¡as ¡key ¡ value ¡pairs ¡ ¡

Schema-‑Less ¡ Pros : ¡ -‑ ¡Schema-‑less ¡data ¡model ¡is ¡richer ¡than ¡key/value ¡pairs ¡ -‑ eventual ¡consistency ¡ -‑ many ¡are ¡distributed ¡ -‑ s<ll ¡provide ¡excellent ¡performance ¡and ¡scalability ¡ ¡ Cons : ¡ ¡ -‑ ¡ typically ¡no ¡ACID ¡transac<ons ¡or ¡joins ¡ ¡

Common ¡Advantages ¡ • Cheap, ¡easy ¡to ¡implement ¡(open ¡source) ¡ • Data ¡are ¡replicated ¡to ¡mul<ple ¡nodes ¡(therefore ¡iden<cal ¡and ¡ fault-‑tolerant) ¡and ¡can ¡be ¡par<<oned ¡ – Down ¡nodes ¡easily ¡replaced ¡ – No ¡single ¡point ¡of ¡failure ¡ • Easy ¡to ¡distribute ¡ • Don't ¡require ¡a ¡schema ¡ • Can ¡scale ¡up ¡and ¡down ¡ • Relax ¡the ¡data ¡consistency ¡requirement ¡(CAP) ¡

Using Pig, Hive, and Impala with Hadoop Jay Urbain, - PowerPoint PPT Presentation

Using Pig, Hive, and Impala with Hadoop Jay Urbain, PhD Velocity We are genera<ng data faster than ever Processes are increasingly

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September

Impala Platinum Limited Impala Platinum Limited Rock Engineering Rock Engineering The

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Scripting for Multimedia LECTURE 9: WORKING WITH TABLES Tables in HTML A table displays a

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini

Ontology Engineering Lecture 1: Introduction to Knowledge bases, ontologies, and the Semantic Web

Using Pig, Hive, and Impala with Hadoop Jay Urbain, - PowerPoint PPT Presentation

Using Pig, Hive, and Impala with Hadoop Jay Urbain, PhD Velocity We are genera<ng data faster than ever Processes are increasingly

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Impala: A Modern, Open-Source SQL Engine for Hadoop Marcel Kornacker Cloudera, Inc. Agenda

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

The The O Old Hive ld Hive The mission of bee farm THE HE OLD LD HIVE VE is to produce

Apache HIVE Data Warehousing &amp; Analytics on Hadoop Hefu Chai What is HIVE? A system for

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Working the Hive 1 * What *When *How What to do Everyone who own or manages a hive must be

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

LANGUAGES FOR HADOOP: PIG &amp; HIVE Michail Michailidis &amp; Patrick Maiden Friday, September

Impala Platinum Limited Impala Platinum Limited Rock Engineering Rock Engineering The

Big Data Compete by asking bigger questions $$$... $ ??? SLA Yaaaay Hadoop to Save the

of Big Data 10/01/2018 25 Storage of Big Data Data is growing faster than Moores Law Too

V-trace, PopArt Normalization, Partially Observable MDPs Milan Straka January 7, 2019 Charles

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #13: QUERY

Scripting for Multimedia LECTURE 9: WORKING WITH TABLES Tables in HTML A table displays a

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

Hadoop Ecosystem Corso di Sistemi e Architetture per Big Data A.A. 2019/2020 Valeria Cardellini

Ontology Engineering Lecture 1: Introduction to Knowledge bases, ontologies, and the Semantic Web

Apache HIVE Data Warehousing & Analytics on Hadoop Hefu Chai What is HIVE? A system for

Working the Hive 1 * What When How What to do Everyone who own or manages a hive must be

LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September