Apache Pig for Data Science Casey Stella April 9, 2014 Casey - PowerPoint PPT Presentation

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Table of Contents Preliminaries Apache Hadoop Apache Pig Pig in the Data Science Toolbag Understanding Your Data Machine Learning with Pig Applying Models with Pig Unstructured Data Analysis with Pig Questions & Bibliography Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Introduction • I’m a Principal Architect at Hortonworks • I work primarily doing Data Science in the Hadoop Ecosystem • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory • I’m going to talk about Apache Pig’s role for doing scalable data science. Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: What is it? Hadoop is a distributed storage and processing system • Scalable – Efficiently store and process data Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: What is it? Hadoop is a distributed storage and processing system • Scalable – Efficiently store and process data • Reliable – Failover and redundant storage Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: What is it? Hadoop is a distributed storage and processing system • Scalable – Efficiently store and process data • Reliable – Failover and redundant storage • Vast – Many ecosystem projects surrounding data ingestion, processing and export Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: What is it? Hadoop is a distributed storage and processing system • Scalable – Efficiently store and process data • Reliable – Failover and redundant storage • Vast – Many ecosystem projects surrounding data ingestion, processing and export • Economical – Use commodity hardware and open source software Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: What is it? Hadoop is a distributed storage and processing system • Scalable – Efficiently store and process data • Reliable – Failover and redundant storage • Vast – Many ecosystem projects surrounding data ingestion, processing and export • Economical – Use commodity hardware and open source software • Not a one-trick-pony – Not just MapReduce anymore Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Hadoop: Who is using it? Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Compiles scripting language into MapReduce operations Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Compiles scripting language into MapReduce operations • Optimizes such that the minimal number of MapReduce jobs need be run Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Compiles scripting language into MapReduce operations • Optimizes such that the minimal number of MapReduce jobs need be run • Familiar relational primitives available Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Compiles scripting language into MapReduce operations • Optimizes such that the minimal number of MapReduce jobs need be run • Familiar relational primitives available • Extensible via User Defined Functions and Loaders for customized data processing and formats Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig: An Familiar Example SENTENCES= load ’ . . . ’ as ( sentence : c h a r a r r a y ) ; WORDS = foreach SENTENCES generate f l a t t e n (TOKENIZE( sentence )) as word ; WORD_GROUPS = group WORDS by word ; WORD_COUNTS = foreach WORD_GROUPS generate group as word , COUNT(WORDS) ; s t o r e WORD_COUNTS i n t o ’ . . . ’ ; Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data A core pre-requisite to analyzing data is understanding data’s shape and distribution. This requires (among other things): • Computing distribution statistics on data • Sampling data Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling ◦ Weighted sampling without replacement 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Understanding Data: Datafu An Apache Incubating project called datafu 1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling ◦ Weighted sampling without replacement ◦ Random Sample with replacement 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Case Study: Bootstrapping Bootstrapping is a resampling technique which is intended to measure accuracy of sample estimates. It does this by measuring an estimator (such as mean) across a set of random samples with replacement from an original (possibly large) dataset. Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Case Study: Bootstrapping Datafu provides two tools which can be used together to provide that random sample with replacement: • SimpleRandomSampleWithReplacementVote – Ranks multiple candidates for each position in a sample • SimpleRandomSampleWithReplacementElect – Chooses, for each position in the sample, the candidate with the lowest score The datafu docs provide an example 2 of generating a boostrap of the mean estimator. 2 http://datafu.incubator.apache.org/docs/datafu/guide/sampling.html Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: • Unsupervised Learning ◦ Clustering ◦ Outlier detection ◦ Market Basket Analysis Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: • Unsupervised Learning ◦ Clustering ◦ Outlier detection ◦ Market Basket Analysis • Supervised Learning ◦ Classification ◦ Regression ◦ Recommendation Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Apache Pig for Data Science Casey Stella April 9, 2014 Casey - PowerPoint PPT Presentation

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014 Table of Contents Preliminaries Apache Hadoop Apache Pig Pig in the Data Science Toolbag Understanding Your Data

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

75% Location profiler 85% Flow profiler Real user behavior 52% APs assignment algorithm

Sentiment Analysis Classification Tasks Daniel Dakota R&D Seminar HLT Program September 1st,

Paris Attacks A pretext to sentiment analysis on social media M. Gaborit F. Blain HAUM Talks

Sentiment Analysis in Twitter Rohit Kumar Jha, Sakaar Khurana Sentiment Analysis in Twitter

Making It Public Michal Migurski, Stadt der Strme Stamen Eric Rodenbeck Michal Migurski (me)

#becomingsocial jdblundell.com/how-to-guides #becomingsocial Definitions: Tag: A keyword used to

NEW TECHNOLOGIES CANVA What is Visual Content? It is the graphics and images that you see, read

~UNOLS ~ 2019 UNOLS Council Summer Teleconference 13,14 August 2019 2020 Ship Scheduling Update

Apache Pig for Data Science Casey Stella April 9, 2014 Casey - PowerPoint PPT Presentation

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014 Table of Contents Preliminaries Apache Hadoop Apache Pig Pig in the Data Science Toolbag Understanding Your Data

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose,

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine &amp; Julian Hyde

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

75% Location profiler 85% Flow profiler Real user behavior 52% APs assignment algorithm

Sentiment Analysis Classification Tasks Daniel Dakota R&amp;D Seminar HLT Program September 1st,

Paris Attacks A pretext to sentiment analysis on social media M. Gaborit F. Blain HAUM Talks

Sentiment Analysis in Twitter Rohit Kumar Jha, Sakaar Khurana Sentiment Analysis in Twitter

Making It Public Michal Migurski, Stadt der Strme Stamen Eric Rodenbeck Michal Migurski (me)

#becomingsocial jdblundell.com/how-to-guides #becomingsocial Definitions: Tag: A keyword used to

NEW TECHNOLOGIES CANVA What is Visual Content? It is the graphics and images that you see, read

~UNOLS ~ 2019 UNOLS Council Summer Teleconference 13,14 August 2019 2020 Ship Scheduling Update

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Sentiment Analysis Classification Tasks Daniel Dakota R&D Seminar HLT Program September 1st,