PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - PowerPoint PPT Presentation

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS

What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software to coordinate among servers to analyze the data Obviously this changes over time

What is Hadoop/MapReduce? Hadoop is the defacto open source Big Data platform Fault tolerant distributed file system Based on [1] a 2003 Paper from Google about their internal file system Map/Reduce – a parallel computing paradigm that stresses low memory Usage… a Map step is executed on local nodes and the results are sent Over the network to Reducers which complete the task. [2] Another famous Google Paper. If you want to query data – use a database If you want to make a database – use Map/Reduce

What is Pig? Instead of all this (java) … import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; /** …. * RUN import org.apache.hadoop.util.Tool; */ import org.apache.hadoop.util.ToolRunner; @Override public int run(String[] args) throws Exception { public class L2 extends Configured implements Tool { if (args.length != 3) { /** System.err * MAPPER .println("Usage: wordcount <input_dir> <output_dir> <reducers>"); */ return -1; public static class Join extends Mapper<LongWritable, Text, Text, Text> { } private Set<String> hash; Job job = new Job(getConf(), "PigMix L2"); job.setJarByClass(L2.class); @Override public void setup(Context context) { job.setInputFormatClass(TextInputFormat.class); try { job.setOutputKeyClass(Text.class); Path[] paths = DistributedCache.getLocalCacheFiles(context job.setOutputValueClass(Text.class); .getConfiguration()); job.setMapperClass(Join.class); if (paths == null || paths.length < 1) { throw new RuntimeException("DistributedCache no work."); Properties props = System.getProperties(); } Configuration conf = job.getConfiguration(); for (Map.Entry<Object, Object> entry : props.entrySet()) { // Open the small table conf.set((String) entry.getKey(), (String) entry.getValue()); BufferedReader reader = new BufferedReader( } new InputStreamReader(new FileInputStream( paths[0].toString()))); DistributedCache.addCacheFile(new URI(args[0] + "/pigmix_power_users"), String line; conf); hash = new HashSet<String>(500); FileInputFormat.addInputPath(job, new Path(args[0] while ((line = reader.readLine()) != null) { + "/pigmix_page_views")); if (line.length() < 1) FileOutputFormat.setOutputPath(job, new Path(args[1] + "/L2out")); continue; job.setNumReduceTasks(0); String[] fields = line.split(""); if (fields[0].length() != 0) return job.waitForCompletion(true) ? 0 : -1; hash.add(fields[0]); } } } catch (IOException ioe) { /** throw new RuntimeException(ioe); * @param args } */ } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new L2(), args); @Override System.exit(res); public void map(LongWritable k, Text val, Context context) } throws IOException, InterruptedException { } List<Text> fields = Library.splitLine(val, ''); if (hash.contains(fields.get(0).toString())) { context.write(fields.get(0), fields.get(6)); } } }

This (Pig Latin)*! rmf /PIGFARM/pigmixout/l2out register /proj/PIGFARM/PIGMIX/pigperf.jar; A = LOAD '/PIGFARM/pigmix/pigmix_page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() AS (user, action, timespent, query_term,ip_addr, timestamp, estimated_revenue, page_info,page_links); B = FOREACH A GENERATE user, estimated_revenue; alpha = LOAD '/PIGFARM/pigmix/pigmix_users' using PigStorage('\u0001') AS (name, phone, address, city, state, zip); beta = FOREACH alpha GENERATE name; C = JOIN B BY user, beta BY name; STORE C INTO '/PIGFARM/pigmixout/l2out'; * This is PIGMIX Benchmark script L2.pig

PIGFARM Multiple Query Optimization (MQO) – The idea that several queries onto a single database can be made more efficient if combined together and issued at the same time When large firms have data scientists throughout their business units writing Pig scripts against common data sets in an uncoordinated manner… there is an opportunity to use MQO to improve the analytical bandwidth of these systems.

The Real Idea I only like yellow data Farmer CPU PIGSCRIPT 1 Big Data “feed” NOOPS /dev/null

The Real Idea I only like blue data Farmer CPU PIGSCRIPT 2 Big Data “feed” NOOPS /dev/null

The Real Idea I only like red data Farmer CPU PIGSCRIPT N Big Data “feed” NOOPS /dev/null

The Real Idea – Instead of this 1 N 2 N N N

this – fuse the initial map 1 N 2 N N N

fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic --Script determines the number of distinct pred/obj pairs that have math in them --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test001.gz rmf /PIGFARM/Merged/test003.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj, period); filt1 = filter table by (obj matches '.*math.*') or (pred matches '.*math.*'); filt = filter table by (obj matches '.*"North Carolina".*'); unduped = DISTINCT filt1; objs = foreach filt generate obj; store unduped into '/PIGFARM/Merged/test001.gz' using PigStorage('\t'); uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); --Script computes the average height of people for each subject joined = union count, grouped_users; rmf /PIGFARM/Merged/test002.gz store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table BY (pred matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE sub, REGEX_EXTRACT(obj, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE sub, (double)num; grouped = GROUP casted BY sub; avged = FOREACH grouped GENERATE casted.sub, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');

fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/PIGFARM/Script_Farm/ToMerge/test002.pig -- 2: table from /proj/PIGFARM/Script_Farm/ToMerge/test001.pig -- 3: table from /proj/PIGFARM/Script_Farm/ToMerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); -- Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test002.pig --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE laughing_wing, (double)num; grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t'); -- Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test003.pig --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); objs = foreach filt generate jovial_golick; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); …..

fuse the LOAD statement But this didn’t work. Pig just submitted the job as if it were the 3 sequential pig jobs. (Although it might still work with TEZ) Decided to move the store statements to the end… This actually caused very large temporary files to be created….. A performance killer Decided to identify the initial Map portions of the scripts, STORE them compressed and then read them back in… essentially explicit temporary files… this seems to work

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - PowerPoint PPT Presentation

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software

University of Nevada, Las Vegas University of Nevada, Las Vegas University of Nevada, Las Vegas

2020 LAS Collaborators Week Dr. Alyson Wilson Dr. Matt Schmidt Jamie Roseborough LAS

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Track # 1:

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Session 2

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Track #1 -

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: J2EE Track:

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: J2EE Application

Analytic Component System (ACS) Matt Schmidt and Andrew Crerar Presentation to LAS Weekly

BUILDING VALUE IN ARGENTINA Las Aguilas Overview PRESENTATION 2018 LAS AGUILAS - ARGENTINAS

Linking the LAS with Health & Social Care 6 th December 2016 Outline: About me..

NOT FOR DISTRIBUTION C LAS IGUANAS Las Iguanas is the UKs original and most successful Latin

Pat Christenson President | Las Vegas Events 1 Las Vegas has a storied history of events 2

Metropolitan Las Vegas Challenges, Opportunities, and a Vision University of Nevada Las Vegas

5 th Lithium Supply and Markets Conference 31 January 2013 Las Vegas, Nevada Las Vegas, Nevada

Chabot - Las Positas Chabot - Las Positas Community College District Community College District

ASPRS LiDAR Data Exchange Format Standard ASPRS LiDAR Data Exchange Format Standard LAS IIT

Lyman- Emitters with no HST Counterpart Michael Maseda (Leiden), Roland Bacon, Marijn Franx,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Graph Analytics using Vertica Relational Database Alekh Jindal - Samuel Madden, Mal

Cosmological Evolution of Gravitationally Unstable Galactic Disks Marcello Cacciato Minerva

Molecular gas across cosmic time and environment Franoise Combes Malta

Android UI Tools CS 4720 Mobile Application Development Resource: developer.android.com CS

MULTI-TOUCH UI: A TOUCHY SUBJECT A TOUCHY SUBJECT Alan Boykiw SMART Technologies Background

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - PowerPoint PPT Presentation

PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software

University of Nevada, Las Vegas University of Nevada, Las Vegas University of Nevada, Las Vegas

2020 LAS Collaborators Week Dr. Alyson Wilson Dr. Matt Schmidt Jamie Roseborough LAS

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Track # 1:

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Session 2

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: Track #1 -

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: J2EE Track:

Oracle Developer Day Sponsored by: Sponsored by: Sponsored by: Sponsored by: J2EE Application

Analytic Component System (ACS) Matt Schmidt and Andrew Crerar Presentation to LAS Weekly

BUILDING VALUE IN ARGENTINA Las Aguilas Overview PRESENTATION 2018 LAS AGUILAS - ARGENTINAS

Linking the LAS with Health &amp; Social Care 6 th December 2016 Outline: About me..

NOT FOR DISTRIBUTION C LAS IGUANAS Las Iguanas is the UKs original and most successful Latin

Pat Christenson President | Las Vegas Events 1 Las Vegas has a storied history of events 2

Metropolitan Las Vegas Challenges, Opportunities, and a Vision University of Nevada Las Vegas

5 th Lithium Supply and Markets Conference 31 January 2013 Las Vegas, Nevada Las Vegas, Nevada

Chabot - Las Positas Chabot - Las Positas Community College District Community College District

ASPRS LiDAR Data Exchange Format Standard ASPRS LiDAR Data Exchange Format Standard LAS IIT

Lyman- Emitters with no HST Counterpart Michael Maseda (Leiden), Roland Bacon, Marijn Franx,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Graph Analytics using Vertica Relational Database Alekh Jindal - Samuel Madden, Mal

Cosmological Evolution of Gravitationally Unstable Galactic Disks Marcello Cacciato Minerva

Molecular gas across cosmic time and environment Franoise Combes Malta

Android UI Tools CS 4720 Mobile Application Development Resource: developer.android.com CS

MULTI-TOUCH UI: A TOUCHY SUBJECT A TOUCHY SUBJECT Alan Boykiw SMART Technologies Background

Do Less, Get More: Streaming Submodular Maximization with Subsampling Moran Feldman 1 Amin Karbasi

Linking the LAS with Health & Social Care 6 th December 2016 Outline: About me..