PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - - PowerPoint PPT Presentation
PIGFARM - LAS Sponsored Computer Science Senior Design Class Project - - PowerPoint PPT Presentation
PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software
What is Big Data?
Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software to coordinate among servers to analyze the data Obviously this changes over time
What is Hadoop/MapReduce?
Hadoop is the defacto open source Big Data platform Fault tolerant distributed file system Based on [1] a 2003 Paper from Google about their internal file system Map/Reduce – a parallel computing paradigm that stresses low memory Usage… a Map step is executed on local nodes and the results are sent Over the network to Reducers which complete the task. [2] Another famous Google Paper. If you want to query data – use a database If you want to make a database – use Map/Reduce
What is Pig? Instead of all this (java) …
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; …. import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class L2 extends Configured implements Tool { /** * MAPPER */ public static class Join extends Mapper<LongWritable, Text, Text, Text> { private Set<String> hash; @Override public void setup(Context context) { try { Path[] paths = DistributedCache.getLocalCacheFiles(context .getConfiguration()); if (paths == null || paths.length < 1) { throw new RuntimeException("DistributedCache no work."); } // Open the small table BufferedReader reader = new BufferedReader( new InputStreamReader(new FileInputStream( paths[0].toString()))); String line; hash = new HashSet<String>(500); while ((line = reader.readLine()) != null) { if (line.length() < 1) continue; String[] fields = line.split(""); if (fields[0].length() != 0) hash.add(fields[0]); } } catch (IOException ioe) { throw new RuntimeException(ioe); } } @Override public void map(LongWritable k, Text val, Context context) throws IOException, InterruptedException { List<Text> fields = Library.splitLine(val, ''); if (hash.contains(fields.get(0).toString())) { context.write(fields.get(0), fields.get(6)); } } } /** * RUN */ @Override public int run(String[] args) throws Exception { if (args.length != 3) { System.err .println("Usage: wordcount <input_dir> <output_dir> <reducers>"); return -1; } Job job = new Job(getConf(), "PigMix L2"); job.setJarByClass(L2.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Join.class); Properties props = System.getProperties(); Configuration conf = job.getConfiguration(); for (Map.Entry<Object, Object> entry : props.entrySet()) { conf.set((String) entry.getKey(), (String) entry.getValue()); } DistributedCache.addCacheFile(new URI(args[0] + "/pigmix_power_users"), conf); FileInputFormat.addInputPath(job, new Path(args[0] + "/pigmix_page_views")); FileOutputFormat.setOutputPath(job, new Path(args[1] + "/L2out")); job.setNumReduceTasks(0); return job.waitForCompletion(true) ? 0 : -1; } /** * @param args */ public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new L2(), args); System.exit(res); } }
This (Pig Latin)*!
rmf /PIGFARM/pigmixout/l2out register /proj/PIGFARM/PIGMIX/pigperf.jar; A = LOAD '/PIGFARM/pigmix/pigmix_page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() AS (user, action, timespent, query_term,ip_addr, timestamp, estimated_revenue, page_info,page_links); B = FOREACH A GENERATE user, estimated_revenue; alpha = LOAD '/PIGFARM/pigmix/pigmix_users' using PigStorage('\u0001') AS (name, phone, address, city, state, zip); beta = FOREACH alpha GENERATE name; C = JOIN B BY user, beta BY name; STORE C INTO '/PIGFARM/pigmixout/l2out';
* This is PIGMIX Benchmark script L2.pig
PIGFARM
Multiple Query Optimization (MQO) – The idea that several queries onto a single database can be made more efficient if combined together and issued at the same time When large firms have data scientists throughout their business units writing Pig scripts against common data sets in an uncoordinated manner… there is an opportunity to use MQO to improve the analytical bandwidth of these systems.
The Real Idea
PIGSCRIPT 1
NOOPS /dev/null Big Data “feed”
I only like yellow data
Farmer CPU
The Real Idea
PIGSCRIPT 2
NOOPS /dev/null Big Data “feed”
I only like blue data
Farmer CPU
The Real Idea
PIGSCRIPT N
NOOPS /dev/null Big Data “feed”
I only like red data
Farmer CPU
The Real Idea – Instead of this
N N N
1 2 N
this – fuse the initial map
N N N
1 2 N
fuse the LOAD statement
At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic
- -Script determines the number of distinct pred/obj pairs that have math in them
rmf /PIGFARM/Merged/test001.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table by (obj matches '.*math.*') or (pred matches '.*math.*'); unduped = DISTINCT filt1; store unduped into '/PIGFARM/Merged/test001.gz' using PigStorage('\t');
- -Script computes the average height of people for each subject
rmf /PIGFARM/Merged/test002.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table BY (pred matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE sub, REGEX_EXTRACT(obj, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE sub, (double)num; grouped = GROUP casted BY sub; avged = FOREACH grouped GENERATE casted.sub, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');
- -Script determines the number of unique objects with North Carolina
rmf /PIGFARM/Merged/test003.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj, period); filt = filter table by (obj matches '.*"North Carolina".*');
- bjs = foreach filt generate obj;
uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t');
fuse the LOAD statement
At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables….and let Apache Pig work its magic
- - An LAS PIGFARM Compiled Pig Script
- - Compiled on: 23/02/17-07:09
- - The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t')
- - 1: table from /proj/PIGFARM/Script_Farm/ToMerge/test002.pig
- - 2: table from /proj/PIGFARM/Script_Farm/ToMerge/test001.pig
- - 3: table from /proj/PIGFARM/Script_Farm/ToMerge/test003.pig
boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci );
- - Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test002.pig
- -Script computes the average height of people for each subject
rmf /PIGFARM/Merged/test002.gz filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); removeQuotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE laughing_wing, (double)num; grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');
- - Below is the remainder of: /proj/PIGFARM/Script_Farm/ToMerge/test003.pig
- -Script determines the number of unique objects with North Carolina
rmf /PIGFARM/Merged/test003.gz filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*');
- bjs = foreach filt generate jovial_golick;
uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); …..
fuse the LOAD statement
But this didn’t work. Pig just submitted the job as if it were the 3 sequential pig jobs. (Although it might still work with TEZ) Decided to move the store statements to the end… This actually caused very large temporary files to be created….. A performance killer Decided to identify the initial Map portions of the scripts, STORE them compressed and then read them back in… essentially explicit temporary files… this seems to work
fuse the initial mapper
- - An LAS PIGFARM Compiled Pig Script
- - Compiled on: 23/02/17-07:09
rmf /PIGFARM/cumbeeMerged/test001.gz rmf /PIGFARM/cumbeeMerged/test002.gz rmf /PIGFARM/cumbeeMerged/test003.gz rmf /PIGFARM/cumbeeMerged/casted.gz rmf /PIGFARM/cumbeeMerged/objs.gz rmf /PIGFARM/cumbeeMerged/filt2.gz rmf /PIGFARM/cumbeeMerged/filt3.gz
- - The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t')
- - 1: table from /proj/PIGFARM/Script_Farm/ToMerge/test002.pig
- - 2: table from /proj/PIGFARM/Script_Farm/ToMerge/test001.pig
- - 3: table from /proj/PIGFARM/Script_Farm/ToMerge/test003.pig
boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); filt2 = filter boring_aryabhata by (stoic_allen matches '.*math.*'); filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); filt3 = filter boring_aryabhata by (jovial_golick matches '.*math.*');
- bjs = foreach filt generate jovial_golick;
removeQuotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removeQuotes GENERATE laughing_wing, (double)num; store casted into '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t'); store objs into '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t'); store filt2 into '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t'); store filt3 into '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t');
- casted = LOAD '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t') as (laughing_wing,num:double);
- bjs= LOAD '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t') as (jovial_golick);
filt2 = LOAD '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); filt3 = LOAD '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; …
Datasets
PIGMIX – standard synthetic Pig Benchmark 250million rows – Mostly “dense”, 400 GB uncompressed Used to test Apache Pig vs Java Map/Reduce performance Freebase – large knowledge graph available on the internet 3billion + subject,predicate,object tuples 250 GB uncompressed We made a special loader function UDF for FB called FBLoader()
Test Cluster
OSCAR LAB – Hortonworks cluster 12 Blades – 1 login/name server, 11 compute nodes Each blade has 65GB of RAM 12 TB of HDFS Replication factor of 1
Preliminary Results
Freebase Test 1-3 PRL 1,..8 Individual scripts PRL_1-4 PRL_1-6 PRL_1-8 Parallel submission minutes 13 24 - 30 90 135 181 PIGFARM minutes 8 N/A 30 31 39 These scripts were compiled by hand Using Pig defaults for number of reducers
PIGFARMers
PIGFARMers
Work is ongoing
Team is still working on the script combiner Make sure it can handle FBLoader() Create a similarity function for scripts based on the data they access Run all of the experiments and write the results up in a paper Also worthwhile to rerun all experiments with Tez vs MapRed A furious finish… only 5 weeks left in semester!
Conclusion
If a large firm is writing Apache Pig scripts to perform Map/Reduce jobs on common data sets… there could be a lot of performance gains in fusing the maps together with PIGFARM… especially if most of the jobs are map heavy.
Thanks!
- Dr. Aaron Wiechman - LAS
- Dr. Sean Lynch - LAS
- Ms. Margaret Heil – Director SDC
- Dr. David Sturgill – Tech Advisor Session 3
Questions?
References
[1] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file system". Proceedings of the nineteenth ACM Symposium on Operating Systems Principles - SOSP '03 (PDF). p. 29. [2] Dean, J. and Ghemawat, S. (2004). "MapReduce: Simplified data processing
- n large clusters". In Proceedings of the 6th USENIX Symposium on Operating