Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff - - PowerPoint PPT Presentation

▶

Dec 10, 2023 128 likes •434 views

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012 MapReduce MapReduce

SLIDE 1

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Mrs: MapReduce for Scientific Computing in Python

Andrew McNabb, Jeff Lund, and Kevin Seppi

Brigham Young University

November 16, 2012

SLIDE 2

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

MapReduce

Large scale problems require parallel processing Communication in parallel processing is hard MapReduce abstracts away interprocess communication User only has to identify which parts of the problem are embarrassingly parallel

SLIDE 3

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

MapReduce

Input Input Input Input Input Map Map Map Map Map Reduce Reduce Reduce

SLIDE 4

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

WordCount

wordcount.py import mrs class WordCount(mrs.MapReduce): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) if name == ’ main ’: mrs.main(WordCount)

SLIDE 5

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Iterative MapReduce

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

SLIDE 6

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Hadoop

Hadoop is the most widely used open source MapReduce implementation Hadoop was designed for big data, not scientific computing Requires the use of HDFS and a dedicated cluster

SLIDE 7

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

MapReduce in Scientific Computing

What does an ideal MapReduce implementation look like in the context of scientific computing?

SLIDE 8

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Ease of Development

Rapid prototyping Testability Debuggability

SLIDE 9

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Ease of Development

WordCount.java

public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println(”Usage: wordcount <in> <out>”); System.exit(2); } Job job = new Job(conf, ”word count”); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

SLIDE 10

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Ease of Deployment

Dedicated cluster vs. supercomputers and private cluster Work with any filesystem Work with any scheduler

SLIDE 11

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Ease of Deployment

pbs-hadoop.sh

# Step 1: Find the network address. ADDR=$(/sbin/ip −o −4 addr list ”$INTERFACE” |sed −e ’s;ˆ.∗inet $.∗$/.∗$;\1;’) # Step 2: Set up the Hadoop configuration. export HADOOP LOG DIR=$JOBDIR/log mkdir $HADOOP LOG DIR export HADOOP CONF DIR=$JOBDIR/conf cp −R $HADOOP HOME/conf $HADOOP CONF DIR sed −e ”s/MASTER IP ADDRESS/$ADDR/g” −e ”s@HADOOP TMP DIR@$JOBDIR/tmp@g” \ −e ”s/MAP TASKS/$MAP TASKS/g” \ −e ”s/REDUCE TASKS/$REDUCE TASKS/g” \ −e ”s/TASKS PER NODE/$TASKS PER NODE/g” \ <$HADOOP HOME/conf/hadoop−site.xml \ >$HADOOP CONF DIR/hadoop−site.xml # Step 3: Start daemons on the master. HADOOP=”$HADOOP HOME/bin/hadoop” $HADOOP namenode −format # format the hdfs $HADOOP HOME/bin/hadoop−daemon.sh start namenode $HADOOP HOME/bin/hadoop−daemon.sh start jobtracker # Step 4: Start daemons on the slaves. ENV=”. $HOME/.bashrc; export HADOOP CONF DIR=$HADOOP CONF DIR; export HADOOP LOG DIR=$HADOOP LOG DIR” pbsdsh −u bash −c ”$ENV; $HADOOP datanode” & pbsdsh −u bash −c ”$ENV; $HADOOP tasktracker” & sleep 15 # Step 5: Run the User Program $HADOOP dfs −put $INPUT $HDFS INPUT $HADOOP jar $PROGRAM ${ARGS[@]} $HADOOP dfs −get $HDFS OUTPUT $OUTPUT # Step 6: Stop daemons on the slaves and master. kill %2 # kill tasktracker kill %1 # kill datanode $HADOOP HOME/bin/hadoop−daemon.sh stop jobtracker $HADOOP HOME/bin/hadoop−daemon.sh stop namenode

SLIDE 12

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Other Issues

Iterative performance Fault tolerance Interoperability

SLIDE 13

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

What is Mrs?

Aims to be a simple to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind

SLIDE 14

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Why Python?

Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing One downside: GIL

SLIDE 15

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Iterative MapReduce

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

SLIDE 16

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Iterative MapReduce: ReduceMap

Input Input Input Input Map Map Map Map ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap

SLIDE 17

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Automatic Serialization

Serialization happens every time a tasks communicates with another machine Mrs automatically handles this with pickle Hadoop requires Writable classes everywhere

SLIDE 18

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Debugging: Run Modes

Serial Mock Parallel Parallel

SLIDE 19

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Debugging: Random Number Generators

Seeding random number generators makes results reproducible Need different seed for each task Mrs has random function which lets you create a random number generator with an arbitrary number of offset parameters

ex. rand = self.random(id, iter)

SLIDE 20

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Performance and Case Studies

Interpreter overhead does not preclude good performance for Mrs. We demonstrate on three different problems: Halton Sequence: CPU bound benchmark Particle Swarm Optimization: CPU bound application Walk Analysis: IO bound application

SLIDE 21

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Performance and Case Studies

Optimization Story: Make sure you have the right algorithm Careful profiling Run with PyPy Rewrite critical path in C

SLIDE 22

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Monte Carlo Pi Estimation

Monte Carlo algorithm for computing the value

f π by generating

random points in a square Very little data, but computationally intense We can control how much computation each map task performs

−0.5 −0.25 0.25 0.5 −0.5 −0.25 0.25 0.5

Halton Sequence

SLIDE 23

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Monte Carlo Pi Estimation

100 101 102 103 104 105 106 107 108 109 1010 20 40 60 80 100 120

Points Per Map Task Time (seconds)

Mrs using pure Python

Hadoop (Java) Mrs (PyPy) Mrs (cPython)

SLIDE 24

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Monte Carlo Pi Estimation

100 101 102 103 104 105 106 107 108 109 1010 1011 20 40 60 80 100 120

Points Per Map Task Time (seconds)

Python with inner loop in C (using ctypes)

Hadoop (Java) Mrs (cPython)

SLIDE 25

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Particle Swarm Optimization

Inspired by simulations of flocking birds Particles interact while exploring Map: motion and function evaluation Reduce: communication CPU bound problem

2 4 6 8 10 10 20 30 40

SLIDE 26

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Particle Swarm Optimization

240 180 120 300 60 10−4 10−2 100 102 104 106 108 1010

Minutes Best Value

Convergence plots for the Rosenbrock-250 function

Serial Parallel

SLIDE 27

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies

Walk Analyzer

Involves analyzing random walks in a graph Heavy IO bound Average Hadoop Time: 1:06:53 Average Mrs Time: 52:55

SLIDE 28

Where to find Mrs

Mrs Homepage with links to source, documentation, mailing list, etc: http://code.google.com/p/mrs-mapreduce In case you forget the url, just google “mrs mapreduce” :)

SLIDE 29

Mrs: MapReduce for Scientific Computing in Python

Andrew McNabb, Jeff Lund, and Kevin Seppi

Brigham Young University

November 16, 2012

MapReduce

Large scale problems require parallel processing Communication in parallel processing is hard MapReduce abstracts away interprocess communication User only has to identify which parts of the problem are embarrassingly parallel

MapReduce

Input Input Input Input Input Map Map Map Map Map Reduce Reduce Reduce

WordCount

wordcount.py import mrs class WordCount(mrs.MapReduce): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) if name == ’ main ’: mrs.main(WordCount)

Iterative MapReduce

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

Hadoop

Hadoop is the most widely used open source MapReduce implementation Hadoop was designed for big data, not scientific computing Requires the use of HDFS and a dedicated cluster

MapReduce in Scientific Computing

What does an ideal MapReduce implementation look like in the context of scientific computing?

Ease of Development

Rapid prototyping Testability Debuggability

Ease of Development

WordCount.java

Ease of Deployment

Dedicated cluster vs. supercomputers and private cluster Work with any filesystem Work with any scheduler

Ease of Deployment

pbs-hadoop.sh

Other Issues

Iterative performance Fault tolerance Interoperability

What is Mrs?

Aims to be a simple to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind

Why Python?

Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing One downside: GIL

Iterative MapReduce

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

Iterative MapReduce: ReduceMap

Input Input Input Input Map Map Map Map ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap

Automatic Serialization

Serialization happens every time a tasks communicates with another machine Mrs automatically handles this with pickle Hadoop requires Writable classes everywhere

Debugging: Run Modes

Serial Mock Parallel Parallel

Debugging: Random Number Generators

Seeding random number generators makes results reproducible Need different seed for each task Mrs has random function which lets you create a random number generator with an arbitrary number of offset parameters

Performance and Case Studies

Interpreter overhead does not preclude good performance for Mrs. We demonstrate on three different problems: Halton Sequence: CPU bound benchmark Particle Swarm Optimization: CPU bound application Walk Analysis: IO bound application

Performance and Case Studies

Optimization Story: Make sure you have the right algorithm Careful profiling Run with PyPy Rewrite critical path in C

Monte Carlo Pi Estimation

Monte Carlo algorithm for computing the value

random points in a square Very little data, but computationally intense We can control how much computation each map task performs

−0.5 −0.25 0.25 0.5 −0.5 −0.25 0.25 0.5

Halton Sequence

Monte Carlo Pi Estimation

100 101 102 103 104 105 106 107 108 109 1010 20 40 60 80 100 120

Points Per Map Task Time (seconds)

Mrs using pure Python

Hadoop (Java) Mrs (PyPy) Mrs (cPython)

Monte Carlo Pi Estimation

100 101 102 103 104 105 106 107 108 109 1010 1011 20 40 60 80 100 120

Points Per Map Task Time (seconds)

Python with inner loop in C (using ctypes)

Hadoop (Java) Mrs (cPython)

Particle Swarm Optimization

Inspired by simulations of flocking birds Particles interact while exploring Map: motion and function evaluation Reduce: communication CPU bound problem

2 4 6 8 10 10 20 30 40

Particle Swarm Optimization

240 180 120 300 60 10−4 10−2 100 102 104 106 108 1010

Minutes Best Value

Convergence plots for the Rosenbrock-250 function

Serial Parallel

Walk Analyzer

Involves analyzing random walks in a graph Heavy IO bound Average Hadoop Time: 1:06:53 Average Mrs Time: 52:55

Where to find Mrs

Mrs Homepage with links to source, documentation, mailing list, etc: http://code.google.com/p/mrs-mapreduce In case you forget the url, just google “mrs mapreduce” :)

Other cool features I neglected to mention...

Reduce merge sort Asyncronous MapReduce Concurrent Convergence Checks Memory Logging Merge Sort Reduce Dataset Custom Serializer