Mrs: High Performance MapReduce for Iterative and Asynchronous - - PowerPoint PPT Presentation

mrs high performance mapreduce for iterative and
SMART_READER_LITE
LIVE PREVIEW

Mrs: High Performance MapReduce for Iterative and Asynchronous - - PowerPoint PPT Presentation

Mrs. Iterative MapReduce Performance and Case Studies Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python Jeff Lund , Chace Ashcraft, Andrew McNabb and Kevin Seppi Brigham Young University November 14, 2016


slide-1
SLIDE 1

Mrs. Iterative MapReduce Performance and Case Studies

Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python

Jeff Lund, Chace Ashcraft, Andrew McNabb and Kevin Seppi

Brigham Young University

November 14, 2016

slide-2
SLIDE 2

Mrs. Iterative MapReduce Performance and Case Studies

What is Mrs?

Simple and easy to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind

slide-3
SLIDE 3

Mrs. Iterative MapReduce Performance and Case Studies

MapReduce

Input Input Input Input Input Map Map Map Map Map Reduce Reduce Reduce

slide-4
SLIDE 4

Mrs. Iterative MapReduce Performance and Case Studies

Example: WordCount

wordcount.py import mrs class WordCount(mrs.MapReduce): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) if name == ’ main ’: mrs.main(WordCount)

slide-5
SLIDE 5

Mrs. Iterative MapReduce Performance and Case Studies

Why Python?

Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing

slide-6
SLIDE 6

Mrs. Iterative MapReduce Performance and Case Studies

Iterative MapReduce

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

· · ·

Performance Challenges: CPU bound problems Communication time Task Management

slide-7
SLIDE 7

Mrs. Iterative MapReduce Performance and Case Studies

Proposed Solutions

Infrequent Checkpointing Reduce-Map task Generator-Callback Model Asynchronous Scheduling Model

slide-8
SLIDE 8

Mrs. Iterative MapReduce Performance and Case Studies

How Often to Checkpoint

Let X be a random variable indicating a failure occurred during an iteration, then X ∼ Bernoulli 1 f

  • t + c

n

  • n: Number of iterations between checkpoints

t: Time to perform each iteration c: Extra time required for a checkpointed iteration f: Failures in a cluster

slide-9
SLIDE 9

Mrs. Iterative MapReduce Performance and Case Studies

How Often to Checkpoint

If Y ∼Uniform(n) indicates the number of iterations since last checkpoint then the expected value of the number of seconds of extra work in an iteration is: E [X (r + Yt)] = 1 f

  • t + c

n r + n 2t

  • and the breakeven number of iterations is

n = max

  • 1, 1

t c 2 + r 2 − 2c(r − f ) − c 2 + r

  • .
slide-10
SLIDE 10

Mrs. Iterative MapReduce Performance and Case Studies

Iterative MapReduce: ReduceMap

Input Input Input Input Map Map Map Map Reduce Reduce Reduce Reduce Map Map Map Map Reduce Reduce Reduce Reduce

· · ·

Input Input Input Input Map Map Map Map ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap ReduceMap

· · ·

slide-11
SLIDE 11

Mrs. Iterative MapReduce Performance and Case Studies

Generator-Callback Model

def run batches(): data path = input path for iteration in range(MAX ITERATIONS):

  • utput path = make temp path()

job = new job(data path, map func, reduce func, output path) job.wait for completion() data path = output path if iteration % CHECK FREQUENCY == 0: data = read all(data path) perform output(data) if converged(data): break

slide-12
SLIDE 12

Mrs. Iterative MapReduce Performance and Case Studies

Generator-Callback Model

def generator(queue): dataset = input data for iteration in range(MAX ITERATIONS):

  • utput path = make temp path()

dataset = mapreduce(dataset, map func, reduce func, output path) if iteration % CHECK FREQUENCY == 0: queue.submit(dataset, callback) else: queue.submit(dataset, None) def callback(data): data.read all() perform output(data) return !converged(data)

slide-13
SLIDE 13

Mrs. Iterative MapReduce Performance and Case Studies

Task Dependencies: Synchronous MapReduce

slide-14
SLIDE 14

Mrs. Iterative MapReduce Performance and Case Studies

Task Dependencies: Asynchronous MapReduce

slide-15
SLIDE 15

Mrs. Iterative MapReduce Performance and Case Studies

Task Execution Traces

Synchronous: Asynchronous:

slide-16
SLIDE 16

Mrs. Iterative MapReduce Performance and Case Studies

Performance and Case Studies

We demonstrate on two different problems: Particle Swarm Optimization Minimize 250 degree Rosenbrock function Expectation Maximization Mixture of Multinomials model in the context of clustering text documents

slide-17
SLIDE 17

Mrs. Iterative MapReduce Performance and Case Studies

Particle Swarm Optimization

Inspired by simulations of flocking birds Particles interact while exploring Map: motion and function evaluation Reduce: communication CPU bound problem

2 4 6 8 10 10 20 30 40

slide-18
SLIDE 18

Mrs. Iterative MapReduce Performance and Case Studies

Particle Swarm Optimization

100 101 102 103 0.2 0.4 0.6 0.8 1

Number of subiterations Parallel Efficiency Reduce-map tasks Rare checks Concurrent checks No redundant storage Redundant storage

slide-19
SLIDE 19

Mrs. Iterative MapReduce Performance and Case Studies

Particle Swarm Optimization: Asynchronous

10 20 5 15 20 40 60 80 100 120 140

Standard deviation of subiterations Average Tasks per Second Asynchronous Synchronous

slide-20
SLIDE 20

Mrs. Iterative MapReduce Performance and Case Studies

Particle Swarm Optimization: Asynchronous

128 64 256 16 512 768 20 40 60 80

Number of Processors Average Tasks per Second Synchronous Asynchronous

slide-21
SLIDE 21

Mrs. Iterative MapReduce Performance and Case Studies

Expectation Maximization

Feature Set Size 80 252 8000 25298 Reduce-map tasks 0.411 0.357 0.277 0.193 Rare checks 0.362 0.314 0.253 0.18 Redundant storage 0.013 0.013 0.013 0.012 Parallel efficiency per iteration of EM for various feature set sizes.

slide-22
SLIDE 22

Mrs. Iterative MapReduce Performance and Case Studies

Conclusion

By taking the following approaches, we have considerably improved performance for iterative parallel algorithms in Mrs: Infrequent Checkpointing Reduce-Map Task Generator-Callback Model Asynchronous Model

slide-23
SLIDE 23

Where to find Mrs

Mrs Homepage with links to source, documentation, mailing list, etc: https://github.com/byu-aml-lab/mrs-mapreduce In case you forget the url, just google “mrs mapreduce” :)