SEGUE parallel R in the cloud two lines of code no kidding!

why... so i've go this problem... SEGUE insurance simulations updated frequently for one month on my laptop... each sim takes ~ 1 min 10k sims * 1 min = ~ 7 days no need for full map/reduce embarrassingly parallel

you've seen ”word count” SEGUE demos... segue has nothing to do with that big cpu, not big data

my options... SEGUE make the code faster build a cluster type snow mpi hadoop location lowest startup self hosted costs amazon web services ec2 emr rackspace

syntax... SEGUE require(segue) myCluster <- createCluster() contratulations. we've built a hadoop cluster!

more syntax... SEGUE parallel apply() on lists: base R: lapply( X, FUN, … ) segue: emrlapply( clusterObject, X, FUN, … )

example... estimatePi <- function( seed ){ SEGUE set.seed(seed) numDraws <- 1000000 r <- .5 x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) } seedList <- as.list(1:1000) require(segue) myCluster <- createCluster(20) myEstimates <- emrlapply( myCluster, seedList, estimatePi ) stopCluster(myCluster) myPi <- Reduce(sum, myEstimates) / length(myEstimates) format(myPi, digits=10) https://gist.github.com/764370

how does it work? SEGUE createCluster() cluster object: list of parameters temp dirs: local S3 for EMR bootstrap: update R update packages ~ 10-15 minutes

how does it work? emrlapply() SEGUE list is serialized to output is serialized into CSV and uploaded emr part-xxxxx files to S3 – streaming on s3 input file function, arguments, part files are r objects, etc are downloaded to R and saved & uploaded deserialized EMR copies files to deserialized results are nodes – mapper.R reordered and put into picks them up a list object CSV is input to mapper.R applies function to each list element

createCluster( numInstances=2, cranPackages, filesOnNodes, rObjectsOnNodes, SEGUE enableDebugging=FALSE, instancesPerNode, masterInstanceType="m1.small", slaveInstanceType="m1.small", location="us-east-1a", ec2KeyName, copy.image=FALSE, otherBootstrapActions, sourcePackagesToInstall) numInstances number of ec2 machines to fire up cranPackages cran packages to load on each cluster node filesOnNodes files to be loaded on each node rObjectsOnNodes R objects to put on the worker nodes enableDebugging start emr debugging instancesPerNode number of R instances per node masterInstanceType ec2 instance type for the master node slaveInstanceType ec2 instance type for the slave nodes location ec2 location name for the cluster ec2KeyName ec2 key used for logging into the main node copy.image copy the entire local environment to the nodes? otherBootstrapActions other bootstrap actions to run sourcePackagesToInstall R source packages to be installed on each node

when to use segue... SEGUE embarrassingly parallel cpu bound apply on lists with many items object size: to / from s3 roundtrip each job has a fixed & marginal cost

SEGUE downside of segue... embarrassingly parallel failure

ways to fail... SEGUE if you use segue you will see: unreproducable errors clusters that never start temp buckets in your s3 acct clusters left running i/o that takes longer than calcs but... i've never had a ”wrong” answer

imediate segue future... SEGUE maintenance issues: R releases change emr changes vendor lock-in to amazon whirr as solution? foreach %dopar% backend?

imagine the future... SEGUE R objects backed by clusters as.hdfs.data.frame(data) operations converted to map reduce jobs transparently abstractions...

segue project page SEGUE http://code.google.com/p/segue/ google groups http://groups.google.com/group/segue-r see also... rhipe – program m/r in R http://www.stat.purdue.edu/~sguha/rhipe/

SEGUE parallel R in the cloud two lines of code no kidding! - PowerPoint PPT Presentation

SEGUE parallel R in the cloud two lines of code no kidding! why... so i've go this problem... SEGUE insurance simulations updated frequently for one month on my laptop... each sim takes ~ 1 min 10k sims * 1 min = ~ 7 days no need for

High resolution observations of SEGUE stars Sara Lucatello MPA and INAF Osservatorio

M EET THE N EIGHBORS : SEGUE R ESULTS F OR T HE D ISK Jennifer Johnson Ohio State University B

SEGUE-2* SDSS-III Collaboration Meeting Paris July, 2010 *Sloan Extension for Galactic

Plan for Today (Session Four) Land Acknowledgment Review of the arrival in the Americas and

101 iOS Container View Controllers Container View Controllers Display a view controller inside

Stanford CS193p Developing Applications for iOS Spring 2016 CS193p Spring 2016 Today Demo

An introduction to R S. Manzi & A. Salmon South West Peninsula Collaboration for Leadership

HI PSSA Project Support for Harmonization of the ICT Policies in Sub-Sahara Africa MI NI STRY OF

Introduction to R Dr. Ron Rotkopf (ron.rotkopf@weizmann.ac.il) Bioinformatics Unit, Life Sciences

Sketch-to-Scale Solutions Investor Presentation March 2018 Risks and Non-GAAP Disclosures

Trade-offs in Sum-Rate Maximization and Fairness in Relay-Enhanced OFDMA- based Cellular Networks

LECTURE 2- PRODUCTION, TECHNOLOGY AND COST FUNCTIONS (USING LINEAR PROGRAMMING TO ESTIMATE

AUSTRALIAN BAUXITE LIMITED 27 October 2010 Better Bauxite Bauxite is the ore for aluminium

Caroline MacLeod Matt LeRoy Resource Practices Branch Goals and Objectives Backlog NSR

Sustainable package development using documentation generation

The UK Longitudinal Studies (LSs) Sensitive microdata: Sample from the Census linked to

The The Algae Event The The Algae Event Algae Event Algae Event

Sim imulating Cellular Communications in in Vehicular Networks: making in interoperable wit

55 Cranberry Street, Borough of Brooklyn How to Testify Via Zoom:

Cranberry Tipworms, Aphids, and Beneficial Insects in the Post-Diazinon Era Diazinon was phased

Tile Drainage in Massachusetus Cranberry Production Carolyn J. DeMoranville UMass Amherst ,

Developed and Presented by Brian Devlin, RLA Nagy Devlin Land Design 1963 Aerial View Natural

Intersection of Progress Update on Routes 228 & 19 in Southern Butler County Event

Life Cycle Analysis of Residential Brownfield and Greenfield Developments: Case Studies of