 
              Zihang Yin
Introduction   R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge.  Frequently these analytical methods require data sets that are far too large to analyze on local memory.  Our assumption is that each analyst should understand R, but have a limited understanding of Hadoop.
Perspectives   The R and Hadoop Integrated Programming Environment is R package to compute across massive data sets, create subsets, apply routines to subsets, produce displays on subsets across a cluster of computers using the Hadoop DFS and Hadoop MapReduce framework. This is accomplished from within the R environment, using standard R programming idioms.  Enabling the integration of these methods will drive greater analytical productivity and extend the capabilities of companies.
Approach   The native language of Hadoop is Java. Java is not suitable for rapid development such as is needed for a data analysis environment. Hadoop Streaming bridges this gap. Users can write MapReduce programs in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. Hadoop Streaming then transfers the input data from Hadoop to the user program and vice versa. However,  Data analysis from R does not involve the user writing code to be deployed from the command line. The analyst has massive data sitting in the background, she needs to create data, partition the data, compute summaries or displays. This need to be evaluated from the R environment and the results returned to R. Ideally not having to resort to the command line.
Solution --- RHIPE   RHIPE consist of several functions to interact with the HDFS e.g. save data sets, read data, created by RHIPE MapReduce, delete files.  Compose and launch MapReduce jobs from R using the command rhmr and rhex. Monitor the status using rhstatus which returns an R object. Stop jobs using rhkill  Compute side effect files. The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system.
Solution --- RHIPE   Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.  Data sets created using RHIPE are key-value pairs. A key is mapped to a value. A MapReduce computations iterates over the key,value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external- memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.
RHIPE FUNCTION   rhget - Copying from the HDFS  rhput - Copying to the HDFS  rhwrite - Writing R data to the HDFS  rhread - Reading data from HDFS into R  rhgetkeys - Reading Values from Map Files
PACKAGING A JOB FOR MAPREDUCE   rhex - Submitting a MapReduce R Object to Hadoop  rhmr - Creating the MapReduce Object  Functions to Communicate with Hadoop during MapReduce  rhcollect - Writing Data to Hadoop MapReduce  rhstatus - Updating the Status of the Job during Runtime
Setup   Using eucalyptus create the hadoop The cluster has one master node and one slave node.  The Hadoop version that compatible with RHIPE is R-0.20-2.  Installing Google protobuf for searilization  Installing R  ./configure –enable-R-shalib  Make  Make check  Make install  Installing Rhipe as the add-on package  Create an image on eucalyptus thus it saves further efforts.
Example 1   How to make your text file with random numbers make.numbers <- function(N,dest,cols=5,factor=1,local=FALSE){ ## p is how long the word will be, longer more unique words ## factor, if equal to 1, then exactly N rows, otherwise N*factor rows ## cols how many columns per row map <- as.expression(bquote({ COLS <- .(COLS) F <- .(F) lapply(map.values,function(r){ for(i in 1:F){ f <- runif(COLS) rhcollect(NULL,f) } }) },list(COLS=cols,F=factor)))
Example 1   How to make your text file with random numbers R Library(Rhipe) mapred <- list() if (local) mapred$mapred.job.tracker <- 'local' mapred[['mapred.field.separator']]=" " mapred[['mapred.textoutputformat.usekey']]=FALSE mapred$mapred.reduce.tasks=0 z <- rhmr(map=map, N=N,ofolder=dest,inout=c("lapp","text"), mapred=mapred) rhex(z) } make.numbers(N=1000, "/tmp/somenumbers",cols=10) ## read them in (don't if N is too large!) f <- rhread("/tmp/somenumbers/", type="text")
Example 2   How to compute the mean  Mapper ## You want to compute the mean and sd (is ro == correlation?) For  this (and let's ## forget about numerical accuracy), we need the sums and sum of squares of the K ## columns. Using that you can compute mean and sd.  map <- expression({ ## K is the number of colums ## the number of rows is the length of map.values ## map.values is a list of lines ## this approach is okay, if you want /all/ the columns K <- 10 l <- length(map.values) all.lines <- as.numeric(unlist(strsplit(unlist(map.values),"[[:space:]]+"))) dim(all.lines) <- c(l, K) ## K is the number of columns sums <- apply(all.lines, 2, sum) ##by columns sqs <- apply(all.lines,2, function(r) sum(r^2)) # by columns sapply(1:K, function(r) rhcollect(r, c(l,sums[r],sqs[r]))) })
Example 2   How to compute the mean  Reducer  reduce <- expression( pre = { totals <- c(0,0,0)}, reduce = { totals <- totals + apply(do.call('rbind', reduce.values),2,sum) }, post = {rhcollect(reduce.key,totals) } ) ## the mapred bit is optional, but if you have K columns, why run more reducers?  mr <- list(mapred.reduce.tasks=K) y <- rhmr(map=map, reduce=reduce,combiner=TRUE,inout=c("text","sequence"),ifolder="/tmp/somenumbers", ofolder="/tmp/means",mapred=mr) w <- rhex(y,async=TRUE) z <- rhstatus(w, mon.sec=5) results <- if(z$state=="SUCCEEDED") rhread("/tmp/means") else NULL if(!is.null(results)){ results <- cbind(unlist(lapply(results,"[[",1)), do.call('rbind',lapply(results,"[[",2))) colnames(results) <- c("col.num","n","sum","ssq") }
Conclusion   In summary, the objective of RHIPE is to let the user focus on thinking about the data. The difficulties in distributing computations and storing data across a cluster are automatically handled by RHIPE and Hadoop.
Recommend
More recommend