Zihang Yin Introduction R is commonly used as an open share - PowerPoint PPT Presentation

Zihang Yin

Introduction   R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge.  Frequently these analytical methods require data sets that are far too large to analyze on local memory.  Our assumption is that each analyst should understand R, but have a limited understanding of Hadoop.

Perspectives   The R and Hadoop Integrated Programming Environment is R package to compute across massive data sets, create subsets, apply routines to subsets, produce displays on subsets across a cluster of computers using the Hadoop DFS and Hadoop MapReduce framework. This is accomplished from within the R environment, using standard R programming idioms.  Enabling the integration of these methods will drive greater analytical productivity and extend the capabilities of companies.

Approach   The native language of Hadoop is Java. Java is not suitable for rapid development such as is needed for a data analysis environment. Hadoop Streaming bridges this gap. Users can write MapReduce programs in other languages e.g. Python, Ruby, Perl which is then deployed over the cluster. Hadoop Streaming then transfers the input data from Hadoop to the user program and vice versa. However,  Data analysis from R does not involve the user writing code to be deployed from the command line. The analyst has massive data sitting in the background, she needs to create data, partition the data, compute summaries or displays. This need to be evaluated from the R environment and the results returned to R. Ideally not having to resort to the command line.

Solution --- RHIPE   RHIPE consist of several functions to interact with the HDFS e.g. save data sets, read data, created by RHIPE MapReduce, delete files.  Compose and launch MapReduce jobs from R using the command rhmr and rhex. Monitor the status using rhstatus which returns an R object. Stop jobs using rhkill  Compute side effect files. The output of parallel computations may include the creation of PDF files, R data sets, CVS files etc. These will be copied by RHIPE to a central location on the HDFS removing the need for the user to copy them from the compute nodes or setting up a network file system.

Solution --- RHIPE   Data sets that are created by RHIPE can be read using other languages such as Java, Perl, Python and C. The serialization format used by RHIPE (converting R objects to binary data) uses Googles Protocol Buffers which is very fast and creates compact representations for R objects. Ideal for massive data sets.  Data sets created using RHIPE are key-value pairs. A key is mapped to a value. A MapReduce computations iterates over the key,value pairs in parallel. If the output of a RHIPE job creates unique keys the output can be treated as a external- memory associative dictionary. RHIPE can thus be used as a medium scale (millions of keys) disk based dictionary, which is useful for loading R objects into R.

RHIPE FUNCTION   rhget - Copying from the HDFS  rhput - Copying to the HDFS  rhwrite - Writing R data to the HDFS  rhread - Reading data from HDFS into R  rhgetkeys - Reading Values from Map Files

PACKAGING A JOB FOR MAPREDUCE   rhex - Submitting a MapReduce R Object to Hadoop  rhmr - Creating the MapReduce Object  Functions to Communicate with Hadoop during MapReduce  rhcollect - Writing Data to Hadoop MapReduce  rhstatus - Updating the Status of the Job during Runtime

Setup   Using eucalyptus create the hadoop The cluster has one master node and one slave node.  The Hadoop version that compatible with RHIPE is R-0.20-2.  Installing Google protobuf for searilization  Installing R  ./configure –enable-R-shalib  Make  Make check  Make install  Installing Rhipe as the add-on package  Create an image on eucalyptus thus it saves further efforts.

Example 1   How to make your text file with random numbers make.numbers <- function(N,dest,cols=5,factor=1,local=FALSE){ ## p is how long the word will be, longer more unique words ## factor, if equal to 1, then exactly N rows, otherwise N*factor rows ## cols how many columns per row map <- as.expression(bquote({ COLS <- .(COLS) F <- .(F) lapply(map.values,function(r){ for(i in 1:F){ f <- runif(COLS) rhcollect(NULL,f) } }) },list(COLS=cols,F=factor)))

Example 1   How to make your text file with random numbers R Library(Rhipe) mapred <- list() if (local) mapred$mapred.job.tracker <- 'local' mapred[['mapred.field.separator']]=" " mapred[['mapred.textoutputformat.usekey']]=FALSE mapred$mapred.reduce.tasks=0 z <- rhmr(map=map, N=N,ofolder=dest,inout=c("lapp","text"), mapred=mapred) rhex(z) } make.numbers(N=1000, "/tmp/somenumbers",cols=10) ## read them in (don't if N is too large!) f <- rhread("/tmp/somenumbers/", type="text")

Example 2   How to compute the mean  Mapper ## You want to compute the mean and sd (is ro == correlation?) For  this (and let's ## forget about numerical accuracy), we need the sums and sum of squares of the K ## columns. Using that you can compute mean and sd.  map <- expression({ ## K is the number of colums ## the number of rows is the length of map.values ## map.values is a list of lines ## this approach is okay, if you want /all/ the columns K <- 10 l <- length(map.values) all.lines <- as.numeric(unlist(strsplit(unlist(map.values),"[[:space:]]+"))) dim(all.lines) <- c(l, K) ## K is the number of columns sums <- apply(all.lines, 2, sum) ##by columns sqs <- apply(all.lines,2, function(r) sum(r^2)) # by columns sapply(1:K, function(r) rhcollect(r, c(l,sums[r],sqs[r]))) })

Example 2   How to compute the mean  Reducer  reduce <- expression( pre = { totals <- c(0,0,0)}, reduce = { totals <- totals + apply(do.call('rbind', reduce.values),2,sum) }, post = {rhcollect(reduce.key,totals) } ) ## the mapred bit is optional, but if you have K columns, why run more reducers?  mr <- list(mapred.reduce.tasks=K) y <- rhmr(map=map, reduce=reduce,combiner=TRUE,inout=c("text","sequence"),ifolder="/tmp/somenumbers", ofolder="/tmp/means",mapred=mr) w <- rhex(y,async=TRUE) z <- rhstatus(w, mon.sec=5) results <- if(z$state=="SUCCEEDED") rhread("/tmp/means") else NULL if(!is.null(results)){ results <- cbind(unlist(lapply(results,"[[",1)), do.call('rbind',lapply(results,"[[",2))) colnames(results) <- c("col.num","n","sum","ssq") }

Conclusion   In summary, the objective of RHIPE is to let the user focus on thinking about the data. The difficulties in distributing computations and storing data across a cluster are automatically handled by RHIPE and Hadoop.

Zihang Yin Introduction R is commonly used as an open share - PowerPoint PPT Presentation

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical methods require data

Controllable Invariance through Adversarial Feature Learning Qizhe Xie, Zihang Dai, Yulun Du,

The Towel Programming Language W4115 PLT, Fall 2015 Zihang Chen (zc2324) Baochan Zheng (bc2269)

YIN Moot Court: Blossom Online Dating Friday 20 th April 2017 09:00 to 10:30 Introduction Sanjay

RaFM Rank-Aware Factorization Machines Yin Zheng On Behalf of Xiaoshuang Chen, Yin Zheng,

SVD Status H. Yin August 24, 2017 H. Yin SVD Status August 24, 2017 1 / 19 Overview SVD

Regular finite type conditions for smooth pseudoconvex real hypersurfaces in C n Wanke Yin Joint

Large fmuctuations of the fjrst detected quantum return time Ruoyu Yin Bar Ilan University 20th

Intelligible Models for Classification and Regression Yin Lou 1 Rich Caruana 2 Johannes Gehrke 1

On the Memory Requirements of Block Interleaver for Batched Network Codes Hoover H. F. Yin, Ka

Introduction to C Performance Instructor: Yin Lou 02/07/2011 Introduction to C CS 2022, Spring

Introduction to the Correlation Decay Method Yitong Yin Nanjing University orkshop, July 20,

An Interpretable Knowledge Transfer Model for Knowledge Base Completion Qizhe Xie, Xuezhe Ma,

Adversarial Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/ With many slides

SwitchOut: An Efficient Data Augmentation for Neural Machine Translation Xinyi Wang , Hieu

Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO

From rational points to homotopy fixed points Chern Institute July 25, 2016 Gereon Quick NTNU A

Rally-Owl Overview of Rally-Owl Game This game is based off of Rally-X The goal of the game is

Imagemaps and R How the WWW WWWorks Hyperlinks and Imagemaps R Plots to Imagemaps

Influence of Positioning Error on X-Map Estimation Michaela Neuland, TUBS Outline Motivation

COMMUNITY NOISE MAPPING AND ACTION PLANNING- A EUROPEAN APPROACH TO REDUCE ENVIRONMENTAL NOISE

A Study of Nesterovs Scheme for Lagrangian Decomposition and MAP Labeling Bogdan Savchynskyy,

Cooperating Technical Partner Flood Hazard Mapping Project Flood Hazard Mapping Project

Labour market analysis: Wages (structure, trends, thematic areas and relation to other major