Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and - PowerPoint PPT Presentation

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ;

R and Hadoop Review various tools for leveraging Hadoop from R. MapReduce Spark Hive/Impala Revolution R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 2 / 52

Scaling R to Big Data R has scalability issues: Performance Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 3 / 52

Scaling R to Big Data R has scalability issues: Performance? Memory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 4 / 52

R Performance Limits R performance bottlenecks are largely gone: Memory model tweaks Just-in-time compiler Highly performant data manipulation tools(e.g. dplyr , data.table ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 5 / 52

R Memory Limits Two choices for dealing with memory issues: Native R solutions: ff , bigmemory Leverage external tools: e.g. Hadoop, RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 6 / 52

Value of Leaving R Purpose-built Highly engineered Better scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 7 / 52

Cost of Context-Switching External tools rarely share R’s core concepts and features: Vectorization Functional programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 8 / 52

Choosing External Tools Does the value of the tool justify the increased development/conceptual cost? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 9 / 52

Outline MapReduce 1 Spark 2 Hadoop Databases 3 Revolution R ScaleR 4 Concluding 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 10 / 52

The Original Hadoop Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 11 / 52

Map Figure: Apply the same computation to all data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 12 / 52

Reduce Figure: Group and Reduce data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 13 / 52

Why MapReduce? Data localization Simple to understand But extremely flexible Extreme scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 14 / 52

Why not MapReduce? Large overhead Limited support for complex workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 15 / 52

rmr2 Really, Why MapReduce? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 16 / 52

Seemlessly Integrated with R First-class support for R types Atomic vectors (including factor and NA ) Does what you want with data.frame , matrix , array Works with any R values. Recreates your local session in Hadoop Local and global variables Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 17 / 52

4 1 ## [1] head (y) map=function(.,x) keyval (x,x*x)))) 9 16 25 36 Hello rmr x <- to.dfs (1:100) y <- values ( from.dfs ( mapreduce (x, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 18 / 52

## 31.940655 26.596183 21.251711 15.907240 10.562768 2 5.218297 map=function(.,wt) keyval (wt, predict (mpg_model, newdata= data.frame (wt=wt)))))) head (new_mpg) 1 ## 3 4 6 5 Fancy rmr mpg_model <- lm (mpg ~ wt, data=mtcars) new_weights <- to.dfs ( seq (1,5,by=.01)) new_mpg <- values ( from.dfs ( mapreduce (x, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 19 / 52

R-Friendly Data Import Parse text with read.table Read JSON with RJsONIO Load Avro record data into data.frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 20 / 52

mapreduce (..., backend.parameters = list (...)) Directly Control Job Configuration Reduce tasks Memory/Cpu resources JVM parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 21 / 52

Write Results to HDFS MapReduce/ rmr read from HDFS and write to HDFS making it easy to integrate scripts with the rest of your Hadoop workflows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 22 / 52

Misc Awesomeness Great documentation on the wiki with tutorial and topics on performance and data formats Highly optimized typedbytes serialization written in C. Installation only requires defining environmental variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 23 / 52

Caveats Everything is batch. Data issues can be difficult to track down. The API is great, but MapReduce can be limiting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 24 / 52

install_github ("RevolutionAnalytics/rmr2", subdir = "pkg") rmr.options (backend = "local") Try It Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 25 / 52

Outline MapReduce 1 Spark 2 Hadoop Databases 3 Revolution R ScaleR 4 Concluding 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 26 / 52

Hadoop 2.0 Standalone compute engine ported to YARN Hybrid memory model keeps more data in RAM “Lazy” evaluation allows efficient and complex workflows API with more than just Map and Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 27 / 52

Why Spark? It runs faster (on the same workflow) You can develop faster Iterative algorithms are feasible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 28 / 52

Why not SparkR? Version: 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 29 / 52

tuple <- list ( list ("key1", "value1"), list ("key2", "value2")) Writing for Spark not for R Spark uses key-value tuples, so does SparkR This is an awkward value in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 30 / 52

keyval (word, sum (ones)) keyval (words, 1) }) mapreduce (input_txt, map = function(., txt) { Wordcount: rmr words <- unlist ( strsplit (txt, " ")) }, reduce = function(word, ones) { . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 31 / 52

}) strsplit (line, " ")[[1]] Wordcount: SparkR words <- flatMap (lines, function(line) { wordCount <- map (words, function(word) list (word, 1L)) counts <- reduceByKey (wordCount, "+", 2L) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 32 / 52

## NOT VALID SPARKR SparkR Tuples This is awkward and slow to do in R. wordCount <- map (words, function(word) list (word, 1L)) Compared with a vectorized version implemented through an API like keyval. wordCount <- map (words, function(wordsVec) keyval (wordsVec, 1L)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 33 / 52

Limited API for Data Formats All data starts is a text file. You receive it as a character vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 34 / 52

Installation Troubles Requires compilation specific to your specific: Hadoop version Spark version YARN vs no-YARN Additional build tools are required Scala Maven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 35 / 52

Return Results to R Enables exploratory data analysis and ad-hoc analytics Cannot return output to HDFS for integration with other tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; 36 / 52

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and - PowerPoint PPT Presentation

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools for leveraging Hadoop from R. MapReduce Spark Hive/Impala Revolution R . . . . . . . . . . . . . . . . . . . . . . . . . .

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

Big Data processing with Hadoop Luca Pireddu CRS4Distributed Computing Group April 18, 2012

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

COVID Communication New Tools to Communicate in a Socially Distant World Tom Oldfather NYSAC

Agenda 1. Role of stewards & reps locally 2. PPE Digital options innovative practice

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and - PowerPoint PPT Presentation

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools for leveraging Hadoop from R. MapReduce Spark Hive/Impala Revolution R . . . . . . . . . . . . . . . . . . . . . . . . . .

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

Big Data processing with Hadoop Luca Pireddu CRS4Distributed Computing Group April 18, 2012

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Hadoop and

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

Organization to Teach Gathering and Implementation of Requirements Gregor Gabrysiak, Regina

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University

COVID Communication New Tools to Communicate in a Socially Distant World Tom Oldfather NYSAC

Agenda 1. Role of stewards &amp; reps locally 2. PPE Digital options innovative practice

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Hadoop and

Agenda 1. Role of stewards & reps locally 2. PPE Digital options innovative practice