Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and - - PowerPoint PPT Presentation

big data with r and hadoop
SMART_READER_LITE
LIVE PREVIEW

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and - - PowerPoint PPT Presentation

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools for leveraging Hadoop from R. MapReduce Spark Hive/Impala Revolution R . . . . . . . . . . . . . . . . . . . . . . . . . .


slide-1
SLIDE 1

;

Big Data with R and Hadoop

Jamie F Olson

June 11, 2015

slide-2
SLIDE 2

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R and Hadoop

Review various tools for leveraging Hadoop from R. MapReduce Spark Hive/Impala Revolution R

2 / 52

slide-3
SLIDE 3

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scaling R to Big Data

R has scalability issues: Performance Memory

3 / 52

slide-4
SLIDE 4

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Scaling R to Big Data

R has scalability issues: Performance? Memory?

4 / 52

slide-5
SLIDE 5

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R Performance Limits

R performance bottlenecks are largely gone: Memory model tweaks Just-in-time compiler Highly performant data manipulation tools(e.g. dplyr,

data.table)

5 / 52

slide-6
SLIDE 6

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R Memory Limits

Two choices for dealing with memory issues: Native R solutions: ff, bigmemory Leverage external tools: e.g. Hadoop, RDBMS

6 / 52

slide-7
SLIDE 7

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Value of Leaving R

Purpose-built Highly engineered Better scalability

7 / 52

slide-8
SLIDE 8

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cost of Context-Switching

External tools rarely share R’s core concepts and features: Vectorization Functional programming

8 / 52

slide-9
SLIDE 9

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Choosing External Tools

Does the value of the tool justify the increased development/conceptual cost?

9 / 52

slide-10
SLIDE 10

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

MapReduce

2

Spark

3

Hadoop Databases

4

Revolution R ScaleR

5

Concluding

10 / 52

slide-11
SLIDE 11

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Original Hadoop

Map Reduce

11 / 52

slide-12
SLIDE 12

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Map

Figure: Apply the same computation to all data

12 / 52

slide-13
SLIDE 13

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reduce

Figure: Group and Reduce data

13 / 52

slide-14
SLIDE 14

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why MapReduce?

Data localization Simple to understand But extremely flexible Extreme scalability

14 / 52

slide-15
SLIDE 15

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why not MapReduce?

Large overhead Limited support for complex workflows

15 / 52

slide-16
SLIDE 16

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Really, Why MapReduce?

rmr2

16 / 52

slide-17
SLIDE 17

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Seemlessly Integrated with R

First-class support for R types

Atomic vectors (including factor and NA) Does what you want with data.frame, matrix, array Works with any R values.

Recreates your local session in Hadoop

Local and global variables Packages

17 / 52

slide-18
SLIDE 18

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hello rmr

x <- to.dfs(1:100) y <- values(from.dfs(mapreduce(x, map=function(.,x)keyval(x,x*x)))) head(y) ## [1] 1 4 9 16 25 36

18 / 52

slide-19
SLIDE 19

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fancy rmr

mpg_model <- lm(mpg ~ wt, data=mtcars) new_weights <- to.dfs(seq(1,5,by=.01)) new_mpg <- values(from.dfs(mapreduce(x, map=function(.,wt) keyval(wt,predict(mpg_model, newdata=data.frame(wt=wt)))))) head(new_mpg) ## 1 2 3 4 5 6 ## 31.940655 26.596183 21.251711 15.907240 10.562768 5.218297

19 / 52

slide-20
SLIDE 20

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R-Friendly Data Import

Parse text with read.table Read JSON with RJsONIO Load Avro record data into data.frame

20 / 52

slide-21
SLIDE 21

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Directly Control Job Configuration

mapreduce(..., backend.parameters = list(...))

Reduce tasks Memory/Cpu resources JVM parameters

21 / 52

slide-22
SLIDE 22

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Write Results to HDFS

MapReduce/rmr read from HDFS and write to HDFS making it easy to integrate scripts with the rest of your Hadoop workflows.

22 / 52

slide-23
SLIDE 23

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Misc Awesomeness

Great documentation on the wiki with tutorial and topics on performance and data formats Highly optimized typedbytes serialization written in C. Installation only requires defining environmental variables.

23 / 52

slide-24
SLIDE 24

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Caveats

Everything is batch. Data issues can be difficult to track down. The API is great, but MapReduce can be limiting.

24 / 52

slide-25
SLIDE 25

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Try It Out

install_github("RevolutionAnalytics/rmr2", subdir = "pkg") rmr.options(backend = "local")

25 / 52

slide-26
SLIDE 26

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

MapReduce

2

Spark

3

Hadoop Databases

4

Revolution R ScaleR

5

Concluding

26 / 52

slide-27
SLIDE 27

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hadoop 2.0

Standalone compute engine ported to YARN Hybrid memory model keeps more data in RAM “Lazy” evaluation allows efficient and complex workflows API with more than just Map and Reduce

27 / 52

slide-28
SLIDE 28

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why Spark?

It runs faster (on the same workflow) You can develop faster Iterative algorithms are feasible

28 / 52

slide-29
SLIDE 29

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why not SparkR?

Version: 0.1

29 / 52

slide-30
SLIDE 30

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Writing for Spark not for R

Spark uses key-value tuples, so does SparkR

tuple <- list(list("key1", "value1"), list("key2", "value2"))

This is an awkward value in R.

30 / 52

slide-31
SLIDE 31

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Wordcount: rmr

mapreduce(input_txt, map = function(., txt) { words <- unlist(strsplit(txt, " ")) keyval(words, 1) }, reduce = function(word, ones) { keyval(word, sum(ones)) })

31 / 52

slide-32
SLIDE 32

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Wordcount: SparkR

words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <- map(words, function(word) list(word, 1L)) counts <- reduceByKey(wordCount, "+", 2L)

32 / 52

slide-33
SLIDE 33

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SparkR Tuples

This is awkward and slow to do in R.

wordCount <- map(words, function(word) list(word, 1L))

Compared with a vectorized version implemented through an API like keyval.

## NOT VALID SPARKR wordCount <- map(words, function(wordsVec) keyval(wordsVec, 1L))

33 / 52

slide-34
SLIDE 34

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Limited API for Data Formats

All data starts is a text file. You receive it as a character vector.

34 / 52

slide-35
SLIDE 35

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Installation Troubles

Requires compilation specific to your specific:

Hadoop version Spark version YARN vs no-YARN

Additional build tools are required

Scala Maven

35 / 52

slide-36
SLIDE 36

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Return Results to R

Enables exploratory data analysis and ad-hoc analytics Cannot return output to HDFS for integration with other tools

36 / 52

slide-37
SLIDE 37

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why SparkR, Again?

It’s the future. The API just needs some work.

37 / 52

slide-38
SLIDE 38

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Try It Out

install_github("amplab-extras/SparkR-pkg", subdir = "pkg") sc <- sparkR.init(master = "local")

38 / 52

slide-39
SLIDE 39

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

MapReduce

2

Spark

3

Hadoop Databases

4

Revolution R ScaleR

5

Concluding

39 / 52

slide-40
SLIDE 40

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Can you (S/H)QL?

Integrate with existing data lake Leverage existing SQL skills

40 / 52

slide-41
SLIDE 41

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Connecting from R

Connection options (R)ODBC* (R)JDBC *Available from Hadoop distributors.

41 / 52

slide-42
SLIDE 42

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Caveats

Apache Sentry Kerberos can be tricky

rJava Java version must match Hadoop’s

Many driver changes in the past few years

42 / 52

slide-43
SLIDE 43

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thoughts

Danger Zone: Dynamic SQL in R is not pretty Hive QL has a strong “flavor” Recommend: Data reshaping for inputs ETL to recombine R outputs

43 / 52

slide-44
SLIDE 44

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

RJDBC Examples

input_df <- dbGetQuery( paste0("SELECT * FROM ", command_line_arg)) # Do Something hdfs.put(output_df,output_hdfs_path) dbSendQuery(paste0( "CREATE EXTERNAL TABLE r_output", "(...)", "LOCATION ",output_hdfs_path)

44 / 52

slide-45
SLIDE 45

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

MapReduce

2

Spark

3

Hadoop Databases

4

Revolution R ScaleR

5

Concluding

45 / 52

slide-46
SLIDE 46

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Write Once Deploy Anywhere

Efficient linear-scaling algorithms for big data Cross-platform support for distributed computing

Hadoop In-Database Platform LSF

46 / 52

slide-47
SLIDE 47

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Modeling not Distributed Computing

Focus on modeling

Generalized Linear Models Tree-based models Clustering

And data transformation

47 / 52

slide-48
SLIDE 48

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ScaleR on Hadoop

“Inside” architecture with R in the cluster Maximum scalability Currently uses MapReduce “Beside” architecture with R on the edge node Efficient binary format optimizes IO Medium-large data(< 1 TB) can be faster than in-Hadoop Connect to Hive/Impala via ODBC(with unixODBC)

48 / 52

slide-49
SLIDE 49

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

MapReduce

2

Spark

3

Hadoop Databases

4

Revolution R ScaleR

5

Concluding

49 / 52

slide-50
SLIDE 50

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

It Depends

rmr2 is mature and integrated Spark is better, but SparkR is immature

There’s always a role for SQL

50 / 52

slide-51
SLIDE 51

;

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Questions?

51 / 52

slide-52
SLIDE 52

;

Thank you

Revolution Analytics is the leading commercial provider of software and support for the popular open source R statistics language.

www.revolutionanalytics.com 1.855.GET.REVO Twitter: @RevolutionR

. . . . . ... .. .. .... .. .. .... .. .. ... . . .. . . . . .