What is Scalable Data Processing? S CALABLE DATA P ROCES S IN G IN - - PowerPoint PPT Presentation

what is scalable data processing
SMART_READER_LITE
LIVE PREVIEW

What is Scalable Data Processing? S CALABLE DATA P ROCES S IN G IN - - PowerPoint PPT Presentation

What is Scalable Data Processing? S CALABLE DATA P ROCES S IN G IN R Michael J. Kane and Simon Urbanek Instructors, DataCamp In this course .. Work with data that is too large for your computer Write Scalable code Import and process data in


slide-1
SLIDE 1

What is Scalable Data Processing?

S CALABLE DATA P ROCES S IN G IN R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

slide-2
SLIDE 2

SCALABLE DATA PROCESSING IN R

In this course ..

Work with data that is too large for your computer Write Scalable code Import and process data in chunks

slide-3
SLIDE 3

SCALABLE DATA PROCESSING IN R

RAM

All R objects are stored in RAM

slide-4
SLIDE 4

SCALABLE DATA PROCESSING IN R

slide-5
SLIDE 5

SCALABLE DATA PROCESSING IN R

How Big Can Variables Be?

"R is not well-suited for working with data larger than 10-20% of a computer's RAM." - The R Installation and Administration Manual

slide-6
SLIDE 6

SCALABLE DATA PROCESSING IN R

Swapping is inefcient

If computer runs out of RAM, data is moved to disk Since the disk is much slower than RAM, execution time increases

slide-7
SLIDE 7

SCALABLE DATA PROCESSING IN R

Scalable solutions

Move a subset into RAM Process the subset Keep the result and discard the subset

slide-8
SLIDE 8

SCALABLE DATA PROCESSING IN R

Why is my code slow?

Complexity of calculations Carefully consider disk operations to write fast, scalable code

slide-9
SLIDE 9

SCALABLE DATA PROCESSING IN R

Benchmarking Performance

library(microbenchmark) microbenchmark( rnorm(100), rnorm(10000) ) Unit: microseconds expr min lq mean median uq max neval rnorm(100) 7.84 8.440 9.5459 8.773 9.355 29.56 100 rnorm(10000) 679.51 683.706 755.5693 690.876 712.416 2949.03 100

slide-10
SLIDE 10

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-11
SLIDE 11

The Bigmemory Project

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-12
SLIDE 12

SCALABLE DATA PROCESSING IN R

bigmemory

bigmemory is used to store, manipulate, and process big matrices,

that may be larger than a computer's RAM

slide-13
SLIDE 13

SCALABLE DATA PROCESSING IN R

big.matrix

Create Retrieve Subset Summarize

slide-14
SLIDE 14

SCALABLE DATA PROCESSING IN R

What does "out-of-core" mean?

R objects are kept in RAM When you run out of RAM Things get moved to disk Programs keep running (slowly) or crash You are better off moving data to RAM only when the data are needed for processing.

slide-15
SLIDE 15

SCALABLE DATA PROCESSING IN R

When to use a big.matrix?

20% of the size of RAM Dense matrices

slide-16
SLIDE 16

SCALABLE DATA PROCESSING IN R

An Overview of bigmemory

bigmemory implements the big.matrix data type, which is used to create, store, access, and manipulate matrices stored on the disk Data are kept on the disk and moved to RAM implicitly

slide-17
SLIDE 17

SCALABLE DATA PROCESSING IN R

An Overview of bigmemory

A big.matrix object: Only needs to be imported once "backing" le "descriptor" le

slide-18
SLIDE 18

SCALABLE DATA PROCESSING IN R

An example using bigmemory

library(bigmemory) # Create a new big.matrix object x <- big.matrix(nrow = 1, ncol = 3, type = "double", init = 0, backingfile = "hello_big_matrix.bin", descriptorfile = "hello_big_matrix.desc")

slide-19
SLIDE 19

SCALABLE DATA PROCESSING IN R

backing and descriptor les

backing le: binary representation of the matrix on the disk descriptor le: holds metadata, such as number of rows, columns, names, etc..

slide-20
SLIDE 20

SCALABLE DATA PROCESSING IN R

An example using bigmemory

# See what's in it x[,] 0 0 0 x An object of class "big.matrix" Slot "address": <pointer: 0x108e2a9a0>

slide-21
SLIDE 21

SCALABLE DATA PROCESSING IN R

Similarities with matrices

# Change the value in the first row and column x[1, 1] <- 3 # Verify the change has been made x[,] 3 0 0

slide-22
SLIDE 22

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-23
SLIDE 23

References vs. Copies

S CALABLE DATA P ROCES S IN G IN R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

slide-24
SLIDE 24

SCALABLE DATA PROCESSING IN R

Big matrices and matrices - Similarities

Subset Assign

slide-25
SLIDE 25

SCALABLE DATA PROCESSING IN R

Big matrices and matrices - Differences

big.matrix is stored on the disk

Persists across R sessions Can be shared across R sessions

slide-26
SLIDE 26

SCALABLE DATA PROCESSING IN R

R usually makes copies during assignment

This creates a copy of a and assigns it to b .

a <- 42 b <- a a 42 b 42 a <- 43 a 43 b 42

slide-27
SLIDE 27

SCALABLE DATA PROCESSING IN R

R usually makes copies during assignment

a <- 42 foo <- function(a){a <- 43 paste("Inside the function a is", a)} foo(a) "Inside the function a is 43" paste("Outside the function a is still", a) "Outside the function a is still 42"

slide-28
SLIDE 28

SCALABLE DATA PROCESSING IN R

Not all R objects are copied

This function does change the value of a in the global environment

foo <- function(a) {a$val <- 43 paste("Inside the function a is", a$val)} a <- environment() a$val <- 42 foo(a) "Inside the function a is 43" paste("Outside the function a$val is", a$val) "Outside the function a$val is 43"

slide-29
SLIDE 29

SCALABLE DATA PROCESSING IN R

deepcopy()

# x is a big matrix x <- big.matrix(...) # x_no_copy and x refer to the same object x_no_copy <- x # x_copy and x refer to different objects x_copy <- deepcopy(x)

slide-30
SLIDE 30

SCALABLE DATA PROCESSING IN R

Reference behaviour

R won't make copies implicitly Minimize memory usage Reduce execution time

slide-31
SLIDE 31

SCALABLE DATA PROCESSING IN R

Not all R objects are copied

library(bigmemory) x <- big.matrix(nrow = 1, ncol = 3, type = "double", init = 0, backingfile = "hello-bigmemory.bin", descriptorfile = "hello-bigmemory.desc")

slide-32
SLIDE 32

SCALABLE DATA PROCESSING IN R

Not all R objects are copied

x_no_copy <- x x[,] 0 0 0 x_no_copy[,] 0 0 0 x[,] <- 1 x[,] 1 1 1 x_no_copy[,] 1 1 1

slide-33
SLIDE 33

SCALABLE DATA PROCESSING IN R

Not all R objects are copied

x_copy <- deepcopy(x) x[,] 1 1 1 x_copy[,] 1 1 1 x[,] <- 2 x[,] 2 2 2 x_copy[,] 1 1 1

slide-34
SLIDE 34

Let's practice!

S CALABLE DATA P ROCES S IN G IN R