The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R - - PowerPoint PPT Presentation

the bigmemory suite of packages
SMART_READER_LITE
LIVE PREVIEW

The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R - - PowerPoint PPT Presentation

The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University So far .. Import Subset Assign values to big.matrix objects SCALABLE DATA PROCESSING IN R Associated Packages Tables and


slide-1
SLIDE 1

The Bigmemory Suite of Packages

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-2
SLIDE 2

SCALABLE DATA PROCESSING IN R

So far ..

Import Subset Assign values to big.matrix objects

slide-3
SLIDE 3

SCALABLE DATA PROCESSING IN R

Associated Packages

Tables and summaries

biganalytics bigtabulate

slide-4
SLIDE 4

SCALABLE DATA PROCESSING IN R

Associated Packages

Linear algebra

bigalgebra

slide-5
SLIDE 5

SCALABLE DATA PROCESSING IN R

Associated Packages

Fit Models

bigpca bigFastLM biglasso bigrf

slide-6
SLIDE 6

SCALABLE DATA PROCESSING IN R

The FHFA's Mortgage Data Set

Mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015 FHFA Mortgage data is available online here We will focus on a random subset of 70000 loans

slide-7
SLIDE 7

SCALABLE DATA PROCESSING IN R

1st example: using bigtabulate with bigmemory

library(bigtabulate) # How many samples do we have per year? bigtable(mort, "year") 2008 2009 2010 2011 2012 2013 2014 2015 8468 11101 8836 7996 10935 10216 5714 6734 # Create nested tables bigtable(mort, c("msa", "year")) 2008 2009 2010 2011 2012 2013 2014 2015 0 1064 1343 998 851 1066 1005 504 564 1 7404 9758 7838 7145 9869 9211 5210 6170

slide-8
SLIDE 8

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-9
SLIDE 9

Split-Apply-Combine

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-10
SLIDE 10

SCALABLE DATA PROCESSING IN R

Split-Apply-Combine

Split: split() Apply: Map() Combine: Reduce()

slide-11
SLIDE 11

SCALABLE DATA PROCESSING IN R

Partition using split()

The split() function partitions data First argument is a vector or data.frame to split Second argument is a factor or integer whose values dene the partitions

slide-12
SLIDE 12

SCALABLE DATA PROCESSING IN R

# Get the rows corresponding to each of the years in the mortgage da year_splits <- split(1:nrow(mort), mort[,"year"]) # year_splits is a list class(year_splits) "list" # The years that we've split over names(year_splits) "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" # The first few rows corresponding to the year 2010 year_splits[["2010"]][1:10] 1 6 7 10 21 23 24 27 29 38

slide-13
SLIDE 13

SCALABLE DATA PROCESSING IN R

Compute using Map()

The Map() function processes the partitions First argument is the function to apply to each parition Second argument is the partitions

slide-14
SLIDE 14

SCALABLE DATA PROCESSING IN R

Compute using Map()

col_missing_count <- function(mort) { apply(mort, 2, function(x) sum(x == 9))} # For each of the years count the number of missing values # all columns missing_by_year <- Map( function(x) col_missing_count(mort[x, ]), year_splits) missing_by_year[["2008"]] enterprise record_number msa 0 12 0 # ...

slide-15
SLIDE 15

SCALABLE DATA PROCESSING IN R

Combine using Reduce()

The Reduce() function combines the results for all partitions First argument is the function to combine with Second argument is the partitioned data

slide-16
SLIDE 16

SCALABLE DATA PROCESSING IN R

# Calculate the total missing values by column Reduce(`+`, missing_by_year) enterprise record_number msa 0 64 0 # ... # Label the rownames with the year mby <- Reduce(rbind, missing_by_year) row.names(mby) <- names(year_splits) mby[1:3, 1:3] enterprise record_number msa 2008 0 12 0 2009 0 8 0 2010 0 10 0

slide-17
SLIDE 17

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-18
SLIDE 18

Visulize your results using Tidyverse

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-19
SLIDE 19

SCALABLE DATA PROCESSING IN R

Missingness by Year

library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame()

slide-20
SLIDE 20

SCALABLE DATA PROCESSING IN R

Missingness by Year

library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame() %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing"))

slide-21
SLIDE 21

SCALABLE DATA PROCESSING IN R

Missingness by Year

library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing")) %>% gather(Year, Count, -Category)

slide-22
SLIDE 22

SCALABLE DATA PROCESSING IN R

Missingness by Year

library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing")) %>% gather(Year, Count, -Category) %>% ggplot(aes(x = Year, y = Count, group = Category, color = Category)) + geom_line()

slide-23
SLIDE 23

SCALABLE DATA PROCESSING IN R

slide-24
SLIDE 24

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-25
SLIDE 25

Limitations of bigmemory

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-26
SLIDE 26

SCALABLE DATA PROCESSING IN R

Where can you use bigmemory?

You can use bigmemory when your data are matrices dense numeric Underlying data structures are compatible with low-level linear algebra libraries for fast model tting If you have different column types, you could try the ff package

slide-27
SLIDE 27

SCALABLE DATA PROCESSING IN R

Understanding disk access

A big.matrix is a data structure designed for random access

slide-28
SLIDE 28

SCALABLE DATA PROCESSING IN R

Disadvantages of random access

Can't add rows or columns to an existing big.matrix object You need to have enough disk space to hold the entire matrix in

  • ne big block
slide-29
SLIDE 29

Let's practice!

S CALABLE DATA P ROCES S IN G IN R