The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R - PowerPoint PPT Presentation

The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University

So far .. Import Subset Assign values to big.matrix objects SCALABLE DATA PROCESSING IN R

Associated Packages Tables and summaries biganalytics bigtabulate SCALABLE DATA PROCESSING IN R

Associated Packages Linear algebra bigalgebra SCALABLE DATA PROCESSING IN R

Associated Packages Fit Models bigpca bigFastLM biglasso bigrf SCALABLE DATA PROCESSING IN R

The FHFA's Mortgage Data Set Mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015 FHFA Mortgage data is available online here We will focus on a random subset of 70000 loans SCALABLE DATA PROCESSING IN R

1st example: using bigtabulate with bigmemory library(bigtabulate) # How many samples do we have per year? bigtable(mort, "year") 2008 2009 2010 2011 2012 2013 2014 2015 8468 11101 8836 7996 10935 10216 5714 6734 # Create nested tables bigtable(mort, c("msa", "year")) 2008 2009 2010 2011 2012 2013 2014 2015 0 1064 1343 998 851 1066 1005 504 564 1 7404 9758 7838 7145 9869 9211 5210 6170 SCALABLE DATA PROCESSING IN R

Let's practice! S CALABLE DATA P ROCES S IN G IN R

Split-Apply-Combine S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University

Split-Apply-Combine Split: split() Apply: Map() Combine: Reduce() SCALABLE DATA PROCESSING IN R

Partition using split() The split() function partitions data First argument is a vector or data.frame to split Second argument is a factor or integer whose values de�ne the partitions SCALABLE DATA PROCESSING IN R

# Get the rows corresponding to each of the years in the mortgage da year_splits <- split(1:nrow(mort), mort[,"year"]) # year_splits is a list class(year_splits) "list" # The years that we've split over names(year_splits) "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" # The first few rows corresponding to the year 2010 year_splits[["2010"]][1:10] 1 6 7 10 21 23 24 27 29 38 SCALABLE DATA PROCESSING IN R

Compute using Map() The Map() function processes the partitions First argument is the function to apply to each parition Second argument is the partitions SCALABLE DATA PROCESSING IN R

Compute using Map() col_missing_count <- function(mort) { apply(mort, 2, function(x) sum(x == 9))} # For each of the years count the number of missing values # all columns missing_by_year <- Map( function(x) col_missing_count(mort[x, ]), year_splits) missing_by_year[["2008"]] enterprise record_number msa 0 12 0 # ... SCALABLE DATA PROCESSING IN R

Combine using Reduce() The Reduce() function combines the results for all partitions First argument is the function to combine with Second argument is the partitioned data SCALABLE DATA PROCESSING IN R

# Calculate the total missing values by column Reduce(`+`, missing_by_year) enterprise record_number msa 0 64 0 # ... # Label the rownames with the year mby <- Reduce(rbind, missing_by_year) row.names(mby) <- names(year_splits) mby[1:3, 1:3] enterprise record_number msa 2008 0 12 0 2009 0 8 0 2010 0 10 0 SCALABLE DATA PROCESSING IN R

Visulize your results using Tidyverse S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University

Missingness by Year library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame() SCALABLE DATA PROCESSING IN R

Missingness by Year library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame() %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing")) SCALABLE DATA PROCESSING IN R

Missingness by Year library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing")) %>% gather(Year, Count, -Category) SCALABLE DATA PROCESSING IN R

Missingness by Year library(ggplot2) library(tidyr) library(dplyr) mort %>% bigtable(c("borrower_gender", "year")) %>% as.data.frame %>% mutate(Category = c("Male", "Female", "Not Provided", "Not Applicable", "Missing")) %>% gather(Year, Count, -Category) %>% ggplot(aes(x = Year, y = Count, group = Category, color = Category)) + geom_line() SCALABLE DATA PROCESSING IN R

SCALABLE DATA PROCESSING IN R

Limitations of bigmemory S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University

Where can you use bigmemory? You can use bigmemory when your data are matrices dense numeric Underlying data structures are compatible with low-level linear algebra libraries for fast model �tting If you have different column types, you could try the ff package SCALABLE DATA PROCESSING IN R

Understanding disk access A big.matrix is a data structure designed for random access SCALABLE DATA PROCESSING IN R

Disadvantages of random access Can't add rows or columns to an existing big.matrix object You need to have enough disk space to hold the entire matrix in one big block SCALABLE DATA PROCESSING IN R

The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R - PowerPoint PPT Presentation

The Bigmemory Suite of Packages S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University So far .. Import Subset Assign values to big.matrix objects SCALABLE DATA PROCESSING IN R Associated Packages Tables and

Htel Splendide Royal Junior Suite Junior Suite Junior Suite Suite Suite Suite Suite Suite

Building DICE Building DICE Building DICE Building DICE Packages Packages Packages Packages

Home Care Packages Program 1 Key points Home Care Packages More packages Four levels

Presidential Suite Presidential Suite Presidential Suite Presidential Suite Presidential Suite

New Abstract: Multi-gigabyte data sets challenge and frustrate R users even on well-equipped

bigmemory : bigger, better, and platform independent John W. Emerson Jay Associate

Extending R through packages: Theres a package for everything R packages are available on CRAN

A new way to pick up your packages How many student packages arrive on campus annually? A.

Parcel Pro Mockup Presentation 2.009 Pink B Packages get lost Packages get stolen

MATLAB 1 Mathematical Software Symbolic Math Packages - This amorphous set of packages can

4 OO Package Design Principles 4.1 Packages Introduction 4.2 Packages in UML 4.3 Three

Corridor Improvements Analysis Process Develop Option Apply Screening Select Analyze Packages

CORPORATE EVENT PACKAGES BANGKOK CONTENTS - Description - Benefits - Packages - Package

pkgsrc meets pkg-ng Generating pkg-ng packages from pkgsrc pkgsrcCon Berlin, July 4 th 2015

Writing and Building R Packages John Fox McMaster University Hamilton, Ontario, Canada IQS

Introduction to roxygen2 Aime Gott Education Practice Lead, Mango Solutions DataCamp

11:30-12:30pm Disclaimer: The analysis and conclusions are those of the authors and do not

October 20-23 Washington Hilton Thank You to Our Lead Sponsor Thank You to Our Platinum

Were in a mental recession Its a constant stream of negative headlines for a couple

TENANT PROTECTIONS TIED TO MULTIFAMILY MORTGAGE COVID-19 FORBEARANCES Lisa Sitkin, Senior Staff

Electronic Arts Inc. Q3 FY 2019 Results February 5, 2019 Safe Harbor Statement Please review

CIGIE/GAO Financial Statement Audit Conference FASAB Update Wendy Payne, CPA, CGFM 1:40

Discussion Upcoming pronouncements Upcoming pronouncements Summary Revenue recognition Revenue

Baird Industrial Conference 11 November 2015 Peter Mackie Group President, CHEP Pallets