Review and Preliminary Mortgage Analysis S CALABLE DATA P ROCES S - - PowerPoint PPT Presentation

review and preliminary mortgage analysis
SMART_READER_LITE
LIVE PREVIEW

Review and Preliminary Mortgage Analysis S CALABLE DATA P ROCES S - - PowerPoint PPT Presentation

Review and Preliminary Mortgage Analysis S CALABLE DATA P ROCES S IN G IN R Michael Kane Assistant Professor, Yale University Overview of the chapter Compare proportions of people receiving mortgages Missingness in the data Changes in


slide-1
SLIDE 1

Review and Preliminary Mortgage Analysis

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-2
SLIDE 2

SCALABLE DATA PROCESSING IN R

Overview of the chapter

Compare proportions of people receiving mortgages Missingness in the data Changes in Mortgage demographic proportions over time City vs rural mortgages Proportion of people securing federally guaranteed loans

slide-3
SLIDE 3

SCALABLE DATA PROCESSING IN R

United States Census Bureau Race and Ethnic Proportions

Category Percentge American Indian or Alaska Native 0.9 Asian 4.8 Black or African American 12.6 Native Hawaiian or Other Pacic Islander 0.2 Two or more races (Not included) 2.9 Other race (Not included) 6.2

slide-4
SLIDE 4

SCALABLE DATA PROCESSING IN R

Proportional Borrowing

We know that most mortgages went to people who identify as white. Is this group borrowing more proportionally?

slide-5
SLIDE 5

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-6
SLIDE 6

Are the data missing at random?

S CALABLE DATA P ROCES S IN G IN R

Michael Kane

Assistant Professor, Yale University

slide-7
SLIDE 7

SCALABLE DATA PROCESSING IN R

slide-8
SLIDE 8

SCALABLE DATA PROCESSING IN R

Types of Missing Data

Missing Completely at Random (MCAR) Missing at Random (MAR) Missing Not at Random (MNAR)

slide-9
SLIDE 9

SCALABLE DATA PROCESSING IN R

MCAR

Missing Completely at Random There is no way to predict which values are missing Can drop missing data

slide-10
SLIDE 10

SCALABLE DATA PROCESSING IN R

MAR

Missing at Random Missingness is dependent on variables in the data set Use multiple imputation to predict what missing values could be

slide-11
SLIDE 11

SCALABLE DATA PROCESSING IN R

MNAR

Missing Not at Random Not MCAR or MAR Deterministic relationship between variables

slide-12
SLIDE 12

SCALABLE DATA PROCESSING IN R

Dealing with missing data in this course

Full treatment of missingness is beyond the scope of this course We will check to see if it's plausible data are MCAR and drop missing values

slide-13
SLIDE 13

SCALABLE DATA PROCESSING IN R

A Quick Check for MAR

Recode a column with one if the data is missing and zero otherwise Regress other variables onto it using a logistic regression Signicant p-value indicates MAR Repeat for other columns with missingness Some p-values can be signicant by chance, so adjust your cutoff for signicance based on the number of regressions

slide-14
SLIDE 14

SCALABLE DATA PROCESSING IN R

MAR Quick Check Example

# Our dependent variable is_missing <- rbinom(1000, 1, 0.5) # Our independent variables data_matrix <- matrix(rnorm(1000*10), nrow = 1000, ncol = 10) # A vector of p-values we'll fill in p_vals <- rep(NA, ncol(data_matrix))

slide-15
SLIDE 15

SCALABLE DATA PROCESSING IN R

MAR Quick Check Example

# Perform logistic regression for (j in 1:ncol(data_matrix)) { s <- summary(glm(is_missing ~ data_matrix[, j]), family = binomial) p_vals[j] <- s$coefficients[2, 4] } # Show the p-values p_vals 0.5930082 0.7822695 0.7560343 0.3689330 0.8757048 0.8812320 0.8281008 0.4888898 0.4781299 0.5655739

slide-16
SLIDE 16

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-17
SLIDE 17

Analyzing the Housing Data

S CALABLE DATA P ROCES S IN G IN R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

slide-18
SLIDE 18

SCALABLE DATA PROCESSING IN R

So far ..

Compare different demographic groups in data Quick check to see if data are missing at random

slide-19
SLIDE 19

SCALABLE DATA PROCESSING IN R

Adjusted Counts and Proportional Change by Year

Adjusting group size lets you compare different groups as if they were the same size Proportional change shows growth (or decline) of a group

slide-20
SLIDE 20

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-21
SLIDE 21

Other Lending Trends

S CALABLE DATA P ROCES S IN G IN R

Simon Urbanek

Member of R-Core, Lead Inventive Scientist, AT&T Labs Research

slide-22
SLIDE 22

SCALABLE DATA PROCESSING IN R

In this lesson ...

City vs rural Federally guaranteed loans vs. income

slide-23
SLIDE 23

SCALABLE DATA PROCESSING IN R

City vs. Rural

City means a home is in a metropolitan area, otherwise rural In the mortgage data set, city has msa value of 1, 0 otherwise For a more precise denition see FHFA website

slide-24
SLIDE 24

SCALABLE DATA PROCESSING IN R

Federally Guaranteed Loans and Borrower Income

Federally guaranteed loans protect the company issuing a loan If a lender can issue a federally guaranteed loan, then the lender is less worried about the loan defaulting as the government will buy the loan We'll use Borrower Income Ratio: borrower income divided by median income of people in the area

slide-25
SLIDE 25

Let's practice!

S CALABLE DATA P ROCES S IN G IN R

slide-26
SLIDE 26

Congratulations!

S CALABLE DATA P ROCES S IN G IN R

Michael J. Kane and Simon Urbanek

Instructors, DataCamp

slide-27
SLIDE 27

SCALABLE DATA PROCESSING IN R

Split-Apply-Combine

Break the data into parts Compute on the parts Combine the results

slide-28
SLIDE 28

SCALABLE DATA PROCESSING IN R

Split-Apply-Combine: Advantages

Manageable parts don't overwhelm your computer Approach is easy to parallelize Process sequentially Process on serveral machines in a cluster

slide-29
SLIDE 29

SCALABLE DATA PROCESSING IN R

Split-Apply-Combine: R

split() partitions set of row numbers or data.frame Map() computes on parts Reduce() combines results

slide-30
SLIDE 30

SCALABLE DATA PROCESSING IN R

bigmemory

bigmemory Good for larger data sets that can be represented as dense matrices and might be too big for RAM Looks like a regular R matrix

slide-31
SLIDE 31

SCALABLE DATA PROCESSING IN R

iotools

iotools Good for much larger data that can be processed in sequential chunks Supports data.frame and matrix

slide-32
SLIDE 32

SCALABLE DATA PROCESSING IN R

slide-33
SLIDE 33

Good luck!

S CALABLE DATA P ROCES S IN G IN R