Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science - PowerPoint PPT Presentation

Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science Workshop Series Michael Hahsler OIT, SMU Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 1 / 23

What is Data Science? 1 Predictive Modeling 2 Package Caret 3 Exercises 4 Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 2 / 23

Section 1 What is Data Science? Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 3 / 23

Data Science Data Science is still evolving. One definition by Hal Varian (Chief economist at Google and professor at UC Berkeley) is: Data Science The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades. – Hal Varian Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 4 / 23

Data Science Figure 1: Data Science LifeCycle Source: https://datascience.berkeley.edu/about/what-is-data-science/ Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 5 / 23

Section 2 Predictive Modeling Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 6 / 23

Predictive Modeling Data mining Machine learning Prediction ◮ regression (predict a number, e.g., the age of a person) ◮ classification (predict a label, e.g., yes/no) Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 7 / 23

Predictive Modeling Workflow Training train Model Data cars predict Predictions wt ... hp mpg 2.1 110 21.0 3.2 120 18.0 mpg wt ... hp . . . . New 2.5 200 21.0 . . . . Data 1.7 80 18.0 1.9 ... 90 30.0 features response Figure 2: Workflow of Predictive Modeling Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 8 / 23

Predictive Modeling Workflow in R Function data.frame R object/list (train, lm, etc) Training train Model Data vector Function predict cars predict Predictions wt ... hp mpg 2.1 110 21.0 data.frame 3.2 120 18.0 mpg wt ... hp . . . . New 2.5 200 21.0 . . . . Data 1.7 80 18.0 1.9 ... 90 30.0 features response Figure 3: Workflow of Predictive Modeling with R Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 9 / 23

Example data(mtcars) # Load the dataset knitr::kable(head(mtcars)) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 6 160 110 3.9 2.6 16 0 1 4 4 Mazda RX4 Wag 21 6 160 110 3.9 2.9 17 0 1 4 4 Datsun 710 23 4 108 93 3.9 2.3 19 1 1 4 1 Hornet 4 Drive 21 6 258 110 3.1 3.2 19 1 0 3 1 Hornet Sportabout 19 8 360 175 3.1 3.4 17 0 0 3 2 Valiant 18 6 225 105 2.8 3.5 20 1 0 3 1 Note : kable in package knitr is used to pretty-print the table because the slides were created with Markdown. Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 10 / 23

Example: Predict Miles per Gallon plot(mtcars$wt, mtcars$mpg) 30 25 mtcars$mpg 20 15 10 2 3 4 5 mtcars$wt Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 11 / 23

Linear Regression model <- lm(mpg ~ wt, data = mtcars) model ## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Coefficients: ## (Intercept) wt ## 37.29 -5.34 Formula Interface R often uses a “model formula” to specify models of the from response ~ predictors . See ? formula for details. Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 12 / 23

Linear Regression: Model summary summary(model) ## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.543 -2.365 -0.125 1.410 6.873 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.285 1.878 19.86 < 2e-16 *** ## wt -5.344 0.559 -9.56 1.3e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3 on 30 degrees of freedom ## Multiple R-squared: 0.753, Adjusted R-squared: 0.745 ## F-statistic: 91.4 on 1 and 30 DF, p-value: 1.29e-10 Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 13 / 23

Linear Regression: The model as an R object str(model) ## List of 12 ## $ coefficients : Named num [1:2] 37.29 -5.34 ## ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt" ## $ residuals : Named num [1:32] -2.28 -0.92 -2.09 1.3 -0.2 ... ## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## $ effects : Named num [1:32] -113.65 -29.116 -1.661 1.631 0.111 ... ## ..- attr(*, "names")= chr [1:32] "(Intercept)" "wt" "" "" ... ## $ rank : int 2 ## $ fitted.values: Named num [1:32] 23.3 21.9 24.9 20.1 18.9 ... ## ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## $ assign : int [1:2] 0 1 ## $ qr :List of 5 ## ..$ qr : num [1:32, 1:2] -5.657 0.177 0.177 0.177 0.177 ... ## .. ..- attr(*, "dimnames")=List of 2 ## .. .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet ## .. .. ..$ : chr [1:2] "(Intercept)" "wt" ## .. ..- attr(*, "assign")= int [1:2] 0 1 ## ..$ qraux: num [1:2] 1.18 1.05 ## ..$ pivot: int [1:2] 1 2 ## ..$ tol : num 1e-07 ## ..$ rank : int 2 Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 14 / 23 ## ..- attr(*, "class")= chr "qr"

Linear Regression: Plotting the regression line plot(mtcars$wt, mtcars$mpg) abline(coef(model), col = "red", lty = 2, lwd = 3) 30 25 mtcars$mpg 20 15 10 2 3 4 5 mtcars$wt Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 15 / 23

Multiple Linear Regression model <- lm(mpg ~ wt + cyl + hp, data = mtcars) model ## ## Call: ## lm(formula = mpg ~ wt + cyl + hp, data = mtcars) ## ## Coefficients: ## (Intercept) wt cyl hp ## 38.752 -3.167 -0.942 -0.018 summary(model) ## ## Call: ## lm(formula = mpg ~ wt + cyl + hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.929 -1.560 -0.531 1.185 5.899 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.7518 1.7869 21.69 <2e-16 *** ## wt -3.1670 0.7406 -4.28 0.0002 *** ## cyl -0.9416 0.5509 -1.71 0.0985 . ## hp -0.0180 0.0119 -1.52 0.1400 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.5 on 28 degrees of freedom ## Multiple R-squared: 0.843, Adjusted R-squared: 0.826 ## F-statistic: 50.2 on 3 and 28 DF, Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science p-value: 2.18e-11 16 / 23

Prediction Almost all R models provide a predict function. predict(model, head(mtcars)) ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## 23 22 26 21 ## Hornet Sportabout Valiant ## 17 20 Note: Prediction is typically done on new or test data. Package like caret and mlr3 and Superlearner help with this. Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 17 / 23

Section 3 Package Caret Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 18 / 23

Train a model with Caret library("caret") ## Loading required package: lattice ## Loading required package: ggplot2 # Simple linear regression model (lm means linear model) model <- train(mpg ~ wt + cyl + hp, data = mtcars, method = "lm") model ## Linear Regression ## ## 32 samples ## 3 predictor ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 2.8 0.84 2.3 ## ## Tuning parameter 'intercept' was held constant at a value of TRUE Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 19 / 23

Training a regression tree # rpart implements CART (here a regression tree) model <- train(mpg ~ wt + cyl + hp, data = mtcars, method = "rpart") ## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = ## trainInfo, : There were missing values in resampled performance measures. model ## CART ## ## 32 samples ## 3 predictor ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... ## Resampling results across tuning parameters: ## ## cp RMSE Rsquared MAE ## 0.000 4.0 0.55 3.3 ## 0.097 4.1 0.54 3.4 ## 0.643 5.1 0.48 4.2 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was cp = 0. Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 20 / 23

Plotting a regression tree library(rpart.plot) varImp(model) ## Loading required package: rpart ## rpart variable importance ## rpart.plot(model$finalModel) ## Overall ## hp 100.0 20 ## cyl 94.6 100% ## wt 0.0 cyl >= 5 yes no 17 66% hp >= 193 13 18 27 22% 44% 34% Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 21 / 23

Section 4 Exercises Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 22 / 23

Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science - PowerPoint PPT Presentation

Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science Workshop Series Michael Hahsler OIT, SMU Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 1 / 23 What is Data Science? 1 Predictive Modeling 2 Package Caret 3

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

Lab 0 Objectives Intro to Labs Intro to Operating Systems Start Lab #0 UNIX/Linux

Some issues in model-based development for embedded control systems Paul Caspi Verimag-Cnrs

Intro to R - 2. Objects and Data OIT/SMU Libraries Data Science Workshop Series Michael Hahsler

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

MA/CSSE 473 Day 01 Course Intro Algorithms Intro Pick up a handout from the back table MA/CSSE

DATA MINING INTRO LECTURE Introduction Instructors Aris (Aris Anagnostopoulos) ChaTo (Carlos

STAT 401 - Statistical Methods for Research Workers Two-sample t-test Jarad Niemi Iowa State

Rank-Sum Test STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

Page 1 1 Midterm 1 Topics Covered Review: Rendering Pipeline rendering pipeline

Probability Density Function (PDF) Joint Probability Distribution Jo Banana -shaped

Reduced Basis Method for Poisson-Boltzmann Equation Workshop in Industrial and Applied

Data validation and exploration Data validation and exploration Abhijit Dasgupta Abhijit

Defense Industry Adjustment Program SoCal AMP Bi-Annual Meeting August 4, 2016 Presented by

WRFFire: A Wildland Fire Behavior module for WRF Contribu9ons from: Jonathan Beezley, Janice

Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science - PowerPoint PPT Presentation

Intro to R - 5. R for Data Science OIT/SMU Libraries Data Science Workshop Series Michael Hahsler OIT, SMU Michael Hahsler (OIT, SMU) Intro to R - 5. R for Data Science 1 / 23 What is Data Science? 1 Predictive Modeling 2 Package Caret 3

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data &amp; Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data &amp; Intro to Cloud Computing

INTRO: What is a MOOD BOARD? What is it? INTRO: Why are they Used? INTRO: Things to Consider

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Intro to Life Cycle Analysis Intro to Life Cycle Analysis Intro to Life Cycle Analysis

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

Lecture 5: HW1 Discussion, Intro to GPUs G63.2011.002/G22.2945.001 October 5, 2010 Discuss HW1

Lab 0 Objectives Intro to Labs Intro to Operating Systems Start Lab #0 UNIX/Linux

Some issues in model-based development for embedded control systems Paul Caspi Verimag-Cnrs

Intro to R - 2. Objects and Data OIT/SMU Libraries Data Science Workshop Series Michael Hahsler

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data

MA/CSSE 473 Day 01 Course Intro Algorithms Intro Pick up a handout from the back table MA/CSSE

DATA MINING INTRO LECTURE Introduction Instructors Aris (Aris Anagnostopoulos) ChaTo (Carlos

STAT 401 - Statistical Methods for Research Workers Two-sample t-test Jarad Niemi Iowa State

Rank-Sum Test STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

Page 1 1 Midterm 1 Topics Covered Review: Rendering Pipeline rendering pipeline

Probability Density Function (PDF) Joint Probability Distribution Jo Banana -shaped

Reduced Basis Method for Poisson-Boltzmann Equation Workshop in Industrial and Applied

Data validation and exploration Data validation and exploration Abhijit Dasgupta Abhijit

Defense Industry Adjustment Program SoCal AMP Bi-Annual Meeting August 4, 2016 Presented by

WRFFire: A Wildland Fire Behavior module for WRF Contribu9ons from: Jonathan Beezley, Janice

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing