CME/STATS 195 CME/STATS 195 Lecture 6: Data Modeling and Linear - PowerPoint PPT Presentation

CME/STATS 195 CME/STATS 195 Lecture 6: Data Modeling and Linear Lecture 6: Data Modeling and Linear Regression Regression Evan Rosenman Evan Rosenman April 18, 2019 April 18, 2019 5.13

Contents Contents Data Modeling Linear Regression Lasso Regression 5.13

Data Modeling Data Modeling 5.13

Introduction to models Introduction to models “All models a re w rong, but some a re useful. Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen pa rsimonious models often do provide rema rka bly useful a pproxima tions (…).” – George E.P. Box, 1976 The goal of a model is to provide a simple low dimensional summary of a dataset . Models can be used to partition data into patterns of interest and residuals (other sources of variation and random noise). 5.13

Hypothesis generation vs. hypothesis confirmation Hypothesis generation vs. hypothesis confirmation Models are often used for inference about a pre-specified hypothesis e.g. “BMI is associated with blood pressure controlling for other factors” Doing inference correctly is hard. Each observation should either be used for exploration or confirmation, NOT both. Observation can be used many times for exploration, but only once for confirmation. There is nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis . 5.13

Confirmatory analysis Confirmatory analysis One approach is to split your data into three pieces before you begin the analysis: Training set – the bulk (e.g. 60%) of the dataset which can be used to do anything: visualizing, fitting multiple models. Validation set – a smaller set (e.g. 20%) used for manually comparing models and visualizations. Test set – a set (e.g. 20%) held back used only ONCE to test and asses your final model. 5.13

Confirmatory analysis Confirmatory analysis Partitioning the dataset allows you to explore the training data, generate a number of candidate hypotheses and models. You can select a final model based on its performance on the validation set. Finally, when you are confident with the chosen model you can check how good it is using the test data. 5.13

Model Basics Model Basics There are two parts to data modeling: defining a family of models : deciding on a set of models that can express a type of pattern you want to capture, e.g. a straight line, or a quadratic curve. fitting a model : finding a model within the family that the closest to your data. A fitted model is just the best model from a chosen family of models, i.e. the “best” according to some set criteria. This does not necessarily imply that the model is a good and certainly does NOT imply that the model is true. 5.13

A toy dataset A toy dataset We will work with a simulated dataset sim1 from the modelr package: library (modelr) ggplot (sim1, aes (x, y)) + geom_point () sim1 ## # A tibble: 30 x 2 ## x y ## <int> <dbl> ## 1 1 4.20 ## 2 1 7.51 ## 3 1 2.13 ## 4 2 8.99 ## 5 2 10.2 ## 6 2 11.3 ## 7 3 7.36 ## 8 3 10.5 ## 9 3 10.5 ## 10 4 12.4 ## # ... with 20 more rows 5.13

Defining a family of models Defining a family of models The relationship between and for the points in sim1 look linear. So, x y will look for models which belong to a family of models of the following form: models <- tibble ( b0 = runif (250, -20, 40), b1 = runif (250, -5, 5)) ggplot (sim1, aes (x, y)) + geom_abline ( y = ̃ 0 + ̃ 1 ⋅ x data = models, aes (intercept = b0, slope = b1), alpha = 1/4) + geom_point () The models that can be expressed by the above formula, can adequately capture a linear trend. We generate a few examples of such models on the right. 5.13

Fitting a model Fitting a model From all the lines in the linear family of models, we need to find the best one, i.e. the one that is the closest to the data . This means that we need to find parameters and that identify a a ˆ 0 ˆ 1 such a fitted line. A typical measure of “closeness” is the sum of squared errors (SSE), i.e. we want the model with minimum squared residuals: ˆ | 2 ˆ | 2 || | e 2 = || y − | y 2 n ̃ ˆ 1 x i ) 2 ̃ ˆ 0 = ( y i − ( + ) ∑ i=1 5.13

Linear Regression Linear Regression 5.13

Linear Regression Linear Regression Regression is a supervised learning method, whose goal is inferring the relationship between input data, , and a x continuous response variable, . y Linear regression is a type of regression where is modeled y as a linear function of . x Simple linear regression predicts the output from a y single predictor . x y = ̃ 0 + ̃ 1 x + ̲ Multiple linear regression assumes relies on many y covariates: y = ̃ 0 + ̃ 1 x 1 + ̃ 2 x 2 + ⋯ + ̃ p x p + ̲ ̃ T = x + ̲ here denotes a random noise term with zero mean. 5.13

̃ Objective function Objective function Linear regression seeks a solution that minimizes the ̃ ˆ y = ⋅ x ˆ difference between the true outcome and the prediction , in ˆ y y terms of the residual sum of squares (RSS). 2 ̃ T x i ̃ ˆ = arg min y i − ( ) ∑ i 5.13

Simple Linear Regression Simple Linear Regression Predict the mileage per gallon using the weight of the car. In R the linear models can be fit with a lm() function. # Separate the data into train and test: set.seed (123) n <- nrow (mtcars) idx <- sample (1:n, size = round (0.7*n)) mtcars_train <- mtcars[idx, ] mtcars_test <- mtcars[-idx, ] # Fit a simple linear model: mtcars_fit <- lm (mpg ~ wt, mtcars_train) # Extract the fitted model coefficients: coef (mtcars_fit) ## (Intercept) wt ## 37.252154 -5.541406 5.13

Linear Regression Model Summary Linear Regression Model Summary # check the details on the fitted model: summary (mtcars_fit) ## ## Call: ## lm(formula = mpg ~ wt, data = mtcars_train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5302 -1.9952 0.0179 1.3017 3.5194 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 36.470 2.108 17.299 7.61e-11 *** ## wt -5.407 0.621 -8.707 5.04e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.2 on 14 degrees of freedom ## Multiple R-squared: 0.8441, Adjusted R-squared: 0.833 ## F-statistic: 75.81 on 1 and 14 DF, p-value: 5.043e-07 5.13

Fitted values Fitted values We can compute the fitted values , a.k.a. the predicted mpg values for y ˆ existing observations using the predict() function. pred <- predict (mtcars_fit, newdata = mtcars_train) pred ## Merc 280 Pontiac Firebird Merc 450SL ## 18.189718 15.945449 16.582710 ## Fiat X1-9 Porsche 914-2 Mazda RX4 Wag ## 26.529534 25.393545 21.320612 ## Merc 450SLC AMC Javelin Ford Pantera L ## 16.305640 18.217425 19.685897 ## Merc 280C Dodge Challenger Volvo 142E ## 18.189718 17.746405 21.847046 ## Camaro Z28 Maserati Bora Lotus Europa ## 15.973156 17.469335 28.868007 ## Lincoln Continental Hornet 4 Drive Mazda RX4 ## 7.195569 19.436534 22.733671 ## Hornet Sportabout Ferrari Dino Honda Civic ## 18.189718 21.902460 28.302783 ## Merc 240D ## 19.575069 5.13

Fitted values Fitted values Alternatively, the add_predictions() function in the modelr package will automatically append the model predictions to our data frame mtcars_train <- mtcars_train %>% add_predictions (mtcars_fit) tbl_df (mtcars_train) ## # A tibble: 22 x 12 ## mpg cyl disp hp drat wt qsec vs am gear carb pred ## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 18.2 ## 2 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2 15.9 ## 3 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3 16.6 ## 4 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 26.5 ## 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 25.4 ## 6 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.3 ## 7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3 16.3 ## 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 18.2 ## 9 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 19.7 ## 10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4 18.2 ## # ... with 12 more rows 5.13

Predictions for new observations Predictions for new observations To predict the mpg for new observations , e.g. cars not in the dataset, we first need to generate a data table with predictors , in this case the x car weights: newcars <- tibble (wt = c (2, 2.1, 3.14, 4.1, 4.3)) newcars <- newcars %>% add_predictions (mtcars_fit) newcars ## # A tibble: 5 x 2 ## wt pred ## <dbl> <dbl> ## 1 2 26.2 ## 2 2.1 25.6 ## 3 3.14 19.9 ## 4 4.1 14.5 ## 5 4.3 13.4 5.13

CME/STATS 195 CME/STATS 195 Lecture 6: Data Modeling and Linear - PowerPoint PPT Presentation

CME/STATS 195 CME/STATS 195 Lecture 6: Data Modeling and Linear Lecture 6: Data Modeling and Linear Regression Regression Evan Rosenman Evan Rosenman April 18, 2019 April 18, 2019 5.13 Contents Contents Data Modeling Linear Regression

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

CME/STATS 195 CME/STATS 195 Lecture 3: Importing and transforming data Lecture 3: Importing and

CME/STATS 195 CME/STATS 195 Lecture 4: Visualizing data Lecture 4: Visualizing data Evan

CME/STATS 195 CME/STATS 195 Lecture 2: Programming and Lecture 2: Programming and Communicating

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

CME/STATS 195 CME/STATS 195 Lecture 8: Hypothesis Testing and Lecture 8: Hypothesis Testing and

CME/STATS 195 Lecture 1: Intro to R Evan Rosenman April 2, 2019 Contents Course Objectives

2017: Into the Future CME Group ISM June 2017 Source: CME Group Nov 2017 Source: CME

CME 101: Debbie Platek, MS Remembering the Basics President, CME Mentors Where were going

Issues in TDS u/s. 195 CA N.C. Hegde 3rd August 2019 The Chamber of Tax Consultants 1 Foreign

Withholding of Tax u/s 195 Withholding of Tax u/s 195 Form 15CA / 15CB Form 15CA / 15CB

Integrated Data at Stats NZ Stats NZ Stats NZ is the public service department of New

Any-Code Completion public static Path[] stat2Paths(FileStatus[] stats) { if (stats == null)

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

MACRA MIPS and CME Working Group 3/17/16 MACRA, MIPS and CME Enacted in April 2015

Taking a Deep Dive CMEs Technology Assessment Program 1 Welcome This session will be

Hardware Algos Made Easy: Deploy your trading strategies on FPGAs with the nxAccess HLS Framework

Assess the Need for Peer Review of CME Content ACME 36 th Annual Conference Friday, January 28,

Disclosures I have no disclosures. NEUROINFECTIOUS DISEASES 201: BEYOND THE BASICS Felicia

Three-dimensional behaviors of atmospheric CO 2 revealed by the CONTRAIL project T. Machida 1 , H.

Overview of the ARRCs Work Sandie OConnor Chair, Alternative Reference Rates Committee

Khara Lukancic Mentors: Rachel Hock, Tom Woods, and Alysha Reinard Study the parameters of

Purpose of this presentation Inform the IETF NEMO WG about the use of NEMO BS in the CALM