Introduction to Data Science Winter Semester 2018/19 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2018/19 Oliver Ernst TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik Lecture Slides

Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors 4 Classification 4.1 Overview of Classification 4.2 Why Not Linear Regression? 4.3 Logistic Regression 4.4 Linear Discriminant Analysis 4.5 A Comparison of Classification Methods 5 Resampling Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 496

Contents II 5.1 Cross Validation 5.2 The Bootstrap 6 Linear Model Selection and Regularization 6.1 Subset Selection 6.2 Shrinkage Methods 6.3 Dimension Reduction Methods 6.4 Considerations in High Dimensions 6.5 Miscellanea 7 Nonlinear Regression Models 7.1 Polynomial Regression 7.2 Step Functions 7.3 Regression Splines 7.4 Smoothing Splines 7.5 Generalized Additive Models 8 Tree-Based Methods 8.1 Decision Tree Fundamentals 8.2 Bagging, Random Forests and Boosting Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 496

Contents III 9 Support Vector Machines 9.1 Maximal Margin Classifier 9.2 Support Vector Classifiers 9.3 Support Vector Machines 9.4 SVMs with More than Two Classes 9.5 Relationship to Logistic Regression 10 Unsupervised Learning 10.1 Principal Components Analysis 10.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 496

Contents 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 69 / 496

Linear Regression Advertising again Recall advertising data set from Slide 28: 25 25 25 20 20 20 Sales 15 Sales 15 Sales 15 10 10 10 5 5 5 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper We will use the simple and well-established statistical learning technique known as linear regression to answer the following questions: Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 70 / 496

Linear Regression Questions about advertising data set 1 Is there a relationship between advertising budget and sales? Otherwise, why bother? 2 How strong is this relationship between advertising budget and sales? Prediction possibly better than random guess? 3 Which media contribute to sales? Separate individual contributions 4 How accurately can we estimate the effect of each medium on sales? Euro by Euro? 5 How accurately can we predict future sales? Precise prediction for each medium? 6 Is the relationship linear? If yes, linear regression appropriate (possibly after transforming data) 7 Is there synergy among the advertising media? Called interaction effect in statistics. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 71 / 496

Contents 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 72 / 496

Simple Linear Regression Definition, terminology, notation Linear model for quantitative response Y of single predictor X : Y ≈ β 0 + β 1 X . (3.1) Statistician: “We are regressing Y onto X .” E.g., with predictor TV advertising and response sales , sales ≈ β 0 + β 1 × TV . The values of coefficients or parameters β 0 , β 1 obtained from fitting to the training data are denoted by ˆ β 0 , ˆ β 1 , leading to the prediction values y = ˆ β 0 + ˆ ˆ β 1 x (3.2) when X = x , where the hat on ˆ y denotes the predicted value of the reponse. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 73 / 496

Simple Linear Regression Estimating the coefficients Determining intercept ˆ β 0 and slope ˆ β 1 in (3.1) amounts to choosing these parameters such that the residuals or data misfits y i = y i − ( ˆ β 0 + ˆ r i := y i − ˆ β 1 x i ) , i = 1 , . . . , n , are minimized. There are many options for defining smallness here, in least squares estimation this is measured by the residual sum of squares (RSS) β 1 x 1 ) 2 + · · · + ( y n − ˆ n = ( y 1 − ˆ β 0 − ˆ β 0 − ˆ RSS := r 2 1 + · · · + r 2 β 1 x n ) 2 . (3.3) An easy calculation reveals � n x := 1 β 0 = y − ˆ ˆ β 1 x , x i , n i = 1 (3.4) � n � n i = 1 ( x i − x )( y i − y ) y := 1 ˆ � n β 1 = , y i . i = 1 ( x i − x ) 2 n i = 1 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 74 / 496

Simple Linear Regression Example: LS fit for advertising data 25 20 Sales 15 10 5 0 50 100 150 200 250 300 TV β 0 = 7 . 03, ˆ ˆ β 1 = 0 . 0475 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 75 / 496

Simple Linear Regression Example: LS fit for advertising data LS fit of sales vs. TV budget: RSS as a function of ( β 0 , β 1 ) 3 3 2.5 0.06 2 . 1 5 0.05 β 1 RSS 0.04 2.2 2.3 β 1 0.03 3 3 β 0 5 6 7 8 9 β 0 Left: Level curves. Right: Surface plot. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 76 / 496

Simple Linear Regression Assessing the accuracy of the coefficient estimates Linear regression yields a linear model Y = β 0 + β 1 X + ε (3.5) where β 0 : intercept β 1 : slope ε : model error, modeled as centered random variable, independent of X . Model (3.5) defines the population regression line , the best linear approximati- on to the true (generally unknown) relationship between X and Y . The linear relation (3.2) containing the coefficients ˆ β 0 , ˆ β 1 estimated from a given data set is called the least squares line . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 77 / 496

Simple Linear Regression Example: population regression line, least squares line 10 10 5 5 Y Y 0 0 −5 −5 −10 −10 −2 −1 0 1 2 −2 −1 0 1 2 X X • Left: Simulated data set ( n = 100) from model f ( X ) = 2 + 3 X . Red line: population regression line (true model). Blue line: least squares line from data (black dots). • Right: Additionally ten (light blue) least squares lines obtained from ten separate randomly generated data sets from same model; seen to average to the red line. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 78 / 496

Simple Linear Regression Analogy: estimation of mean • Standard statistical approach: use information contained in a sample to estimate characteristics of a large (possibly infinite) population. • Example: approximate population mean µ (expectation, expected value) of random variable Y from observations y 1 , . . . , y n by sample mean � n µ := y := 1 ˆ i = 1 y i . n µ � = µ , the coefficients ˆ β 0 , ˆ • Just like ˆ µ ≈ µ but, in general, ˆ β 1 defining the least squares line are estimates of the true values β 0 , β 1 of the model. • Sample mean ˆ µ is an unbiased estimator of µ , i.e., it does not systemati- cally over- or underestimate the true value µ . Same holds for estimators ˆ β 0 , ˆ β 1 . • How accurate is ˆ µ ≈ µ ? Standard error 4 of ˆ µ , denoted SE(ˆ µ ) , satisfies µ ) 2 = σ 2 where σ 2 = Var Y . Var ˆ µ = SE(ˆ n , (3.6) 4 Standard deviation of the sample distribution, i.e., average amount ˆ µ differs from µ . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 79 / 496

Simple Linear Regression Standard error of regression coefficients For the regression coefficients (assuming uncorrelated observation errors) � 1 � x 2 β 0 ) 2 = σ 2 SE( ˆ � n n + , i = 1 ( x i − x ) 2 (3.7) σ 2 β 1 ) 2 = σ 2 = Var ε. SE( ˆ � n i = 1 ( x i − x ) 2 , • SE( ˆ β 1 ) smaller when x i more spread out (provides more leverage to estimate slope). • SE( ˆ µ ) if x = 0. (Then ˆ β 0 ) = SE(ˆ β 0 = y .) • σ generally unknown, can be estimated from the data by residual standard error � RSS RSE := n − 2 . When RSE used in place of σ , should write � SE( ˆ β 1 ) . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 80 / 496

Simple Linear Regression Confidence intervals • 95 % confidence interval : range of values containing true unknown value of parameter with probability 95 % . • For linear regression: 95 % CI for β 1 approximately β 1 ± 2 · SE( ˆ ˆ β 1 ) , (3.8) i.e., with probability 95 % , β 1 ∈ [ ˆ β 1 − 2 · SE( ˆ β 1 ) , ˆ β 1 + 2 · SE( ˆ β 1 )] . (3.9) • Similarly, for β 0 , 95% CI approximately given by β 0 ± 2 · SE( ˆ ˆ β 0 ) . (3.10) • For advertising example: with 95 % probability β 0 ∈ [ 6 . 130 , 7 . 935 ] , β 1 ∈ [ 0 . 042 , 0 . 053 ] . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 81 / 496

Introduction to Data Science Winter Semester 2018/19 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2018/19 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

Single agent or multiple agents Many domains are characterized by multiple agents rather than a

Econometric analysis of models with social interactions Brendan Kline (UT-Austin, Econ) and Elie

Social Networks with Multiple Products Krzysztof R. Apt CWI and University of Amsterdam Based on

Bank Runs: The Pre-Deposit Game Karl Shell Yu Zhang Cornell University Xiamen University

STAT 213 Interactions in Multiple Regression Colin Reimer Dawson Oberlin College 29 March 2016

Multiplication Overview Multiplication approaches: Sequential: Shift-and-Add produces one

COLUMNS VS. ROWS INFLUENCE OF THE REDUCTION ORDER IN MULTIPLIER VERIFICATION USING COMPUTER

t s

Sambuz

Useful Links

Newsletter

Mail Us