Introduction to Data Science
Winter Semester 2018/19 Oliver Ernst
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
Introduction to Data Science Winter Semester 2018/19 Oliver Ernst - - PowerPoint PPT Presentation
Introduction to Data Science Winter Semester 2018/19 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
1 What is Data Science? 2 Learning Theory
3 Linear Regression
4 Classification
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 496
6 Linear Model Selection and Regularization
7 Nonlinear Regression Models
8 Tree-Based Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 496
9 Support Vector Machines
10 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 69 / 496
50 100 200 300 5 10 15 20 25 TV Sales 10 20 30 40 50 5 10 15 20 25 Radio Sales 20 40 60 80 100 5 10 15 20 25 Newspaper Sales
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 70 / 496
1 Is there a relationship between advertising budget and sales?
2 How strong is this relationship between advertising budget and sales?
3 Which media contribute to sales?
4 How accurately can we estimate the effect of each medium on sales?
5 How accurately can we predict future sales?
6 Is the relationship linear?
7 Is there synergy among the advertising media?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 71 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 72 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 73 / 496
1 + · · · + r 2 n = (y1 − ˆ
n
i=1(xi − x)(yi − y)
i=1(xi − x)2
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 74 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 75 / 496
β0 β1
2 . 1 5 2.2 2.3 2.5 3 3 3 3
5 6 7 8 9 0.03 0.04 0.05 0.06
RSS β1 β0
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 76 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 77 / 496
−2 −1 1 2 −10 −5 5 10 X Y −2 −1 1 2 −10 −5 5 10 X Y
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 78 / 496
n
i=1 yi.
4Standard deviation of the sample distribution, i.e., average amount ˆ
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 79 / 496
i=1(xi − x)2
i=1(xi − x)2 ,
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 80 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 81 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 82 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 83 / 496
n
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 84 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 85 / 496
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 86 / 496
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 87 / 496
i=1(xi − x)(yi − y)
i=1(xi − x)2n i=1(yi − y)2 .
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 88 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 89 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 90 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 91 / 496
j=0 to minimize
n
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 92 / 496
X1 X2 Y
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 93 / 496
j=0 to minimize the RSS in (3.20) is equi-
2, where we have introduced the notation
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 94 / 496
j=1 also leads to a linear regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 95 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 96 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 97 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 98 / 496
1 Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the
2 Do all predictors help to explain Y , or is only a subset of the predictors
3 How well does the model fit the data? 4 Given a set of predictor values, what response value should we predict, and
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 99 / 496
n
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 100 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 101 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 102 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 103 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 104 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 105 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 106 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 107 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 108 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 109 / 496
1 Reducible error: ˆ
2 Model bias: linear model can only yield best linear approximation. 3 Irreducible error: Y = f (X) + ε.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 110 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 111 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 112 / 496
Balance
20 40 60 80 100 5 10 15 20 2000 8000 14000 500 1500 20 40 60 80 100
Age Cards
2 4 6 8 5 10 15 20
Education Income
50 100 150 2000 8000 14000
Limit
500 1500 2 4 6 8 50 100 150 200 600 1000 200 600 1000
Rating Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 113 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 114 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 115 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 116 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 117 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 118 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 119 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 120 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 121 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 122 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 123 / 496
50 100 150 200 600 1000 1400 Income Balance 50 100 150 200 600 1000 1400 Income Balance student non−student
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 124 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 125 / 496
50 100 150 200 10 20 30 40 50 Horsepower Miles per gallon Linear Degree 2 Degree 5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 126 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 127 / 496
1 Nonlinear dependence of response on predictors 2 Correlated error terms 3 Non-constant variance of error terms 4 Outliers 5 High-leverage points 6 Collinearity
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 128 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 129 / 496
5 10 15 20 25 30 −15 −10 −5 5 10 15 20 Fitted values Residuals Residual Plot for Linear Fit
323 330 334
15 20 25 30 35 −15 −10 −5 5 10 15 Fitted values Residuals Residual Plot for Quadratic Fit
334 323 155
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 130 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 131 / 496
20 40 60 80 100 −3 −1 1 2 3
ρ=0.0
Residual 20 40 60 80 100 −4 −2 1 2
ρ=0.5
Residual 20 40 60 80 100 −1.5 −0.5 0.5 1.5
ρ=0.9
Residual
Observation Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 132 / 496
i = σ2/ni. Remedy: weighted least squares with weights
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 133 / 496
10 15 20 25 30 −10 −5 5 10 15 Fitted values Residuals Response Y
998 975 845
2.4 2.6 2.8 3.0 3.2 3.4 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 Fitted values Residuals Response log(Y)
437 671 605
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 134 / 496
−2 −1 1 2 −4 −2 2 4 6 20 −2 2 4 6 −1 1 2 3 4 Fitted Values Residuals 20 −2 2 4 6 2 4 6 Fitted Values Studentized Residuals 20
X Y
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 135 / 496
i′=1(xi′ − x)2 ∈
n , deviation from average indicates high leverage.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 136 / 496
−2 −1 1 2 3 4 5 10 20 41 −2 −1 1 2 −2 −1 1 2 0.00 0.05 0.10 0.15 0.20 0.25 −1 1 2 3 4 5 Leverage Studentized Residuals 20 41
X Y X1 X2
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 137 / 496
2000 4000 6000 8000 12000 30 40 50 60 70 80 Limit Age 2000 4000 6000 8000 12000 200 400 600 800 Limit Rating
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 138 / 496
2 1 . 2 5 21.5 21.8
0.16 0.17 0.18 0.19 −5 −4 −3 −2 −1
21.5 2 1 . 8
−0.1 0.0 0.1 0.2 1 2 3 4 5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 139 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 140 / 496
Xj|X−j
Xj|X−j: R2 from regression of Xj onto all other predictors.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 141 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 142 / 496
1 Is there a relationship between advertising budget and sales? 2 How strong is this relationship between advertising budget and sales? 3 Which media contribute to sales? 4 How accurately can we estimate the effect of each medium on sales? 5 How accurately can we predict future sales? 6 Is the relationship linear? 7 Is there synergy among the advertising media?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 143 / 496
1 Is there a relationship between advertising budget and sales?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 144 / 496
2 How strong is this relationship between advertising budget and sales?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 145 / 496
3 Which media contribute to sales?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 146 / 496
4 How accurately can we estimate the effect of each medium on sales?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 147 / 496
5 How accurately can we predict future sales?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 148 / 496
6 Is the relationship linear?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 149 / 496
7 Is there synergy among the advertising media?
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 150 / 496
3 Linear Regression
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 151 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 152 / 496
y y x1 x1 x2 x2
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 153 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 154 / 496
−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 −1.0 −0.5 0.0 0.5 1.0 1 2 3 4
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 155 / 496
−1.0 −0.5 0.0 0.5 1.0 1 2 3 4 0.2 0.5 1.0 0.00 0.05 0.10 0.15 Mean Squared Error
1/K
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 156 / 496
−1.0 −0.5 0.0 0.5 1.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.2 0.5 1.0 0.00 0.02 0.04 0.06 0.08 Mean Squared Error
1/K
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 157 / 496
−1.0 −0.5 0.0 0.5 1.0 1.0 1.5 2.0 2.5 3.0 3.5 0.2 0.5 1.0 0.00 0.05 0.10 0.15 Mean Squared Error
1/K
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 158 / 496
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=1
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=2
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=3
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=4
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=10
0.2 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p=20
Mean Squared Error 1/K
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 159 / 496
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 160 / 496