Introduction to Data Science
Winter Semester 2019/20 Oliver Ernst
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - - PowerPoint PPT Presentation
Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
1 What is Data Science? 2 Learning Theory
3 Linear Regression
4 Classification
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 462
6 Linear Model Selection and Regularization
7 Nonlinear Regression Models
8 Tree-Based Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 462
9 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 462
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 221 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 222 / 462
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 223 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 224 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 224 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 225 / 462
2 4 6 8 10 16 18 20 22 24 26 28
Degree of Polynomial Mean Squared Error
2 4 6 8 10 16 18 20 22 24 26 28
Degree of Polynomial Mean Squared Error
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 225 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 226 / 462
1 High variability of validation set MSE with changing partitions. 2 Valuable data not used to fit model, we expect this results in overestima-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 226 / 462
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! %! %! %!
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 227 / 462
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! %! %! %!
n
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 227 / 462
1 Less bias, since each fit uses nearly all observations, less overestimation of
2 Well-defined approach, no arbitrariness in partitioning the data as in valida-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 228 / 462
1 Less bias, since each fit uses nearly all observations, less overestimation of
2 Well-defined approach, no arbitrariness in partitioning the data as in valida-
2 4 6 8 10 16 18 20 22 24 26 28
LOOCV
Degree of Polynomial Mean Squared Error
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 228 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 229 / 462
n
j=1(xj − x)2 .
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 229 / 462
n
j=1(xj − x)2 .
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 229 / 462
n
j=1(xj − x)2 .
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 229 / 462
k
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 230 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 231 / 462
2 4 6 8 10 16 18 20 22 24 26 28
10−fold CV
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 231 / 462
2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Flexibility Mean Squared Error
2 5 10 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Flexibility Mean Squared Error
2 5 10 20 5 10 15 20
Flexibility Mean Squared Error
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 232 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 233 / 462
n
yi}.
1 + β4X 2 2 .
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 234 / 462
Degree=1
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 235 / 462
Degree=3
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 236 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 237 / 462
2 4 6 8 10 0.12 0.14 0.16 0.18 0.20
Order of Polynomials Used Error Rate
0.01 0.02 0.05 0.10 0.20 0.50 1.00 0.12 0.14 0.16 0.18 0.20
1/K Error Rate
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 237 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 238 / 462
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 239 / 462
6A 250-Year Argument: Belief, Behavior and the Bootstrap. Bull. AMS 50(1) 2013 pp. 129–146.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 240 / 462
Source: Things I can’t avoid blog Source: Wikipedia Commons
6A 250-Year Argument: Belief, Behavior and the Bootstrap. Bull. AMS 50(1) 2013 pp. 129–146.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 240 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 241 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 241 / 462
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 241 / 462
Y − σXY
X + σ2 Y − 2σXY
X = Var X, σ2 Y = Var Y , σXY = Cov(X, Y ).
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 241 / 462
Y − σXY
X + σ2 Y − 2σXY
X = Var X, σ2 Y = Var Y , σXY = Cov(X, Y ).
X, ˆ
Y , ˆ
Y − ˆ
X + ˆ
Y − 2ˆ
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 241 / 462
−2 −1 1 2 −2 −1 1 2
X Y
−2 −1 1 2 −2 −1 1 2
X Y
−3 −2 −1 1 2 −3 −2 −1 1 2
X Y
−2 −1 1 2 3 −3 −2 −1 1 2
X Y
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 242 / 462
0.4 0.5 0.6 0.7 0.8 0.9 50 100 150 200 0.3 0.4 0.5 0.6 0.7 0.8 0.9 50 100 150 200 True Bootstrap 0.3 0.4 0.5 0.6 0.7 0.8 0.9
α α α
j=1 . (σ2 X =
Y = 1.25, σXY = 0.5, ⇒ α = 0.6, solid vertical line). Center: bootstrap histogram.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 243 / 462
1000
1000
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 244 / 462
B
B
2
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 245 / 462
2.8 5.3 3 1.1 2.1 2 2.4 4.3 1 Y X Obs 2.8 5.3 3 2.4 4.3 1 2.8 5.3 3 Y X Obs 2.4 4.3 1 2.8 5.3 3 1.1 2.1 2 Y X Obs 2.4 4.3 1 1.1 2.1 2 1.1 2.1 2 Y X Obs Original Data (Z)
1 *
2 *
1 *
2 *
!! !! !! !! ! !! !! !! !! !! !! !! !!
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 246 / 462