Introduction to Data Science
Winter Semester 2019/20 Oliver Ernst
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - - PowerPoint PPT Presentation
Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?
TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik
1 What is Data Science? 2 Learning Theory
3 Linear Regression
4 Classification
5 Resampling Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 463
6 Linear Model Selection and Regularization
7 Nonlinear Regression Models
8 Tree-Based Methods
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 463
9 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 463
9 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 414 / 463
i=1}, each consisting of fea-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463
i=1}, each consisting of fea-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463
i=1}, each consisting of fea-
i=1.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 416 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 416 / 463
9 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 417 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 418 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 418 / 463
2
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 419 / 463
2
p
j,1 = 1,
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 419 / 463
i
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 420 / 463
j=1 for first principal component determined as (normalized)
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 421 / 463
j=1 for first principal component determined as (normalized)
j=1 solve optimization problem
n
p
2
p
j,1 = 1
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 421 / 463
j=1 for first principal component determined as (normalized)
j=1 solve optimization problem
n
p
2
p
j,1 = 1
φ2=1 Xφ2 2 = max φ2=1 φ⊤X⊤Xφ.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 421 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 422 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 422 / 463
j=1 ⊂
i=1 ⊂ R1×p of X:
1 φ1
n φ1
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 422 / 463
10 20 30 40 50 60 70 5 10 15 20 25 30 35
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 423 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 424 / 463
1, . . . , σ2 p).
F = σ2 1 +
p.
j=1 are given by the normali-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 425 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 426 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 426 / 463
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 First Principal Component Second Principal Component
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
−0.5 0.0 0.5 −0.5 0.0 0.5 Murder Assault UrbanPop Rape
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 427 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 428 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 428 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 428 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 428 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 428 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 429 / 463
First principal component Second principal component −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 430 / 463
20 30 40 50 5 10 15 20 25 30
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 431 / 463
M
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 432 / 463
M
M
m = M
m,
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 432 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 433 / 463
−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 −0.5 0.0 0.5 Murder Assault UrbanPop Rape
Scaled
−100 −50 50 100 150 −100 −50 50 100 150 First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Murder Assault UrbanPop Rape
Unscaled
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 434 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 435 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 435 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 436 / 463
p
n
n
i,j = 1
F.
2 = 1
n
i,m = 1
n
p
2
2.
2/X2 F.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 436 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 437 / 463
1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 Principal Component
1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 Principal Component Cumulative Prop. Variance Explained
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 437 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 438 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 438 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 439 / 463
9 Unsupervised Learning
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 440 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 441 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 442 / 463
k=1 denote sets containing indices of n observations in cluster k
K
C1,...,CK K
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 442 / 463
p
C1,...,CK
K
p
10 like K n. There are, however, simple heuristics for finding good approxi-
10These are known as the Stirling numbers of the second kind, S(n, K) ∼ kn/k! as n → ∞.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 443 / 463
1 Randomly assign a number, from 1 to K, to each of the observations.
2 Iterate until the cluster assignments stop changing: a For each of the K clusters, compute the cluster centroid. The k-th cluster
b Assign each observation to the cluster whose centroid is closest (where clo-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 444 / 463
K=2 K=3 K=4
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 445 / 463
p
p
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 446 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 447 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 447 / 463
320.9 235.8 235.8 235.8 235.8 310.9
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 448 / 463
−6 −4 −2 2 −2 2 4
X1 X2
2 4 6 8 10
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 449 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 450 / 463
3 4 1 6 9 2 8 5 7
0.0 0.5 1.0 1.5 2.0 2.5 3.0
1 2 3 4 5 6 7 8 9
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5
X1 X2
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 451 / 463
2 4 6 8 10 2 4 6 8 10
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 452 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 453 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 454 / 463
1 Begin with n observations and a measure of all n(n − 1)/2 pairwise dissimi-
2 For i = n, n − 1, . . . , 2: a Examine all pairwise inter-cluster dissimilarities among the i clusters and
b Compute the new pairwise inter-cluster dissimilarities among the i − 1 remai-
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 454 / 463
Linkage Description Complete Maximal intercluster dissimilarity. Compute all pairwise dis- similarities between the observations in cluster A and the
dissimilarities. Single Minimal intercluster dissimilarity. Compute all pairwise dis- similarities between the observations in cluster A and the
clusters in which single observations are fused one-at-a-time. Average Mean intercluster dissimilarity. Compute all pairwise dis- similarities between the observations in cluster A and the
dissimilarities. Centroid Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 455 / 463
1 2 3 4 5 6 7 8 9
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 456 / 463
1 2 3 4 5 6 7 8 9
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 456 / 463
1 2 3 4 5 6 7 8 9
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 456 / 463
1 2 3 4 5 6 7 8 9
−1.5 −1.0 −0.5 0.0 0.5 1.0 −1.5 −1.0 −0.5 0.0 0.5
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 456 / 463
Average Linkage Complete Linkage Single Linkage
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 457 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 458 / 463
5 10 15 20 5 10 15 20 Variable Index Observation 1 Observation 2 Observation 3 1 2 3
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 458 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 459 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 460 / 463
Socks Computers 2 4 6 8 10 Socks Computers 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Socks Computers 500 1000 1500
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 461 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 462 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 462 / 463
Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 463 / 463