Statistical Machine Learning Lecture 07: Clustering and Evaluation - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning Lecture 07: Clustering and Evaluation - - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 07: Clustering and Evaluation Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 38 Todays Objectives Make


slide-1
SLIDE 1

Statistical Machine Learning

Lecture 07: Clustering and Evaluation

Kristian Kersting TU Darmstadt

Summer Term 2020

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

1 / 38

slide-2
SLIDE 2

Today’s Objectives

Make you understand how to find meaningful groups of data points and evaluating the performance of estimators Covered Topics:

Clustering Bias & Variance Cross-Validation

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

2 / 38

slide-3
SLIDE 3

Outline

  • 1. Clustering
  • 2. Evaluation
  • 3. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

3 / 38

slide-4
SLIDE 4
  • 1. Clustering

Outline

  • 1. Clustering
  • 2. Evaluation
  • 3. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

4 / 38

slide-5
SLIDE 5
  • 1. Clustering

Clustering

We introduced mixture models as part of density estimation They are also very useful for clustering

Divide the feature space into meaningful groups Find the group assignment

1 2 3 4 5 6 40 50 60 70 80 90 100 1 2 3 4 5 6 40 50 60 70 80 90 100

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

5 / 38

slide-6
SLIDE 6
  • 1. Clustering

Clustering

Clustering is a type of Unsupervised Learning Examples

k-Means Mixture models

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

6 / 38

slide-7
SLIDE 7
  • 1. Clustering

Simple Clustering Methods

Agglomerative Clustering

Make each point a separate cluster While the clustering is not satisfactory

Merge the two clusters with the smallest inter-cluster distance

Divisive Clustering

Construct a single cluster containing all points While the clustering is not satisfactory

Split the cluster that yields the two components with the largest inter-cluster distance

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

7 / 38

slide-8
SLIDE 8
  • 1. Clustering

Mean Shift Clustering

Mean shift is a method for finding modes in a cloud of data points where the points are most dense

[Comaniciu & Meer, 02]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 38

slide-9
SLIDE 9
  • 1. Clustering

Mean Shift Clustering

The mean shift procedure tries to find the modes of a kernel density estimate through local search

[Comaniciu & Meer, 02]

The black lines indicate various search paths starting at different points Paths that converge at the same point get assigned the same label

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

9 / 38

slide-10
SLIDE 10
  • 1. Clustering

Mean Shift Clustering

Start with kernel density estimate ˆ f (x) = 1 Nhd

N

  • i=1

k

  • x − xi

h

  • 2

We can derive the mean shift procedure by taking the gradient of the kernel density estimate For details see: D. Comaniciu, P. Meer, Mean Shift: A Robust Approach toward Feature Space Analysis, IEEE Trans. Pattern Analysis Machine Intell., Vol. 24, No. 5, 603-619, 2002.

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 38

slide-11
SLIDE 11
  • 1. Clustering

Mean Shift Clustering

Start at a random data point x Compute the mean shift vector: mh,g (x) = N

i=1 xig

  • x−xi

h

  • 2

N

i=1 g

  • x−xi

h

  • 2 − x

Where g(y) = −k′(y)

Move the current point by the mean shift vector: x ← x + mh,g(x) Repeat until convergence

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

11 / 38

slide-12
SLIDE 12
  • 1. Clustering

Mean Shift Clustering - Illustration Intuitive Description

Region of interest Center of mass Mean Shift vector

Objective : Find the densest region

[Ukrainitz & Sarel]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

12 / 38

slide-13
SLIDE 13
  • 1. Clustering

Segmentation using Clustering

Clustering of simple image features, e.g. color & pixel position

  • [Comaniciu & Meer, 02]
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

13 / 38

slide-14
SLIDE 14
  • 2. Evaluation

Outline

  • 1. Clustering
  • 2. Evaluation
  • 3. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

14 / 38

slide-15
SLIDE 15
  • 2. Evaluation

Evaluation

What have we seen so far...

Classification using the Bayes classifier p (Ck | x) ∝ p (x | Ck) p (Ck) Probability density estimation to estimate the class-conditional densities p (x | Ck)

How do we know how well we are carrying out each of these tasks? We need a way of performance evaluation

For density estimation (or really, parameter estimation) For the classifier as a whole

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

15 / 38

slide-16
SLIDE 16
  • 2. Evaluation

Evaluation

Overfitting is everywhere Is there a tank in the picture?

[DARPA Neural Network Study (1988-89), AFCEA International Press]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

16 / 38

slide-17
SLIDE 17
  • 2. Evaluation

Test Error vs Training Error

3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 Mean Squared Error 1

2N N i

(yi (xi) )2 Optimal Polynomial Degree 5 Error on Test Set Training Error Test Error 3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 Mean Squared Error 1

2N N i

(yi (xi) )2 Optimal Polynomial Degree 5

Underfitting

Error on Test Set Training Error Test Error 3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 Mean Squared Error 1

2N N i

(yi (xi) )2 Optimal Polynomial Degree 5

Underfitting Overfitting

Error on Test Set Training Error Test Error 3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 Mean Squared Error 1

2N N i

(yi (xi) )2 Optimal Polynomial Degree 5

Underfitting Overfitting About right

Error on Test Set Training Error Test Error

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

17 / 38

Small training error good model? ⇒NO! We need to rethink model selection!

slide-18
SLIDE 18
  • 2. Evaluation

Occam’s Razor and Model Selection

Model Selection Questions

Number of parameters a.k.a. degree of polynomial n? Is your model class sufficently rich? ⇒ Underfitting Too rich? ⇒ Overfitting

Occams’s Razor:

Always choose the simplest model that fits the data Simplest = smallest model complexity!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

18 / 38 William of Ockham (1285-1347)

slide-19
SLIDE 19
  • 2. Evaluation

Bias and Variance

As we saw before, maximum likelihood is just one possible way to estimate a parameter

How can we assess how good an estimator is?

Assume that we have an estimator ˆ θ that estimates the parameter θ from the data set X Bias of an estimator: expected deviation from the true parameter bias

  • ˆ

θ

  • = EX
  • ˆ

θ (X) − θ

  • Variance of an estimator: expected squared error between the

estimator and the mean estimator var

  • ˆ

θ

  • = EX
  • ˆ

θ (X) − EX

  • ˆ

θ (X) 2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

19 / 38

slide-20
SLIDE 20
  • 2. Evaluation

Bias of an Estimator

The estimate ˆ θ (X) is a random variable, because we assumed that X is a random sample from a true underlying distribution An estimator is biased if the expected value of the estimator EX

  • ˆ

θ (X)

  • differs from the true value of the parameter θ

Otherwise it is called unbiased, i.e., EX

  • ˆ

θ (X)

  • = θ
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 38

slide-21
SLIDE 21
  • 2. Evaluation

Variance of an Estimator

Ideally, we want an unbiased estimator with small variance In practice, this is not that easy as we will see shortly

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

21 / 38

slide-22
SLIDE 22
  • 2. Evaluation

Bias and Variance

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

22 / 38

slide-23
SLIDE 23
  • 2. Evaluation

I am so BLUE...

An estimator with

Zero bias Minimum variance

is called a Minimum Variance Unbiased Estimator (MVUE) A Minimum Variance Unbiased Estimator which is

Linear in the features

is called a Best Linear Unbiased Estimator (BLUE)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

23 / 38

slide-24
SLIDE 24
  • 2. Evaluation

Maximum-Likelihood Estimation (MLE) of a Gaussian

Remember, the Gaussian has two parameters, the mean µ and the variance σ2 Let’s compute the bias of the Maximum-Likelihood estimate of the mean of a Gaussian

ˆ µ (X) = 1 N

N

  • i=1

xi EX [ˆ µ (X) − µ] = E [ˆ µ (X)] − E [µ] = E

  • 1

N

N

  • i=1

xi

  • − µ

= 1 N N

  • i=1

E [xi]

  • − µ = 1

N N

  • i=1

µ

  • − µ = 0

The MLE of the mean of a Gaussian is UNBIASED

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

24 / 38

slide-25
SLIDE 25
  • 2. Evaluation

Maximum-Likelihood Estimation (MLE) of a Gaussian

Are all MLEs unbiased? No! ˆ σ2 (X) = 1 N

N

  • i=1

(xi − ˆ µ)2 EX

  • ˆ

σ2 (X) − σ2 = . . . = N − 1 N σ2 − σ2 = − 1 Nσ2 The MLE of the variance of a Gaussian is BIASED We can easily get an unbiased estimator ˜ σ2 (X) = 1 N − 1

N

  • i=1

(xi − ˆ µ)2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

25 / 38

slide-26
SLIDE 26
  • 2. Evaluation

Bias and Variance (Regression Example)

Estimator ˆ fD from training data D, data generated by y(xq) = f (xq) + ǫ with E{ǫ} = 0 and Var{ǫ} = σ2

ǫ

Note f (x) is not random ED,ǫ {y(xq)} = ED,ǫ {f (xq)} = f (xq) Expected Squared Error for query xq estimated from all possible data sets D Lˆ

f (xq) = ED,ǫ

  • y(xq) − ˆ

fD(xq) 2

  • = ED,ǫ
  • y(xq) − f (xq) + f (xq) − ˆ

fD(xq) 2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 38

slide-27
SLIDE 27
  • 2. Evaluation

Bias and Variance (Regression Example)

Expected Squared Error for query xq estimated from all possible data sets D

f (xq) = ED,ǫ

  • (y(xq) − f (xq))2
  • =ǫ2

+

  • f (xq) − ˆ

fD(xq) 2+2

  • y(xq) − f (xq))(f (xq) − ˆ

fD(xq)

  • f (xq)−ˆ

fD(x)

  • = σ2

ǫ + ED

  • f (xq) − ˆ

fD(xq) 2

  • = σ2

ǫ + bias2

ˆ fD(xq)

  • + var
  • ˆ

fD(xq)

  • using ¯

ˆ f (xq) = ED

  • ˆ

fD(xq)

  • , we obtain

ED

  • f (xq) − ˆ

fD(xq) 2

  • = ED
  • f (xq) − ¯

ˆ f (xq) + ¯ ˆ f (xq) − ˆ fD(xq) 2

  • =
  • f (xq) − ¯

ˆ f (xq) 2

  • =bias2{ˆ

fD(xq)}

+ ED ¯ ˆ f (xq) − ˆ fD(xq) 2

  • =var{ˆ

fD(xq)}

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 38

slide-28
SLIDE 28
  • 2. Evaluation

Bias-Variance Tradeoff

(Total) Bias bias2 ˆ fD

  • = Exq
  • f (xq) − ED

ˆ fD(xq) 2

Structure error Model ˆ fD (xq) cannot do better

(Total) Variance var

  • ˆ

fD(xq)

  • = Exq,D

ˆ fD(xq) − E ˜

D

ˆ f˜

D(xq)

2

  • Estimation error

Finite data sets will always have errors

Expected Total Error ∝ Bias2+Variance

You typically cannot minimize both

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 38

5 10 15 20 Polynomial Degree n 10

6

10

5

10

4

10

3

10

2

10

1

Bias², Variance, and their Sum Optimal n * = 5 Precision 10

6

Error due to Bias and Variance

Bias² Variance Bias²+Variance

slide-29
SLIDE 29
  • 2. Evaluation

Bias-Variance Tradeoff

Our learning algorithm will only generalize well if we find the right tradeoff between bias and variance

Simple enough to prevent overfitting to the particular training data set that we have Yet expressive enough to be able to represent the important properties of the data

To ensure that our learning algorithm works well, we have to evaluate it on test data

But what if we don’t have any?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

29 / 38

slide-30
SLIDE 30
  • 2. Evaluation

How do choose the model?

Goal: Find a good model (e.g., good set of features) Split the dataset into:

Training Set Validation Set Test Set

  • 1. Training Set: Fit Parameters
  • 2. Validation Set: Choose model class or single parameters
  • 3. Test Set: Estimate prediction error of trained model

⇒ Error/loss L needs to be estimated on an independent set!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

30 / 38

slide-31
SLIDE 31
  • 2. Evaluation

Model Selection: Cross Validation

Partition data into K sets Dκ, use K − 1 for Training and 1 for Validation

Training Set Training Set Validation Set

and compute θk(Mj) = argminθ∈Mj

  • κ=k
  • (xi,yi)∈Dκ

Lfθ(xi, yi) Lk(Mj) =

  • (xi,yi)∈Dk

Lfθk(xi, yi) Exhaustive Cross Validation: Try all partitioning possibilities ⇒ Computationally expensive Bootstrap: Randomly sample non-overlapping training / validation sets

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

31 / 38

slide-32
SLIDE 32
  • 2. Evaluation

Model Selection: K-fold Cross Validation

Cheapest reasonable approach: K-fold Cross Validation

Training Set Training Set Training Set Training Set Training Set Training Set Validation Set Validation Set Validation Set

Compute the validation loss and choose Model M∗ = argminM 1 K

K

  • k=1

Lk(M) with smallest average validation loss Leave-one-out cross-validation (LOOCV): K = N − 1 ⇒ Validation set size 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

32 / 38

slide-33
SLIDE 33
  • 2. Evaluation

Model Selection: K-fold Cross Validation

4-fold Cross-Validation

3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 0.4 Mean Squared Error 1

2N N i

(yi (xi) )2

Error: Test Set vs Crossvalidation

Training Error Test Error 4-fold Crossvalidation Error Optimal Polynomial Degree (Test Error) Optimal Polynomial Degree (CV Error)

Leave-one-out Cross-Validation

3 6 9 12 Polynomial Degree n 0.0 0.1 0.3 0.4 Mean Squared Error 1

2N N i

(yi (xi) )2

Error: Test Set vs Crossvalidation

Training Error Test Error Leave One Out Crossvalidation Error Optimal Polynomial Degree (Test Error) Optimal Polynomial Degree (CV Error)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

33 / 38

slide-34
SLIDE 34
  • 2. Evaluation

Machine Learning Cycle

[https://devpost.com]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 38

slide-35
SLIDE 35
  • 3. Wrap-Up

Outline

  • 1. Clustering
  • 2. Evaluation
  • 3. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

35 / 38

slide-36
SLIDE 36
  • 3. Wrap-Up
  • 3. Wrap-Up

You know now: Different algorithms for clustering How to compute the bias and variance of an estimator What the Bias-Variance tradeoff is What MVUE and BLUE mean The difference between unbiased and biased estimators How to mimic test data evaluation using cross-validation

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 38

slide-37
SLIDE 37
  • 3. Wrap-Up

Self-Test Questions

How can we find meaningful clusters in the data? How does density estimation with mixture models relate to clustering? What is the bias-variance trade-off? What is a BLUE estimator? Are maximum likelihood estimators always unbiased? What is leave one out cross-validation? What do we need it for?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

37 / 38

slide-38
SLIDE 38
  • 3. Wrap-Up

Homework

Reading Assignment for next lecture

Murphy ch. 7 Bishop ch. 3

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

38 / 38