Lecture 2: Model-based classification Felix Held, Mathematical - - PowerPoint PPT Presentation

โ–ถ
lecture 2 model based classification
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: Model-based classification Felix Held, Mathematical - - PowerPoint PPT Presentation

Lecture 2: Model-based classification Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 28th March 2019 Reprise: Statistical Learning (I) (|) [] (GLM)) 2. linear


slide-1
SLIDE 1

Lecture 2: Model-based classification

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 28th March 2019

slide-2
SLIDE 2

Reprise: Statistical Learning (I)

Regression

โ–ถ Theoretically best regression function for squared error

loss ห† ๐‘”(๐ฒ) = ๐”ฝ๐‘ž(๐‘ง|๐ฒ)[๐‘ง]

โ–ถ Approximate (1) or make model-assumptions (2)

  • 1. k-nearest neighbour regression

๐”ฝ๐‘ž(๐‘ง|๐ฒ)[๐‘ง] โ‰ˆ 1 ๐‘™ โˆ‘

๐ฒ๐‘—๐‘šโˆˆ๐‘‚๐‘™(๐ฒ)

๐‘ง๐‘—๐‘š

  • 2. linear regression (viewpoint: generalized linear models

(GLM)) ๐”ฝ๐‘ž(๐‘ง|๐ฒ)[๐‘ง] โ‰ˆ ๐ฒ๐‘ˆ๐œธ

1/25

slide-3
SLIDE 3

Reprise: Statistical Learning (II)

Classification

โ–ถ Theoretically best classification rule for 0-1 loss and ๐ฟ

possible classes ฬ‚ ๐‘‘(๐ฒ) = arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐‘ž(๐‘—|๐ฒ)

โ–ถ Approximate (1) or make model-assumptions (2)

  • 1. k-nearest neighbour classification

๐‘ž(๐‘—|๐ฒ) โ‰ˆ 1 ๐‘™ โˆ‘

๐ฒ๐‘šโˆˆ๐‘‚๐‘™(๐ฒ)

1(๐‘—๐‘š = ๐‘—)

  • 2. Instead of approximating ๐‘ž(๐‘—|๐ฒ) from data, can we make

sensible model assumptions instead?

2/25

slide-4
SLIDE 4

Amendment: kNN methods

There are two choices to make when implementing a kNN method

  • 1. The metric to determine a neighbourhood

โ–ถ e.g. Euclidean/โ„“2 norm, Manhattan/โ„“1 norm, max norm, โ€ฆ

  • 2. The number of neighbours, i.e. ๐‘™

The choice of metric changes the underlying local model of the method while ๐‘™ is a tuning parameter.

3/25

slide-5
SLIDE 5

Model-based classification

slide-6
SLIDE 6

Classification as regression

โ–ถ Consider a two-class problem, with ๐‘—๐‘š = 0 or ๐‘—๐‘š = 1 โ–ถ Instead of 0-1 loss, use square error loss, i.e.

๐”ฝ๐‘ž(๐‘—|๐ฒ)[๐‘—] = 0 โ‹… ๐‘ž(0|๐ฒ) + 1 โ‹… ๐‘ž(1|๐ฒ) = ๐‘ž(1|๐ฒ) Note that ๐‘— has a discrete distribution.

โ–ถ Linear regression model assumption

๐‘ž(1|๐ฒ) = ๐”ฝ๐‘ž(๐‘—|๐ฒ)[๐‘—] โ‰ˆ ๐ฒ๐‘ˆ๐œธ

โ–ถ Since we are approximating ๐‘ž(1|๐ฒ) and

๐‘ž(0|๐ฒ) = 1 โˆ’ ๐‘ž(1|๐ฒ) โ‰ˆ 1 โˆ’ ๐ฒ๐‘ˆ๐œธ, we indirectly specified a model approximation for Bayesโ€™ rule as well ๐‘‘(๐ฒ) = { ๐ฒ๐‘ˆ๐œธ โ‰ค

1 2

1

  • therwise

Note that ๐ฒ๐‘ˆ๐œธ =

1 2 defines the decision boundary 4/25

slide-7
SLIDE 7

0-1 regression

  • โ—
  • โ—
  • โ—
  • โ—
  • 0.0

0.5 1.0 5 6 7

Sepal Length Coding

  • โ—
  • 2.0

2.5 3.0 3.5 4.0 4.5 5 6 7

Sepal Length Sepal Width Species

  • setosa

versicolor

The solid black lines show the decision boundary.

5/25

slide-8
SLIDE 8

0-1 regressions and outliers

  • โˆ’5

5 5 10

x1 x2 Case

No outlier With Outlier

6/25

slide-9
SLIDE 9

Dummy encoding for categorical variables

In regression, when a predictor ๐‘ฆ is categorical, i.e. takes one

  • f ๐ฟ values, it is common to use a dummy encoding.

Example: ๐‘ฆ = 1 โ†’ ๐‘จ = (1, 0, 0) ๐‘ฆ = 2 โ†’ ๐‘จ = (0, 1, 0) ๐‘ฆ = 3 โ†’ ๐‘จ = (0, 0, 1) Idea Turn a classification problem into a regression problem by representing the class outcomes ๐‘—๐‘š in the training data (๐‘—๐‘š, ๐ฒ๐‘š) as vectors in dummy encoding.

7/25

slide-10
SLIDE 10

Multiple classes

โ–ถ This creates a sequence of 0-1 regressions (see

blackboard). If there are ๐ฟ classes then ๐‘จ(1)

๐‘š

โˆถ= 1(๐‘—๐‘š = 1) โ†’ ๐‘ž(๐‘จ(1) = 1|๐ฒ) โ‰ˆ ๐ฒ๐‘ˆ๐œธ(1) โ‹ฎ ๐‘จ(๐ฟ)

๐‘š

โˆถ= 1(๐‘—๐‘š = ๐ฟ) โ†’ ๐‘ž(๐‘จ(๐ฟ) = 1|๐ฒ) โ‰ˆ ๐ฒ๐‘ˆ๐œธ(๐ฟ)

โ–ถ Note that

๐‘ž(๐‘—|๐ฒ) = ๐‘ž(๐‘จ(๐‘—) = 1|๐ฒ) โ‰ˆ ๐ฒ๐‘ˆ๐œธ(๐‘—)

โ–ถ Classification rule

๐‘‘(๐‘ฆ) = arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐‘ž(๐‘—|๐ฒ) โ‰ˆ arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐ฒ๐‘ˆ๐œธ(๐‘—) Decision boundaries are defined by ๐‘‘(๐‘ฆ) = ๐ฒ๐‘ˆ๐›พ(๐‘—) = ๐ฒ๐‘ˆ๐›พ(๐‘˜) for ๐‘— โ‰  ๐‘˜

8/25

slide-11
SLIDE 11

Multiple 0-1 regressions

  • โ—
  • 1

Coding

  • โ—
  • 1

5 10

Predictor Coding Class

  • 1

2 3

9/25

slide-12
SLIDE 12

Problems with 0-1 regression

Observations:

  • 1. ๐ฒ๐‘ˆ๐œธ is unbounded but models a probability ๐‘ž(๐‘—|๐ฒ) โˆˆ [0, 1]
  • 2. Only values of ๐ฒ๐‘ˆ๐œธ around 0.5 (for binary classification) or

close to the maximal value (for multiple classes) are really of interest.

  • 3. Sensitive to points far away from the boundary (outliers)
  • 4. Masking: Classes can get buried among other classes

(adding polynomial predictors can sometimes help, but this is arbitrary and data dependent) Inspiration from GLM Can we transform ๐ฒ๐‘ˆ๐œธ such that the transformed values are in [0, 1], are similar to the original values when close to 0.5 and insensitive outliers far away from the boundary?

10/25

slide-13
SLIDE 13

Logistic function and Normal Distribution CDF

0.00 0.25 0.50 0.75 1.00 โˆ’4 โˆ’2 2 4

x y Type

Logistic Function Standard Normal CDF

Logistic (sigmoid) function ๐œ(๐‘ฆ) = exp(๐‘ฆ) 1 + exp(๐‘ฆ) Standard Normal CDF ฮฆ(๐‘ฆ) = โˆซ

๐‘ฆ โˆ’โˆž

1 โˆš2๐œŒ exp (โˆ’๐‘จ2 2 ) d๐‘จ

11/25

slide-14
SLIDE 14

Logistic and probit regression

โ–ถ We arrive at logistic regression when assuming

๐‘ž(1|๐ฒ) = ๐”ฝ๐‘ž(๐‘—|๐ฒ)[๐‘—] = ๐œโˆ’1 (๐ฒ๐‘ˆ๐œธ)

  • r probit regression when assuming

๐‘ž(1|๐ฒ) = ๐”ฝ๐‘ž(๐‘—|๐ฒ)[๐‘—] = ฮฆโˆ’1 (๐ฒ๐‘ˆ๐œธ)

โ–ถ Parameters can be estimated by iteratively reweighted

least squares (Details in ESL Ch. 4.4.1)

โ–ถ A warning: Problematic situation in two-class case

(occurs seldom in practice)

โ–ถ Assume two classes can be separated perfectly in one or

more predictors

โ–ถ Logistic regression tries to fit a step-like function, which

forces the intercept to โˆ’โˆž and the corresponding predictor coefficient to +โˆž.

12/25

slide-15
SLIDE 15

Logistic regression and outliers

  • โˆ’5

5 5 10

x1 x2 Case

0โˆ’1: no outlier 0โˆ’1: with outlier Logistic: with outlier

13/25

slide-16
SLIDE 16

Multi-class logistic regression

โ–ถ In case of ๐ฟ > 2 classes, using dummy encoding for the

  • utcome leads again to a series of regression problems.

โ–ถ Requirement: Probabilities should be modelled, i.e. in

๐‘ž(๐‘—|๐ฒ) โˆˆ [0, 1] for each class and โˆ‘๐‘— ๐‘ž(๐‘—|๐ฒ) = 1

โ–ถ Softmax function: ๐‰ โˆถ โ„๐ฟ โ†ฆ [0, 1]๐ฟ

๐‰

๐‘˜(๐ด) =

๐‘“๐‘จ๐‘˜ โˆ‘

๐ฟ ๐‘š=1 ๐‘“๐‘จ๐‘š

โ‡” ๐‰

๐‘˜(๐ด) =

๐‘“(๐‘จ๐‘˜โˆ’๐‘จ๐ฟ) 1 + โˆ‘

๐ฟโˆ’1 ๐‘š=1 ๐‘“(๐‘จ๐‘šโˆ’๐‘จ๐ฟ)

โ–ถ Model now:

๐‘ž(๐‘—|๐ฒ) = ๐‘“๐ฒ๐‘ˆ๐œธ(๐‘—) โˆ‘

๐ฟ ๐‘š=1 ๐‘“๐ฒ๐‘ˆ๐œธ(๐‘—)

  • r

๐‘ž(๐‘—|๐ฒ) = ๐‘“๐ฒ๐‘ˆ(๐œธ(๐‘š)โˆ’๐œธ(๐ฟ)) 1 + โˆ‘

๐ฟโˆ’1 ๐‘š=1 ๐‘“๐ฒ๐‘ˆ(๐œธ(๐‘š)โˆ’๐œธ(๐ฟ))

โ–ถ This method has many names: softmax regression,

multinomial logistic regression, maximum entropy classifier, โ€ฆ

14/25

slide-17
SLIDE 17

Multi-class logistic regression: An example

0.0 0.5 1.0

Probability

  • โ—
  • โ—
  • โ—
  • โ—
  • setosa

versicolor virginica 5 6 7 8

Sepal Length Species

15/25

slide-18
SLIDE 18

Classification with focus on the feature/predictor space

slide-19
SLIDE 19

Motivation for a different viewpoint: Nearest centroids

  • โ—
  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width Species

  • setosa

versicolor virginica

Determine mean predictor vector per class ห† ๐‚๐‘— = 1 ๐‘œ๐‘— โˆ‘

๐‘—๐‘š=๐‘—

๐ฒ๐‘š where ๐‘œ๐‘— =

๐‘œ

โˆ‘

๐‘š=1

1(๐‘—๐‘š = ๐‘—) and classify points to the class whoโ€™s mean is closest.

16/25

slide-20
SLIDE 20

A change of scenery

Summary

โ–ถ Classification can be approached through regression and

approximation of ๐”ฝ๐‘ž(๐‘—|๐ฒ)[๐‘—]

โ–ถ Indirectly we approximated ๐‘ž(๐‘—|๐ฒ) and were able to use

Bayesโ€™ rule Observation: Good predictors group by class in feature space Change of focus: Letโ€™s model the density of ๐ฒ conditionally on ๐‘— instead! How? Bayesโ€™ law

17/25

slide-21
SLIDE 21

The setting of Discriminant Analysis

Apply Bayesโ€™ law ๐‘ž(๐‘—|๐ฒ) = ๐‘ž(๐ฒ|๐‘—)๐‘ž(๐‘—) โˆ‘

๐ฟ ๐‘˜=1 ๐‘ž(๐ฒ|๐‘˜)๐‘ž(๐‘˜)

Instead of specifying ๐‘ž(๐‘—|๐ฒ) we can specify ๐‘ž(๐ฒ|๐‘—) and ๐‘ž(๐‘—) The main assumption of Discriminant Analysis (DA) is ๐‘ž(๐ฒ|๐‘—) โˆผ ๐‘‚(๐‚๐‘—, ๐šป๐‘—) where ๐‚๐‘— โˆˆ โ„๐‘ž is the mean vector for class ๐‘— and ๐šป๐‘— โˆˆ โ„๐‘žร—๐‘ž the corresponding covariance matrix.

18/25

slide-22
SLIDE 22

Finding the parameters of DA

โ–ถ Notation: Write ๐‘ž(๐‘—) = ๐œŒ๐‘— and consider them as unknown

parameters

โ–ถ Given data (๐‘—๐‘š, ๐ฒ๐‘š) the likelihood maximization problem is

arg max

๐‚,๐šป,๐† ๐‘œ

โˆ

๐‘š=1

๐‘‚(๐ฒ๐‘š|๐‚๐‘—๐‘š, ๐šป๐‘—๐‘š)๐œŒ๐‘—๐‘š subject to

๐ฟ

โˆ‘

๐‘—=1

๐œŒ๐‘— = 1.

โ–ถ Can be solved using a Lagrange multiplier (try it!) and

leads to ห† ๐œŒ๐‘— = ๐‘œ๐‘— ๐‘œ , with ๐‘œ๐‘— =

๐‘œ

โˆ‘

๐‘š=1

1(๐‘—๐‘š = ๐‘—) ห† ๐‚๐‘— = 1 ๐‘œ๐‘— โˆ‘

๐‘—๐‘š=๐‘—

๐‘ฆ๐‘š ห† ๐šป๐‘— = 1 ๐‘œ๐‘— โˆ’ 1 โˆ‘

๐‘—๐‘š=๐‘—

(๐‘ฆ๐‘š โˆ’ ห† ๐‚๐‘—)(๐‘ฆ๐‘š โˆ’ ห† ๐‚๐‘—)๐‘ˆ

19/25

slide-23
SLIDE 23

Performing classification in DA

Bayesโ€™ rule implies the classification rule ๐‘‘(๐ฒ) = arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐‘‚(๐ฒ|๐‚๐‘—, ๐šป๐‘—)๐œŒ๐‘— Note that since log is strictly increasing this is equivalent to ๐‘‘(๐ฒ) = arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐œ€๐‘—(๐ฒ) where ๐œ€๐‘—(๐ฒ) = log ๐‘‚(๐ฒ|๐‚๐‘—, ๐šป๐‘—) + log ๐œŒ๐‘— = log ๐œŒ๐‘— โˆ’ 1 2(๐ฒ โˆ’ ๐‚๐‘—)๐‘ˆ๐šปโˆ’1

๐‘— (๐ฒ โˆ’ ๐‚๐‘—) โˆ’ 1

2 log |๐šป๐‘—| (+ ๐ท) This is a quadratic function in ๐ฒ.

20/25

slide-24
SLIDE 24

Different levels of complexity

โ–ถ This method is called Quadratic Discriminant Analysis

(QDA)

โ–ถ Problem: Many parameters that grow quickly with

dimension

โ–ถ ๐ฟ โˆ’ 1 for all ๐œŒ๐‘— โ–ถ ๐‘ž โ‹… ๐ฟ for all ๐‚๐‘— โ–ถ ๐‘ž(๐‘ž + 1)/2 โ‹… ๐ฟ for all ๐šป๐‘— (most costly)

โ–ถ Solution: Replace covariance matrices ๐šป๐‘— by a pooled

estimate ห† ๐šป =

๐ฟ

โˆ‘

๐‘—=1

ห† ๐šป๐‘— ๐‘œ๐‘— โˆ’ 1 ๐‘œ โˆ’ ๐ฟ = 1 ๐‘œ โˆ’ ๐ฟ

๐ฟ

โˆ‘

๐‘—=1

โˆ‘

๐‘—๐‘š=๐‘—

(๐‘ฆ๐‘š โˆ’ ห† ๐‚๐‘—)(๐‘ฆ๐‘š โˆ’ ห† ๐‚๐‘—)๐‘ˆ

โ–ถ Simpler correlation and variance structure: All classes

are assumed to have the same correlation structure between features

21/25

slide-25
SLIDE 25

Performing classification in the simplified case

As before, consider ๐‘‘(๐ฒ) = arg max

1โ‰ค๐‘—โ‰ค๐ฟ

๐œ€๐‘—(๐ฒ) where ๐œ€๐‘—(๐ฒ) = log ๐œŒ๐‘— + ๐ฒ๐‘ˆ๐šปโˆ’1๐‚๐‘— โˆ’ 1 2๐‚๐‘ˆ

๐‘— ๐šปโˆ’1๐‚๐‘—

(+ ๐ท) This is a linear function in ๐ฒ. The method is therefore called Linear Discriminant Analysis (LDA).

22/25

slide-26
SLIDE 26

Even more simplifications

Other simplifications of the correlation structure are possible

โ–ถ Ignore all correlations between features but allow

different variances, i.e. ๐šป๐‘— = ๐šณ๐‘— for a diagonal matrix ๐šณ๐‘— (Diagonal QDA or Naive Bayesโ€™ Classifier)

โ–ถ Ignore all correlations and make feature variances equal,

i.e. ๐šป๐‘— = ๐šณ for a diagonal matrix ๐šณ (Diagonal LDA)

โ–ถ Ignore correlations and variances, i.e. ๐šป๐‘— = ๐œ2๐‰๐‘žร—๐‘ž

(Nearest Centroids adjusted for class frequencies ๐œŒ๐‘— )

23/25

slide-27
SLIDE 27

Examples of LDA and QDA

  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

Nearest Centroids

  • โ—
  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

LDA

  • โ—
  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

QDA

Species

  • setosa

versicolor virginica

Decision boundaries can be found with ๐‘‚(๐ฒ|๐‚๐‘—, ๐šป๐‘—)๐œŒ๐‘— = ๐‘‚(๐ฒ|๐‚

๐‘˜, ๐šป ๐‘˜)๐œŒ ๐‘˜

for ๐‘— โ‰  ๐‘˜ and ๐šป๐‘— = ๐šป for LDA and ๐šป๐‘— = ๐œ2๐‰๐‘žร—๐‘ž for Nearest Centroids.

24/25

slide-28
SLIDE 28

Take-home message

โ–ถ Classification can be achieved through the point-of-view

  • f regression

โ–ถ Modelling the conditional densities of features instead of

classes leads to Discriminant Analysis (DA)

โ–ถ There is a range of assumptions in DA about the

correlation structure in feature space โ†’ trade-off between stability and flexibility

25/25