Lecture 2: Model-based classification Felix Held, Mathematical - - PowerPoint PPT Presentation
Lecture 2: Model-based classification Felix Held, Mathematical - - PowerPoint PPT Presentation
Lecture 2: Model-based classification Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 28th March 2019 Reprise: Statistical Learning (I) (|) [] (GLM)) 2. linear
Reprise: Statistical Learning (I)
Regression
โถ Theoretically best regression function for squared error
loss ห ๐(๐ฒ) = ๐ฝ๐(๐ง|๐ฒ)[๐ง]
โถ Approximate (1) or make model-assumptions (2)
- 1. k-nearest neighbour regression
๐ฝ๐(๐ง|๐ฒ)[๐ง] โ 1 ๐ โ
๐ฒ๐๐โ๐๐(๐ฒ)
๐ง๐๐
- 2. linear regression (viewpoint: generalized linear models
(GLM)) ๐ฝ๐(๐ง|๐ฒ)[๐ง] โ ๐ฒ๐๐ธ
1/25
Reprise: Statistical Learning (II)
Classification
โถ Theoretically best classification rule for 0-1 loss and ๐ฟ
possible classes ฬ ๐(๐ฒ) = arg max
1โค๐โค๐ฟ
๐(๐|๐ฒ)
โถ Approximate (1) or make model-assumptions (2)
- 1. k-nearest neighbour classification
๐(๐|๐ฒ) โ 1 ๐ โ
๐ฒ๐โ๐๐(๐ฒ)
1(๐๐ = ๐)
- 2. Instead of approximating ๐(๐|๐ฒ) from data, can we make
sensible model assumptions instead?
2/25
Amendment: kNN methods
There are two choices to make when implementing a kNN method
- 1. The metric to determine a neighbourhood
โถ e.g. Euclidean/โ2 norm, Manhattan/โ1 norm, max norm, โฆ
- 2. The number of neighbours, i.e. ๐
The choice of metric changes the underlying local model of the method while ๐ is a tuning parameter.
3/25
Model-based classification
Classification as regression
โถ Consider a two-class problem, with ๐๐ = 0 or ๐๐ = 1 โถ Instead of 0-1 loss, use square error loss, i.e.
๐ฝ๐(๐|๐ฒ)[๐] = 0 โ ๐(0|๐ฒ) + 1 โ ๐(1|๐ฒ) = ๐(1|๐ฒ) Note that ๐ has a discrete distribution.
โถ Linear regression model assumption
๐(1|๐ฒ) = ๐ฝ๐(๐|๐ฒ)[๐] โ ๐ฒ๐๐ธ
โถ Since we are approximating ๐(1|๐ฒ) and
๐(0|๐ฒ) = 1 โ ๐(1|๐ฒ) โ 1 โ ๐ฒ๐๐ธ, we indirectly specified a model approximation for Bayesโ rule as well ๐(๐ฒ) = { ๐ฒ๐๐ธ โค
1 2
1
- therwise
Note that ๐ฒ๐๐ธ =
1 2 defines the decision boundary 4/25
0-1 regression
- โ
- โ
- โ
- โ
- 0.0
0.5 1.0 5 6 7
Sepal Length Coding
- โ
- 2.0
2.5 3.0 3.5 4.0 4.5 5 6 7
Sepal Length Sepal Width Species
- setosa
versicolor
The solid black lines show the decision boundary.
5/25
0-1 regressions and outliers
- โ5
5 5 10
x1 x2 Case
No outlier With Outlier
6/25
Dummy encoding for categorical variables
In regression, when a predictor ๐ฆ is categorical, i.e. takes one
- f ๐ฟ values, it is common to use a dummy encoding.
Example: ๐ฆ = 1 โ ๐จ = (1, 0, 0) ๐ฆ = 2 โ ๐จ = (0, 1, 0) ๐ฆ = 3 โ ๐จ = (0, 0, 1) Idea Turn a classification problem into a regression problem by representing the class outcomes ๐๐ in the training data (๐๐, ๐ฒ๐) as vectors in dummy encoding.
7/25
Multiple classes
โถ This creates a sequence of 0-1 regressions (see
blackboard). If there are ๐ฟ classes then ๐จ(1)
๐
โถ= 1(๐๐ = 1) โ ๐(๐จ(1) = 1|๐ฒ) โ ๐ฒ๐๐ธ(1) โฎ ๐จ(๐ฟ)
๐
โถ= 1(๐๐ = ๐ฟ) โ ๐(๐จ(๐ฟ) = 1|๐ฒ) โ ๐ฒ๐๐ธ(๐ฟ)
โถ Note that
๐(๐|๐ฒ) = ๐(๐จ(๐) = 1|๐ฒ) โ ๐ฒ๐๐ธ(๐)
โถ Classification rule
๐(๐ฆ) = arg max
1โค๐โค๐ฟ
๐(๐|๐ฒ) โ arg max
1โค๐โค๐ฟ
๐ฒ๐๐ธ(๐) Decision boundaries are defined by ๐(๐ฆ) = ๐ฒ๐๐พ(๐) = ๐ฒ๐๐พ(๐) for ๐ โ ๐
8/25
Multiple 0-1 regressions
- โ
- 1
Coding
- โ
- 1
5 10
Predictor Coding Class
- 1
2 3
9/25
Problems with 0-1 regression
Observations:
- 1. ๐ฒ๐๐ธ is unbounded but models a probability ๐(๐|๐ฒ) โ [0, 1]
- 2. Only values of ๐ฒ๐๐ธ around 0.5 (for binary classification) or
close to the maximal value (for multiple classes) are really of interest.
- 3. Sensitive to points far away from the boundary (outliers)
- 4. Masking: Classes can get buried among other classes
(adding polynomial predictors can sometimes help, but this is arbitrary and data dependent) Inspiration from GLM Can we transform ๐ฒ๐๐ธ such that the transformed values are in [0, 1], are similar to the original values when close to 0.5 and insensitive outliers far away from the boundary?
10/25
Logistic function and Normal Distribution CDF
0.00 0.25 0.50 0.75 1.00 โ4 โ2 2 4
x y Type
Logistic Function Standard Normal CDF
Logistic (sigmoid) function ๐(๐ฆ) = exp(๐ฆ) 1 + exp(๐ฆ) Standard Normal CDF ฮฆ(๐ฆ) = โซ
๐ฆ โโ
1 โ2๐ exp (โ๐จ2 2 ) d๐จ
11/25
Logistic and probit regression
โถ We arrive at logistic regression when assuming
๐(1|๐ฒ) = ๐ฝ๐(๐|๐ฒ)[๐] = ๐โ1 (๐ฒ๐๐ธ)
- r probit regression when assuming
๐(1|๐ฒ) = ๐ฝ๐(๐|๐ฒ)[๐] = ฮฆโ1 (๐ฒ๐๐ธ)
โถ Parameters can be estimated by iteratively reweighted
least squares (Details in ESL Ch. 4.4.1)
โถ A warning: Problematic situation in two-class case
(occurs seldom in practice)
โถ Assume two classes can be separated perfectly in one or
more predictors
โถ Logistic regression tries to fit a step-like function, which
forces the intercept to โโ and the corresponding predictor coefficient to +โ.
12/25
Logistic regression and outliers
- โ5
5 5 10
x1 x2 Case
0โ1: no outlier 0โ1: with outlier Logistic: with outlier
13/25
Multi-class logistic regression
โถ In case of ๐ฟ > 2 classes, using dummy encoding for the
- utcome leads again to a series of regression problems.
โถ Requirement: Probabilities should be modelled, i.e. in
๐(๐|๐ฒ) โ [0, 1] for each class and โ๐ ๐(๐|๐ฒ) = 1
โถ Softmax function: ๐ โถ โ๐ฟ โฆ [0, 1]๐ฟ
๐
๐(๐ด) =
๐๐จ๐ โ
๐ฟ ๐=1 ๐๐จ๐
โ ๐
๐(๐ด) =
๐(๐จ๐โ๐จ๐ฟ) 1 + โ
๐ฟโ1 ๐=1 ๐(๐จ๐โ๐จ๐ฟ)
โถ Model now:
๐(๐|๐ฒ) = ๐๐ฒ๐๐ธ(๐) โ
๐ฟ ๐=1 ๐๐ฒ๐๐ธ(๐)
- r
๐(๐|๐ฒ) = ๐๐ฒ๐(๐ธ(๐)โ๐ธ(๐ฟ)) 1 + โ
๐ฟโ1 ๐=1 ๐๐ฒ๐(๐ธ(๐)โ๐ธ(๐ฟ))
โถ This method has many names: softmax regression,
multinomial logistic regression, maximum entropy classifier, โฆ
14/25
Multi-class logistic regression: An example
0.0 0.5 1.0
Probability
- โ
- โ
- โ
- โ
- setosa
versicolor virginica 5 6 7 8
Sepal Length Species
15/25
Classification with focus on the feature/predictor space
Motivation for a different viewpoint: Nearest centroids
- โ
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width Species
- setosa
versicolor virginica
Determine mean predictor vector per class ห ๐๐ = 1 ๐๐ โ
๐๐=๐
๐ฒ๐ where ๐๐ =
๐
โ
๐=1
1(๐๐ = ๐) and classify points to the class whoโs mean is closest.
16/25
A change of scenery
Summary
โถ Classification can be approached through regression and
approximation of ๐ฝ๐(๐|๐ฒ)[๐]
โถ Indirectly we approximated ๐(๐|๐ฒ) and were able to use
Bayesโ rule Observation: Good predictors group by class in feature space Change of focus: Letโs model the density of ๐ฒ conditionally on ๐ instead! How? Bayesโ law
17/25
The setting of Discriminant Analysis
Apply Bayesโ law ๐(๐|๐ฒ) = ๐(๐ฒ|๐)๐(๐) โ
๐ฟ ๐=1 ๐(๐ฒ|๐)๐(๐)
Instead of specifying ๐(๐|๐ฒ) we can specify ๐(๐ฒ|๐) and ๐(๐) The main assumption of Discriminant Analysis (DA) is ๐(๐ฒ|๐) โผ ๐(๐๐, ๐ป๐) where ๐๐ โ โ๐ is the mean vector for class ๐ and ๐ป๐ โ โ๐ร๐ the corresponding covariance matrix.
18/25
Finding the parameters of DA
โถ Notation: Write ๐(๐) = ๐๐ and consider them as unknown
parameters
โถ Given data (๐๐, ๐ฒ๐) the likelihood maximization problem is
arg max
๐,๐ป,๐ ๐
โ
๐=1
๐(๐ฒ๐|๐๐๐, ๐ป๐๐)๐๐๐ subject to
๐ฟ
โ
๐=1
๐๐ = 1.
โถ Can be solved using a Lagrange multiplier (try it!) and
leads to ห ๐๐ = ๐๐ ๐ , with ๐๐ =
๐
โ
๐=1
1(๐๐ = ๐) ห ๐๐ = 1 ๐๐ โ
๐๐=๐
๐ฆ๐ ห ๐ป๐ = 1 ๐๐ โ 1 โ
๐๐=๐
(๐ฆ๐ โ ห ๐๐)(๐ฆ๐ โ ห ๐๐)๐
19/25
Performing classification in DA
Bayesโ rule implies the classification rule ๐(๐ฒ) = arg max
1โค๐โค๐ฟ
๐(๐ฒ|๐๐, ๐ป๐)๐๐ Note that since log is strictly increasing this is equivalent to ๐(๐ฒ) = arg max
1โค๐โค๐ฟ
๐๐(๐ฒ) where ๐๐(๐ฒ) = log ๐(๐ฒ|๐๐, ๐ป๐) + log ๐๐ = log ๐๐ โ 1 2(๐ฒ โ ๐๐)๐๐ปโ1
๐ (๐ฒ โ ๐๐) โ 1
2 log |๐ป๐| (+ ๐ท) This is a quadratic function in ๐ฒ.
20/25
Different levels of complexity
โถ This method is called Quadratic Discriminant Analysis
(QDA)
โถ Problem: Many parameters that grow quickly with
dimension
โถ ๐ฟ โ 1 for all ๐๐ โถ ๐ โ ๐ฟ for all ๐๐ โถ ๐(๐ + 1)/2 โ ๐ฟ for all ๐ป๐ (most costly)
โถ Solution: Replace covariance matrices ๐ป๐ by a pooled
estimate ห ๐ป =
๐ฟ
โ
๐=1
ห ๐ป๐ ๐๐ โ 1 ๐ โ ๐ฟ = 1 ๐ โ ๐ฟ
๐ฟ
โ
๐=1
โ
๐๐=๐
(๐ฆ๐ โ ห ๐๐)(๐ฆ๐ โ ห ๐๐)๐
โถ Simpler correlation and variance structure: All classes
are assumed to have the same correlation structure between features
21/25
Performing classification in the simplified case
As before, consider ๐(๐ฒ) = arg max
1โค๐โค๐ฟ
๐๐(๐ฒ) where ๐๐(๐ฒ) = log ๐๐ + ๐ฒ๐๐ปโ1๐๐ โ 1 2๐๐
๐ ๐ปโ1๐๐
(+ ๐ท) This is a linear function in ๐ฒ. The method is therefore called Linear Discriminant Analysis (LDA).
22/25
Even more simplifications
Other simplifications of the correlation structure are possible
โถ Ignore all correlations between features but allow
different variances, i.e. ๐ป๐ = ๐ณ๐ for a diagonal matrix ๐ณ๐ (Diagonal QDA or Naive Bayesโ Classifier)
โถ Ignore all correlations and make feature variances equal,
i.e. ๐ป๐ = ๐ณ for a diagonal matrix ๐ณ (Diagonal LDA)
โถ Ignore correlations and variances, i.e. ๐ป๐ = ๐2๐๐ร๐
(Nearest Centroids adjusted for class frequencies ๐๐ )
23/25
Examples of LDA and QDA
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
Nearest Centroids
- โ
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
LDA
- โ
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
QDA
Species
- setosa
versicolor virginica
Decision boundaries can be found with ๐(๐ฒ|๐๐, ๐ป๐)๐๐ = ๐(๐ฒ|๐
๐, ๐ป ๐)๐ ๐
for ๐ โ ๐ and ๐ป๐ = ๐ป for LDA and ๐ป๐ = ๐2๐๐ร๐ for Nearest Centroids.
24/25
Take-home message
โถ Classification can be achieved through the point-of-view
- f regression
โถ Modelling the conditional densities of features instead of
classes leads to Discriminant Analysis (DA)
โถ There is a range of assumptions in DA about the