Statistical Machine Learning Lecture 09: Classification Kristian - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning Lecture 09: Classification Kristian - - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 61 Todays Objectives Make you


slide-1
SLIDE 1

Statistical Machine Learning

Lecture 09: Classification

Kristian Kersting TU Darmstadt

Summer Term 2020

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

1 / 61

slide-2
SLIDE 2

Today’s Objectives

Make you understand how to do build a discriminative classifier! Covered Topics:

Discriminant Functions Multi-Class Classification Fisher Discriminate Analysis Perceptrons Logistic Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

2 / 61

slide-3
SLIDE 3

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

3 / 61

slide-4
SLIDE 4
  • 1. Discriminant Functions

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

4 / 61

slide-5
SLIDE 5
  • 1. Discriminant Functions

Reminder of Bayesian Decision Theory

We want to find the a-posteriori probability (posterior) of the class Ck given the observation (feature) x p (Ck | x) = p (x | Ck)p (Ck) p (x) = p (x | Ck)p (Ck)

  • j p
  • x | Cj
  • p
  • Cj
  • p (Ck | x) - class posterior

p (x | Ck) - class-conditional probability (likelihood) p (Ck) - class prior p (x) - normalization term

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

5 / 61

slide-6
SLIDE 6
  • 1. Discriminant Functions

Reminder of Bayesian Decision Theory

Decision rule

Decide C1 if p (C1 | x) > p (C2 | x) Using the definition of conditional distributions, equivalent to p (x | C1) p (C1) > p (x | C2) p (C2) ≡ p (x | C1) p (x | C2) > p (C2) p (C1)

A classifier obeying this rule is called a Bayes optimal classifier

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

6 / 61

slide-7
SLIDE 7
  • 1. Discriminant Functions

Reminder of Bayesian Decision Theory

Current approach

p (Ck | x) = p (x | Ck) p (Ck) /p (x) (Bayes’ rule) Model and estimate the class-conditional density p (x | Ck) and the class prior p (Ck) Compute posterior p (Ck | x) Minimize the error probability by maximizing p (Ck | x)

New approach

Directly encode the decision boundary Without modeling the densities directly Still minimize the error probability

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

7 / 61

slide-8
SLIDE 8
  • 1. Discriminant Functions

Discriminant Functions

Formulate classification using comparisons

Discriminant functions y1 (x) , . . . , yK (x) Classify x as class Ck iff yk (x) > yj (x) ∀j = k

More formally, a discriminant maps a vector x to one of the K available classes

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 61

slide-9
SLIDE 9
  • 1. Discriminant Functions

Discriminant Functions

Example of discriminant functions from the Bayes classifier yk (x) = p (Ck | x) yk (x) = p (x | Ck) p (Ck) yk (x) = log p (x | Ck) + log p (Ck)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

9 / 61

slide-10
SLIDE 10
  • 1. Discriminant Functions

Discriminant Functions

Base case with 2 classes y1 (x) > y2 (x) y1 (x) − y2 (x) > y (x) > Example from the Bayes classifier y (x) = p (C1 | x) − p (C2 | x) y (x) = log p (x | C1) p (x | C2) + log p (C1) p (C2)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 61

slide-11
SLIDE 11
  • 1. Discriminant Functions

Example - Bayes Classifier

Base case with 2 classes and Gaussian class-conditionals

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

11 / 61

slide-12
SLIDE 12
  • 1. Discriminant Functions

Linear Discriminant Functions

Base case with 2 classes y (x) > 0 decide class 1, otherwise class 2 Simplest case: linear decision boundary

In linear discriminants, the decision surfaces are (hyper)planes Linear Discriminant Function y (x) = w⊺x + w0 Where w is the normal vector and w0 the offset

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

12 / 61

slide-13
SLIDE 13
  • 1. Discriminant Functions

Linear Discriminant Functions

Illustration of the 2D case y (x) = w⊺x + w0, x =

  • x1

x2 ⊺

x2 x1 w x

y(x) kwk

x?

−w0 kwk

y = 0 y < 0 y > 0 R2 R1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

13 / 61

slide-14
SLIDE 14
  • 1. Discriminant Functions

Linear Discriminant Functions

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

14 / 61

slide-15
SLIDE 15
  • 1. Discriminant Functions

Discriminant Functions

Why might we want to use discriminant functions? We could easily fit the class-conditionals using Gaussians and use a Bayes classifier

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

15 / 61

slide-16
SLIDE 16
  • 1. Discriminant Functions

Discriminant Functions

How about now? Do these points matter for making the decision between the two classes?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

16 / 61

slide-17
SLIDE 17
  • 1. Discriminant Functions

Distribution-free Classifiers

We do not necessarily need to model all the details of the class-conditional distributions to come up with a good decision

  • boundary. (The class-conditionals may have many intricacies

that do not matter at the end of the day) If we can learn where to place the decision boundary directly, we can avoid some of the complexity It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. We shall see why later...

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

17 / 61

slide-18
SLIDE 18
  • 1. Discriminant Functions

Multi-Class Case

What if we constructed a multi-class classifier from several 2-class classifiers? If we base our decision rule on binary decisions, this may lead to ambiguities, where we can votes for several classes such as C1, C2 respectively C1, C2, C3

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

18 / 61

slide-19
SLIDE 19
  • 1. Discriminant Functions

Multi-Class Case - Better Solution

Use a discriminant function to encode how strongly we believe in each class y1 (x) , . . . , yK (x) Decision rule: Decide k if yk (x) > yj (x) ∀j = k If the discriminant functions are linear, the decision regions are connected and convex

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

19 / 61

slide-20
SLIDE 20
  • 2. Fisher Discriminant Analysis

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 61

slide-21
SLIDE 21
  • 2. Fisher Discriminant Analysis

Linear Discriminant Functions

Illustration of the 2D case y (x) = w⊺x + w0, x =

  • x1

x2 ⊺

x2 x1 w x

y(x) kwk

x?

−w0 kwk

y = 0 y < 0 y > 0 R2 R1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

21 / 61

slide-22
SLIDE 22
  • 2. Fisher Discriminant Analysis

First Attempt: Least Squares

Try to achieve a certain value of the discriminative function y (x) = +1 ⇔ x ∈ C1 y (x) = −1 ⇔ x ∈ C2

Training data inputs: X =

  • x1 ∈ Rd, . . . , xn
  • Training data labels: Y = {y1 ∈ {−1, +1} , . . . , yn}

Linear Discriminant Function

Try to enforce x⊺

i w + w0 = yi,

∀i = 1, . . . , n There is one linear equation for each training data point/label pair

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

22 / 61

slide-23
SLIDE 23
  • 2. Fisher Discriminant Analysis

First Attempt: Least Squares

Linear system of equations x⊺

i w + w0 = yi,

∀i = 1, . . . , n

Define ˆ xi = xi 1 ⊺ ∈ Rd×1, ˆ w = w w0 ⊺ ∈ Rd×1 Rewrite the equation system ˆ x⊺

i ˆ

w = yi, ∀i = 1, . . . , n In matrix-vector notation we have ˆ X⊺ ˆ w = y

With ˆ X = [ˆ x1, . . . , ˆ xn] ∈ Rd×n and y = [y1, . . . , yn]⊺

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

23 / 61

slide-24
SLIDE 24
  • 2. Fisher Discriminant Analysis

First Attempt: Least Squares

ˆ X⊺ ˆ w = y An overdetermined system of equations There are n equations and d + 1 unknowns

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

24 / 61

slide-25
SLIDE 25
  • 2. Fisher Discriminant Analysis

First Attempt: Least Squares

Look for the least squares solution ˆ w∗ = arg min

ˆ w

  • ˆ

X⊺ ˆ w − y

  • 2

= arg min

ˆ w

  • ˆ

X⊺ ˆ w − y ⊺ ˆ X⊺ ˆ w − y

  • = arg min

ˆ w

ˆ w⊺ˆ X ˆ X⊺ ˆ w − 2y⊺ˆ X⊺ˆ w + y⊺y ∇ˆ

w

  • ˆ

w⊺ˆ X ˆ X⊺ ˆ w − 2y⊺ˆ X⊺ˆ w + y⊺y

  • = 0

ˆ w =

  • ˆ

X ˆ X⊺−1 ˆ X

  • pseudo-inverse

y

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

25 / 61

slide-26
SLIDE 26
  • 2. Fisher Discriminant Analysis

First Attempt: Least Squares

Problem: Least-squares is very sensitive to outliers

−4 −2 2 4 6 8 −8 −6 −4 −2 2 4

Without outliers least-squares discriminant works With outliers least-squares discriminant breaks down

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 61

slide-27
SLIDE 27
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Take a different view on linear classification Find a linear projection of our data and classify the projected values The same thing as a linear discriminant function

Projection: y = w⊺x Checking against a threshold: w⊺x ≥ −w0 or w⊺x + w0 ≥ 0

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 61

slide-28
SLIDE 28
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

What is a good projection w?

Idea: Maximize the “distance” between the two classes to allow for a good separation

First attempt: Maximize the distance between the class means m1 = 1 |C1|

  • i∈C1

xi m2 = 1 |C2|

  • i∈C2

xi

Projection of the means on the 1D line of real numbers m1 = w⊺m1 m2 = w⊺m2 Maximize squared distance between means max (m1 − m2)2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 61

slide-29
SLIDE 29
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Maximize squared distance between means w∗ = arg max

w

(w⊺m1 − w⊺m2)2

Obvious problem: Grows unboundedly with the norm of w Obvious solution: Fix the norm of w max

w

(w⊺m1 − w⊺m2)2 s.t. w2 = 1 Constrained optimization problem!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

29 / 61

slide-30
SLIDE 30
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

max

w

(w⊺m1 − w⊺m2)2 s.t. w2 = 1 Necessary conditions ∇xf (x) + λ∇xg (x) = 0 2 (w⊺m1 − w⊺m2) (m1 − m2) + 2λw = 0 It follows that w = m1 − m2 m1 − m2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

30 / 61

slide-31
SLIDE 31
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Here’s what we get

−2 2 6 −2 2 4

Obvious problem: large class overlap

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

31 / 61

slide-32
SLIDE 32
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Here’s what we could get

−2 2 6 −2 2 4

Much better separation between classes How do we get this?

Idea: Separate the means as far as possible while minimizing the variance of each class

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

32 / 61

slide-33
SLIDE 33
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Second (and final) attempt:

Define within-class variances: s2

1 =

  • n∈C1

(w⊺xn − m1)2 s2

2 =

  • n∈C2

(w⊺xn − m2)2 where m1 = w⊺m1 and m2 = w⊺m2

Fisher criterion max

w J (w) = (m1 − m2)2

s2

1 + s2 2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

33 / 61

slide-34
SLIDE 34
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Fisher criterion max

w J (w) = (m1 − m2)2

s2

1 + s2 2

Rewrite the numerator (m1 − m2)2 = (w⊺m1 − w⊺m2)2 = (w⊺ (m1 − m2))2 = w⊺ (m1 − m2) (m1 − m2)⊺

  • =: SB

between-class covariance

w

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 61

slide-35
SLIDE 35
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Fisher criterion max

w J (w) = (m1 − m2)2

s2

1 + s2 2

Rewrite the denominator

s2

1 + s2 2

=

  • n∈C1

(w⊺xn − m1)2 +

  • n∈C2

(w⊺xn − m2)2 =

  • n∈C1

(w⊺ (xn − m1))2 +

  • n∈C2

(w⊺ (xn − m2))2 =

  • n∈C1

w⊺ (xn − m1) (xn − m1)⊺ w +

  • n∈C2

w⊺ (xn − m2) (xn − m2)⊺ w = w⊺  

n∈C1

(xn − m1) (xn − m1)⊺ +

  • n∈C2

(xn − m2) (xn − m2)⊺  

  • =: SW

within-class covariance

w

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

35 / 61

slide-36
SLIDE 36
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Fisher criterion max

w J (w) = (m1 − m2)2

s2

1 + s2 2

= w⊺SBw w⊺SWw Differentiating w.r.t. w and setting to 0 we have (w⊺SBw) SWw = (w⊺SWw) SBw Since (w⊺SBw) and (w⊺SWw) are scalars, we have that SWw SBw where || means collinearity

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 61

slide-37
SLIDE 37
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

Also, we know that SBw = (m1 − m2) (m1 − m2)⊺ w = ⇒ SBw (m1 − m2) Hence, we have SWw

  • (m1 − m2)

w

  • S−1

W (m1 − m2)

Fisher’s Linear Discriminant w ∝ S−1

W (m1 − m2)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

37 / 61

slide-38
SLIDE 38
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

w ∝ S−1

W (m1 − m2)

The Fisher linear discriminant only gives us a projection

We still need to find the threshold E.g., use Bayes classifier with Gaussian class-conditionals

Bayes optimality

Fisher’s linear discriminant is Bayes optimal if the class-conditional distributions are equal, with diagonal covariance

Essentially equivalent to Linear Discriminant Analysis (LDA)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

38 / 61

slide-39
SLIDE 39
  • 2. Fisher Discriminant Analysis

Fisher’s Linear Discriminant

We won’t go through this here, but Fisher’s linear discriminant can be shown to be equivalent to a certain case of a least-squares linear classifier (see Bishop 4.1.5) Problem with this method: it is still very sensitive to noise! By The Way: This method is a true classic (it dates back to 1936)

Fisher, R.A., The Use of Multiple Measurements in Taxonomic

  • Problems. Annals of Eugenics, 7: 179-188 (1936)
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

39 / 61

slide-40
SLIDE 40
  • 3. Perceptron Algorithm

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

40 / 61

slide-41
SLIDE 41
  • 3. Perceptron Algorithm

New Strategy

If our classes are linearly separable, we want to make sure that we find a separating (hyper)plane First such algorithm we will see

The perceptron algorithm [Rosenblatt, 1962]

Rosenblatt [1928-1971]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

41 / 61

slide-42
SLIDE 42
  • 3. Perceptron Algorithm

Perceptron Algorithm

Perceptron discriminant function y (x) = sign (w⊺x + b)

where sign (x) = {+1, x > 0; 0, x = 0; −1, x < 0}

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

42 / 61

slide-43
SLIDE 43
  • 3. Perceptron Algorithm

Perceptron Algorithm

Perceptron Algorithm

Initialize the weight vector w and bias b For all pairs of data points (xi, yi), where yi ∈ {−1, +1}, do

If xi is correctly classified, i.e., y (xi) = yi, do nothing Else if yi = 1 update the parameters with w ← w + xi, b ← b + 1 Else if yi = −1 update the parameters with w ← w − xi, b ← b − 1 Repeat until convergence

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

43 / 61

slide-44
SLIDE 44
  • 3. Perceptron Algorithm

Perceptron Algorithm - Intuition

−1 −0.5 0.5 1 −1 −0.5 0.5 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

44 / 61

slide-45
SLIDE 45
  • 3. Perceptron Algorithm

Perceptron Algorithm - Intuition

−1 −0.5 0.5 1 −1 −0.5 0.5 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

45 / 61

slide-46
SLIDE 46
  • 3. Perceptron Algorithm

Perceptron Algorithm - Intuition

−1 −0.5 0.5 1 −1 −0.5 0.5 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

46 / 61

slide-47
SLIDE 47
  • 3. Perceptron Algorithm

Perceptron Algorithm - Intuition

−1 −0.5 0.5 1 −1 −0.5 0.5 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

47 / 61

slide-48
SLIDE 48
  • 3. Perceptron Algorithm

Perceptron Algorithm

Why does this algorithm work? We have an optimization problem max

w J (w) = |{x ∈ X : w, x < 0}|

=

  • x∈X:w,x<0

w, x And also a gradient method ∂J ∂w =

  • x∈X:w,x<0

x

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

48 / 61

slide-49
SLIDE 49
  • 3. Perceptron Algorithm

But is the Perceptron Algorithm useful?

How often is data linearly separable? A simple failure example is the XOR function History: Minsky & Papert [1969] criticized the perceptron for not being able to handle this case, which halted research on this and related techniques for decades

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

49 / 61

slide-50
SLIDE 50
  • 3. Perceptron Algorithm

Other Feature Spaces

It took a long time until people had realized that there is a simple way out Key idea: Transform the input data nonlinearly so that the problem becomes linearly separable! There is an important message to get out from this

Create features instead of learning from raw data Neural networks do it automagically for you

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

50 / 61

slide-51
SLIDE 51
  • 4. Logistic Regression

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

51 / 61

slide-52
SLIDE 52
  • 4. Logistic Regression

Generative vs. Discriminative

There are two different views to solve the classification problem Generative modelling

We model the class-conditional distributions p(x | C2) and p(x | C1) We classify by computing the class posterior using Bayes’ rule E.g.: Naive Bayes

Discriminative modelling

We model the class-posterior directly, e.g. p(C1 | x) Consequence: We only care about getting the classification right, and not whether we fit the class-conditional well E.g.: Logistic Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

52 / 61

slide-53
SLIDE 53
  • 4. Logistic Regression

Probabilistic Discriminative Models

For now, we will write the class posterior using Bayes’ rule p (C1 | x) = p (x | C1) p (C1) p (x) = p (x | C1) p (C1)

  • i p (x, Ci)

= p (x | C1) p (C1)

  • i p (x | Ci) p (Ci)

= p (x | C1) p (C1) p (x | C1) p (C1) + p (x | C2) p (C2) = 1 1 + p (x | C2) p (C2) / (p (x | C1) p (C1)) = 1 1 + exp (−a) = σ (a) → logistic sigmoid function with a = log p(x|C1)p(C1)

p(x|C2)p(C2)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

53 / 61

slide-54
SLIDE 54
  • 4. Logistic Regression

Sigmoid

Logistic / Sigmoid function σ (a) = 1 1 + exp (−a)

[Wikipedia]

Sigmoid: ’S-shaped’ Squashes real numbers into the [0, 1] interval

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

54 / 61

slide-55
SLIDE 55
  • 4. Logistic Regression

Probabilistic Discriminative Models

Class posterior p (C1 | x) = σ (a) with a = log p (x | C1) p (C1) p (x | C2) p (C2) Logistic regression

Assume that a is given by a linear discriminant function

p(C1 | x) = σ(w⊺x + w0)

Find w and w0 so that the class-posterior is modeled best When is this an appropriate assumption?

When the class conditionals are Gaussians with equal covariance But also for a number of other distributions Some independence of the form of the class-conditionals

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

55 / 61

slide-56
SLIDE 56
  • 4. Logistic Regression

Logistic Regression

Model the class posterior as p(C1 | x) = σ(w⊺x + w0) Maximize the likelihood

Data (as always) is i.i.d. and define yi =

  • 1

xi belongs to C1 xi belongs to C2

p

  • Y
  • X; w, w0
  • =

N

  • i=1

p

  • yi
  • xi; w, w0
  • =

N

  • i=1

p

  • C1
  • xi; w, w0

1−yi p

  • C2
  • xi; w, w0

yi =

N

  • i=1

σ(w⊺xi + w0)1−yi (1 − σ(w⊺xi + w0))yi

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

56 / 61

slide-57
SLIDE 57
  • 4. Logistic Regression

Logistic Regression

We won’t do the derivation here (see Bishop 4.3), but basically you can apply the logarithm to p

  • Y
  • X; w, w0
  • and do gradient

descent Similar to what we have seen in regression, we can get more robust classifiers by incorporating priors and taking a Bayesian approach Later, we will turn to a very different interpretation of this:

Logistic regression as a neural network

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

57 / 61

slide-58
SLIDE 58
  • 5. Wrap-Up

Outline

  • 1. Discriminant Functions
  • 2. Fisher Discriminant Analysis
  • 3. Perceptron Algorithm
  • 4. Logistic Regression
  • 5. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

58 / 61

slide-59
SLIDE 59
  • 5. Wrap-Up
  • 5. Wrap-Up

You know now: What a Bayesian Optimal Classifier is What a discriminant function is How to formalize (with intuition and mathematically) the classification problem as linearly-separable How to compute the least squares solution for classification and why it fails What Fisher’s Linear Discriminant is and how it differs from least-squares What the perceptron is, why it fails in the XOR problem and how to

  • vercome it with feature spaces

The difference between Generative and Discriminative modelling What logistic regression is

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

59 / 61

slide-60
SLIDE 60
  • 5. Wrap-Up

Self-Test Questions

How do we get from Bayesian optimal decisions to discriminant functions? How to derive a discriminant function from a probability distribution? How to deal with more than two classes? What does linearly-separable mean? What is Fisher discriminant analysis? How does it relate to regression? Is Fisher’s linear discriminant Bayes optimal? What are perceptrons? How can we train them? What is logistic regression? How to derive the parameter update rule?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

60 / 61

slide-61
SLIDE 61
  • 5. Wrap-Up

Homework

Reading Assignment for next week

Bishop 7.1.5 and 12.1 Murphy 6.5 and 12.2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

61 / 61