Linear classification Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

linear classification
SMART_READER_LITE
LIVE PREVIEW

Linear classification Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

Linear classification Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Introduction 2 Classification most common case: disjoint classes, each input has


slide-1
SLIDE 1

Linear classification

Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019

1

slide-2
SLIDE 2

Introduction

2

slide-3
SLIDE 3

Classification

  • value t to predict are from a discrete domain, where each value denotes

a class

  • most common case: disjoint classes, each input has to assigned to

exactly one class

  • input space is partitioned into decision regions
  • in linear classification models decision boundaries are linear functions
  • f input x (D − 1-dimensional hyperplanes in the D-dimensional

feature space)

  • datasets such as classes correspond to regions which may be separated

by linear decision boundaries are said linearly separable

3

slide-4
SLIDE 4

Regression and classification

  • Regression: the target variable t is a vector of reals
  • Classification: several ways to represent classes (target variable values)
  • Binary classification: a single variable t ∈ {0, 1}, where t = 0 denotes

class C0 and t = 1 denotes class C1

  • K > 2 classes: “1 of K” coding. t is a vector of K bits, such that for

each class Cj all bits are 0 except the j-th one (which is 1)

4

slide-5
SLIDE 5

Approaches to classification

Three general approaches to classification

  • 1. find f : X → {1, . . . , K} (discriminant function) which maps each input

x to some class Ci (such that i = f(x))

  • 2. discriminative approach: determine the conditional probabilities

p(Cj|x) (inference phase); use these distributions to assign an input to a class (decision phase)

  • 3. generative approach: determine the class conditional distributions

p(x|Cj), and the class prior probabilities p(Cj); apply Bayes’ formula to derive the class posterior probabilities p(Cj|x) ; use these distributions to assign an input to a class

5

slide-6
SLIDE 6

Discriminative approaches

  • Approaches 1 and 2 are discriminative: they tackle the classification

problem by deriving from the training set conditions (such as decision boundaries) that , when applied to a point, discriminate each class from the others

  • The boundaries between regions are specify by discrimination functions

6

slide-7
SLIDE 7

Generalized linear models

  • In linear regression, a model predicts the target value; the prediction is

made through a linear function y(x) = wT x + w0 (linear basis functions could be applied)

  • In classification, a model predicts probabilities of classes, that is values

in [0, 1]; the prediction is made through a generalized linear model y(x) = f(wT x + w0), where f is a non linear activation function with codomain [0, 1]

  • boundaries correspond to solution of y(x) = c for some constant c; this

results into wT x + w0 = f −1(c), that is a linear boundary. The inverse function f −1 is said link function.

7

slide-8
SLIDE 8

Generative approaches

  • Approach 3 is generative: it works by defining, from the training set, a

model of items for each class

  • The model is a probability distribution (of features conditioned by the

class) and could be used for random generation of new items in the class

  • By comparing an item to all models, it is possible to verify the one that

best fits

8

slide-9
SLIDE 9

Discriminant functions

9

slide-10
SLIDE 10

Linear discriminant functions in binary classification

  • Decision boundary: D − 1-dimensional hyperplane y(x) = 0 of all

points s.t. wT x + w0 = 0

  • Given x1, x2 on the hyperplane, y(x1) = y(x2) = 0. Hence,

wT (x1) − wT (x2) = wT (x1 − x2) = 0 that is, x1 − x2, w orthogonal

  • For any x s.t. y(x) = 0, wT x is the length of the projection of x in the

direction of w (orthogonal to the hyperplane y(x) = 0), in multiples of ||w||2

  • By normalizing wrt to ||w||2 =

√∑

i w2 i , we get the length of the

projection of x in the direction orthogonal to the hyperplane, assuming ||w||2 = 1

  • Since wT x = −w0,

wT x ||w|| = − w0 ||w|| thus, the distance is determined by the threshold w0

10

slide-11
SLIDE 11

Linear discriminant functions in binary classification

  • In general, for any x, y(x) = wT x + w0 returns the distance (in

multiples of ||w||) of x from the hyperplane

  • The sign of the returned value discriminates in which of the regions

separated by the hyperplane the point lies

11

slide-12
SLIDE 12

Linear discriminant functions in multiclass classification

First approach

  • Define K − 1 discrimination functions
  • Function fi (1 ≤ i ≤ K − 1) discriminates points belonging to class Ci

from points belonging to all other classes: if fi(x) > 0 then x ∈ Ci,

  • therwise x ̸∈ Ci
  • The green region belongs to both R1 and R2

12

slide-13
SLIDE 13

Linear discriminant functions in multiclass classification

Second approach

  • Define K(K − 1)/2 discrimination functions, one for each pair of

classes

  • Function fij (1 ≤ i < j ≤ K) discriminates points which might belong

to Ci from points which might belong to Cj

  • Item x is classified on a majority basis
  • The green region is unassigned

13

slide-14
SLIDE 14

Linear discriminant functions in multiclass classification

Third approach

  • Define K linear functions

yi(x) = wT

i x + wi0

1 ≤ i ≤ K Item x is assigned to class Ck iff yk(x) > yj(x) for all j ̸= k: that is, k = argmax

j

yj(x)

  • Decision boundary between Ci and Cj: all points x s.t. yi(x) = yj(x), a

D − 1-dimensional hyperplane (wi − wj)T x + (wi0 − wj0) = 0

14

slide-15
SLIDE 15

Linear discriminant functions in multiclass classification

The resulting decision regions are connected and convex

  • Given xA, xB ∈ Rk then yk(xA) > yj(xA) and yk(xB) > yj(xB), for all

j ̸= k

  • Let ˆ

x = λxA + (1 − λ)xB, 0 ≤ λ ≤ 1

  • For all i, since yi is linear for all, yi(ˆ

x) = λyi(xa) + (1 − λ)yi(xB)

  • Then, yk(ˆ

x) > yj(ˆ x) for all j ̸= k; that is, ˆ x ∈ Rk

Ri Rj Rk xA xB

  • x

15

slide-16
SLIDE 16

Generalized discriminant functions

  • The definition can be extended to include terms relative to products of

pairs of feature values (Quadratic discriminant functions) y(x) = w0 +

D

i=1

wixi +

D

i=1 i

j=1

wijxixj d(d + 1) 2 additional parameters wrt the d + 1 original ones: decision boundaries can be more complex

  • In general, generalized discrimination functions through set of

functions φi, . . . , φm y(x) = w0 +

M

i=1

wiφi(x)

16

slide-17
SLIDE 17

Least squares and classification

17

slide-18
SLIDE 18

Linear discriminant functions and regression

  • Assume classification with K classes
  • Classes are represented through a 1-of-K coding scheme: set of

variables z1, . . . , zK, class Ci coded by values zi = 1, zk = 0 for k ̸= i

  • Discriminant functions yi are derived as linear regression functions

with variables zi as targets

  • To each variable zi a discriminant function yi(x) = wT

i x + wi0 is

associated: x is assigned to the class Ck s.t. k = argmax

i

yi(x)

  • Then, zk(x) = 1 and zj(x) = 0 (j ̸= k) if k = argmax

i

yi(x)

  • Group all parameters together as

y(x) = W

T x 18

slide-19
SLIDE 19

Linear discriminant functions and regression

  • In general, a regression function provides an estimation of the target

given the input E[t|x]

  • Value yi(x) can then be seen as a (poor) estimation of the conditional

expectation E[zi|x] of variable zi given x; hence, yi(x) is an estimate of p(Ci|x). However, yi(x) is not a probability

  • In this case, dealing with a Bernoulli distribution, the expectation

corresponds to the posterior probability E[zi|x] = P(zi = 1|x) · 1 + P(zi = 0|x) · 0 = P(zi = 1|x) = P(Ci|x)

19

slide-20
SLIDE 20

Learning functions yi

  • Given a training set T , a regression function is derived by least squares
  • An item in T is a pair (xi, ti), xi ∈ I

RD e ti ∈ {0, 1}K

  • W ∈ I

R(D+1)×K is the matrix of parameters of all functions yi: the i-th column represents the D + 1 parameters wi0, . . . , wiD of yi W =       w10 w20 · · · wK0 w11 w21 · · · wK1 . . . . . . ... . . . w1D w2D · · · wKD      

  • y(x) = W

T x with x = (1, x1, . . . , xd) 20

slide-21
SLIDE 21

Learning functions yi

  • X ∈ I

Rn×(D+1) is the matrix of feature values for all items in the traing set X =       1 x(1)

1

· · · x(D)

1

1 x(1)

2

· · · x(D)

2

. . . . . . ... . . . 1 x(1)

n

· · · x(D)

n

     

  • Then, for matrix XW, of size n × K, we have

(XW)ij = wj0 +

D

k=1

x(k)

i

wjk = yj(xi)

21

slide-22
SLIDE 22

Learning functions yi

  • yj(xi) is compared to item Tij in the matrix T, of size n × K, of target

values, where row i is the 1-of-K coding of the class of item xi (XW − T)ij = yj(xi) − tij

  • Let us consider the diagonal items of (XW − T)T (XW − T). Then,

((XW − T)T (XW − T))ii =

K

j=1

(yj(xi) − tij)2 That is, assuming xi is in class Ck, ((XW − T)T (XW − T))ii = (yk(xi) − 1)2 + ∑

j̸=k

yj(xi)2

22

slide-23
SLIDE 23

Learning functions yi

  • Summing all elements on the diagonal of (XW − T)T (XW − T)

provides the overall sum, on all items in T , of squared differences between observed values and values computed by the model, with parameters W

  • This corresponds to the trace of (XW − T)T (XW − T). Hence, we

have to minimize: E(W) = 1 2tr((XW − T)T (XW − T))

  • Standard approach, solve

∂E(W) ∂W = 0

23

slide-24
SLIDE 24

Fisher’ linear discriminant

24

slide-25
SLIDE 25

Approach

  • The idea of Linear Discriminant Analysis (LDA) is to find a linear

projection of the traing set into a suitable subspace where classes are as linearly separated as possible

  • A common approach is provided by Fisher linear discriminant, where all

items in the training set (points in a D-dimensional space) are projected to one dimension, by means of a transformation of the type y = w · x = wT x where w is the D-dimensional vector corresponding to the direction of projection (in the following, we will consider the one with unit norm).

25

slide-26
SLIDE 26

LDA

If K = 2, given a threshold ˜ y, item x is assigned to C1 iff its projection y = wT x is such that y > ˜ y; otherwise, x is assigned to C2.

26

slide-27
SLIDE 27

LDA

Different line directions, that is different parameters w, may induce quite different separability properties.

27

slide-28
SLIDE 28

Deriving w in the binary case

Let n1 be the number of items in the training set belonging to class C1 and n2 the number of items in class C2. The mean points of both classes are m1 = 1 n1 ∑

x∈C1

x m2 = 1 n2 ∑

x∈C2

x A simple measure of the separation of classes, when the training set is projected onto a line, is the difference between their mean points m2 − m1 = wT (m2 − m1) where mi = wT mi is the projection of mi onto the line.

28

slide-29
SLIDE 29

Deriving w in the binary case

  • We wish to find a line direction w such that m2 − m1 is maximum
  • wT (m2 − m1) can be made arbitrarily large by multiplying w by a

suitable constant, at the same time maintaining the direction

  • unchanged. To avoid this drawback, we consider unit vectors,

introducing the constraint ||w||2 = wT w = 1

  • This results in an optimization with a lagrangian multiplier: we wish to

maximize the following function of w and λ wT (m2 − m1) + λ(1 − wT w)

29

slide-30
SLIDE 30

Deriving w in the binary case

Setting the gradient of the function wrt w to 0 ∂ ∂w (wT (m2 − m1) + λ(1 − wT w)) = m2 − m1 + 2λw = 0 results into w = m2 − m1 2λ

30

slide-31
SLIDE 31

Deriving w in the binary case

Setting the derivative wrt λ to 0 ∂ ∂λ(wT (m2 − m1) + λ(1 − wT w)) = 1 − wT w = 0 results into 1 − wT w = 1 − (m2 − m1)T (m2 − m1) 4λ2 = 0 that is λ = √ (m2 − m1)T (m2 − m1) 2 = ||m2 − m1||2 2 Combining with the result for the gradient, w = m2 − m1 ||m2 − m1||2

31

slide-32
SLIDE 32

Deriving w in the binary case

The direction w of the line is the one from m1 to m2. This may result in a poor separation of classes. Projections of classes are dispersed (high variance) along the direction of m1 − m2. This may result in a large overlap.

32

slide-33
SLIDE 33

Deriving w in the binary case: refinement

  • Choose directions s.t. classes projections show as little dispersion as

possible

  • Possible in the case that the amount of class dispersion changes wrt

different directions, that is if the distribution of points in the class is elongated

  • We wish then to maximize a function which:
  • is growing wrt the separation between the projected classes (for example,

their mean points)

  • is decreasing wrt to the dispersion of the projections of points of each class

33

slide-34
SLIDE 34

Deriving w in the binary case: refinement

  • The within-class variance of the projection of class Ci (i = 1, 2) is

defined as s2

i =

x∈Ci

(wT x − mi)2 The total within-class variance is defined as s2

1 + s2 2

  • Given a direction w, the Fisher criterion is the ratio between the

(squared) class separation and the overall within-class variance, along that direction J(w) = (m2 − m1)2 s2

1 + s2 2

  • Indeed, J(w) grows wrt class separation and decreases wrt within-class

variance

34

slide-35
SLIDE 35

Deriving w in the binary case: refinement

Let S1, S2 be the within-class covariance matrices, defined as Si = ∑

x∈Ci

(x − mi)(x − mi)T Then, s2

i =

x∈Ci

(wT x − mi)2 = ∑

x∈Ci

(wT x − wT mi)2 = ∑

x∈Ci

(wT x − wT mi)(wT x − wT mi) = ∑

x∈Ci

(wT x − wT mi)(xT w − mT

i w)

= ∑

x∈Ci

( wT (x − mi) ) ( (x − mi)T w ) = ∑

x∈Ci

wT (x − mi)(x − mi)T w = wT   ∑

x∈Ci

(x − mi)(x − mi)T   w = wT Siw

35

slide-36
SLIDE 36

Deriving w in the binary case: refinement

Let also SW = S1 + S2 be the total within-class covariance matrix and SB = (m2 − m1)(m2 − m1)T be the between-class covariance matrix. Then, J(w) = (m2 − m1)2 s2

1 + s2 2

= (wT m2 − wT m1)2 wT S1w + wT S2w = (wT m2 − wT m1)(wT m2 − wT m1) wT S1w + wT S2w = wT (m2 − m1)(m2 − m1)T w wT S1w + wT S2w = wT SBw wT SW w

36

slide-37
SLIDE 37

Deriving w in the binary case: refinement

As usual, J(w) is maximized wrt w by setting its gradient to 0 ∂ ∂w wT SBw wT SW w = 2(wT SBw)SW w − (wT SW w)SBw (wT SW w)(wT SW w)T which results into (wT SBw)SW w = (wT SW w)SBw

37

slide-38
SLIDE 38

Deriving w in the binary case: refinement

Observe that:

  • wT SBw is a scalar, say cB
  • wT SW w is a scalar, say cW
  • (m2 − m1)T w is a scalar, say cm

Then, the condition (wT SBw)SW w = (wT SW w)SBw can be written as cBSW w = cwSBw = cW (m2 − m1)(m2 − m1)T w = cW (m2 − m1)cm which results into w = cW cm cB S−1

W (m2 − m1)

Since we are interested into the direction of w, that is in any vector proportional to w, we may consider the solution ˆ w = S−1

W (m2 − m1) = (S1 + S2)−1(m2 − m1) 38

slide-39
SLIDE 39

Deriving w in the binary case: choosing a threshold

Possible approach:

  • model p(y|Ci) as a gaussian: derive mean and variance by maximum

likelihood mi = 1 ni ∑

x∈Ci

wT x σ2

i =

1 ni − 1 ∑

x∈Ci

(wT x − mi)2 where ni is the number of items in training set belonging to class Ci

  • derive the class probabilities

p(Ci|y) ∝ p(y|Ci)p(Ci) = p(y|Ci) ni n1 + n2 ∝ nie

− (y−mi)2

2σ2 i

  • the threshold ˜

y can be derived as the minimum y such that p(C2|y) p(C1|y) = n2 n1 p(y|C2) p(y|C1) > 1

39

slide-40
SLIDE 40

LDA and multiclass classification

Let K > 2 and assume D > K, that is the number of features is greater than the number of classes. Let also D′, 1 < D′ < D, be the dimension of the projection space: then, D′ linear transformations yk = wT

k x (k = 1, . . . , D′) are defined which project a

D-dimensional point x into a D′-dimensional point y = (y1, . . . , yD′)T . In short, if wi is the i-th column of W, y = WT x To apply the same criterion of the binary case, we have to define within-class and between-class matrices, both in the D-dimensional and in the D′-dimensional spaces. The generalization of the within-class covariance matrix is trivial: SW =

K

i=1

Si =

K

i=1

x∈Ci

(x − mi)(x − mi)T where mi = 1 ni ∑

x∈Ci

x

40

slide-41
SLIDE 41

LDA and multiclass classification

For what concerns the between-class covariance, we first define the total covariance matrix of the training set ST = ∑

x

(x − m)(x − m)T =

K

i=1

x∈Ci

(x − m)(x − m)T where m is the mean point of the whole training set m = 1 n ∑

x

x = 1 n

K

i=1

nimi This matrix can be decomposed as follows ST =

K

i=1

x∈Ci

(x − mi + mi − m)(x − mi + mi − m)T =

K

i=1

x∈Ci

(x − mi)(x − mi)T +

K

i=1

x∈Ci

(mi − m)(mi − m)T = SW +

K

i=1

ni(mi − m)(mi − m)T

41

slide-42
SLIDE 42

LDA and multiclass classification

In the identity ST = SW +

K

i=1

ni(mi − m)(mi − m)T we may identify the share of total covariance not caused by within-class covariance as between-class covariance, thus defining the between-class covariance matrix as SB =

K

i=1

ni(mi − m)(mi − m)T

42

slide-43
SLIDE 43

LDA and multiclass classification

sW =

K

i=1

si =

K

i=1

x∈Ci

(WT x − mi)(WT x − mi)T sB =

K

i=1

ni(mi − m)(mi − m)T mi = 1 ni ∑

x∈Ci

WT x m = 1 n ∑

x

WT x It is also possible to prove that sW = WT SW W and sB = WT SBW

43

slide-44
SLIDE 44

LDA and multiclass classification

  • Reminder: we need a matrix W that
  • 1. increases dispersion of classes (between-class covariance after projection)
  • 2. decreases the dispersion of points within classes (within-class covariance

after projection)

  • Different measures of dispersion can be introduced in this framework,

such as

  • 1. the ratio between the determinants of sB and sW

J(W) = |sB| |sW | = |s−1

W sB| = |(WT SW W)−1WT SBW|

the determinant is the product of the eigenvalues (and, approximately, of the variances along the distribution axes in a gaussian model)

  • 2. the trace of the “ratio” between sB and sW

J(W) = tr(s−1

W sB) = tr((WT SW W)−1WT SBW)

note that the trace is the sum of the eigenvalues

It is possible to prove that W is given by the eigenvectors of S−1

B SW

corresponding to the D′ largest eigenvalues.

44

slide-45
SLIDE 45

Perceptron

45

slide-46
SLIDE 46

Perceptron

  • Introduced in the ’60s, at the basis of the neural network approach
  • Simple model of a single neuron
  • Hard to evaluate in terms of probability
  • Works only in the case that classes are linearly separable

46

slide-47
SLIDE 47

Definition

It corrisponds to a binary classification model where an item x is first transformed by a non linear function φ and the classified on the basis of the sign of the obtained value. That is, y(x) = f(wT φ(x)) f() is essentially the sign function f(i) =    −1 if i < 0 1 if i ≥ 0 The resulting model is a particular generalized linear model. A special case is the one when φ is the identity, that is y(x) = f(wT x). By the definition of the model, y(x) can only be ±1: we denote y(x) = 1 as x ∈ C1 and y(x) = −1 as x ∈ C2. To each element xi in the training set, a target value is then associated ti ∈ {−1, 1}.

47

slide-48
SLIDE 48

Cost function

  • A natural definiton of the cost function would be the number of

misclassified elements in the training set

  • This would result in a piecewise constant function and gradient
  • ptimization could not be applied (we would have zero gradient almost

everywhere)

  • A better choice is using a piecewise linear function as cost function

48

slide-49
SLIDE 49

Cost function

We would like to find a vector of parameters w such that, for any xi, wT xi > 0 if xi ∈ C1 and wT xi < 0 if xi ∈ C2: in short, wT xiti > 0. Each element xi provides a contribution to the cost function as follows

  • 1. 0 if xi is classified correctly by the model
  • 2. −wT xiti > 0 if xi is misclassified

Let M be the set of misclassified elements. Then the cost is Ep(w) = − ∑

xi∈M

wT φ(xi)ti The contribution of xi to the cost is 0 if xi ̸∈ M and it is a linear function of w otherwise

49

slide-50
SLIDE 50

Gradient optimization

The minimum of Ep(w) can be found through gradient descent w(k+1) = w(k) − η ∂Ep(w) ∂w

  • w(k)

the gradient of the cost function wrt to w is ∂Ep(w) ∂w = − ∑

xi∈M

φ(xi)ti Then gradient descent can be expressed as w(k+1) = w(k) + η ∑

xi∈Mk

φ(xi)ti where Mk denotes the set of points misclassified by the model with parameter w(k)

50

slide-51
SLIDE 51

Gradient optimization

Online (or stochastic gradient descent): at each step, only the gradient wrt a single item is considered w(k+1) = w(k) + ηφ(xi)ti where xi ∈ Mk The method works by circularly iterating on all elements and applying the above formula. Initialize w0 k := 0 repeat k := k + 1 i := (k mod n) + 1 y := f(wT φ(xi))ti if y > 0 then w(k+1) = w(k) else w(k+1) = w(k) + ηφ(xi)ti until all elements are well classified

51

slide-52
SLIDE 52

Gradient optimization

In black, decision boundary and corresponding parameter vector w; in red misclassified item vector φ(xi), added by the algorithm to the parameter vector as ηφ(xi)

52

slide-53
SLIDE 53

Gradient optimization

At each step, if xi is well classified then w(k) is unchanged; else, its contirbution to the cost is modified as follows −(w(k+1))T φ(xi)ti = −(w(k))T φ(xi)ti − η(φ(xi)ti)T φ(xi)ti = −(w(k))T φ(xi)ti − η||φ(xi)||2 < −(w(k))T φ(xi)ti This contribution is decreasing, however this does not guarantee the convergence of the method, since the cost function could increase due to some other element becoming misclassified if w(k+1) is used

53

slide-54
SLIDE 54

Perceptron convergence theorem

It is possible to prove that, in the case the classes are linearly separable, the algorithm converges to the correct solution in a finite number of steps. Let ˆ w be a solution (that is, it discriminates C1 and C2): if xk+1 is the element considered at iteration (k + 1) and it is misclassified, then w(k+1) − α ˆ w = (w(k) − α ˆ w) + ηφ(xk+1)tk+1 where α > 0 is a constant, to be specified later

54

slide-55
SLIDE 55

Perceptron convergence theorem

By squaring left and right expressions of the above formula, we get

  • w(k+1) − α ˆ

w

  • 2

=

  • w(k) − α ˆ

w

  • 2

+ η2 ||φ(xk+1)||2 + 2η(w(k) − α ˆ w)T φ(xk+1)tk+1 =

  • w(k) − α ˆ

w

  • 2

+ η2 ||φ(xk+1)||2 + 2η(w(k))T φ(xk+1)tk+1 − 2ηα ˆ wT φ(xk+1)tk+1 Since xk+1 was misclassified by hypothesis, (w(k))T φ(xk+1)tk+1 < 0 and

  • w(k+1) − α ˆ

w

  • 2

<

  • w(k) − α ˆ

w

  • 2

+ η2 ||φ(xk+1)||2 − 2ηα ˆ wT φ(xk+1)tk+1

55

slide-56
SLIDE 56

Perceptron convergence theorem

Let γ be the minimum value of the signed dot product of ˆ w with φ(xi) for some element xi, where the sign depends on the class of xi γ = min

i

( ˆ wT φ(xi)ti) = min

i

| ˆ wT φ(xi)| > 0 Let δ be the length of the longest φ(xi) δ2 = max

i

||φ(xi)||2 Then,

  • w(k+1) − α ˆ

w

  • 2

<

  • w(k) − α ˆ

w

  • 2

+ η2δ2 − 2ηαγ

56

slide-57
SLIDE 57

Perceptron convergence theorem

By setting α = ηδ2 γ we get

  • w(k+1) − α ˆ

w

  • 2

<

  • w(k) − α ˆ

w

  • 2

− η2δ2 As can be seen, the squared distance between w(k+1) and ˆ w decreases at each step of an amount greater than η2δ2

57

slide-58
SLIDE 58

Perceptron convergence theorem

Iterating the above properties on all steps,

  • w(k+1) − α ˆ

w

  • 2

<

  • w(0) − α ˆ

w

  • 2

− (k + 1)η2δ2 Note that, after k =

  • w(0) − α ˆ

w

  • 2

η2δ2 − 1 steps we get

  • w(0) − α ˆ

w

  • 2

− (k + 1)η2δ2 = 0 So, after at most k updates of w, a decision boundary has been derived

58

slide-59
SLIDE 59

Perceptron convergence theorem

Setting w(0) = 0, we have k = α2 η2δ2 || ˆ w||2 − 1 = δ2 γ2 || ˆ w||2 − 1 = max

i

||φ(xi)||2 (min

i

( ˆ wT φ(xi)))2 || ˆ w||2 − 1 The number of required step is large if min

i

( ˆ wT φ(xi)) is small, that is if there exists some xi such that φ(xi) is (almost) orthogonal to ˆ w.

59