Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

β–Ά
linear classifiers
SMART_READER_LITE
LIVE PREVIEW

Linear classifiers CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation

Linear classifiers CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Topics } Linear classifiers } Perceptron SVM will be covered in the later lectures } Fisher } Multi-class classification 2 Classification


slide-1
SLIDE 1

Linear classifiers

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2019

slide-2
SLIDE 2

Topics

} Linear classifiers

} Perceptron } Fisher

} Multi-class classification

2

SVM will be covered in the later lectures

slide-3
SLIDE 3

Classification problem

3

} Given:Training set

} labeled set of 𝑂 input-output pairs 𝐸 =

π’š % , 𝑧 %

%() *

} 𝑧 ∈ {1, … , 𝐿}

} Goal: Given an input π’š, assign it to one of 𝐿 classes } Examples:

} Spam filter } Handwritten digit recognition

slide-4
SLIDE 4

Linear classifiers

4

} Decision boundaries are linear in π’š, or linear in some

given set of functions of π’š

} Linearly separable data: data points that can be exactly

classified by a linear decision surface.

} Why linear classifier?

} Even when they are not optimal, we can use their simplicity

} are relatively easy to compute } In the absence of information suggesting otherwise, linear classifiers are an

attractive candidates for initial, trial classifiers.

slide-5
SLIDE 5

Two Category

5

} 𝑔 π’š; 𝒙 = 𝒙4π’š + π‘₯7 = π‘₯7 + π‘₯)𝑦) + . . . π‘₯;𝑦;

} π’š = 𝑦) 𝑦< … 𝑦; } 𝒙 = [π‘₯) π‘₯< … π‘₯;] } π‘₯7: bias

} if 𝒙4π’š + π‘₯7 β‰₯ 0 then π’Ÿ) } else π’Ÿ<

Decision surface (boundary): 𝒙4π’š + π‘₯7 = 0

𝒙 is orthogonal to every vector lying within the decision surface

slide-6
SLIDE 6

Example

6

𝑦1 𝑦2

1 2 3 1 2 3 4

3 βˆ’ 3 4 𝑦) βˆ’ 𝑦< = 0 if 𝒙4π’š + π‘₯7 β‰₯ 0 then π’Ÿ) else π’Ÿ<

slide-7
SLIDE 7

Linear classifier: Two Category

7

} Decision boundary is a (𝑒 βˆ’ 1)-dimensional hyperplane 𝐼 in

the 𝑒-dimensional feature space

} The orientation of 𝐼 is determined by the normal vector π‘₯), … , π‘₯; } π‘₯7 determines the location of the surface.

} The normal distance from the origin to the decision surface is HI

𝒙

π’š = π’šJ + 𝑠 𝒙 𝒙 𝒙4π’š + π‘₯7 = 𝑠 𝒙 β‡’ 𝑠 = 𝒙4π’š + π‘₯7 𝒙 gives a signed measure of the perpendicular distance 𝑠 of the point π’š from the decision surface

𝑔 π’š = 0

π’šJ

slide-8
SLIDE 8

Linear boundary: geometry

8

𝒙4π’š + π‘₯7 = 0 𝒙4π’š + π‘₯7 > 0 𝒙4π’š + π‘₯7 < 0 𝒙4π’š + π‘₯7 𝒙

slide-9
SLIDE 9

Non-linear decision boundary

9

} Choose non-linear features } Classifier still linear in parameters 𝒙 𝑦1 𝑦2

1 1

βˆ’1 + 𝑦)

< + 𝑦< < = 0

if 𝒙4𝝔(π’š) β‰₯ 0 then 𝑧 = 1 else 𝑧 = βˆ’1 𝒙 = π‘₯7, π‘₯), … , π‘₯R = [βˆ’1, 0, 0,1,1,0]

1

  • 1

𝝔 π’š = [1, π’š), π’š< , π’š)

<, π’š< <, π’š)π’š<]

π’š = [π’š), π’š<]

slide-10
SLIDE 10

Cost Function for linear classification

10

} Finding linear classifiers can be formulated as an optimization

problem:

} Select how to measure the prediction loss

} Based on the training set 𝐸 =

π’š % , 𝑧 %

%() S

, a cost function 𝐾 𝒙 is defined

} Solve the resulting optimization problem to find parameters:

} Find optimal 𝑔

U π’š = 𝑔 π’š; 𝒙 V where 𝒙 V = argmin

𝒙

𝐾 𝒙

} Criterion or cost functions for classification:

} We will investigate several cost functions for the classification problem

slide-11
SLIDE 11

SSE cost function for classification

11

SSE cost function is not suitable for classification:

} Least square loss penalizes β€˜too correct’ predictions (that they lie a long

way on the correct side of the decision)

} Least square loss also lack robustness to noise

𝐾 𝒙 = ] 𝒙4π’š % βˆ’ 𝑧 %

< * %() 𝐿 = 2 𝑧 ∈ {βˆ’1, +1}

slide-12
SLIDE 12

SSE cost function for classification

12

𝒙4π’š 𝑧 = 1 𝒙4π’š βˆ’ 𝑧 < 1 𝒙4π’š 𝑧 = βˆ’1 𝒙4π’š βˆ’ 𝑧 < βˆ’1 Correct predictions that are penalized by SSE [Bishop] 𝐿 = 2

slide-13
SLIDE 13

SSE cost function for classification

13

𝐾(𝒙)

} Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝑕 𝒙4π’š ?

𝐾 𝒙 = ] sign 𝒙4π’š % βˆ’ 𝑧 %

< * %()

sign 𝑨 = aβˆ’ 1, 𝑨 < 0 1, 𝑨 β‰₯ 0

} 𝐾 𝒙

is a piecewise constant function shows the number

  • f misclassifications

𝐿 = 2 𝒙4π’š 𝑧 = 1 sign 𝒙4π’š βˆ’ 𝑧 <

Training error incurred in classifying training samples

slide-14
SLIDE 14

SSE cost function

14

} Is it more suitable if we set 𝑔 π’š; 𝒙 = 𝜏 𝒙4π’š ?

𝐾 𝒙 = ] 𝜏 𝒙4π’š % βˆ’ 𝑧 %

< * %()

𝜏 𝑨 = 1 βˆ’ 𝑓de 1 + 𝑓de

} We see later in this lecture than the cost function of the

logistic regression method is more suitable than this cost function for the classification problem

𝐿 = 2

slide-15
SLIDE 15

Perceptron algorithm

15

} Linear classifier } Two-class: 𝑧 ∈ {βˆ’1,1}

} 𝑧 = βˆ’1 for 𝐷<,

𝑧 = 1 for 𝐷)

} Goal: βˆ€π‘—, π’š(%) ∈ 𝐷) β‡’ 𝒙4π’š(%) > 0 }

βˆ€π‘—, π’š % ∈ 𝐷< β‡’ 𝒙4π’š % < 0

} 𝑔 π’š; 𝒙 = sign(𝒙4π’š)

slide-16
SLIDE 16

Perceptron criterion

16

𝐾i 𝒙 = βˆ’ ] 𝒙4π’š % 𝑧 %

  • %βˆˆβ„³

β„³: subset of training data that are misclassified Many solutions? Which solution among them?

slide-17
SLIDE 17

Cost function

17

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾i(𝒙) π‘₯7 π‘₯) π‘₯7 π‘₯) # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

slide-18
SLIDE 18

Batch Perceptron

18

β€œGradient Descent” to solve the optimization problem: 𝒙lm) = 𝒙l βˆ’ πœƒπ›Ό

𝒙𝐾i(𝒙l)

𝛼

𝒙𝐾i 𝒙 = βˆ’ ] π’š % 𝑧 %

  • %βˆˆβ„³

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + πœƒ βˆ‘ π’š % 𝑧 %

  • %βˆˆβ„³

Until πœƒ βˆ‘ π’š % 𝑧 %

  • %βˆˆβ„³

< πœ„

slide-19
SLIDE 19

Stochastic gradient descent for Perceptron

19

} Single-sample perceptron:

} If π’š(%) is misclassified:

𝒙lm) = 𝒙l + πœƒπ’š(%)𝑧(%)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

𝒙 ← 𝟏 𝑒 ← 0 repeat 𝑒 ← 𝑒 + 1 𝑗 ← 𝑒 mod 𝑂 if π’š(%) is misclassified then 𝒙 = 𝒙 + π’š(%)𝑧(%) Until all patterns properly classified

Fixed-Increment single sample Perceptron πœƒ can be set to 1 and proof still works

slide-20
SLIDE 20

Example

20

slide-21
SLIDE 21

Perceptron: Example

21

Change 𝒙 in a direction that corrects the error [Bishop]

slide-22
SLIDE 22

Online vs. Batch Learning

22

} Batch Learning

} Learn from all the examples at once

} Online Learning

} Gradually learn as each example is received

} Email classification example } Recommendation systems

} Examples: recommending movies; predicting whether a user

will be interested in a new news article

slide-23
SLIDE 23

Perceptron Convergence

23

} Assume that βˆƒπ’™βˆ—, π’™βˆ—

= 1and some 𝛿 > 0 such that 𝑧(S)π’™βˆ—4π’š(S) β‰₯ 𝛿 for all π‘œ = 1, … , 𝑂 . Also, assume that for all π‘œ = 1, … , 𝑂, π’š(S) ≀ 𝑆 , Perceptron makes at most

}~

  • ~ errors
slide-24
SLIDE 24

Perceptron Convergence (more general case)

24

} It can be shown:

𝑁 = max

(π’š,β€š)βˆˆΖ’ π’š <

𝜈 = 2 min

(π’š,β€š)βˆˆΖ’ 𝑧𝒙

V 0 4π’š 𝑏 = min

(π’š,β€š)βˆˆΖ’ 𝑧𝒙

V βˆ—4π’š

slide-25
SLIDE 25

Convergence of Perceptron

25

} For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

slide-26
SLIDE 26

Pocket algorithm

26

} For the data that are not linearly separable due to noise:

} Keeps in its pocket the best 𝒙 encountered up to now.

Initialize 𝒙 for 𝑒 = 1, … , π‘ˆ 𝑗 ← 𝑒 mod 𝑂 if π’š(%) is misclassified then 𝒙S‑H = 𝒙 + π’š(%)𝑧(%) if 𝐹l‰Š%S 𝒙S‑H < 𝐹l‰Š%S 𝒙 then 𝒙 = 𝒙S‑H end

𝐹l‰Š%S 𝒙 = 1 𝑂 ] π‘‘π‘—π‘•π‘œ(𝒙4π’š(S)) β‰  𝑧(S)

* S()

slide-27
SLIDE 27

Linear Discriminant Analysis (LDA)

27

} Fisher’s Linear Discriminant Analysis :

} Dimensionality reduction

} Finds linear combinations of features with large ratios of between-

groups scatters to within-groups scatters (as discriminant new variables)

} Classification

} Predicts the class of an observation π’š by first projecting it to the

space of discriminant variables and then classifying it in this space

slide-28
SLIDE 28

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

28

slide-29
SLIDE 29

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

29

slide-30
SLIDE 30

Good Projection for Classification

} What is a good criterion?

} Separating different classes in the projected space

30

𝒙

slide-31
SLIDE 31

LDA Problem

} Problem definition:

} 𝐿 = 2 classes }

π’š(%), 𝑧(%)

%() *

training samples with 𝑂) samples from the first class (π’Ÿ)) and 𝑂< samples from the second class (π’Ÿ<)

} Goal: finding the best direction 𝒙 that we hope to enable accurate

classification

} The projection of sample π’š onto a line in direction 𝒙 is 𝒙4π’š } What is the measure of the separation between the projected

points of different classes?

31

slide-32
SLIDE 32

Measure of Separation in the Projected Direction

32

[Bishop] } Is the direction of the line jointing the class means a good

candidate for 𝒙?

slide-33
SLIDE 33

Measure of Separation in the Projected Direction

33

} The direction of the line jointing the class means is the

solution of the following problem:

} Maximizes the separation of the projected class means

max

𝒙 𝐾 𝒙 = 𝜈)

  • βˆ’ 𝜈<
  • <
  • s. t. 𝒙

= 1

} What is the problem with the criteria considering only

𝜈)

  • βˆ’ 𝜈<
  • ?

} It does not consider the variances of the classes in the projected direction

𝜈)

  • = 𝒙4 𝝂)

𝝂) =

βˆ‘ π’š(β€’)

  • π’š(β€’)βˆˆπ’Ÿβ€˜

*β€˜

𝜈<

  • = 𝒙4 𝝂<

𝝂< =

βˆ‘ π’š(β€’)

  • π’š(β€’)βˆˆπ’Ÿ~

*~

slide-34
SLIDE 34

LDA Criteria

34

} Fisher idea: maximize a function that will give

} large separation between the projected class means } while also achieving a small variance within each class, thereby

minimizing the class overlap.

𝐾 𝒙 = 𝜈)

  • βˆ’ 𝜈<
  • <

𝑑)

  • < + 𝑑<
  • <
slide-35
SLIDE 35

LDA Criteria

} The scatters of projected data are:

𝑑)

  • < =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(β€’)βˆˆπ’Ÿβ€˜

𝑑<

  • < =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(β€’)βˆˆπ’Ÿ~

35

slide-36
SLIDE 36

LDA Criteria

36

𝐾 𝒙 = 𝜈)

  • βˆ’ 𝜈<
  • <

𝑑)

  • < + 𝑑<
  • <

𝜈)

  • βˆ’ 𝜈<
  • < = 𝒙4𝝂) βˆ’ 𝒙4𝝂< <

= 𝒙4 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4𝒙 𝑑)

  • < =

] 𝒙4π’š % βˆ’ 𝒙4𝝂)

<

  • π’š(β€’)βˆˆπ’Ÿβ€˜

= 𝒙4 ] π’š % βˆ’ 𝝂) π’š % βˆ’ 𝝂)

4

  • π’š(β€’)βˆˆπ’Ÿβ€˜

𝒙

slide-37
SLIDE 37

LDA Criteria

37

𝐾 𝒙 =

𝒙4π‘»β€œπ’™ 𝒙4𝑻”𝒙

π‘»β€œ = 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4 𝑻” = 𝑻) + 𝑻< 𝑻) = ] π’š % βˆ’ 𝝂) π’š % βˆ’ 𝝂)

4

  • π’š(β€’)βˆˆπ’Ÿβ€˜

𝑻< = ] π’š % βˆ’ 𝝂< π’š % βˆ’ 𝝂<

4

  • π’š(β€’)βˆˆπ’Ÿ~

scatter matrix=NΓ—covariance matrix Between-class scatter matrix Within-class scatter matrix

slide-38
SLIDE 38

LDA Derivation

( )

T B T W

J = w S w w w S w

( )

( ) ( )

( )

2 2

2 2 ( )

T T T T W B T T W B B W W B T T W W

J ΒΆ ΒΆ Β΄

  • Β΄
  • ΒΆ

ΒΆ ΒΆ = = ΒΆ w S w w S w w S w w S w S w w S w S w w S w w w w w w S w w S w

( )

B W

J l ¢ = Þ = ¢ w S w S w w

38

slide-39
SLIDE 39

LDA Derivation

} 𝑻𝐢𝒙 (for any vector 𝒙) points in the same direction as

𝝂) βˆ’ 𝝂<:

} Thus, we can solve the eigenvalue problem immediately

rank

  • is full

𝑻𝑋 If

39

𝑻𝐢𝒙 = πœ‡π‘»π‘‹π’™ 𝑻”

d)𝑻𝐢𝒙 = πœ‡π’™

𝑻𝐢𝒙 = 𝝂) βˆ’ 𝝂< 𝝂) βˆ’ 𝝂< 4𝒙 ∝ 𝝂) βˆ’ 𝝂< 𝒙 ∝ 𝑻”

d) 𝝂) βˆ’ 𝝂<

slide-40
SLIDE 40

LDA Algorithm

40

} Find 𝝂) and 𝝂< as the mean of class 1 and 2 respectively } Find 𝑻) and 𝑻< as scatter matrix of class 1 and 2 respectively } 𝑻” = 𝑻) + 𝑻< } π‘»β€œ = 𝝂) βˆ’ 𝝂<

𝝂) βˆ’ 𝝂< 4

} Feature Extraction

} 𝒙 = 𝑻H

d) 𝝂) βˆ’ 𝝂<

as the eigenvector corresponding to the largest eigenvalue of 𝑻H

d)𝑻ő

} Classification

} 𝒙 = 𝑻H

d) 𝝂) βˆ’ 𝝂<

} Using a threshold on 𝒙4π’š, we can classify π’š

𝝂< 𝝂)

slide-41
SLIDE 41

Multi-class classification

41

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(π’š) for each class 𝑗 is found

Β¨ 𝑧

β€Ί = argmax

%(),…,Ε“

𝑔

%(π’š)

𝑦1

𝑦2

slide-42
SLIDE 42

Multi-class classification

42

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(π’š) for each class 𝑗 is found

Β¨ 𝑧

β€Ί = argmax

%(),…,Ε“

𝑔

%(π’š)

𝑦1

𝑦2

slide-43
SLIDE 43

Converting multi-class problem to a set of two-class problems

43

} β€œone versus rest” or β€œone against all”

} For each class 𝐷%, a linear discriminant function that separates

samples of 𝐷% from all the other samples is found.

} Totally linearly separable

} β€œone versus one”

} 𝑑(𝑑 βˆ’ 1)/2 linear discriminant functions are used, one to

separate samples of a pair of classes.

} Pairwise linearly separable

slide-44
SLIDE 44

Multi-class classification

44

} One-vs-all (one-vs-rest)

Class 1: Class 2: Class 3:

𝑦2 𝑦2 𝑦1 𝑦2 𝑦1 𝑦1 𝑦2 𝑦1

slide-45
SLIDE 45

Multi-class classification

45

} One-vs-one

Class 1: Class 2: Class 3:

𝑦2 𝑦1

𝑦2 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1

slide-46
SLIDE 46

Multi-class classification: ambiguity

46

  • ne versus rest
  • ne versus one

[Duda, Hart & Stork, 2002] } Converting the multi-class problem to a set of two-class

problems can lead to regions in which the classification is undefined

slide-47
SLIDE 47

Multi-class classification

47

} Solutions to multi-category problems:

} Converting the problem to a set of two-class problems } Extend the learning algorithm to support multi-class:

} A function 𝑔

%(π’š) for each class 𝑗 is found

Β¨ 𝑧

β€Ί = argmax

%(),…,Ε“

𝑔

%(π’š)

𝑦1

𝑦2

𝑔𝑗(π’š) > 𝑔

ΕΈ(π’š) "π‘˜ ΒΉ 𝑗

if 𝐷% is assigned to class π’š

slide-48
SLIDE 48

Discriminant Functions

48

} Discriminant functions: A discriminant function 𝑔

% π’š

for each class π’Ÿ% (𝑗 = 1, … , 𝐿):

} π’š is assigned to class π’Ÿ% if:

𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗

} Thus, we can easily divide the feature space into 𝐿 decision

regions

βˆ€π’š, 𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗 β‡’ π’š ∈ β„›%

} Decision surfaces (or boundaries) can also be found using

discriminant functions

} Boundary of the β„›% and β„›ΕΈ separating samples of these two categories:

βˆ€π’š, 𝑔

% π’š = π‘”π‘˜(π’š)

β„›%: Region of the 𝑗-th class

slide-49
SLIDE 49

Discriminant functions

49

} Discriminant function can directly assign each vector π’š to a

specific class 𝑙

} A popular way of representing a classifier

} Many classification methods are based on discriminant functions

} Assumption: the classes are taken to be disjoint

} The input space is thereby divided into decision regions

} boundaries are called decision boundaries or decision surfaces.

slide-50
SLIDE 50

Discriminant Functions: Two-Category

50

} Decision surface: 𝑔 π’š = 0 } For two-category problem, we can only find a function 𝑔 ∢

ℝΒ₯ β†’ ℝ

} 𝑔

) π’š = 𝑔(π’š)

} 𝑔

< π’š = βˆ’π‘”(π’š)

slide-51
SLIDE 51

Multi-class classification: linear machine

51

} A discriminant function 𝑔

% π’š = 𝒙% 4π’š + π‘₯%7 for each class

π’Ÿ% (𝑗 = 1, … , 𝐿):

} π’š is assigned to class π’Ÿ% if:

𝑔𝑗(π’š) > π‘”π‘˜(π’š) "π‘˜ ΒΉ 𝑗

} Decision surfaces (boundaries) can also be found using

discriminant functions

} Boundary of the contiguous β„›% and β„›ΕΈ: βˆ€π’š, 𝑔

% π’š = π‘”π‘˜(π’š)

}

𝒙% βˆ’ 𝒙Ÿ

4π’š + π‘₯%7 βˆ’ π‘₯ ΕΈ7 = 0

} Decision regions are convex

slide-52
SLIDE 52

Multi-class classification: linear machine

52

} Decision regions are convex

} Linear machines are most suitable for problems where π‘ž(π’š|π’Ÿ%)

are unimodal.

Convex region definition: βˆ€π‘¦),𝑦< ∈ β„›,0 ≀ 𝛽 ≀ 1 β‡’ 𝛽𝑦) + 1 βˆ’ 𝛽 𝑦< ∈ β„› 𝑦) 𝑦< 𝑦), 𝑦< ∈ β„›% β‡’ βˆ€π‘˜ β‰  𝑗, 𝑔

% 𝑦) β‰₯ 𝑔 ΕΈ 𝑦)

𝑔

% 𝑦< β‰₯ 𝑔 ΕΈ 𝑦<

β‡’ 𝛽𝑔

% 𝑦) + 1 βˆ’ 𝛽 𝑔 % 𝑦< β‰₯ 𝛽𝑔 ΕΈ 𝑦) + 1 βˆ’ 𝛽 𝑔 ΕΈ 𝑦<

β‡’ 𝑔

% 𝛽𝑦) + 1 βˆ’ 𝛽 𝑦< β‰₯ 𝑔 ΕΈ 𝛽𝑦) + 1 βˆ’ 𝛽 𝑦<

β‡’ 𝛽𝑦) + 1 βˆ’ 𝛽 𝑦< ∈ β„›% 𝑔

% is linear

slide-53
SLIDE 53

Multi-class classification: linear machine

53

[Duda, Hart & Stork, 2002]

slide-54
SLIDE 54

Perceptron: multi-class

54

𝑧 β€Ί = argmax

%(),…,Ε“

𝒙%

4π’š

𝐾i 𝑿 = βˆ’ ] π’™β€š β€’ βˆ’ π’™β€š

β€Ί β€’ 4

π’š %

  • %βˆˆβ„³

β„³: subset of training data that are misclassified β„³ = 𝑗|𝑧 β€Ί % β‰  𝑧(%)

Initialize 𝑿 = 𝒙), … , 𝒙œ , 𝑙 ← 0 repeat 𝑙 ← 𝑙 + 1 mod 𝑂 if π’š(%) is misclassified then π’™β€š

β€Ί β€’ = π’™β€š β€Ί β€’ βˆ’ π’š(%)

π’™β€š β€’ = π’™β€š β€’ + π’š(%) Until all patterns properly classified

slide-55
SLIDE 55

Resources

55

} C. Bishop, β€œPattern Recognition and Machine Learning”,

Chapter 4.1.