LINEAR CLASSIFIER V aclav Hlav a c Czech Technical University, - - PowerPoint PPT Presentation

linear classifier
SMART_READER_LITE
LIVE PREVIEW

LINEAR CLASSIFIER V aclav Hlav a c Czech Technical University, - - PowerPoint PPT Presentation

1/44 LINEAR CLASSIFIER V aclav Hlav a c Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo n am. 13, Czech Republic http://cmp.felk.cvut.cz


slide-1
SLIDE 1

1/44

LINEAR CLASSIFIER

V´ aclav Hlav´ aˇ c Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo n´

  • am. 13, Czech Republic

hlavac@fel.cvut.cz, http://cmp.felk.cvut.cz LECTURE PLAN

  • Rehearsal: linear classifiers and its importance.
  • Learning linear classifiers. Three formulations.
  • Generalized Anderson task. First part.
  • Perceptron and Kozinec algorithm.
  • Generalized Anderson task. Second part.
slide-2
SLIDE 2

2/44

CLASSIFIER Analyzed object is represented by X – space of observations K – set of hidden states Aim of the classification is to determine a relation between X and K, i.e. to find a function f : X → K. Classifier q: X → J maps observations Xn → set of class indices J, J = 1, . . . , |K|. Mutual exclusion of classes X = X1 ∪ X2 ∪ . . . ∪ X|K|, X1 ∩ X2 ∩ . . . ∩ X|K| = ∅.

slide-3
SLIDE 3

3/44

CLASSIFIER, ILLUSTRATION

  • A classifier partitions observation space X into

class-labelled regions.

  • Classification determines to which region an observation

vector x belongs.

  • Borders between regions are called decision boundaries.
slide-4
SLIDE 4

4/44

RECOGNITION (DECISION) STRATEGY fi(x) > fj(x) for x ∈ class i , i = j.

f (x)

1

f (x)

2

f (x)

|K|

max

X K

Strategy j = argmax

j

fj(x)

slide-5
SLIDE 5

5/44

WHY ARE LINEAR CLASSIFIERS IMPORTANT? (1) Theoretical importance – Bayesian decision rule decomposes the space of probabilities into convex cones.

p (x)

X|1

p (x)

X|2

p (x)

X|2

p (x)

X|3

p (x)

X|3

p (x)

X|1

slide-6
SLIDE 6

6/44

WHY ARE LINEAR CLASSIFIERS IMPORTANT? (2)

  • For some statistical models, the Bayesian or non-Bayesian

strategy is implemented by a linear discriminant function.

  • Capacity (VC dimension) of linear strategies in an

n-dimensional space is n + 2. Thus, the learning task is correct, i.e., strategy tuned on a finite training multiset does not differ much from correct strategy found for a statistical model.

  • There are efficient learning algorithms for linear classifiers.
  • Some non-linear discriminant functions can be implemented

as linear after straightening the feature space.

slide-7
SLIDE 7

7/44

LINEAR DISCRIMINANT FUNCTION q(x)

  • fj(x) = wj, x + bj, where denotes a scalar product.
  • A strategy j = argmax

j

fj(x) divides X into |K| convex regions.

k=1 k=2 k=3 k=4 k=5 k=6

slide-8
SLIDE 8

8/44

DICHOTOMY, TWO CLASSES ONLY |K| = 2, i.e. two hidden states (typically also classes) q(x) =    k = 1 , if w, x + b ≥ 0 , k = 2 , if w, x + b < 0 .

x1 x2

slide-9
SLIDE 9

9/44

LEARNING LINEAR CLASSIFIERS The aim of learning is to estimate classifier parameters wi, bi for ∀i. The learning algorithms differ by

  • The character of training set
  • 1. Finite set consisting of individual observations and

hidden states, i.e., {(x1, y1) . . . (xL, yL)}.

  • 2. Infinite sets described by Gaussian distributions.
  • Three learning task formulations.
slide-10
SLIDE 10

10/44

3 LEARNING TASK FORMULATIONS

  • Minimization of the empirical risk.
  • Minimization of a real risk margin.
  • Generalized Anderson task.
slide-11
SLIDE 11

11/44

MINIMIZATION OF THE EMPIRICAL RISK True risk is approximated by Remp(q(x, Θ)) = 1 L

L

  • i=1

W(q(xi, Θ), yi) , where W is a penalty function. Learning is based on the empirical minimization principle Θ∗ = argmin

Θ

Remp(q(x, Θ)) Examples of learning algorithms: Perceptron, Back-propagation, etc.

slide-12
SLIDE 12

12/44

OVERFITTING AND UNDERFITTING

  • How rich class of classifications q(x, Θ) has to be used?
  • Problem of generalization: a small emprical risk Remp does

not imply a small true expected risk R. underfit fit

  • verfit
slide-13
SLIDE 13

13/44

MINIMIZATION OF A REAL RISK MARGIN (1) Risk R(f) =

  • x,k

W(q(xi, Θ), k) p(x, k) dx dk , where

  • W(q(xi, Θ) is a loss function,
  • p(x, k) is not known.
slide-14
SLIDE 14

14/44

STRUCTURAL RISK MINIMIZATION PRINCIPLE R(f) =

  • x,k

W(q(xi, Θ), k) p(x, k) dx dk , where p(x, k) is not known. Margin according to Vapnik, Chervonenkis is R(f) ≤ Remp(f) + Rstr

  • h, 1

L

  • ,

where h is VC dimension (capacity) of the class of strategies q.

slide-15
SLIDE 15

15/44

MINIMIZATION OF A REAL RISK MARGIN (2)

  • For linear discriminant functions

R m

VC dimension (capacity) h ≤ R2 m2 + 1

  • Examples of learning algorithms: SVM or ε-Kozinec.

(w∗, b∗) = argmax

w,b

min

  • min

x∈X1

w, x + b |w| , min

x∈X2

w, x + b |w|

  • .
slide-16
SLIDE 16

16/44

TOWARDS GENERALIZED ANDERSON TASK

  • X is an observation space (a multidimensional linear

space).

  • x ∈ X is a single observation.
  • k ∈ K = {1, 2} is a hidden state.
  • It is assumed that pX|K(x | k) is a multi-dimensional

Gaussian distribution.

  • The mathematical expectation µk and the covariance

matrix σk, k = 1, 2, of these probability distributions are not known.

slide-17
SLIDE 17

17/44

TOWARDS GAnderssonT (2)

  • It is known that the parameters (µ1, σ1) belong to a certain

finite set of parameters {(µj, σj) | j ∈ J1}.

  • Similarly (µ2, σ2) are also unknown parameters belonging

to the finite set {(µj, σj) | j ∈ J2}.

  • Superscript and subscript indices are used.
  • µ1 and σ1 mean real, but unknown, parameters of an object

that is in the first state.

  • Parameters (µj, σj) for some of the superscripts j are one
  • f the possible value pairs which the parameter can assume.
slide-18
SLIDE 18

18/44

GAndersonT ILLUSTRATED IN 2D SPACE Illustration of the statistical model, i.e., mixture of Gaussians.

k=1 k=2

  • q

Unknown are weights of the Gaussian components.

slide-19
SLIDE 19

19/44

GAndersonT, FORMULATION (1)

  • If there is a non-random hidden parameter then Bayesian

approach cannot be used. p(x|k, z) is influenced by a non-random intervention z.

  • Divide and conquer approach. For X(k), k ∈ K

X(k) = X1(k) ∪ . . . ∪ X|K|(k) .

  • Probability of the wrong classification for given k and z

ε(k, z) =

  • x/

∈X(k)

p(x|k, z) .

slide-20
SLIDE 20

20/44

GAndersonT, FORMULATION (2)

  • Learning algorithm

(x∗(k), k ∈ K) = argmin

x(k),k∈K′ max k

max

z

p(x|k, z) .

  • Particularly for Generalized Anderson task
  • 2 hidden states only, K = {1, 2}.
  • Separation of X(k), k = {1, 2} using the hyperplane

w, x + b = 0.

  • p(x|k, z) is a Gaussian distribution.
slide-21
SLIDE 21

21/44

GAndersonT, FORMULATION (3)

  • Seeking a strategy q: X → {1, 2} which minimises

max

j∈J1∪J2 ε(j, µj, σj, q) ,

where ε(j, µj, σj, q) is the probability that the Gaussian random vector x with mathematical expectation µj and the covariance matrix σj satisfies either the relation q(x) = 1 for j ∈ J2 or q(x) = 2 for j ∈ J1.

  • Additional constraint (linearity of the classifier)

q(x) =

  • 1 ,

if w, x > b , 2 , if w, x < b .

slide-22
SLIDE 22

22/44

SIMPLIFICATION OF GAndersonT (1) Anderson–Bahadur task

  • A special case of GAndersonT solved in 1962.
  • For |J1| = |J2| = 1.
  • Anderson, T. and Bahadur, R.

Classification into two multivariate normal distributions with different covariance matrices. Annals of Mathematical Statistics, 1962, 33:420–431.

slide-23
SLIDE 23

23/44

SIMPLIFICATION OF GAndersonT (2) Optimal separation of a finite sets of points

  • Finite set of observations

X = x1, x2, . . . , xn has to be decomposed into X1 and X2, X1 ∩ X2 = ∅.

  • w, x > b for x ∈

X1. w, x < b for x ∈ X2.

  • under the condition

argmax

w,b

min

  • min

x∈ X1

w, x − b |w| , min

x∈ X2

b − w, x |w|

slide-24
SLIDE 24

24/44

SIMPLIFICATION OF GAndersonT (3) Simple separation of a finite sets of points

  • Finite set of observations

X = x1, x2, . . . , xn has to be decomposed into X1 and X2, X1 ∩ X2 = ∅.

  • w, x > b for x ∈

X1. w, x < b for x ∈ X2.

  • The division hyperplane can lie anywhere between sets

X1 and X2.

slide-25
SLIDE 25

25/44

GOOD NEWS FOR GAndersonT

  • The minimised optimisation criterion will appear to be

unimodal.

  • Steepest descent optimization methods can be used with

which the optimum can be found without being stuck in local extremes.

  • Bad news: minimised unimodal criterion is neither convex,

nor differentiable. Therefore neither calculating the gradient, nor that the gradient in the point corresponding to the minimum is equal to zero can be applied.

  • Convex optimisation techniques alow to solve the problem.
slide-26
SLIDE 26

26/44

EQUIVALENT FORMULATION OF GAndersonT Without loss of generality we can assume that the recognition strategy based on comparing the value of the linear function w, x with the threshold value b w, x + b > 0 can be replaced by an equivalent strategy making decision according to the sign of the linear function w′, x′ w′, x′ > 0 .

slide-27
SLIDE 27

27/44

ORIGINAL FORMULATION w, xj + b > 0 , j ∈ J1 w, xj + b < 0 , j ∈ J2

X

n

w, x + b = 0

slide-28
SLIDE 28

28/44

EQUIVALENT FORMULATION, through origin Embedding into n + 1-dimensional space w′ = (w1, w2, . . . , wn, −b) x′ = (x1, x2, . . . , xn, 1) w′, x′j > 0 , j ∈ J1 w′, x′j < 0 , j ∈ J2

X

n+1

w’, x’ = 0

slide-29
SLIDE 29

29/44

EQUIVALENT FORMULATION, mirrored µ′j =

  • µj ,

for j ∈ J1 , −µj , for j ∈ J2 . w′, x′j > b , j ∈ J1 w′, x′j < b , j ∈ J2 w′ = (w1, w2, . . . , wn, −b) x′ = (x1, x2, . . . , xn, 1)

X

n+1

w’, x’ = 0

slide-30
SLIDE 30

30/44

GAndersonT, REFORMULATION For the ensemble

  • (µj, σj), j ∈ J
  • , a non-zero vector w∗ has to

be sought which minimises the number max

j

εj(w), w∗ = argmin

w

max

j

εj(w) , where εj(w) is the probability that the random Gaussian vector x with the mathematical expectation µj and the covariance matrix σj will satisfy the inequality w, x ≤ 0, i.e. it will be classified to an incorrect class.

slide-31
SLIDE 31

31/44

TRAINING ALGORITHM 1 – PERCEPTRON Creates a sequence of w1, . . . , wt. Perceptron algorithm (Rosenblat 1962):

  • 1. w1 = 0.
  • 2. A wrongly classified observation xt is sought, i.e.,

wt, xj < b, j ∈ J.

  • 3. If there is no wrongly classified observation then the

algorithm finishes otherwise wt+1 = wt + xt .

  • 4. Goto 1.
slide-32
SLIDE 32

32/44

PERCEPTRON, PICTORIAL ILLUSTRATION

wt wt+1 xt w ,x = 0

t

Perceptron

slide-33
SLIDE 33

33/44

NOVIKOFF THEOREM If the data are linearly separable then there exists a number t∗ ≤ D2

m2,

such that the vector wt∗ satisfies the inequality wt∗, xj > 0 for each j ∈ J.

D m

slide-34
SLIDE 34

34/44

THE CLOSEST POINT TO THE CONVEX HULL The optimal separation by a hyperplane (1) w∗ = argmax

w

min

j

w |w|, xj

  • can be converted to seek for the closest point to a convex hull

(denoted by the overline) x∗ = argmin

x∈X

|x| There holds that x∗ solves also the problem (1).

slide-35
SLIDE 35

35/44

CONVEX HULL, ILLUSTRATION

w* = m X

min

j

w |w|, xj

  • ≤ m ≤ |w| , w ∈ X

lower bound upper bound

slide-36
SLIDE 36

36/44

ε-SOLUTION

  • The aim is to speed up the algorithm.
  • The allowed uncertainty ε is introduced.

|w| − min

j

w |w|, xj

  • ≤ ε
slide-37
SLIDE 37

37/44

TRAINING ALGORITHM 2 – KOZINEC (1973)

  • 1. w1 = xj, i.e. any observation.
  • 2. A wrongly classified observation xt is sought, i.e.,

wt, xj < b, j ∈ J.

  • 3. If there is no wrongly classified observation then the

algorithm finishes otherwise wt+1 = (1 − k) · wt + xt · k , k ∈ R . where k = argmin

k

|(1 − k) · wt + xt · k|.

  • 4. Goto 1.
slide-38
SLIDE 38

38/44

KOZINEC, PICTORIAL ILLUSTRATION

wt wt+1 b xt w ,x = 0

t

Kozinec

slide-39
SLIDE 39

39/44

KOZINEC and ε-SOLUTION The second step of Kozinec algorithm is modified to: A wrongly classified observation xt is sought, i.e., |wt| − min

j

wt |wt|, xt

  • ≥ ε

m ε t

|w|

t

slide-40
SLIDE 40

40/44

GAndersonT ROLE OF MAHANALOBIS DISTANCE J1 = {1, 2}, J2 = {3}. ε(µj, σj) =

  • w,x+b<0

N(µj, σj) dx =

  • r

1 √ 2π exp x2 2

  • dx

ε(µj, σj) ∼ 1 r r is a Mahanalobis distance from a separation hyperplane.

N( )

  • 1

1

N( )

  • 3

3

N( )

  • 2

2

X1 X2

slide-41
SLIDE 41

41/44

GAndersonT, SOLUTION

  • Optimal solution:

w∗, b∗ = argmin

w,b

max

j∈J1∪J2 ε(fw,b, µj, σj)

= argmax

w,b

min

j∈J1∪J2 r(fw,b, µj, σj)

  • ε-optimal solution:

w, b are sought which max

j∈J1∪J2 ε(fw,b, µj, σj) < ε0

is transformed to (nasty integrals are avoided, r0 is ‘size’ of the ellipse corresponding to ε0 ∼ 1/r0) min

j∈J1∪J2 r(fw,b, µj, σj) > r0

slide-42
SLIDE 42

42/44

GAndersonT, ALGORITHM, OPTIMAL

  • 1. Algorithm starts with w0 such that ε(w0, µj, σj) < 0.5,

w0, µj > 0, j ∈ J. If such w0 does not exist then finish because ε > 0.5.

  • 2. Find improving direction ∆ w

min

j∈J r(wt + k ∆w, µj, σj) > r(wt, µj, σj) , k > 0, k ∈ R.

If such ∆w is not found then finish as wt is the solution.

  • 3. Find optimal k = argmax

k>0

min

j∈J r(wt + k ∆w, µj, σj) and

weight is adapted wt+1 = wt + k ∆w.

  • 4. Goto Step 2.
slide-43
SLIDE 43

43/44

GAndersonT ε-SOLUTION USING KOZINEC ALGORITHM The task can be converted into a problem of separating infinite sets. w, x > 0 , x ∈ X1(r) w, x < 0 , x ∈ X2(r) where X1(r) = ∪j∈J1Xj(r), X2(r) = ∪j∈J2Xj(r) xj(r) , j ∈ J1 =

  • x|(x − µj), (σj)−1(x − µj) ≤ r2

ε(µj, σj) =

  • r

1 √ 2π exp x2 2

  • dx ∼ 1

r

k=1 k=2

  • q
slide-44
SLIDE 44

44/44

GAndersonT, ε-SOLUTION, ALGORITHM

  • 1. Calculate

r0 ∼ ε0 =

  • r0

1 √ 2π exp x2 2

  • dx

w1 = µj for arbitrary j ∈ J1 ∪ J2.

  • 2. Find such j for which r(wt, µj, σj) < r0.
  • 3. If such j does not exist then finish as wt is the solution.

Otherwise seek x ∈ Xj which minimizes distance from wt, x. Adaptation (Kozinec, can be also Perceptron): wt+1 = (1 − k)wt + xk, k = argmin

0≤k≤1

|(1 − k)wt + xk|. Goto 2.