MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) - - PowerPoint PPT Presentation

mira svm k nn
SMART_READER_LITE
LIVE PREVIEW

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) - - PowerPoint PPT Presentation

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation ( ) = ( ) = w i f x ( ) activation w x w i i f i x i If the activation is:


slide-1
SLIDE 1

Lirong Xia

MIRA, SVM, k-NN

slide-2
SLIDE 2

Linear Classifiers (perceptrons)

2

  • Inputs are feature values
  • Each feature has a weight
  • Sum is the activation
  • If the activation is:
  • Positive: output +1
  • Negative, output -1

activationw x

( ) =

wii fi x

( ) = wi f x ( )

i

slide-3
SLIDE 3

Classification: Weights

3

  • Binary case: compare features to a weight vector
  • Learning: figure out the weight vector from examples
slide-4
SLIDE 4

Binary Decision Rule

4

  • In the space of feature vectors
  • Examples are points
  • Any weight vector is a hyperplane
  • One side corresponds to Y = +1
  • Other corresponds to Y = -1
slide-5
SLIDE 5

Learning: Binary Perceptron

5

  • Start with weights = 0
  • For each training instance:
  • Classify with current weights
  • If correct (i.e. y=y*), no change!
  • If wrong: adjust the weight vector

by adding or subtracting the feature vector. Subtract if y* is -1.

y = +1 if wi f x

( ) ≥ 0

−1 if wi f x

( ) < 0

# $ % & %

w = w+ y*i f

slide-6
SLIDE 6

Multiclass Decision Rule

6

  • If we have multiple classes:
  • A weight vector for each class:
  • Score (activation) of a class y:
  • Prediction highest score wins

y

w

wyi f x

( )

y = argmax

y

wyi f x

( )

Binary = multiclass where the negative class has weight zero

slide-7
SLIDE 7

Learning: Multiclass Perceptron

7

  • Start with all weights = 0
  • Pick up training examples one by one
  • Predict with current weights
  • If correct, no change!
  • If wrong: lower score of wrong

answer, raise score of right answer

( ) ( )

* * y y y y

w w f x w w f x = − = +

y = argmax y wyi f x

( )

= argmax y wy,ii fi x

( )

i

slide-8
SLIDE 8

Today

8

  • Fixing the Perceptron: MIRA
  • Support Vector Machines
  • k-nearest neighbor (KNN)
slide-9
SLIDE 9

Properties of Perceptrons

9

  • Separability: some parameters get

the training set perfectly correct

  • Convergence: if the training is

separable, perceptron will eventually converge (binary case)

slide-10
SLIDE 10

Examples: Perceptron

10

  • Non-Separable Case
slide-11
SLIDE 11

Problems with the Perceptron

11

  • Noise: if the data isn’t

separable, weights might thrash

  • Averaging weight vectors over

time can help (averaged perceptron)

  • Mediocre generalization: finds

a “barely” separating solution

  • Overtraining: test / held-out

accuracy usually rises, then falls

  • Overtraining is a kind of overfitting
slide-12
SLIDE 12

Fixing the Perceptron

12

  • Idea: adjust the weight update to

mitigate these effects

  • MIRA*: choose an update size

that fixes the current mistake

  • …but, minimizes the change to w
  • The +1 helps to generalize

min

w

1 2 wy − w'y

y

2

wy*i f x

( ) ≥ wyi f x ( )+1

*Margin Infused Relaxed Algorithm

Guessed y instead of y* on example x with features f x

( )

wy = w'y−τ f x

( )

wy* = w'y*+τ f x

( )

slide-13
SLIDE 13

Minimum Correcting Update

13

min

w

1 2 wy − wy'

2 y

wy*i f ≥ wyi f +1

min not τ=0, or would not have made an error, so min will be where equality holds

( ) ( )

* *

' '

y y y y

w w f x w w f x τ τ = − = +

min

τ

τ f

2

wy*i f ≥ wyi f +1 minτ τ 2 w'y*+τ f

( ) f ≥ w'y−τ f ( ) f +1

τ = w'y− w'y*

( ) f +1

2 f i f

slide-14
SLIDE 14

Maximum Step Size

14

  • In practice, it’s also bad to make

updates that are too large

  • Example may be labeled incorrectly
  • You may not have enough features
  • Solution: cap the maximum possible

value of τ with some constant C

  • Corresponds to an optimization that

assumes non-separable data

  • Usually converges faster than

perceptron

  • Usually better, especially on noisy data

τ*= min w'y− w'y*

( ) f +1

2 f i f ,C " # $ $ % & ' '

slide-15
SLIDE 15

Outline

15

  • Fixing the Perceptron: MIRA
  • Support Vector Machines
  • k-nearest neighbor (KNN)
slide-16
SLIDE 16

Linear Separators

16

  • Which of these linear separators is optimal?
slide-17
SLIDE 17

Support Vector Machines

17

  • Maximizing the margin: good according to intuition, theory, practice
  • Only support vectors matter; other training examples are ignorable
  • Support vector machines (SVMs) find the separator with max

margin

  • Basically, SVMs are MIRA where you optimize over all examples at
  • nce

min

w

1 2 w− w'

y

2

wy*i f xi

( ) ≥ wyi f xi ( )+1

min

w

1 2 w

y

2

∀i, y wy*i f xi

( ) ≥ wyi f xi ( )+1

MIRA SVM

slide-18
SLIDE 18

Classification: Comparison

18

  • Naive Bayes
  • Builds a model training data
  • Gives prediction probabilities
  • Strong assumptions about feature independence
  • One pass through data (counting)
  • Perceptrons / MIRA:
  • Makes less assumptions about data
  • Mistake-driven learning
  • Multiple passes through data (prediction)
  • Often more accurate
slide-19
SLIDE 19

Outline

19

  • Fixing the Perceptron: MIRA
  • Support Vector Machines
  • k-nearest neighbor (KNN)
slide-20
SLIDE 20

Case-Based Reasoning

20

  • Similarity for classification
  • Case-based reasoning
  • Predict an instance’s label using

similar instances

  • Nearest-neighbor classification
  • 1-NN: copy the label of the most

similar data point

  • K-NN: let the k nearest neighbors

vote (have to devise a weighting scheme)

  • Key issue: how to define similarity
  • Trade-off:
  • Small k gives relevant neighbors
  • Large k gives smoother functions

Generated data 1-NN

. . . . . .

slide-21
SLIDE 21

Parametric / Non-parametric

21

  • Parametric models:
  • Fixed set of parameters
  • More data means better settings
  • Non-parametric models:
  • Complexity of the classifier increases with

data

  • Better in the limit, often worse in the non-limit
  • (K)NN is non-parametric
slide-22
SLIDE 22

Nearest-Neighbor Classification

22

  • Nearest neighbor for digits:
  • Take new image
  • Compare to all training images
  • Assign based on closest example
  • Encoding: image is vector of intensities:
  • What’s the similarity function?
  • Dot product of two images vectors?
  • Usually normalize vectors so ||x||=1
  • min = 0 (when?), max = 1(when?)

= 0.0 0.0 0.3 0.8 0.7 0.10.0

sim x,x'

( ) = xix' =

xix'i

i

slide-23
SLIDE 23

Basic Similarity

23

  • Many similarities based on feature dot products:
  • If features are just the pixels:
  • Note: not all similarities are of this form

sim x,x'

( ) = xix' =

xix'i

i

sim x,x'

( ) = f x ( )i f x' ( ) =

fi x

( ) fi x' ( )

i

slide-24
SLIDE 24

Invariant Metrics

24

  • Better distances use knowledge about vision
  • Invariant metrics:
  • Similarities are invariant under certain transformations
  • Rotation, scaling, translation, stroke-thickness…
  • E.g.:
  • 16*16=256 pixels; a point in 256-dim space
  • Small similarity in R256 (why?)
  • How to incorporate invariance into similarities?

This and next few slides adapted from Xiao Hu, UIUC

slide-25
SLIDE 25

Invariant Metrics

25

  • Each example is now a

curve in R256

  • Rotation invariant similarity:

s’=max s(r( ),r( ))

  • E.g. highest similarity

between images’ rotation lines