Classification and Pattern Recognition L eon Bottou NEC Labs - - PowerPoint PPT Presentation

classification and pattern recognition
SMART_READER_LITE
LIVE PREVIEW

Classification and Pattern Recognition L eon Bottou NEC Labs - - PowerPoint PPT Presentation

Classification and Pattern Recognition L eon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs.


slide-1
SLIDE 1

Classification and Pattern Recognition

L´ eon Bottou

NEC Labs America

COS 424 – 2/23/2010

slide-2
SLIDE 2

The machine learning mix and match

Goals Classification, clustering, regression, other. Representation Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Capacity Control Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Operational Considerations Loss functions Budget constraints Online vs. offline Computational Considerations Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms.

L´ eon Bottou 2/32 COS 424 – 2/23/2010

slide-3
SLIDE 3

Topics for today’s lecture

Goals Classification, clustering, regression, other. Representation Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Linear vs. nonlinear Deep vs. shallow Capacity Control Explicit: architecture, feature selection Explicit: regularization, priors Implicit: approximate optimization Implicit: bayesian averaging, ensembles Operational Considerations Loss functions Budget constraints Online vs. offline Computational Considerations Exact algorithms for small datasets. Stochastic algorithms for big datasets. Parallel algorithms.

L´ eon Bottou 3/32 COS 424 – 2/23/2010

slide-4
SLIDE 4

Summary

  • 1. Bayesian decision theory
  • 2. Nearest neigbours
  • 3. Parametric classifiers
  • 4. Surrogate loss functions
  • 5. ROC curve.
  • 6. Multiclass and multilabel problems

L´ eon Bottou 4/32 COS 424 – 2/23/2010

slide-5
SLIDE 5

Classification a.k.a. Pattern recognition

Association between patterns x ∈ X and classes y ∈ Y.

  • The pattern space X is unspecified. For instance, X = Rd.
  • The class space Y is an unordered finite set.

Examples:

  • Binary classification (Y = {±1}).

Fraud detection, anomaly detection,. . .

  • Multiclass classification: (Y = {C1, C2, . . . CM})

Object recognition, speaker identification, face recognition,. . .

  • Multilabel classification: (Y is a power set).

Document topic recognition,. . .

  • Sequence recognition: (Y contains sequences).

Speech recognition, signal identification, . . . .

L´ eon Bottou 5/32 COS 424 – 2/23/2010

slide-6
SLIDE 6

Probabilistic model

Patterns and classes are represented by random variables X and Y .

  • P (X, Y ) = P (X) P (Y |X) = P (Y ) P (X|Y )

L´ eon Bottou 6/32 COS 424 – 2/23/2010

slide-7
SLIDE 7

Bayes decision theory

Consider a classifier x ∈ X → f(x) ∈ Y. Maximixe the probability of correct answer:

P {f(X) = Y } =

  • 1

I(f(x) = y) dP(x, y) =

y∈Y

1 I(f(x) = y) P {Y = y|X = x} dP(x) =

  • P {Y = f(x)|X = x} dP(x)

Bayes optimal decision rule: f∗(x) = arg max

y∈Y

P {Y = y|X = x}

Bayes optimal error rate: B = 1 −

  • max

y∈Y P {Y = y|X = x} dP (x).

L´ eon Bottou 7/32 COS 424 – 2/23/2010

slide-8
SLIDE 8

Bayes optimal decision rule

Comparing class densities py(x) scaled by the class priors Py = P {Y = y}:

  • Hatched area represents the Bayes optimal error rate.

L´ eon Bottou 8/32 COS 424 – 2/23/2010

slide-9
SLIDE 9

How to build a classifier from data

Given a finite set of training examples {(x1, y1), . . . , (xn, ym)} ?

  • Estimating probabilities:

– Find a plausible probability distribution (next lecture). – Compute or approximate the optimal Bayes classifier.

  • Minimize empirical error:

– Choose a parametrized family of classification functions a priori. – Pick one that minimize the observed error rate.

  • Nearest neighbours:

– Determine class of x on the basis of the closest example(s).

L´ eon Bottou 9/32 COS 424 – 2/23/2010

slide-10
SLIDE 10

Nearest neighbours

Let d(x, x′) be a distance on the patterns. Nearest neighbour rule (1NN) – Give x the class of the closest training example. – fnn(x) = ynn(x) with nn(x) = arg mini d(x, xi).

K-Nearest neighbours rule (kNN)

– Give x the most frequent class among the K closest training examples.

K-Nearest neighbours variants

– Weighted votes (according the the distances)

L´ eon Bottou 10/32 COS 424 – 2/23/2010

slide-11
SLIDE 11

Voronoi tesselation

Euclian distance in the plane Cosine distance on the sphere – 1NN: Piecewise constant classifier defined on the Voronoi cells. – kNN: Same, but with smaller cells and additional constraints.

L´ eon Bottou 11/32 COS 424 – 2/23/2010

slide-12
SLIDE 12

1NN and Optimal Bayes Error

Theorem (Cover & Hart, 1967) : Assume ηy(x) = P {Y = y|X = x} is continuous. When n → ∞, B ≤ P {fnn(X) = Y } ≤ 2B. Easy proof when there are only two classes Let η(x) = P {Y = +1|X = x}. – B =

  • min(η(x), 1 − η(x)) dP(x)

– P {fnn(X) = Y }

=

  • η(x)(1 − η(x∗)) + (1 − η(x))η(x∗) dP(x)

  • 2 η(x)(1 − η(x)) dP(x)

eon Bottou 12/32 COS 424 – 2/23/2010

slide-13
SLIDE 13

1NN versus kNN

Using more neighbours – Is to Bayes rule in the limit. – Needs more examples to approach the condition η(xknn(x)) ≈ η(x)

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Bayes Bayes*2 1−nn 3−nn 5−nn 7−nn 51−nn

K is a capacity parameter – to be determined using a validation set.

L´ eon Bottou 13/32 COS 424 – 2/23/2010

slide-14
SLIDE 14

Computation

Straightforward implementation – Computing f(x) requires n distance computations. – (−) Grows with the number of examples. – (+) Embarrassingly parallelizable. Data structures to speedup the search: K-D trees – (+) Very effective in low dimension – (−) Nearly useless in high dimension Shortcutting the computation of distances – Stop computing as soon as a distance gets non-competitive. Use the triangular inequality d(x, xi) ≥ |d(x, x′) − d(xi, x′)| – Pick r well spread patterns x(1) . . . x(r). – Precompute d(xi, x(j)) for i = 1 . . . n and j = 1 . . . r. – Lower bound d(x, xi) ≥ maxj=1...r |d(x, x(j)) − d(xi, x(j))|. – Shortcut if lower bound is not competitive.

L´ eon Bottou 14/32 COS 424 – 2/23/2010

slide-15
SLIDE 15

Distances

Nearest Neighbour performance is sensitive to distance. Euclidian distance: d(x, x′) = (x − x′)2 – do not take the square root! Mahalanobis distance: d(x, x′) = (x − x′)⊤A (x − x′) – Mahalanobis distance: A = Σ−1 – Safe variant: A = (Σ + ǫI)−1 Dimensionality reduction: – Diagonalize Σ = Q⊤ΛQ. – Drop the low eigenvalues and corresponding eigenvector. – Define ˜

x = Λ−1/2Q x. Precompute all the ˜ xi.

– Compute d(x, xi) = (˜

x − ˜ xi)2.

L´ eon Bottou 15/32 COS 424 – 2/23/2010

slide-16
SLIDE 16

Discriminant function

Binary classification: y = ±1 Discriminant function: fw(x) – Assigns class sign(fw(x)) to pattern x. – Symbol x represents parameters to be learnt. Example: Linear discriminant function – fw(x) = w⊤Φ(x).

L´ eon Bottou 16/32 COS 424 – 2/23/2010

slide-17
SLIDE 17

Example: The Perceptron

The perceptron is a linear discriminant function

Retina Associative area x w’ x (w’ x) sign Treshold element

L´ eon Bottou 17/32 COS 424 – 2/23/2010

slide-18
SLIDE 18

The Perceptron Algorithm

– Initialize w ← 0. – Loop – Pick example xi, yi – If yi w⊤Φ(xi) ≤ 0 then w ← w + yi Φ(xi) – Until all examples are correctly classified Perceptron theorem Guaranteed to stop if the training data is linearly separable Perceptron via Stochastic Gradient Descent SGD for minimizing C(w) =

i max(0, −yi w⊤Φ(xi)) gives:

– If yi w⊤Φ(xi) ≤ 0 then w ← w + γ yi Φ(xi)

L´ eon Bottou 18/32 COS 424 – 2/23/2010

slide-19
SLIDE 19

The Perceptron Mark 1 (1957)

The Perceptron is not an algorithm. The Perceptron is a machine!

L´ eon Bottou 19/32 COS 424 – 2/23/2010

slide-20
SLIDE 20

Minimize the empirical error rate

Empirical error rate

min

w

1 n

n

  • i=1

1 I{yi f(xi, w) ≤ 0}

Misclassification loss function – Noncontinuous – Nondifferentiable – Nonconvex

y y(x) ^ L´ eon Bottou 20/32 COS 424 – 2/23/2010

slide-21
SLIDE 21

Surrogate loss function

Minimize instead

min

w

1 n

n

  • i=1

ℓ(yi f(xi, w))

Quadratic surrogate loss Quadratic:

ℓ(z) = (z − 1)2

y y(x) ^ L´ eon Bottou 21/32 COS 424 – 2/23/2010

slide-22
SLIDE 22

Surrogate loss functions

Exp loss and Log loss Exp loss:

ℓ(z) = exp(−z)

Log loss:

ℓ(z) = log(1 + exp(−z))

y y(x) ^

Hinges Perceptron loss:

ℓ(z) = max(0, −z)

Hinge loss:

ℓ(z) = max(0, 1 − z)

y y(x) ^ L´ eon Bottou 22/32 COS 424 – 2/23/2010

slide-23
SLIDE 23

Surrogate loss function

Quadratic+Sigmoid Let σ(z) = tanh(z).

ℓ(z) = (σ(3

2z) − 1)2

y y(x) ^

Ramp Ramp loss:

ℓ(z) = [1 − z]+ − [s − z]+

y y(x) ^ L´ eon Bottou 23/32 COS 424 – 2/23/2010

slide-24
SLIDE 24

Choice of a surrogate loss function

Constraints from the optimization algorithm – A convex loss with a convex fw(x) ensures the unicity of the minimum. – Optimization by gradient descent suggests differentiable losses. – Dual optimizatoin methods work well with hinges. Class calibrated loss – In the limit min

η(x)ℓ(fw(x)) + (1 − η(x))ℓ(−fw(x))

  • dP(x).

– Define L(η, z) = ηℓ(z) + (1 − η)ℓ(−z). – If we had an infinite training set and a fully flexible fw(x), we would have: f(x) = arg min

z

L(P {Y = +1|X = x} , z).

– Examples.

L´ eon Bottou 24/32 COS 424 – 2/23/2010

slide-25
SLIDE 25

Asymmetric cost problem

Binary classification. – Positive class y = +1, negative class y = −1. Examples of positive classes. – fraudulent credit card transaction – relevant document for a given query – heart failure detection Different kinds of errors have different costs. – False positive, false detection, false alarm. – False negative, non detection. Costs are difficult to assess.

L´ eon Bottou 25/32 COS 424 – 2/23/2010

slide-26
SLIDE 26

Receiver Operating Curve (ROC)

Changing the threshold – Assigned class is sign(f(x) − b). – True positives: F+(b) = P {f(x) − b > 0|Y = +1} – False positives: F−(b) = P {f(x) − b > 0|Y = −1}

eon Bottou 26/32 COS 424 – 2/23/2010

slide-27
SLIDE 27

Optimal decision rule with asymmetric costs

Optimal asymmetric decision rule – Let Cy be the cost of erroneously assigning class y to an example. – We want to minimize

  • y=±1

Cy 1 I(f(x) = y) P {Y = y|X = x} dP(x).

– f(x) =

arg min

y=±1

Cy P {Y = y|X = x} = sign

  • η(x) −

C+ C+ + C−

  • Optimal ROC curve

– The optimal decision rules have the form sign(f(x) − b) – Therefore f(x) = η(x) = P {Y = +1|X} gives the optimal ROC curve. – Same for monotone transformations of f(x).

L´ eon Bottou 27/32 COS 424 – 2/23/2010

slide-28
SLIDE 28

Empirical ROC

eon Bottou 28/32 COS 424 – 2/23/2010

slide-29
SLIDE 29

Ranking

Find a function fw(x) with ROC close to the optimal ROC. Maximize Area Under Curve (AUC) – We would like min

  • i∈P
  • j∈N

1 I{f(xi, w) ≤ f(xj, w)}

– With a surrogate min

  • i∈P
  • j∈N

ℓ(f(xi, w) − f(xj, w))

Ranking the best instances – AUC often optimizes useless parts of the ROC curve. – Various algorithms have been proposed to do better. . . .

L´ eon Bottou 29/32 COS 424 – 2/23/2010

slide-30
SLIDE 30

What to do with more than two classes ?

Turning the problem into multiple binary classification problems.

  • One versus all (M classifiers).

– Classifier fk(x) detects class k. – Recognized class is arg maxk fx(x). – Each classifier is trained on the full dataset. – Dubious principle. Works well in practice.

  • One versus others (M(M − 1)/2 classifiers)

– Classifier fk,k′ separates class k from class k′. – Recognized class if arg maxk

  • k′ fk,k′(x).

– Classifier fk,k′ is trained on examples from classes k and k′. – Dubious principle. Often faster but sligtly worse.

L´ eon Bottou 30/32 COS 424 – 2/23/2010

slide-31
SLIDE 31

What to do with more than two classes ?

Doing it right! – Learn a function Sw(x, y) that measures how well y goes with x. – Recognized class arg maxy Sw(x, y) Cost functions Perceptron-like: min

w

1 n

n

  • i=1

− Sw(xi, yi) + max

y

Sw(xi, y)

Hinge-like:

min

w

1 n

n

  • i=1

max

  • 1 − Sw(xi, yi) + max

y=yi

Sw(xi, y)

  • +

Logloss-like:

min

w

1 n

n

  • i=1

− Sw(xi, yi) + log

y

eSw(xi,y)

Comments – More costly than OVA. – Not better than OVA in practice.

L´ eon Bottou 31/32 COS 424 – 2/23/2010

slide-32
SLIDE 32

Multilabel Problems

Documents can treat multiple topics. Therefore y is a subset of the set of topics. Simple approach – One binary classification for each topic. – But labels are not independent: taxonomies, related topics. Complex scoring functions – fk(x) gives a score for document x and topic k. – Rw(y) measures the compatibility the topic set y. – Recognized topics: arg max

y1...yk

Rw({y1 . . . yk}) +

  • k

fk(x).

– Same loss functions as the multiclass problem.

L´ eon Bottou 32/32 COS 424 – 2/23/2010