Classification and Pattern Recognition L eon Bottou NEC Labs - PowerPoint PPT Presentation

Classification and Pattern Recognition L´ eon Bottou NEC Labs America COS 424 – 2/23/2010

The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/32 COS 424 – 2/23/2010

Topics for today’s lecture Classification , clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 3/32 COS 424 – 2/23/2010

Summary 1. Bayesian decision theory 2. Nearest neigbours 3. Parametric classifiers 4. Surrogate loss functions 5. ROC curve. 6. Multiclass and multilabel problems L´ eon Bottou 4/32 COS 424 – 2/23/2010

Classification a.k.a. Pattern recognition Association between patterns x ∈ X and classes y ∈ Y . • The pattern space X is unspecified. For instance, X = R d . • The class space Y is an unordered finite set. Examples: • Binary classification ( Y = {± 1 } ). Fraud detection, anomaly detection,. . . • Multiclass classification: ( Y = { C 1 , C 2 , . . . C M } ) Object recognition, speaker identification, face recognition,. . . • Multilabel classification: ( Y is a power set). Document topic recognition,. . . • Sequence recognition: ( Y contains sequences). Speech recognition, signal identification, . . . . L´ eon Bottou 5/32 COS 424 – 2/23/2010

Probabilistic model Patterns and classes are represented by random variables X and Y . �� P ( X, Y ) = P ( X ) P ( Y | X ) = P ( Y ) P ( X | Y ) L´ eon Bottou 6/32 COS 424 – 2/23/2010

Bayes decision theory Consider a classifier x ∈ X �→ f ( x ) ∈ Y . Maximixe the probability of correct answer: � P { f ( X ) = Y } = 1 I( f ( x ) = y ) dP ( x, y ) � � I( f ( x ) = y ) P { Y = y | X = x } dP ( x ) = 1 y ∈Y � P { Y = f ( x ) | X = x } dP ( x ) = Bayes optimal decision rule: f ∗ ( x ) = arg max P { Y = y | X = x } y ∈Y � Bayes optimal error rate: B = 1 − max y ∈Y P { Y = y | X = x } dP ( x ) . L´ eon Bottou 7/32 COS 424 – 2/23/2010

Bayes optimal decision rule Comparing class densities p y ( x ) scaled by the class priors P y = P { Y = y } : � � �� Hatched area represents the Bayes optimal error rate. L´ eon Bottou 8/32 COS 424 – 2/23/2010

How to build a classifier from data Given a finite set of training examples { ( x 1 , y 1 ) , . . . , ( x n , y m ) } ? • Estimating probabilities : – Find a plausible probability distribution (next lecture). – Compute or approximate the optimal Bayes classifier. • Minimize empirical error : – Choose a parametrized family of classification functions a priori. – Pick one that minimize the observed error rate. • Nearest neighbours : – Determine class of x on the basis of the closest example(s). L´ eon Bottou 9/32 COS 424 – 2/23/2010

Nearest neighbours Let d ( x, x ′ ) be a distance on the patterns. Nearest neighbour rule (1NN) – Give x the class of the closest training example. – f nn ( x ) = y nn ( x ) with nn ( x ) = arg min i d ( x, x i ) . K -Nearest neighbours rule (kNN) – Give x the most frequent class among the K closest training examples. K -Nearest neighbours variants – Weighted votes (according the the distances) L´ eon Bottou 10/32 COS 424 – 2/23/2010

Voronoi tesselation Euclian distance in the plane Cosine distance on the sphere – 1NN: Piecewise constant classifier defined on the Voronoi cells. – kNN: Same, but with smaller cells and additional constraints. L´ eon Bottou 11/32 COS 424 – 2/23/2010

1NN and Optimal Bayes Error Theorem (Cover & Hart, 1967) : Assume η y ( x ) = P { Y = y | X = x } is continuous. When n → ∞ , B ≤ P { f nn ( X ) � = Y } ≤ 2 B . Easy proof when there are only two classes Let η ( x ) = P { Y = +1 | X = x } . � – B = min( η ( x ) , 1 − η ( x )) dP ( x ) – P { f nn ( X ) � = Y } η ( x )(1 − η ( x ∗ )) + (1 − η ( x )) η ( x ∗ ) dP ( x ) � = � → 2 η ( x )(1 − η ( x )) dP ( x ) �� L´ eon Bottou 12/32 COS 424 – 2/23/2010

1NN versus kNN 1 Bayes Bayes*2 1−nn 0.8 3−nn Using more neighbours 5−nn 7−nn 51−nn 0.6 – Is to Bayes rule in the limit. – Needs more examples to approach the 0.4 condition η ( x k nn ( x ) ) ≈ η ( x ) 0.2 0 0 0.2 0.4 0.6 0.8 1 K is a capacity parameter – to be determined using a validation set. L´ eon Bottou 13/32 COS 424 – 2/23/2010

Computation Straightforward implementation – Computing f ( x ) requires n distance computations. – ( − ) Grows with the number of examples. – ( + ) Embarrassingly parallelizable. Data structures to speedup the search: K-D trees – ( + ) Very effective in low dimension – ( − ) Nearly useless in high dimension Shortcutting the computation of distances – Stop computing as soon as a distance gets non-competitive. Use the triangular inequality d ( x, x i ) ≥ | d ( x, x ′ ) − d ( x i , x ′ ) | – Pick r well spread patterns x (1) . . . x ( r ) . – Precompute d ( x i , x ( j ) ) for i = 1 . . . n and j = 1 . . . r . – Lower bound d ( x, x i ) ≥ max j =1 ...r | d ( x, x ( j ) ) − d ( x i , x ( j ) ) | . – Shortcut if lower bound is not competitive. L´ eon Bottou 14/32 COS 424 – 2/23/2010

Distances Nearest Neighbour performance is sensitive to distance. Euclidian distance: d ( x, x ′ ) = ( x − x ′ ) 2 – do not take the square root! Mahalanobis distance: d ( x, x ′ ) = ( x − x ′ ) ⊤ A ( x − x ′ ) – Mahalanobis distance: A = Σ − 1 – Safe variant: A = (Σ + ǫI ) − 1 Dimensionality reduction: – Diagonalize Σ = Q ⊤ Λ Q . – Drop the low eigenvalues and corresponding eigenvector. x = Λ − 1 / 2 Q x . Precompute all the ˜ – Define ˜ x i . x i ) 2 . – Compute d ( x, x i ) = (˜ x − ˜ L´ eon Bottou 15/32 COS 424 – 2/23/2010

Discriminant function Binary classification: y = ± 1 Discriminant function: f w ( x ) – Assigns class sign( f w ( x )) to pattern x . – Symbol x represents parameters to be learnt. Example: Linear discriminant function – f w ( x ) = w ⊤ Φ( x ) . L´ eon Bottou 16/32 COS 424 – 2/23/2010

Example: The Perceptron The perceptron is a linear discriminant function Retina Associative area Treshold element (w’ x) sign w’ x x L´ eon Bottou 17/32 COS 424 – 2/23/2010

The Perceptron Algorithm – Initialize w ← 0 . – Loop – Pick example x i , y i – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + y i Φ( x i ) – Until all examples are correctly classified Perceptron theorem Guaranteed to stop if the training data is linearly separable Perceptron via Stochastic Gradient Descent i max(0 , − y i w ⊤ Φ( x i )) gives: SGD for minimizing C ( w ) = � – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + γ y i Φ( x i ) L´ eon Bottou 18/32 COS 424 – 2/23/2010

The Perceptron Mark 1 (1957) The Perceptron is not an algorithm. The Perceptron is a machine! L´ eon Bottou 19/32 COS 424 – 2/23/2010

Minimize the empirical error rate Empirical error rate n 1 � min 1 I { y i f ( x i , w ) ≤ 0 } w n i =1 Misclassification loss function – Noncontinuous – Nondifferentiable – Nonconvex ^ y y(x) L´ eon Bottou 20/32 COS 424 – 2/23/2010

Surrogate loss function Minimize instead n 1 � min ℓ ( y i f ( x i , w )) w n i =1 Quadratic surrogate loss Quadratic: ℓ ( z ) = ( z − 1) 2 ^ y y(x) L´ eon Bottou 21/32 COS 424 – 2/23/2010

Surrogate loss functions Exp loss and Log loss Exp loss: ℓ ( z ) = exp( − z ) Log loss: ^ ℓ ( z ) = log(1 + exp ( − z )) y y(x) Hinges Perceptron loss: ℓ ( z ) = max(0 , − z ) Hinge loss: ^ ℓ ( z ) = max(0 , 1 − z ) y y(x) L´ eon Bottou 22/32 COS 424 – 2/23/2010

Surrogate loss function Quadratic + Sigmoid Let σ ( z ) = tanh( z ) . ℓ ( z ) = ( σ ( 3 2 z ) − 1) 2 ^ y y(x) Ramp Ramp loss: ℓ ( z ) = [1 − z ] + − [ s − z ] + ^ y y(x) L´ eon Bottou 23/32 COS 424 – 2/23/2010

Classification and Pattern Recognition L eon Bottou NEC Labs - PowerPoint PPT Presentation

Classification and Pattern Recognition L eon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs.

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

CS 7616 Pattern Recognition Introduction Aaron Bobick School of Interactive Computing

Pattern Recognition 2019 Linear Models for Classification Ad Feelders Universiteit Utrecht Ad

Pattern Recognition 2019 Linear Models for Classification (2) Ad Feelders Universiteit Utrecht

Pattern Review Pattern Name and Classification: A descriptive and unique name that helps in

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Pattern Recognition CSE 802 Michigan State University Spring 2017 Lecture 1, January 9, 2017

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

Pattern Recognition Linear Models for Classification Extra Slides Ad Feelders Universiteit

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Analysis using Relative Infinitesimals Richard ODonovan Pisa, June 2008 in collaboration with

Board of Governors Meeting via Teleconference/Webinar April 26, 2016 12:00-1:30 p.m. ET Welcome

100% Renewable Electricity Tim McCollough, Lindsay Ex 1 September 20, 2018 Principles

DNS Resolvers Considered Harmful Kyle Schomp, Mark Allman, and Michael Rabinovich 2 DNS

o Track finding in silicon trackers with a small number of layers R. Fr uhwirth, R.

The principle of general local covariance and the quantization of Abelian gauge theories Claudio

Information needs of cancer policy/ planning stakeholders on just published screening trial

UbiComp is About Context Who are you? Are you buying this? Where are you? Where are your

Sambuz

Useful Links

Newsletter

Mail Us