classification and pattern recognition
play

Classification and Pattern Recognition L eon Bottou NEC Labs - PowerPoint PPT Presentation

Classification and Pattern Recognition L eon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs.


  1. Classification and Pattern Recognition L´ eon Bottou NEC Labs America COS 424 – 2/23/2010

  2. The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/32 COS 424 – 2/23/2010

  3. Topics for today’s lecture Classification , clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 3/32 COS 424 – 2/23/2010

  4. Summary 1. Bayesian decision theory 2. Nearest neigbours 3. Parametric classifiers 4. Surrogate loss functions 5. ROC curve. 6. Multiclass and multilabel problems L´ eon Bottou 4/32 COS 424 – 2/23/2010

  5. Classification a.k.a. Pattern recognition Association between patterns x ∈ X and classes y ∈ Y . • The pattern space X is unspecified. For instance, X = R d . • The class space Y is an unordered finite set. Examples: • Binary classification ( Y = {± 1 } ). Fraud detection, anomaly detection,. . . • Multiclass classification: ( Y = { C 1 , C 2 , . . . C M } ) Object recognition, speaker identification, face recognition,. . . • Multilabel classification: ( Y is a power set). Document topic recognition,. . . • Sequence recognition: ( Y contains sequences). Speech recognition, signal identification, . . . . L´ eon Bottou 5/32 COS 424 – 2/23/2010

  6. Probabilistic model Patterns and classes are represented by random variables X and Y . �������� ������ � � ��������� ��������� � � ������� ������� ������ ������� ������� ����������������� � ����������������� ������� ����������������� ����������������� P ( X, Y ) = P ( X ) P ( Y | X ) = P ( Y ) P ( X | Y ) L´ eon Bottou 6/32 COS 424 – 2/23/2010

  7. Bayes decision theory Consider a classifier x ∈ X �→ f ( x ) ∈ Y . Maximixe the probability of correct answer: � P { f ( X ) = Y } = 1 I( f ( x ) = y ) dP ( x, y ) � � I( f ( x ) = y ) P { Y = y | X = x } dP ( x ) = 1 y ∈Y � P { Y = f ( x ) | X = x } dP ( x ) = Bayes optimal decision rule: f ∗ ( x ) = arg max P { Y = y | X = x } y ∈Y � Bayes optimal error rate: B = 1 − max y ∈Y P { Y = y | X = x } dP ( x ) . L´ eon Bottou 7/32 COS 424 – 2/23/2010

  8. Bayes optimal decision rule Comparing class densities p y ( x ) scaled by the class priors P y = P { Y = y } : � � �� � ��� � � �� � ��� � � �� � ��� Hatched area represents the Bayes optimal error rate. L´ eon Bottou 8/32 COS 424 – 2/23/2010

  9. How to build a classifier from data Given a finite set of training examples { ( x 1 , y 1 ) , . . . , ( x n , y m ) } ? • Estimating probabilities : – Find a plausible probability distribution (next lecture). – Compute or approximate the optimal Bayes classifier. • Minimize empirical error : – Choose a parametrized family of classification functions a priori. – Pick one that minimize the observed error rate. • Nearest neighbours : – Determine class of x on the basis of the closest example(s). L´ eon Bottou 9/32 COS 424 – 2/23/2010

  10. Nearest neighbours Let d ( x, x ′ ) be a distance on the patterns. Nearest neighbour rule (1NN) – Give x the class of the closest training example. – f nn ( x ) = y nn ( x ) with nn ( x ) = arg min i d ( x, x i ) . K -Nearest neighbours rule (kNN) – Give x the most frequent class among the K closest training examples. K -Nearest neighbours variants – Weighted votes (according the the distances) L´ eon Bottou 10/32 COS 424 – 2/23/2010

  11. Voronoi tesselation Euclian distance in the plane Cosine distance on the sphere – 1NN: Piecewise constant classifier defined on the Voronoi cells. – kNN: Same, but with smaller cells and additional constraints. L´ eon Bottou 11/32 COS 424 – 2/23/2010

  12. 1NN and Optimal Bayes Error Theorem (Cover & Hart, 1967) : Assume η y ( x ) = P { Y = y | X = x } is continuous. When n → ∞ , B ≤ P { f nn ( X ) � = Y } ≤ 2 B . Easy proof when there are only two classes Let η ( x ) = P { Y = +1 | X = x } . � – B = min( η ( x ) , 1 − η ( x )) dP ( x ) – P { f nn ( X ) � = Y } η ( x )(1 − η ( x ∗ )) + (1 − η ( x )) η ( x ∗ ) dP ( x ) � = � → 2 η ( x )(1 − η ( x )) dP ( x ) ���� � � L´ eon Bottou 12/32 COS 424 – 2/23/2010

  13. 1NN versus kNN 1 Bayes Bayes*2 1−nn 0.8 3−nn Using more neighbours 5−nn 7−nn 51−nn 0.6 – Is to Bayes rule in the limit. – Needs more examples to approach the 0.4 condition η ( x k nn ( x ) ) ≈ η ( x ) 0.2 0 0 0.2 0.4 0.6 0.8 1 K is a capacity parameter – to be determined using a validation set. L´ eon Bottou 13/32 COS 424 – 2/23/2010

  14. Computation Straightforward implementation – Computing f ( x ) requires n distance computations. – ( − ) Grows with the number of examples. – ( + ) Embarrassingly parallelizable. Data structures to speedup the search: K-D trees – ( + ) Very effective in low dimension – ( − ) Nearly useless in high dimension Shortcutting the computation of distances – Stop computing as soon as a distance gets non-competitive. Use the triangular inequality d ( x, x i ) ≥ | d ( x, x ′ ) − d ( x i , x ′ ) | – Pick r well spread patterns x (1) . . . x ( r ) . – Precompute d ( x i , x ( j ) ) for i = 1 . . . n and j = 1 . . . r . – Lower bound d ( x, x i ) ≥ max j =1 ...r | d ( x, x ( j ) ) − d ( x i , x ( j ) ) | . – Shortcut if lower bound is not competitive. L´ eon Bottou 14/32 COS 424 – 2/23/2010

  15. Distances Nearest Neighbour performance is sensitive to distance. Euclidian distance: d ( x, x ′ ) = ( x − x ′ ) 2 – do not take the square root! Mahalanobis distance: d ( x, x ′ ) = ( x − x ′ ) ⊤ A ( x − x ′ ) – Mahalanobis distance: A = Σ − 1 – Safe variant: A = (Σ + ǫI ) − 1 Dimensionality reduction: – Diagonalize Σ = Q ⊤ Λ Q . – Drop the low eigenvalues and corresponding eigenvector. x = Λ − 1 / 2 Q x . Precompute all the ˜ – Define ˜ x i . x i ) 2 . – Compute d ( x, x i ) = (˜ x − ˜ L´ eon Bottou 15/32 COS 424 – 2/23/2010

  16. Discriminant function Binary classification: y = ± 1 Discriminant function: f w ( x ) – Assigns class sign( f w ( x )) to pattern x . – Symbol x represents parameters to be learnt. Example: Linear discriminant function – f w ( x ) = w ⊤ Φ( x ) . L´ eon Bottou 16/32 COS 424 – 2/23/2010

  17. Example: The Perceptron The perceptron is a linear discriminant function Retina Associative area Treshold element (w’ x) sign w’ x x L´ eon Bottou 17/32 COS 424 – 2/23/2010

  18. The Perceptron Algorithm – Initialize w ← 0 . – Loop – Pick example x i , y i – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + y i Φ( x i ) – Until all examples are correctly classified Perceptron theorem Guaranteed to stop if the training data is linearly separable Perceptron via Stochastic Gradient Descent i max(0 , − y i w ⊤ Φ( x i )) gives: SGD for minimizing C ( w ) = � – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + γ y i Φ( x i ) L´ eon Bottou 18/32 COS 424 – 2/23/2010

  19. The Perceptron Mark 1 (1957) The Perceptron is not an algorithm. The Perceptron is a machine! L´ eon Bottou 19/32 COS 424 – 2/23/2010

  20. Minimize the empirical error rate Empirical error rate n 1 � min 1 I { y i f ( x i , w ) ≤ 0 } w n i =1 Misclassification loss function – Noncontinuous – Nondifferentiable – Nonconvex ^ y y(x) L´ eon Bottou 20/32 COS 424 – 2/23/2010

  21. Surrogate loss function Minimize instead n 1 � min ℓ ( y i f ( x i , w )) w n i =1 Quadratic surrogate loss Quadratic: ℓ ( z ) = ( z − 1) 2 ^ y y(x) L´ eon Bottou 21/32 COS 424 – 2/23/2010

  22. Surrogate loss functions Exp loss and Log loss Exp loss: ℓ ( z ) = exp( − z ) Log loss: ^ ℓ ( z ) = log(1 + exp ( − z )) y y(x) Hinges Perceptron loss: ℓ ( z ) = max(0 , − z ) Hinge loss: ^ ℓ ( z ) = max(0 , 1 − z ) y y(x) L´ eon Bottou 22/32 COS 424 – 2/23/2010

  23. Surrogate loss function Quadratic + Sigmoid Let σ ( z ) = tanh( z ) . ℓ ( z ) = ( σ ( 3 2 z ) − 1) 2 ^ y y(x) Ramp Ramp loss: ℓ ( z ) = [1 − z ] + − [ s − z ] + ^ y y(x) L´ eon Bottou 23/32 COS 424 – 2/23/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend