statistical machine learning
play

Statistical Machine Learning A Crash Course Part II: Classification - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Bayesian Decision Theory Decision rule: p ( C 1 | x ) > p ( C 2 | x ) Decide


  1. Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

  2. Bayesian Decision Theory ■ Decision rule: p ( C 1 | x ) > p ( C 2 | x ) • Decide if C 1 We do not need the normalization! • This is equivalent to p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) • Which is equivalent to p ( x | C 1 ) p ( x | C 2 ) > p ( C 2 ) p ( C 1 ) ■ Bayes optimal classifier: • A classifier obeying this rule is called a Bayes optimal classifier. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

  3. Bayesian Decision Theory p ( C k | x ) = p ( x | C k ) p ( C k ) ■ Bayesian decision theory: p ( x ) • Model and estimate class-conditional density as well as p ( x | C k ) class prior p ( C k ) • Compute posterior p ( C k | x ) • Minimize the error probability by maximizing p ( C k | x ) ■ New approach: • Directly encode the “decision boundary” • Without modeling the densities directly • Still minimize the error probability Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

  4. Discriminant Functions ■ Formulate classification using comparisons: • Discriminant functions: y 1 ( x ) , . . . , y K ( x ) • Classify as class iff: C k x y k ( x ) > y j ( x ) , ⇥ j � = k • Example: Discriminant functions from Bayes classifier y k ( x ) = p ( C k | x ) y k ( x ) = p ( x | C k ) p ( C k ) y k ( x ) = log p ( x | C k ) + log p ( C k ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

  5. Discriminant Functions ■ Special case: 2 classes y 1 ( x ) y 2 ( x ) > y 1 ( x ) − y 2 ( x ) 0 ⇔ > ⇔ : y ( x ) 0 > • Example: Bayes classifier y ( x ) = p ( C 1 | x ) − p ( C 2 | x ) log p ( x | C 1 ) p ( x | C 2 ) + log p ( C 1 ) y ( x ) = p ( C 2 ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5

  6. Example: Bayes classifier ■ 2 classes, Gaussian class-conditionals: decision boundary 5 5 5 5 5 0 0 0.01 4 4 4 4 4 − − − 1 2 − 0.01 − 0.01 3 0 1 0.01 0 . 0 0.02 1 3 3 3 3 3 − 0.01 − 0.01 0.02 3 0.02 0 . 0 0 0 0.01 0.01 0.01 0.01 0.01 2 2 2 2 2 0.02 − − 0.03 1 2 0 − 0.01 − 0.01 0.04 0 0.03 0.01 0.01 . 0 3 0 . 2 2 1 0 1 1 1 1 1 0.02 0 0 0.02 0.02 0.02 . . 0 0 1 4 2 2 0 0 0 0 . . 0 0 0 0 0 0 0 0 . 0 0.01 − 3 1 0.01 0.01 0 2 0.01 0 0 . 3 2 1 − 1 − 1 − 1 − 1 − 1 0.01 0.01 0.01 0 0 − 2 − 2 − 2 − 2 − 2 0 4 3 2 1 − 3 − 3 − 3 − 3 − 3 − 3 − 3 − 2 − 2 − 1 − 1 0 0 1 1 2 2 3 3 4 4 5 5 − 3 − 3 − 3 − 2 − 2 − 2 − 1 − 1 − 1 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) p ( x | C 1 ) p ( C 1 ) − p ( x | C 2 ) p ( C 2 ) log p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6

  7. Linear Discriminant Functions ■ 2-class problem: y ( x ) > 0 : decide class 1, otherwise class 2 ■ Simplest case: linear decision boundary • Linear discriminant function: y ( x ) = w T x + w 0 offset normal vector Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

  8. Linear Discriminant Functions ■ Illustration for 2D case: y ( x ) = w T x + w 0 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w signed distance to the y ( x ) k w k decision boundary x ? x 1 � w 0 k w k Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8

  9. Linear Discriminant Functions ■ 2 basic cases: linearly separable not linearly separable Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

  10. Multi-Class Case ■ What if we constructed a multi-class classifier from several 2-class classifiers? C 3 C 1 ? R 1 R 1 R 3 R 2 C 1 ? C 1 C 3 R 3 R 2 C 2 C 2 not C 1 not C 2 C 2 one-versus-all one-versus-one (one-versus-the-rest) • If we base our decision rule on binary decisions, this may lead to ambiguities. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

  11. Multi-Class Case ■ Better solution: (we have seen this already) • Use discriminant function to encode how strongly we believe in y 1 ( x ) , . . . , y K ( x ) each class: y k ( x ) > y j ( x ) , ⇥ j � = k • Decision rule: R j R i If the discriminant functions are linear, the decision regions are R k connected and convex x B x A ˆ x Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11

  12. Discriminant Functions ■ Why might we want to use discriminant functions? ■ Example: 2 classes • We could easily fit the class-conditionals using Gaussians and use a Bayes classifier. • How about now? • Do these points matter for making the decision between the two classes? Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12

  13. Distribution-free Classifiers ■ Main idea: • We do not necessarily need to model all details of the class- conditional distributions to come up with a good decision boundary. - The class-conditionals may have many intricacies that do not matter at the end of the day. • If we can learn where to place the decision boundary directly, we can avoid some of the complexity. ■ Nonetheless: • It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13

  14. First Attempt: Least Squares ■ Try to achieve a certain value of the discriminative function: y ( x ) = +1 x ∈ C 1 ⇔ y ( x ) = − 1 x ∈ C 2 ⇔ X = { x 1 ∈ R d , . . . , x n } • Training data: • Labels: Y = { y 1 ∈ { − 1 , 1 } , . . . , y n } ■ Linear discriminant function: x T • Try to enforce ∀ i = 1 , . . . , n i w + w 0 = y i , • One linear equation for each training data point/label pair. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14

  15. First Attempt: Least Squares ■ Linear equation system: x T ∀ i = 1 , . . . , n i w + w 0 = y i , � w ⇥ � x i ⇥ • Define ˆ x i = w = ˆ 1 w 0 • Rewrite equation system: x T ˆ i ˆ ∀ i = 1 , . . . , n w = y i , • Or in matrix-vector notation: X T ˆ ˆ ˆ X = [ˆ x 1 , . . . , ˆ x n ] with w = y y = [ y 1 , . . . , y n ] T Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15

  16. First Attempt: Least Squares ■ Overdetermined system of equations: X T ˆ ˆ w = y • n equations, d+1 unknowns ■ Look for least squares solution: X T ˆ || ˆ w − y || 2 → min X T ˆ X T ˆ ( ˆ w − y ) T ( ˆ w − y ) → min w T ˆ X T ˆ w − 2 y T ˆ X T ˆ X ˆ w + y T y → min ˆ • Set the derivative to zero: X T ˆ 2 ˆ X ˆ w − 2 ˆ Xy := 0 X T ) − 1 ˆ w = ( ˆ X ˆ ˆ X ⌅ y ⇤ ⇥� Pseudo-inverse Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16

  17. First Attempt: Least Squares ■ Problem: Least-squares is very sensitive to outliers 4 4 2 2 0 0 − 2 − 2 − 4 − 4 − 6 − 6 − 8 − 8 − 4 − 2 0 2 4 6 8 − 4 − 2 0 2 4 6 8 Outliers present No outliers Least-squares discriminant breaks Least-squares discriminant works down Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17

  18. New Stategy ■ If our classes are linearly separable, we want to make sure that we find a separating (hyper)plane: ■ First such algorithm we will see: • The perceptron algorithm [Rosenblatt, 1962] • A true classic! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

  19. Perceptron Algorithm ■ Perceptron discriminant function: y ( x ) = sign( w T x + b ) ■ Algorithm: • Start with some “weight” vector and some bias b w y i ∈ { − 1 , 1 } • For all data points with class labels x i - If is correctly classified, i.e. , do nothing. y ( x i ) = y i x i - Otherwise, if update the parameters using: y i = 1 b ← b + 1 w ← w + x i - Otherwise, if update the parameters using: y i = − 1 b ← b − 1 w ← w − x i • Repeat until convergence. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19

  20. Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20

  21. Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21

  22. Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 22

  23. Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend