Statistical Machine Learning A Crash Course Part II: Classification - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

Bayesian Decision Theory ■ Decision rule: p ( C 1 | x ) > p ( C 2 | x ) • Decide if C 1 We do not need the normalization! • This is equivalent to p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) • Which is equivalent to p ( x | C 1 ) p ( x | C 2 ) > p ( C 2 ) p ( C 1 ) ■ Bayes optimal classifier: • A classifier obeying this rule is called a Bayes optimal classifier. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

Bayesian Decision Theory p ( C k | x ) = p ( x | C k ) p ( C k ) ■ Bayesian decision theory: p ( x ) • Model and estimate class-conditional density as well as p ( x | C k ) class prior p ( C k ) • Compute posterior p ( C k | x ) • Minimize the error probability by maximizing p ( C k | x ) ■ New approach: • Directly encode the “decision boundary” • Without modeling the densities directly • Still minimize the error probability Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3

Discriminant Functions ■ Formulate classification using comparisons: • Discriminant functions: y 1 ( x ) , . . . , y K ( x ) • Classify as class iff: C k x y k ( x ) > y j ( x ) , ⇥ j � = k • Example: Discriminant functions from Bayes classifier y k ( x ) = p ( C k | x ) y k ( x ) = p ( x | C k ) p ( C k ) y k ( x ) = log p ( x | C k ) + log p ( C k ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4

Example: Bayes classifier ■ 2 classes, Gaussian class-conditionals: decision boundary 5 5 5 5 5 0 0 0.01 4 4 4 4 4 − − − 1 2 − 0.01 − 0.01 3 0 1 0.01 0 . 0 0.02 1 3 3 3 3 3 − 0.01 − 0.01 0.02 3 0.02 0 . 0 0 0 0.01 0.01 0.01 0.01 0.01 2 2 2 2 2 0.02 − − 0.03 1 2 0 − 0.01 − 0.01 0.04 0 0.03 0.01 0.01 . 0 3 0 . 2 2 1 0 1 1 1 1 1 0.02 0 0 0.02 0.02 0.02 . . 0 0 1 4 2 2 0 0 0 0 . . 0 0 0 0 0 0 0 0 . 0 0.01 − 3 1 0.01 0.01 0 2 0.01 0 0 . 3 2 1 − 1 − 1 − 1 − 1 − 1 0.01 0.01 0.01 0 0 − 2 − 2 − 2 − 2 − 2 0 4 3 2 1 − 3 − 3 − 3 − 3 − 3 − 3 − 3 − 2 − 2 − 1 − 1 0 0 1 1 2 2 3 3 4 4 5 5 − 3 − 3 − 3 − 2 − 2 − 2 − 1 − 1 − 1 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) p ( x | C 1 ) p ( C 1 ) − p ( x | C 2 ) p ( C 2 ) log p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6

Linear Discriminant Functions ■ 2-class problem: y ( x ) > 0 : decide class 1, otherwise class 2 ■ Simplest case: linear decision boundary • Linear discriminant function: y ( x ) = w T x + w 0 offset normal vector Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7

Linear Discriminant Functions ■ Illustration for 2D case: y ( x ) = w T x + w 0 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w signed distance to the y ( x ) k w k decision boundary x ? x 1 � w 0 k w k Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8

Linear Discriminant Functions ■ 2 basic cases: linearly separable not linearly separable Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

Multi-Class Case ■ What if we constructed a multi-class classifier from several 2-class classifiers? C 3 C 1 ? R 1 R 1 R 3 R 2 C 1 ? C 1 C 3 R 3 R 2 C 2 C 2 not C 1 not C 2 C 2 one-versus-all one-versus-one (one-versus-the-rest) • If we base our decision rule on binary decisions, this may lead to ambiguities. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

Multi-Class Case ■ Better solution: (we have seen this already) • Use discriminant function to encode how strongly we believe in y 1 ( x ) , . . . , y K ( x ) each class: y k ( x ) > y j ( x ) , ⇥ j � = k • Decision rule: R j R i If the discriminant functions are linear, the decision regions are R k connected and convex x B x A ˆ x Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11

Discriminant Functions ■ Why might we want to use discriminant functions? ■ Example: 2 classes • We could easily fit the class-conditionals using Gaussians and use a Bayes classifier. • How about now? • Do these points matter for making the decision between the two classes? Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12

Distribution-free Classifiers ■ Main idea: • We do not necessarily need to model all details of the class- conditional distributions to come up with a good decision boundary. - The class-conditionals may have many intricacies that do not matter at the end of the day. • If we can learn where to place the decision boundary directly, we can avoid some of the complexity. ■ Nonetheless: • It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13

First Attempt: Least Squares ■ Try to achieve a certain value of the discriminative function: y ( x ) = +1 x ∈ C 1 ⇔ y ( x ) = − 1 x ∈ C 2 ⇔ X = { x 1 ∈ R d , . . . , x n } • Training data: • Labels: Y = { y 1 ∈ { − 1 , 1 } , . . . , y n } ■ Linear discriminant function: x T • Try to enforce ∀ i = 1 , . . . , n i w + w 0 = y i , • One linear equation for each training data point/label pair. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14

First Attempt: Least Squares ■ Linear equation system: x T ∀ i = 1 , . . . , n i w + w 0 = y i , � w ⇥ � x i ⇥ • Define ˆ x i = w = ˆ 1 w 0 • Rewrite equation system: x T ˆ i ˆ ∀ i = 1 , . . . , n w = y i , • Or in matrix-vector notation: X T ˆ ˆ ˆ X = [ˆ x 1 , . . . , ˆ x n ] with w = y y = [ y 1 , . . . , y n ] T Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15

First Attempt: Least Squares ■ Overdetermined system of equations: X T ˆ ˆ w = y • n equations, d+1 unknowns ■ Look for least squares solution: X T ˆ || ˆ w − y || 2 → min X T ˆ X T ˆ ( ˆ w − y ) T ( ˆ w − y ) → min w T ˆ X T ˆ w − 2 y T ˆ X T ˆ X ˆ w + y T y → min ˆ • Set the derivative to zero: X T ˆ 2 ˆ X ˆ w − 2 ˆ Xy := 0 X T ) − 1 ˆ w = ( ˆ X ˆ ˆ X ⌅ y ⇤ ⇥� Pseudo-inverse Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16

First Attempt: Least Squares ■ Problem: Least-squares is very sensitive to outliers 4 4 2 2 0 0 − 2 − 2 − 4 − 4 − 6 − 6 − 8 − 8 − 4 − 2 0 2 4 6 8 − 4 − 2 0 2 4 6 8 Outliers present No outliers Least-squares discriminant breaks Least-squares discriminant works down Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17

New Stategy ■ If our classes are linearly separable, we want to make sure that we find a separating (hyper)plane: ■ First such algorithm we will see: • The perceptron algorithm [Rosenblatt, 1962] • A true classic! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

Perceptron Algorithm ■ Perceptron discriminant function: y ( x ) = sign( w T x + b ) ■ Algorithm: • Start with some “weight” vector and some bias b w y i ∈ { − 1 , 1 } • For all data points with class labels x i - If is correctly classified, i.e. , do nothing. y ( x i ) = y i x i - Otherwise, if update the parameters using: y i = 1 b ← b + 1 w ← w + x i - Otherwise, if update the parameters using: y i = − 1 b ← b − 1 w ← w − x i • Repeat until convergence. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19

Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20

Statistical Machine Learning A Crash Course Part II: Classification - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Bayesian Decision Theory Decision rule: p ( C 1 | x ) > p ( C 2 | x ) Decide

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering

MITP Scientific Program Jets, particle production and transport properties in collider and

Besicovitch covering property on graded groups and applications to measure differentiation

Smaller class invariants for constructing curves of genus 2 Marco Streng The 15th workshop on

The Orbifold Construction for Join Restriction Categories Dorette Pronk 1 with Robin Cockett 2 and

Schur -groups Michael Bush Washington and Lee University August 5, 2013 Michael Bush Schur

Equational Constraints and Cylindrical Algebraic Decomposition James Davenport (Bath) with

Speeding up target-language driven part-of-speech tagger training for machine translation Felipe

Sambuz

Useful Links

Newsletter

Mail Us