Statistical Machine Learning Lecture 09: Classification Kristian - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 61

Today’s Objectives Make you understand how to do build a discriminative classifier! Covered Topics: Discriminant Functions Multi-Class Classification Fisher Discriminate Analysis Perceptrons Logistic Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 61

Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 61

1. Discriminant Functions Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 61

1. Discriminant Functions Reminder of Bayesian Decision Theory We want to find the a-posteriori probability (posterior) of the class C k given the observation (feature) x p ( C k | x ) = p ( x | C k ) p ( C k ) p ( x | C k ) p ( C k ) = � � � � � p ( x ) x | C j j p p C j p ( C k | x ) - class posterior p ( x | C k ) - class-conditional probability (likelihood) p ( C k ) - class prior p ( x ) - normalization term K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 61

1. Discriminant Functions Reminder of Bayesian Decision Theory Decision rule Decide C 1 if p ( C 1 | x ) > p ( C 2 | x ) Using the definition of conditional distributions, equivalent to p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) ≡ p ( x | C 1 ) p ( x | C 2 ) > p ( C 2 ) p ( C 1 ) A classifier obeying this rule is called a Bayes optimal classifier K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 61

1. Discriminant Functions Reminder of Bayesian Decision Theory Current approach p ( C k | x ) = p ( x | C k ) p ( C k ) / p ( x ) (Bayes’ rule) Model and estimate the class-conditional density p ( x | C k ) and the class prior p ( C k ) Compute posterior p ( C k | x ) Minimize the error probability by maximizing p ( C k | x ) New approach Directly encode the decision boundary Without modeling the densities directly Still minimize the error probability K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 61

1. Discriminant Functions Discriminant Functions Formulate classification using comparisons Discriminant functions y 1 ( x ) , . . . , y K ( x ) Classify x as class C k iff y k ( x ) > y j ( x ) ∀ j � = k More formally, a discriminant maps a vector x to one of the K available classes K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 61

1. Discriminant Functions Discriminant Functions Example of discriminant functions from the Bayes classifier y k ( x ) = p ( C k | x ) y k ( x ) = p ( x | C k ) p ( C k ) y k ( x ) = log p ( x | C k ) + log p ( C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 61

1. Discriminant Functions Discriminant Functions Base case with 2 classes y 1 ( x ) > y 2 ( x ) 0 y 1 ( x ) − y 2 ( x ) > y ( x ) > 0 Example from the Bayes classifier y ( x ) = p ( C 1 | x ) − p ( C 2 | x ) log p ( x | C 1 ) p ( x | C 2 ) + log p ( C 1 ) y ( x ) = p ( C 2 ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 61

1. Discriminant Functions Example - Bayes Classifier Base case with 2 classes and Gaussian class-conditionals K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 61

1. Discriminant Functions Linear Discriminant Functions Base case with 2 classes y ( x ) > 0 decide class 1, otherwise class 2 Simplest case: linear decision boundary In linear discriminants , the decision surfaces are (hyper)planes Linear Discriminant Function y ( x ) = w ⊺ x + w 0 Where w is the normal vector and w 0 the offset K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 61

1. Discriminant Functions Linear Discriminant Functions Illustration of the 2D case � � ⊺ y ( x ) = w ⊺ x + w 0 , x = x 1 x 2 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 61

1. Discriminant Functions Linear Discriminant Functions K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 61

1. Discriminant Functions Discriminant Functions Why might we want to use discriminant functions? We could easily fit the class-conditionals using Gaussians and use a Bayes classifier K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 61

1. Discriminant Functions Discriminant Functions How about now? Do these points matter for making the decision between the two classes? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 61

1. Discriminant Functions Distribution-free Classifiers We do not necessarily need to model all the details of the class-conditional distributions to come up with a good decision boundary. (The class-conditionals may have many intricacies that do not matter at the end of the day) If we can learn where to place the decision boundary directly, we can avoid some of the complexity It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. We shall see why later... K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 61

1. Discriminant Functions Multi-Class Case What if we constructed a multi-class classifier from several 2-class classifiers? If we base our decision rule on binary decisions, this may lead to ambiguities, where we can votes for several classes such as C 1 , C 2 respectively C 1 , C 2 , C 3 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 61

1. Discriminant Functions Multi-Class Case - Better Solution Use a discriminant function to encode how strongly we believe in each class y 1 ( x ) , . . . , y K ( x ) Decision rule: Decide k if y k ( x ) > y j ( x ) ∀ j � = k If the discriminant functions are linear, the decision regions are connected and convex K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 61

2. Fisher Discriminant Analysis Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 61

2. Fisher Discriminant Analysis Linear Discriminant Functions Illustration of the 2D case � � ⊺ y ( x ) = w ⊺ x + w 0 , x = x 1 x 2 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 61

2. Fisher Discriminant Analysis First Attempt: Least Squares Try to achieve a certain value of the discriminative function y ( x ) = + 1 ⇔ x ∈ C 1 y ( x ) = − 1 ⇔ x ∈ C 2 � � x 1 ∈ R d , . . . , x n Training data inputs: X = Training data labels: Y = { y 1 ∈ {− 1 , + 1 } , . . . , y n } Linear Discriminant Function Try to enforce x ⊺ i w + w 0 = y i , ∀ i = 1 , . . . , n There is one linear equation for each training data point/label pair K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 61

2. Fisher Discriminant Analysis First Attempt: Least Squares Linear system of equations x ⊺ i w + w 0 = y i , ∀ i = 1 , . . . , n � x i � w 1 � ⊺ ∈ R d × 1 , ˆ � ⊺ ∈ R d × 1 Define ˆ x i = w = w 0 Rewrite the equation system x ⊺ w = y i , ∀ i = 1 , . . . , n ˆ i ˆ In matrix-vector notation we have X ⊺ ˆ ˆ w = y x n ] ∈ R d × n and y = [ y 1 , . . . , y n ] ⊺ With ˆ X = [ˆ x 1 , . . . , ˆ K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 61

2. Fisher Discriminant Analysis First Attempt: Least Squares X ⊺ ˆ ˆ w = y An overdetermined system of equations There are n equations and d + 1 unknowns K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 61

2. Fisher Discriminant Analysis First Attempt: Least Squares Look for the least squares solution � � 2 w ∗ = arg min X ⊺ ˆ � ˆ � � ˆ w − y � ˆ w � � ⊺ � � X ⊺ ˆ X ⊺ ˆ ˆ ˆ = arg min w − y w − y ˆ w X ⊺ ˆ w ⊺ ˆ X ˆ w − 2 y ⊺ ˆ ˆ = arg min X ⊺ ˆ w + y ⊺ y w ˆ � � X ⊺ ˆ w ⊺ ˆ X ˆ w − 2 y ⊺ ˆ ∇ ˆ ˆ w + y ⊺ y = 0 X ⊺ ˆ w X ⊺ � − 1 ˆ � X ˆ ˆ ˆ w = X y � �� pseudo-inverse K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 25 / 61

Statistical Machine Learning Lecture 09: Classification Kristian - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 61 Todays Objectives Make you

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Model-Constructing Satisfiability Calculus Dejan Jovanovi c Clark Barrett Leonardo de Moura

Linear Models for Classification II Henrik I Christensen Robotics & Intelligent Machines @ GT

Cutting the dendrogram through permutation tests Dario Bruzzese Domenico Vistocco

CA in a cervical LN. UROTHELIAL CARCINOMA (Prim. or Metastatic site) Challenges: - Poorly

Greedy algorithms: greed is good? Greedy algorithms Greed, for lack of a better word, Coin

Greedy algorithms: greed is good? Greedy algorithms Shortest paths in weighted graphs Greed, for

WHU_NERCMS at TRECVID2018: INS Dongshu Xu, Longxiang Jiang, Xiaoyu Chai, Jin Chen, Han Fang, Li

k-means++: few more steps yield constant approximation Davin Choo Christoph Grunau Julian

Statistical Machine Learning Lecture 09: Classification Kristian - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 61 Todays Objectives Make you

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Model-Constructing Satisfiability Calculus Dejan Jovanovi c Clark Barrett Leonardo de Moura

Linear Models for Classification II Henrik I Christensen Robotics &amp; Intelligent Machines @ GT

Cutting the dendrogram through permutation tests Dario Bruzzese Domenico Vistocco

CA in a cervical LN. UROTHELIAL CARCINOMA (Prim. or Metastatic site) Challenges: - Poorly

Greedy algorithms: greed is good? Greedy algorithms Greed, for lack of a better word, Coin

Greedy algorithms: greed is good? Greedy algorithms Shortest paths in weighted graphs Greed, for

WHU_NERCMS at TRECVID2018: INS Dongshu Xu, Longxiang Jiang, Xiaoyu Chai, Jin Chen, Han Fang, Li

k-means++: few more steps yield constant approximation Davin Choo Christoph Grunau Julian

Linear Models for Classification II Henrik I Christensen Robotics & Intelligent Machines @ GT