Statistical Machine Learning Lecture 05: Bayesian Decision Theory - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 05: Bayesian Decision Theory Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 36

Today’s Objectives Make you understand how to do an optimal decision! Covered Topics: Bayesian Optimal Decisions Classification from a Bayesian point of view Risk-based Classification K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 36

Outline 1. Bayesian Decision Theory 2. Risk Minimization 3. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 36

1. Bayesian Decision Theory Outline 1. Bayesian Decision Theory 2. Risk Minimization 3. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 36

1. Bayesian Decision Theory Statistical Methods Statistical methods in machine learning all have in common that they assume that the process that “generates” the data is governed by the rules of probability The data is understood to be a set of random samples from some underlying probability distribution Today will be all about probabilities. But in future lectures, the use of probability will sometimes be much less explicit Nonetheless, the basic assumption about how the data is generated is always there, even if you don’t see a single probability distribution anywhere K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 36

1. Bayesian Decision Theory Character Recognition Goal : classify a new letter so that the probability of a wrong classification is minimized K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 36

1. Bayesian Decision Theory Class conditional probabilities Class conditional probabilities Probability of making an observation x knowing that it comes from some class C k Here x is often a feature vector, which measures/describes properties of the data. E.g.: number of black pixels, height-width ratio, ... K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 36

1. Bayesian Decision Theory Class conditional probabilities Example How do we decide which class the data point belongs to? Here, we should decide for class a K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 36

1. Bayesian Decision Theory Class conditional probabilities Example How do we decide which class the data point belongs to? Since p ( x | a ) is a lot smaller than p ( x | b ) we should now decide for class b K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 36

1. Bayesian Decision Theory Class conditional probabilities Example How do we decide which class the data point belongs to? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 36

1. Bayesian Decision Theory Class priors The a priori probability of a data point belonging to a particular class is called the class prior Example: abaaababaaaabbaaaaaa What are p ( a ) and p ( b ) ? C 1 = a p ( C 1 ) = 0 . 75 p ( C 2 ) = 0 . 25 C 2 = b � p ( C k ) = 1 k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 36

1. Bayesian Decision Theory Back to our problem... Example How do we decide which class the data point belongs to? If p ( a ) = 0 . 75 and p ( b ) = 0 . 25, we should decide for class a K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 36

1. Bayesian Decision Theory Bayesian Decision Theory Bayes Theorem lets us formalize the previous intuitive decision We want to find the a-posteriori probability (posterior) of the class C k given the observation (feature) x p ( C k | x ) = p ( x | C k ) p ( C k ) p ( x | C k ) p ( C k ) � � � � = � p ( x ) j p x | C j p C j class prior: p ( C k ) class-conditional probability (likelihood): p ( x | C k ) class posterior: p ( C k | x ) normalization term: p ( x ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 36

1. Bayesian Decision Theory Bayesian Decision Theory K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 36

1. Bayesian Decision Theory Bayesian Decision Theory Why is it called this way? To some extent, because it involves applying Bayes’ rule But this is not the whole story... The real reason is that it is built on so-called Bayesian probabilities K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 36

1. Bayesian Decision Theory Bayesian Probabilities Probability is not just interpreted as a frequency of a certain event happening Rather, it is seen as a degree of belief in an outcome Only this allows us to assert a prior belief in a data point coming from a certain class Even though this might seem easy to accept to you now, this interpretation was quite contentious in statistics for a long time K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 36

1. Bayesian Decision Theory Bayesian Decision Theory Goal: Minimize the misclassification rate (the probability of classifying wrongly) x 0 b x p ( x, C 1 ) p ( x, C 2 ) x R 1 R 2 p ( error ) = p ( x ∈ R 1 , C 2 ) + p ( x ∈ R 2 , C 1 ) � � = p ( x , C 2 ) d x + p ( x , C 1 ) d x R 1 R 2 � � = p ( x | C 2 ) p ( C 2 ) d x + p ( x | C 1 ) p ( C 1 ) d x R 1 R 2 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 36

1. Bayesian Decision Theory Bayesian Decision Theory Decision rule: decide C 1 if p ( C 1 | x ) > p ( C 2 | x ) Equivalent to p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) > p ( x ) p ( x ) p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) p ( x | C 1 ) p ( C 2 ) > p ( x | C 2 ) p ( C 1 ) A classifier obeying this rule is called a Bayes Optimal Classifier K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 36

1. Bayesian Decision Theory Bayesian Decision Theory p ( x | C 1 ) p ( C 2 ) > p ( x | C 2 ) p ( C 1 ) Special cases If p ( x | C 1 ) = p ( x | C 2 ) , then use p ( C 1 ) > p ( C 2 ) If p ( C 1 ) = p ( C 2 ) , then use p ( x | C 1 ) > p ( x | C 2 ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 36

1. Bayesian Decision Theory More than two Classes Generalization to more than 2 classes: Decide for class k iff it has the highest a-posteriori probability � � p ( C k | x ) > p C j | x ∀ j � = k Equivalent to p ( x | C k ) p ( C k ) > p ( x | C j ) p ( C j ) ∀ j � = k p ( x | C k ) p ( C j ) > ∀ j � = k p ( x | C j ) p ( C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 36

1. Bayesian Decision Theory More than two Classes Decision regions: R 1 , R 2 , R 3 , . . . K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 36

1. Bayesian Decision Theory High Dimensional Features So far we have only considered one-dimensional features, i.e., x ∈ R We can use more features and generalize to an arbitrary D -dimensional feature space, i.e., x ∈ R D For instance, in the salmon vs. sea-bass classification task � x 1 � ⊺ ∈ R 2 x = x 2 Where x 1 is the width, and x 2 is the lightness The decision boundary we devised still applies to x ∈ R D . We just need to use multivariate class-conditional densities p ( x | C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 36

1. Bayesian Decision Theory Dummy Classes There are also applications, where it may be advantageous to have a dummy class denoted “don’t know” or “don’t care” Also called a reject option Not a common case though and we will not cover this in this class K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 36

2. Risk Minimization Outline 1. Bayesian Decision Theory 2. Risk Minimization 3. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 36

2. Risk Minimization 2. Risk Minimization So far, we have tried to minimize the misclassification rate There are many cases when not every misclassification is equally bad Smoke detector If there is a fire, we need to be very sure that we classify it as such If there is no fire, it is ok to occasionally have a false alarm Medical diagnosis If the patient is sick, we need to be very sure that we report them as sick If they are healthy, it is ok to classify them as sick and order further testing that may help clarifying this up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 25 / 36

Statistical Machine Learning Lecture 05: Bayesian Decision Theory - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 05: Bayesian Decision Theory Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 36 Todays Objectives Make

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Decision theory Dr. Jarad Niemi STAT 544 - Iowa State University March 7, 2017 Jarad Niemi

Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer Vision Researcher at Kitware

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University

Bayesian Decision Theory with applications to Experimental Design Robbie Peck University of Bath

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Making Decisions Under Uncertainty What an agent should do depends on: The agents ability

Probability and Statistical Decision Theory Many slides attributable to: Prof. Mike Hughes Erik

Introduction to Decision Networks Alice Gao Lecture 13 Based on work by K. Leyton-Brown, K.