 
              THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020 1
WHY ML ? WHY ML ? Traditional engineering is top-down We use fundamental principles (mathematics, physics) to build models and abstractions Design is performed based on models Example : building a communication system Machine learning is bottom-up We think there is a model to be found The model is too complex to describe or identify from fundamental principles We have data Adversarial examples [Elsayed et al.’18’] Example : classifying cats and dogs There are plenty of problems that do not require ML We should probably not try to learn laws of physics with ML There are plenty of situation in which ML can help Engineering design based on heuristics Example : Computer Aided Design 2
ML IN THE REAL WORLD ML IN THE REAL WORLD Match making Movie recommendations Autonomous vehicles Credit card fraud detection Handwriting recognition Cooking Painting Teaching 3
TYPES OF LEARNING TYPES OF LEARNING Supervised learning Given input data representing observation of phenomenon { x i } N i =1 Given output data representing “label” attached to observation { y i } N i =1 Goal is to identify input-output relationship from training data and generalize x i y i } N {( , ) i =1 Unsupervised learning Given input data representing observation of phenomenon { x i } N i =1 No output data! Goal is to understand structure in data, or infer some characteristic of underlying probability distribution Other types of learning semi-supervised learning active learning online learning reinforcement learning transfer learning imitation learning 4
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) The formula to distinguish cats from dogs 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} : picture of cat/dog R d x i ∈ X ≜ : the corresponding label cat/dog y i ∈ Y ≜ R 3. A set of hypotheses as to what the function could be H Example: deep neural nets with AlexNet architecture 4. An algorithm to find the best that explains ALG h ∈ H f Terminology: : regression problem Y = R : binary classification problem Learning model #1 | Y | = 2 The goal is to generalize , i.e., be able to classify inputs we have not seen. 5
A LEARNING PUZZLE A LEARNING PUZZLE Learning seems impossible without additional assumptions! 6
POSSIBLE VS PROBABLE POSSIBLE VS PROBABLE Flip a biased coin that lands on head with unknown probability p ∈ [0, 1] and P (head) = p P (tail) = 1 − p Say we flip the coin times, can we estimate ? N p https://xkcd.com/221/ # head p = ^ N Can we relate to ? p p ^ The law of large numbers tells us that converges in probability to as gets large p ^ p N ∀ ϵ > 0 P (| p − p | > ϵ ) ⟶ 0. ^ N →∞ It is possible that is completely off but it is not probable p ^ 7
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from unknown distribution on { x i } N P x X i =1 are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f Learning model #2 8
ANOTHER LEARNING PUZZLE ANOTHER LEARNING PUZZLE Which color is the dress? 9
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown conditional distribution to learn P y | x models with noise P y | x f : X → Y 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f The roles of and are different P y | x P x is what we want to learn, captures the underlying P y | x function and the noise added to it Learning model #3 models sampling of dataset, need not be learned P x 10
YET ANOTHER LEARNING PUZZLE YET ANOTHER LEARNING PUZZLE Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! Biometric authentication system If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design 11
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 2. An unknown conditional distribution P y | x models with noise P y | x f : X → Y 3. A set of hypotheses as to what the function could be H 4. A loss function capturing the “cost” of ℓ : Y × Y → R + prediction 5. An algorithm to find the best that explains ALG h ∈ H f Learning model 12
THE LEARNING PROBLEM THE LEARNING PROBLEM Learning is not memorizing Our goal is not to find that accurately assigns values to elements of h ∈ H D Our goal is to find the best that accurately predicts values of unseen samples h ∈ H Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) h ∈ H N 1 ˆ N R ( h ) ≜ N ∑ ℓ( y i , h ( x i )) i =1 What we really care about is the true risk (a.k.a. out-sample error) R ( h ) ≜ E x y [ℓ( y , h ( x ))] Question #1: Can generalize ? For a given , is close to ? ˆ N h R ( h ) R ( h ) Question #2: Can we learn well ? Given , the best hypothesis is h ♯ H ≜ argmin h ∈ H R ( h ) Our algorithm can only find h ∗ ˆ N ≜ argmin h ∈ H R ( h ) Is close to ? ˆ N h ∗ h ♯ R ( ) R ( ) Is ? h ♯ R ( ) ≈ 0    13
Recommend
More recommend