Matthieu R Bloch January 7, 2020
THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM
1
THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM - - PowerPoint PPT Presentation
THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020 1 WHY ML ? WHY ML ? Traditional engineering is top-down We use fundamental principles (mathematics, physics) to build models and abstractions
Matthieu R Bloch January 7, 2020
1
Adversarial examples [Elsayed et al.’18’]
Traditional engineering is top-down We use fundamental principles (mathematics, physics) to build models and abstractions Design is performed based on models Example: building a communication system Machine learning is bottom-up We think there is a model to be found The model is too complex to describe or identify from fundamental principles We have data Example: classifying cats and dogs There are plenty of problems that do not require ML We should probably not try to learn laws of physics with ML There are plenty of situation in which ML can help Engineering design based on heuristics Example: Computer Aided Design
2
Credit card fraud detection Movie recommendations Autonomous vehicles Match making Handwriting recognition Cooking Painting Teaching
3
Supervised learning Given input data representing observation of phenomenon Given output data representing “label” attached to observation Goal is to identify input-output relationship from training data and generalize Unsupervised learning Given input data representing observation of phenomenon No output data! Goal is to understand structure in data, or infer some characteristic of underlying probability distribution Other types of learning semi-supervised learning active learning
reinforcement learning transfer learning imitation learning
{xi}N
i=1
{yi}N
i=1
{( , ) xi yi }N
i=1
{xi}N
i=1
4
Learning model #1
to learn The formula to distinguish cats from dogs
: picture of cat/dog : the corresponding label cat/dog
as to what the function could be Example: deep neural nets with AlexNet architecture
to find the best that explains Terminology: : regression problem : binary classification problem The goal is to generalize, i.e., be able to classify inputs we have not seen.
f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN ∈ X ≜ xi Rd ∈ Y ≜ R yi H ALG h ∈ H f Y = R |Y| = 2
5
Learning seems impossible without additional assumptions!
6
https://xkcd.com/221/
Flip a biased coin that lands on head with unknown probability and Say we flip the coin times, can we estimate ? Can we relate to ? The law of large numbers tells us that converges in probability to as gets large It is possible that is completely off but it is not probable
p ∈ [0, 1] (head) = p P (tail) = 1 − p P N p = p ^ # head N p ^ p p ^ p N ∀ϵ > 0 (| − p| > ϵ) 0. P p ^ ⟶
N→∞
p ^
7
Learning model #2
to learn
drawn i.i.d. from unknown distribution
are the corresponding targets
as to what the function could be
to find the best that explains
f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi H ALG h ∈ H f
8
Which color is the dress?
9
Learning model #3
to learn models with noise
drawn i.i.d. from an unknown probability distribution
are the corresponding targets
as to what the function could be
to find the best that explains The roles of and are different is what we want to learn, captures the underlying function and the noise added to it models sampling of dataset, need not be learned
Py|x Py|x f : X → Y D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi H ALG h ∈ H f Py|x Px Py|x Px
10
Biometric authentication system
Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design
11
Learning model
drawn i.i.d. from an unknown probability distribution
are the corresponding targets
models with noise
as to what the function could be
capturing the “cost” of prediction
to find the best that explains
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi Py|x Py|x f : X → Y H ℓ : Y × Y → R+ ALG h ∈ H f
12
Learning is not memorizing Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) What we really care about is the true risk (a.k.a. out-sample error) Question #1: Can generalize? For a given , is close to ? Question #2: Can we learn well? Given , the best hypothesis is Our algorithm can only find Is close to ? Is ?
h ∈ H D h ∈ H h ∈ H (h) ≜ ℓ( , h( )) R ˆN 1 N ∑
i=1 N
yi xi R(h) ≜ [ℓ(y, h(x))] Exy h (h) R ˆN R(h) H ≜ R(h) h♯ argminh∈H ≜ (h) h∗ argminh∈H R ˆN ( ) R ˆN h∗ R( ) h♯ R( ) ≈ 0 h♯
13