Matthieu R Bloch Thrusday January 9, 2020
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK
1
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - - PowerPoint PPT Presentation
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday January 9, 2020 1 LOGISTICS LOGISTICS Registration update Please decide soon if you want to take the class or not Still many people on the waiting
Matthieu R Bloch Thrusday January 9, 2020
1
Registration update Please decide soon if you want to take the class or not Still many people on the waiting list! Lecture videos on Canvas available in “Media Gallery” Please keep coming to class! Self-assessment online Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) I don’t expect you to do the assignment without refreshing your memory first
http://www.phdcomics.com
here
2
Learning model #1
to learn The formula to distinguish cats from dogs
: picture of cat/dog : the corresponding label cat/dog
as to what the function could be Example: deep neural nets with AlexNet architecture
to find the best that explains Terminology: : regression problem : binary classification problem : binary classification problem The goal is to generalize, i.e., be able to classify inputs we have not seen.
f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN ∈ X ≜ xi Rd ∈ Y ≜ R yi H ALG h ∈ H f Y = R |Y| < ∞ |Y| = 2
3
Learning seems impossible without additional assumptions!
4
https://xkcd.com/221/
Flip a biased coin that lands on head with unknown probability and Say we flip the coin times, can we estimate ? Can we relate to ? The law of large numbers tells us that converges in probability to as gets large It is possible that is completely off but it is not probable
p ∈ [0, 1] (head) = p P (tail) = 1 − p P N p = p ^ # head N p ^ p p ^ p N ∀ϵ > 0 (| − p| > ϵ) 0. P p ^ ⟶
N→∞
p ^
5
Learning model #2
to learn
drawn i.i.d. from unknown distribution
are the corresponding targets
as to what the function could be
to find the best that explains
f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi H ALG h ∈ H f
6
Which color is the dress?
7
Learning model #3
to learn models with noise
drawn i.i.d. from an unknown probability distribution
are the corresponding targets
as to what the function could be
to find the best that explains The roles of and are different is what we want to learn, captures the underlying function and the noise added to it models sampling of dataset, need not be learned
Py|x Py|x f : X → Y D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi H ALG h ∈ H f Py|x Px Py|x Px
8
Biometric authentication system
Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design
9
Final supervised learning model
drawn i.i.d. from an unknown probability distribution
are the corresponding targets
models with noise
as to what the function could be
capturing the “cost” of prediction
to find the best that explains
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi Py|x Py|x f : X → Y H ℓ : Y × Y → R+ ALG h ∈ H f
10
Learning is not memorizing Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) What we really care about is the true risk (a.k.a. out-sample error) Question #1: Can generalize? For a given , is close to ? Question #2: Can we learn well? Given , the best hypothesis is Our algorithm can only find Is close to ? Is ?
h ∈ H D h ∈ H h ∈ H (h) ≜ ℓ( , h( )) R ˆN 1 N ∑
i=1 N
yi xi R(h) ≜ [ℓ(y, h(x))] Exy h (h) R ˆN R(h) H ≜ R(h) h♯ argminh∈H ≜ (h) h∗ argminh∈H R ˆN ( ) R ˆN h∗ R( ) h♯ R( ) ≈ 0 h♯
11
Quick demo: nearest neighbor classification
12
Probabilities are not that old. The axiomatic theory was carried out by Kolmogorov in 1932
Let be a sample space. The class of subsets of that constitutes events satisfies the following axioms: 1. is an event;
some events in , is an event;
in , is an event.
Let be a sample space and a class of events satisfying the axioms for events. A probability rule is a function such that: 1. ;
;
, . Proposition (Union bound) Let be a probability space. For any events we have
Ω ≠ ∅ Ω Ω {Ai}∞
i≥1
Ω ∪∞
i=1Ai
A Ω Ac Ω ≠ ∅ F P : F → R+ (Ω) = 1 P A ⊂ F (A) ≥ 0 P F {Ai}∞
i=1
( ) = ( ) P ∪∞
i=1Ai
∑∞
i=1 P Ai
(Ω, F, P) { } Ai
i≥1
( ) ≤ ( ) P ∪i≥1Ai ∑i≥1 P Ai
13
Let be a probability space. The conditional probability of event given event is, if , .
Let be a probability space and with non zero probability. Then,
Let be a probability space. Event is independent of event if . For , the events are independent if such that ,
(Ω, F, P) A B (B) > 0 P (A|B) = (A ∩ B)/ (B) P P P (Ω, F, P) A, B (A|B) = (B|A) . P P (A) P (B) P (Ω, F, P) A B (A ∩ B) = (A) (B) P P P n > 2 { } Ai
n i=1
∀S ⊂ [1; n] |S| > 2 ( ) = ( ) P ∩i∈SAi ∏
i=1 n
P Ai
14
Let be a probability space. A random variable is a function . 1. might undefined or infinite on a subset of zero probability. 2. must be an event for all
, the set must be an event for all
Let be a probability space and a random variable. The CDF of is the function If
and is called the probability mass function (PMF) of . If the CDF of as a finite derivative at , the derivative is called the probability density function (PDF), denoted by . If has a derivative for every , is continuous We oen don’t need to specify . All we need is a CDF (of PMF or PDF)
(Ω, F, P) X X : Ω → R X {ω ∈ Ω : X(ω) ≤ x} x ∈ R { } Xi
n i=1
{ω : (ω) ≤ , ⋯ , (ω) ≤ } X1 x1 Xn xn { } xi
n i=1
(Ω, F, P) X X : R → R : x ↦ (ω ∈ Ω : X(ω) ≤ x) ≜ (X ≤ x) FX P P |X| < ∞ X = { } xi
|X| i=1
( ) ≜ (X = ) PX xi P xi X X x pX FX x ∈ R X (Ω, F, P)
15
Let be a random variable with PMF . Then . Let be a random variable with PDF . Then . Expectation of a function of a discrete is (and idem for PDFs).
Let be a random variable. The th moment of is . The variance is the second centered moment . Proposition (Expectation of indicator function) Let be a random variable and . Then 11th commandment: thou shall denote random variables by capital letters 12th commandment: but sometimes not
X PX [X] ≜ x (x) E ∑x∈X PX X pX [X] ≜ x (x)dx E ∫x∈X pX f X [f(X)] = f(x) (x) E ∑x∈X PX X m X [ ] E Xm Var(X) ≜ [(X − [X] ] = [ ] − E E )2 E X2 [X] E
2
X E ⊂ R [1{X ∈ E}] = (X ∈ E) E P
16
Consider a special case of the general supervised learning problem
drawn i.i.d. from unknown
labels with (binary classification)
, no noise.
,
In this very specific case, the true risk simplifies The empirical risk becomes
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
Y = {0, 1} f : X → Y H |H| = M < ∞ H ≜ {hi}M
i=1
ℓ : Y × Y → : ( , ) ↦ 1{ ≠ } R+ y1 y2 y1 y2 R(h) ≜ [1{h(x) ≠ y}] = (h(x) ≠ y) Exy Pxy (h) = 1{h( ) ≠ y} R ˆN 1 N ∑
i=1 N
xi
17
Image
Our objective is to find a hypothesis that ensures a small risk For a fixed , how does compares to ? Observe that for The empirical risk is a sum of iid random variables is a statement about the deviation of a normalized sum of iid random variables from its mean We’re in luck! Such bounds, a.k.a, known as concentration inequalities, are a well studied subject
h∗ = (h) h∗ argmin
h∈H
R ˆN ∈ H hj ( ) R ˆN hj R( ) hj ∈ H hj ( ) = 1{ ( ) ≠ y} R ˆN hj 1 N ∑
i=1 N
hj xi [ ( )] = R( ) E R ˆN hj hj ( ( ) − R( ) > ϵ) P ∣ ∣R ˆN hj hj ∣ ∣
18
Lemma (Markov's inequality) Let be a non-negative real-valued random variable. Then for all Lemma (Chebyshev's inequality) Let be a real-valued random variable. Then for all Proposition (Weak law of large numbers) Let be i.i.d. real-valued random variables with finite mean and finite variance . Then
X t > 0 (X ≥ t) ≤ . P [X] E t X t > 0 (|X − [X]| ≥ t) ≤ . P E Var(X) t2 {Xi}N
i=1
μ σ2 ( − μ ≥ ϵ) ≤ ( − μ ≥ ϵ) = 0. P ∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣ σ2 Nϵ2 lim
N→∞ P
∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣
19
By the law of large number, we know that Given enough data, we can generalize How much data? to ensure . That’s not quite enough! We care about where If is large we should expect the existence of such that If we choose we can ensure . That’s a lot of samples!
∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ ≤ P{(
, )} xi yi
∣ ∣R ˆN hj hj ∣ ∣ Var(1{ ( ) ≠ y}) hj x1 Nϵ2 1 Nϵ2 N =
1 δϵ2
( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN hj hj ∣ ∣ ( ) R ˆN h∗ = (h) h∗ argminh∈H R ˆN M = |H| ∈ H hk ( ) ≪ R( ) R ˆN hk hk ( ( ) − R( ) ≥ ϵ) ≤? P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ (∃j : ( ) − R( ) ≥ ϵ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ P ∣ ∣R ˆN h∗ h∗ ∣ ∣ M Nϵ2 N ≥ M
δϵ2
( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN h∗ h∗ ∣ ∣
20
We can obtain much better bounds than with Chebyshev Lemma (Hoeffding's inequality) Let be i.i.d. real-valued zero-mean random variables such that . Then for all In our learning problem We can now choose can be quite large (almost exponential in ) and, with enough data, we can generalize . How about learning ?
{Xi}N
i=1
∈ [ ; ] Xi ai bi ϵ > 0 ( ≥ ϵ) ≤ 2 exp(− ). P ∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣ 2N 2ϵ2 ( − ∑N
i=1 bi
ai)2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2 exp(−2N ) P ∣ ∣R ˆN hj hj ∣ ∣ ϵ2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2M exp(−2N ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ϵ2 N ≥ (log M + log )
1 2ϵ2 2 δ
M N h∗ ≜ R(h) h♯ argminh∈H
21
Lemma. If then . How do we make small? Need bigger hypothesis class ! (could we take ?) Fundamental trade-off of learning Image
∀j ∈ H ( ) − R( ) ≤ ϵ ∣ ∣R ˆN hj hj ∣ ∣ R( ) − R( ) ≤ 2ϵ ∣ ∣ h∗ h♯ ∣ ∣ R( ) h♯ H M → ∞
22
Ideally we want small so that and get lucky so that In general this is not possible Remember, we usually have to learn , not a function Next time What is the optimal binary classification hypothesis class? How small can be?
|H| R( ) ≈ R( ) h∗ h♯ R( ) ≈ 0 h∗ Py|x f R( ) h∗
23