Matthieu R Bloch Tuesday, January 14, 2020
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK
1
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - - PowerPoint PPT Presentation
WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday, January 14, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB) - 2pm-3:15pm Tuesday: TJ (VL) - 1:30pm - 2:45pm Wednesday:
Matthieu R Bloch Tuesday, January 14, 2020
1
TAs and Office hours Monday: Mehrdad (TSRB) - 2pm-3:15pm Tuesday: TJ (VL) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB) - 12:pm-1:15pm Thursday: Hossein (VL): 10:45pm - 12:00pm Friday: Brighton (TSRB) - 12pm-1:15pm Pass/fail policy Same homework/exam requirements as letter grade, B required to pass Self-assessment online Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL)
http://www.phdcomics.com
here
2
Learning model
drawn i.i.d. from an unknown probability distribution
are the corresponding targets
models with noise
as to what the function could be
capturing the “cost” of prediction
to find the best that explains
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
∈ Y ≜ R yi Py|x Py|x f : X → Y H ℓ : Y × Y → R+ ALG h ∈ H f
3
Learning is not memorizing Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) What we really care about is the true risk (a.k.a. out-sample error) Question #1: Can generalize? For a given , is close to ? Question #2: Can we learn well? Given , the best hypothesis is Our Empirical Risk Minimization (ERM) algorithm can only find Is close to ? Is ?
h ∈ H D h ∈ H h ∈ H (h) ≜ ℓ( , h( )) R ˆN 1 N ∑
i=1 N
yi xi R(h) ≜ [ℓ(y, h(x))] Exy h (h) R ˆN R(h) H ≜ R(h) h♯ argminh∈H ≜ (h) h∗ argminh∈H R ˆN ( ) R ˆN h∗ R( ) h♯ R( ) ≈ 0 h♯
4
Consider a special case of the general supervised learning problem
drawn i.i.d. from unknown
labels with (binary classification)
, no noise.
,
In this very specific case, the true risk simplifies The empirical risk becomes
D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N
i=1
Px X {yi}N
i=1
Y = {0, 1} f : X → Y H |H| = M < ∞ H ≜ {hi}M
i=1
ℓ : Y × Y → : ( , ) ↦ 1{ ≠ } R+ y1 y2 y1 y2 R(h) ≜ [1{h(x) ≠ y}] = (h(x) ≠ y) Exy Pxy (h) = 1{h( ) ≠ y} R ˆN 1 N ∑
i=1 N
xi
5
6
Our objective is to find a hypothesis that ensures a small risk For a fixed , how does compares to ? Observe that for The empirical risk is a sum of iid random variables is a statement about the deviation of a normalized sum of iid random variables from its mean We’re in luck! Such bounds, a.k.a, known as concentration inequalities, are a well studied subject
h∗ = (h) h∗ argmin
h∈H
R ˆN ∈ H hj ( ) R ˆN hj R( ) hj ∈ H hj ( ) = 1{ ( ) ≠ y} R ˆN hj 1 N ∑
i=1 N
hj xi [ ( )] = R( ) E R ˆN hj hj ( ( ) − R( ) > ϵ) P ∣ ∣R ˆN hj hj ∣ ∣
7
8
Lemma (Markov's inequality) Let be a non-negative real-valued random variable. Then for all Lemma (Chebyshev's inequality) Let be a real-valued random variable. Then for all Proposition (Weak law of large numbers) Let be i.i.d. real-valued random variables with finite mean and finite variance . Then
X t > 0 (X ≥ t) ≤ . P [X] E t X t > 0 (|X − [X]| ≥ t) ≤ . P E Var(X) t2 {Xi}N
i=1
μ σ2 ( − μ ≥ ϵ) ≤ ( − μ ≥ ϵ) = 0. P ∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣ σ2 Nϵ2 lim
N→∞ P
∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣
9
10
By the law of large number, we know that Given enough data, we can generalize How much data? to ensure . That’s not quite enough! We care about where If is large we should expect the existence of such that If we choose we can ensure . That’s a lot of samples!
∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ ≤ P{(
, )} xi yi
∣ ∣R ˆN hj hj ∣ ∣ Var(1{ ( ) ≠ y}) hj x1 Nϵ2 1 Nϵ2 N =
1 δϵ2
( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN hj hj ∣ ∣ ( ) R ˆN h∗ = (h) h∗ argminh∈H R ˆN M = |H| ∈ H hk ( ) ≪ R( ) R ˆN hk hk ( ( ) − R( ) ≥ ϵ) ≤? P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ (∃j : ( ) − R( ) ≥ ϵ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ P ∣ ∣R ˆN hj hj ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ P ∣ ∣R ˆN h∗ h∗ ∣ ∣ M Nϵ2 N ≥ ⌈ ⌉
M δϵ2
( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN h∗ h∗ ∣ ∣
11
We can obtain much better bounds than with Chebyshev Lemma (Hoeffding's inequality) Let be i.i.d. real-valued zero-mean random variables such that . Then for all In our learning problem We can now choose can be quite large (almost exponential in ) and, with enough data, we can generalize . How about learning ?
{Xi}N
i=1
∈ [ ; ] Xi ai bi ϵ > 0 ( ≥ ϵ) ≤ 2 exp(− ). P ∣ ∣ ∣ ∣ 1 N ∑
i=1 N
Xi ∣ ∣ ∣ ∣ 2N 2ϵ2 ( − ∑N
i=1 bi
ai)2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2 exp(−2N ) P ∣ ∣R ˆN hj hj ∣ ∣ ϵ2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2M exp(−2N ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ϵ2 N ≥ ⌈ (ln )⌉
1 2ϵ2 2M δ
M N h∗ ≜ R(h) h♯ argminh∈H
12
13