 
              WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday, January 14, 2020 1
LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB) - 2pm-3:15pm Tuesday: TJ (VL) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB) - 12:pm-1:15pm Thursday: Hossein (VL): 10:45pm - 12:00pm Friday: Brighton (TSRB) - 12pm-1:15pm Pass/fail policy Same homework/exam requirements as letter grade, B required to pass Self-assessment online here Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) http://www.phdcomics.com 2
RECAP: COMPONENTS OF SUPERVISED MACHINE LEARNING RECAP: COMPONENTS OF SUPERVISED MACHINE LEARNING 1. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 2. An unknown conditional distribution P y | x models with noise P y | x f : X → Y 3. A set of hypotheses as to what the function could be H 4. A loss function capturing the “cost” of ℓ : Y × Y → R + prediction 5. An algorithm to find the best that explains ALG h ∈ H f Learning model 3
RECAP: THE SUPERVISED LEARNING PROBLEM RECAP: THE SUPERVISED LEARNING PROBLEM Learning is not memorizing Our goal is not to find that accurately assigns values to elements of h ∈ H D Our goal is to find the best that accurately predicts values of unseen samples h ∈ H Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) h ∈ H N 1 ˆ N R ( h ) ≜ N ∑ ℓ( y i , h ( x i )) i =1 What we really care about is the true risk (a.k.a. out-sample error) R ( h ) ≜ E x y [ℓ( y , h ( x ))] Question #1: Can generalize ? For a given , is close to ? ˆ N h R ( h ) R ( h ) Question #2: Can we learn well ? Given , the best hypothesis is h ♯ H ≜ argmin h ∈ H R ( h ) Our Empirical Risk Minimization (ERM) algorithm can only find h ∗ ˆ N ≜ ( h ) argmin h ∈ H R Is close to ? ˆ N h ∗ h ♯ R ( ) R ( ) Is ? h ♯ R ( ) ≈ 0 4
A SIMPLER SUPERVISED LEARNING PROBLEM A SIMPLER SUPERVISED LEARNING PROBLEM Consider a special case of the general supervised learning problem 1. Dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from unknown on { x i } N P x X i =1 labels with (binary classification) { y i } N Y = {0, 1} i =1 2. Unknown , no noise. f : X → Y 3. Finite set of hypotheses , H | H | = M < ∞ H ≜ { h i } M i =1 4. Binary loss function R + ℓ : Y × Y → : ( y 1 y 2 , ) ↦ 1 { y 1 ≠ y 2 } In this very specific case, the true risk simplifies R ( h ) ≜ [ 1 { h ( x ) ≠ y }] = ( h ( x ) ≠ y ) E x y P x y The empirical risk becomes N 1 ˆ N R ( h ) = N ∑ 1 { h ( x i ) ≠ y } i =1 5
6
CAN WE LEARN? CAN WE LEARN? Our objective is to find a hypothesis that ensures a small risk h ∗ h ∗ ˆ N = argmin R ( h ) h ∈ H For a fixed , how does compares to ? ˆ N h j h j ∈ H R ( ) R ( h j ) Observe that for h j ∈ H The empirical risk is a sum of iid random variables N 1 ˆ N h j R ( ) = N ∑ 1 { h j x i ( ) ≠ y } i =1 ˆ N h j E R [ ( ) ] = R ( h j ) is a statement about the deviation of a normalized P ∣ ˆ N h j h j ∣ ( ∣ R ( ) − R ( ) > ϵ ) ∣ sum of iid random variables from its mean We’re in luck! Such bounds, a.k.a, known as concentration inequalities , are a well studied subject 7
8
CONCENTRATION INEQUALITIES 101 CONCENTRATION INEQUALITIES 101 Lemma (Markov's inequality) Let be a non-negative real-valued random variable. Then for all X t > 0 E [ X ] P ( X ≥ t ) ≤ . t Lemma (Chebyshev's inequality) Let be a real-valued random variable. Then for all X t > 0 Var( X ) P (| X − E [ X ]| ≥ t ) ≤ . t 2 Proposition (Weak law of large numbers) Let be i.i.d. real-valued random variables with finite mean and finite variance . Then { X i } N σ 2 μ i =1 N N ∣ ∣ σ 2 ∣ ∣ 1 1 ∣ ∣ ∣ ∣ P ( N ∑ X i − μ ≥ ϵ ) ≤ N →∞ P lim ( N ∑ X i − μ ≥ ϵ ) = 0. ∣ ∣ ∣ ∣ Nϵ 2 ∣ ∣ ∣ ∣ i =1 i =1 9
10
BACK TO LEARNING BACK TO LEARNING By the law of large number, we know that Var( 1 { h j x 1 ( ) ≠ y }) 1 ∣ ˆ N h j h j ∣ ∀ ϵ > 0 P {( ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ ≤ x i y i , )} ∣ Nϵ 2 Nϵ 2 Given enough data, we can generalize How much data? to ensure . 1 P ∣ ˆ N h j h j ∣ N = ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ δ ∣ δϵ 2 That’s not quite enough! We care about where ˆ N h ∗ h ∗ ˆ N R ( ) = argmin h ∈ H R ( h ) If is large we should expect the existence of such that ˆ N h k M = | H | h k ∈ H ( ) ≪ R ( h k ) R ˆ N h ∗ h ∗ ∣ P ∣ ( ( ) − R ( ) ≥ ϵ ) ≤? ∣ R ∣ ˆ N h ∗ h ∗ ∣ P ∣ ∣ ˆ N h j h j ∣ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ P ( ∃ j : ∣ R ( ) − R ( ) ≥ ϵ ) ∣ ∣ M ˆ N h ∗ h ∗ ∣ P ∣ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ ∣ Nϵ 2 If we choose we can ensure . M ˆ N h ∗ h ∗ ∣ P ∣ N ≥ ⌈ ⌉ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ δ δϵ 2 ∣ That’s a lot of samples! 11
CONCENTRATION INEQUALITIES 102 CONCENTRATION INEQUALITIES 102 We can obtain much better bounds than with Chebyshev Lemma (Hoeffding's inequality) Let be i.i.d. real-valued zero-mean random variables such that . Then for all { X i } N X i ∈ [ a i b i ; ] i =1 ϵ > 0 N 2 N 2 ϵ 2 ∣ ∣ 1 ∣ ∣ P ( N ∑ X i ≥ ϵ ) ≤ 2 exp ( − ) . ∣ ∣ ∑ N a i ) 2 i =1 b i ( − ∣ ∣ i =1 In our learning problem ϵ 2 P ∣ ˆ N h j h j ∣ ∀ ϵ > 0 ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ 2 exp(−2 N ) ∣ ˆ N h ∗ h ∗ ∣ ϵ 2 P ∣ ∀ ϵ > 0 ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ 2 M exp(−2 N ) ∣ We can now choose 1 2 M N ≥ ⌈ ( ln ) ⌉ 2 ϵ 2 δ can be quite large (almost exponential in ) and, with enough data, we can generalize . h ∗ M N How about learning ? h ♯ ≜ argmin h ∈ H R ( h ) 12
   13
Recommend
More recommend