WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - PowerPoint PPT Presentation

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday, January 14, 2020 1

LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB) - 2pm-3:15pm Tuesday: TJ (VL) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB) - 12:pm-1:15pm Thursday: Hossein (VL): 10:45pm - 12:00pm Friday: Brighton (TSRB) - 12pm-1:15pm Pass/fail policy Same homework/exam requirements as letter grade, B required to pass Self-assessment online here Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) http://www.phdcomics.com 2

RECAP: COMPONENTS OF SUPERVISED MACHINE LEARNING RECAP: COMPONENTS OF SUPERVISED MACHINE LEARNING 1. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 2. An unknown conditional distribution P y | x models with noise P y | x f : X → Y 3. A set of hypotheses as to what the function could be H 4. A loss function capturing the “cost” of ℓ : Y × Y → R + prediction 5. An algorithm to find the best that explains ALG h ∈ H f Learning model 3

RECAP: THE SUPERVISED LEARNING PROBLEM RECAP: THE SUPERVISED LEARNING PROBLEM Learning is not memorizing Our goal is not to find that accurately assigns values to elements of h ∈ H D Our goal is to find the best that accurately predicts values of unseen samples h ∈ H Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) h ∈ H N 1 ˆ N R ( h ) ≜ N ∑ ℓ( y i , h ( x i )) i =1 What we really care about is the true risk (a.k.a. out-sample error) R ( h ) ≜ E x y [ℓ( y , h ( x ))] Question #1: Can generalize ? For a given , is close to ? ˆ N h R ( h ) R ( h ) Question #2: Can we learn well ? Given , the best hypothesis is h ♯ H ≜ argmin h ∈ H R ( h ) Our Empirical Risk Minimization (ERM) algorithm can only find h ∗ ˆ N ≜ ( h ) argmin h ∈ H R Is close to ? ˆ N h ∗ h ♯ R ( ) R ( ) Is ? h ♯ R ( ) ≈ 0 4

A SIMPLER SUPERVISED LEARNING PROBLEM A SIMPLER SUPERVISED LEARNING PROBLEM Consider a special case of the general supervised learning problem 1. Dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from unknown on { x i } N P x X i =1 labels with (binary classification) { y i } N Y = {0, 1} i =1 2. Unknown , no noise. f : X → Y 3. Finite set of hypotheses , H | H | = M < ∞ H ≜ { h i } M i =1 4. Binary loss function R + ℓ : Y × Y → : ( y 1 y 2 , ) ↦ 1 { y 1 ≠ y 2 } In this very specific case, the true risk simplifies R ( h ) ≜ [ 1 { h ( x ) ≠ y }] = ( h ( x ) ≠ y ) E x y P x y The empirical risk becomes N 1 ˆ N R ( h ) = N ∑ 1 { h ( x i ) ≠ y } i =1 5

CAN WE LEARN? CAN WE LEARN? Our objective is to find a hypothesis that ensures a small risk h ∗ h ∗ ˆ N = argmin R ( h ) h ∈ H For a fixed , how does compares to ? ˆ N h j h j ∈ H R ( ) R ( h j ) Observe that for h j ∈ H The empirical risk is a sum of iid random variables N 1 ˆ N h j R ( ) = N ∑ 1 { h j x i ( ) ≠ y } i =1 ˆ N h j E R [ ( ) ] = R ( h j ) is a statement about the deviation of a normalized P ∣ ˆ N h j h j ∣ ( ∣ R ( ) − R ( ) > ϵ ) ∣ sum of iid random variables from its mean We’re in luck! Such bounds, a.k.a, known as concentration inequalities , are a well studied subject 7

CONCENTRATION INEQUALITIES 101 CONCENTRATION INEQUALITIES 101 Lemma (Markov's inequality) Let be a non-negative real-valued random variable. Then for all X t > 0 E [ X ] P ( X ≥ t ) ≤ . t Lemma (Chebyshev's inequality) Let be a real-valued random variable. Then for all X t > 0 Var( X ) P (| X − E [ X ]| ≥ t ) ≤ . t 2 Proposition (Weak law of large numbers) Let be i.i.d. real-valued random variables with finite mean and finite variance . Then { X i } N σ 2 μ i =1 N N ∣ ∣ σ 2 ∣ ∣ 1 1 ∣ ∣ ∣ ∣ P ( N ∑ X i − μ ≥ ϵ ) ≤ N →∞ P lim ( N ∑ X i − μ ≥ ϵ ) = 0. ∣ ∣ ∣ ∣ Nϵ 2 ∣ ∣ ∣ ∣ i =1 i =1 9

BACK TO LEARNING BACK TO LEARNING By the law of large number, we know that Var( 1 { h j x 1 ( ) ≠ y }) 1 ∣ ˆ N h j h j ∣ ∀ ϵ > 0 P {( ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ ≤ x i y i , )} ∣ Nϵ 2 Nϵ 2 Given enough data, we can generalize How much data? to ensure . 1 P ∣ ˆ N h j h j ∣ N = ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ δ ∣ δϵ 2 That’s not quite enough! We care about where ˆ N h ∗ h ∗ ˆ N R ( ) = argmin h ∈ H R ( h ) If is large we should expect the existence of such that ˆ N h k M = | H | h k ∈ H ( ) ≪ R ( h k ) R ˆ N h ∗ h ∗ ∣ P ∣ ( ( ) − R ( ) ≥ ϵ ) ≤? ∣ R ∣ ˆ N h ∗ h ∗ ∣ P ∣ ∣ ˆ N h j h j ∣ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ P ( ∃ j : ∣ R ( ) − R ( ) ≥ ϵ ) ∣ ∣ M ˆ N h ∗ h ∗ ∣ P ∣ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ ∣ Nϵ 2 If we choose we can ensure . M ˆ N h ∗ h ∗ ∣ P ∣ N ≥ ⌈ ⌉ ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ δ δϵ 2 ∣ That’s a lot of samples! 11

CONCENTRATION INEQUALITIES 102 CONCENTRATION INEQUALITIES 102 We can obtain much better bounds than with Chebyshev Lemma (Hoeffding's inequality) Let be i.i.d. real-valued zero-mean random variables such that . Then for all { X i } N X i ∈ [ a i b i ; ] i =1 ϵ > 0 N 2 N 2 ϵ 2 ∣ ∣ 1 ∣ ∣ P ( N ∑ X i ≥ ϵ ) ≤ 2 exp ( − ) . ∣ ∣ ∑ N a i ) 2 i =1 b i ( − ∣ ∣ i =1 In our learning problem ϵ 2 P ∣ ˆ N h j h j ∣ ∀ ϵ > 0 ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ 2 exp(−2 N ) ∣ ˆ N h ∗ h ∗ ∣ ϵ 2 P ∣ ∀ ϵ > 0 ( ∣ R ( ) − R ( ) ≥ ϵ ) ≤ 2 M exp(−2 N ) ∣ We can now choose 1 2 M N ≥ ⌈ ( ln ) ⌉ 2 ϵ 2 δ can be quite large (almost exponential in ) and, with enough data, we can generalize . h ∗ M N How about learning ? h ♯ ≜ argmin h ∈ H R ( h ) 12

   13

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - PowerPoint PPT Presentation

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday, January 14, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB) - 2pm-3:15pm Tuesday: TJ (VL) - 1:30pm - 2:45pm Wednesday:

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020

Current State of Unsupervised Deep Learning William Falcon, PhD Student AGENDA AGENDA

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Critical density for Activated Random Walk Lorenzo Taggi Max Planck Institute for Mathematics in

Algebraic Structure in Network Information Theory Michael Gastpar EPFL / Berkeley European

long range sand pile divisible Chiari L ni DELFT IM PA TU - M j Jara W Ruszel w w

Mechanisms are Performed in IPv6 Qinwen Hu qhu009@aucklanduni.ac.nz Nevil Brownlee

Notes on Error Propagation in Linear Systems CS3220 Summer 2008 - Jonathan Kaldor Up to this

Linear Inverse Problems A MATLAB Tutorial Presented by Johnny Samuels What do we want to do?

Condition estimation of linear algebraic equations and its application to feature selection Joab

Preconditioners for ill conditioned (block) Toeplitz systems: facts and ideas Paris Vassalos