WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - PowerPoint PPT Presentation

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday January 9, 2020 1

LOGISTICS LOGISTICS Registration update Please decide soon if you want to take the class or not Still many people on the waiting list! Lecture videos on Canvas available in “Media Gallery” Please keep coming to class! Self-assessment online here Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) I don’t expect you to do the assignment without refreshing your memory first http://www.phdcomics.com 2

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) The formula to distinguish cats from dogs 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} : picture of cat/dog R d x i ∈ X ≜ : the corresponding label cat/dog y i ∈ Y ≜ R 3. A set of hypotheses as to what the function could be H Example: deep neural nets with AlexNet architecture 4. An algorithm to find the best that explains ALG h ∈ H f Terminology: : regression problem Y = R : binary classification problem Learning model #1 | Y | < ∞ : binary classification problem | Y | = 2 The goal is to generalize , i.e., be able to classify inputs we have not seen. 3

A LEARNING PUZZLE A LEARNING PUZZLE Learning seems impossible without additional assumptions! 4

POSSIBLE VS PROBABLE POSSIBLE VS PROBABLE Flip a biased coin that lands on head with unknown probability p ∈ [0, 1] and P (head) = p P (tail) = 1 − p Say we flip the coin times, can we estimate ? N p https://xkcd.com/221/ # head p = ^ N Can we relate to ? p p ^ The law of large numbers tells us that converges in probability to as gets large p p N ^ ∀ ϵ > 0 P (| p − p | > ϵ ) ⟶ 0. ^ N →∞ It is possible that is completely off but it is not probable p ^ 5

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from unknown distribution on { x i } N P x X i =1 are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f Learning model #2 6

ANOTHER LEARNING PUZZLE ANOTHER LEARNING PUZZLE Which color is the dress? 7

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown conditional distribution to learn P y | x models with noise P y | x f : X → Y 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f The roles of and are different P y | x P x is what we want to learn, captures the underlying P y | x function and the noise added to it Learning model #3 models sampling of dataset, need not be learned P x 8

YET ANOTHER LEARNING PUZZLE YET ANOTHER LEARNING PUZZLE Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! Biometric authentication system If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design 9

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 2. An unknown conditional distribution P y | x models with noise P y | x f : X → Y 3. A set of hypotheses as to what the function could be H 4. A loss function capturing the “cost” of ℓ : Y × Y → R + prediction 5. An algorithm to find the best that explains ALG h ∈ H f Final supervised learning model 10

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Learning is not memorizing Our goal is not to find that accurately assigns values to elements of h ∈ H D Our goal is to find the best that accurately predicts values of unseen samples h ∈ H Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) h ∈ H N 1 ˆ N R ( h ) ≜ N ∑ ℓ( y i , h ( x i )) i =1 What we really care about is the true risk (a.k.a. out-sample error) R ( h ) ≜ E x y [ℓ( y , h ( x ))] Question #1: Can generalize ? For a given , is close to ? ˆ N h R ( h ) R ( h ) Question #2: Can we learn well ? Given , the best hypothesis is h ♯ H ≜ argmin h ∈ H R ( h ) Our algorithm can only find h ∗ ˆ N ≜ argmin h ∈ H R ( h ) Is close to ? ˆ N h ∗ h ♯ R ( ) R ( ) Is ? h ♯ R ( ) ≈ 0 11

WHY THE QUESTIONS MATTERS WHY THE QUESTIONS MATTERS Quick demo: nearest neighbor classification 12

DETOUR: DETOUR: PROBABILITIES PROBABILITIES Probabilities are not that old. The axiomatic theory was carried out by Kolmogorov in 1932 Definition. (Axiom for events) Let be a sample space. The class of subsets of that constitutes events satisfies the following Ω ≠ ∅ Ω axioms: 1. is an event; Ω 2. For some events in , is an event; { A i } ∞ Ω ∪ ∞ i =1 A i i ≥1 3. For every event in , is an event. Ω A c A Definition. (Axiom for probability) Let be a sample space and a class of events satisfying the axioms for events. A probability Ω ≠ ∅ F rule is a function such that: P : F → R + 1. ; P (Ω) = 1 2. For every ; A ⊂ F P ( A ) ≥ 0 3. For any disjoint events in , . ∑ ∞ F { A i } ∞ P ∪ ∞ ( i =1 A i ) = i =1 P A i ( ) i =1 Proposition (Union bound) Let be a probability space. For any events we have (Ω, F , P ) { A i } P ∪ i ≥1 A i ( ) ≤ ∑ i ≥1 P A i ( ) i ≥1 13

DETOUR: DETOUR: PROBABILITIES PROBABILITIES Definition. (Conditional probability) Let be a probability space. The conditional probability of event given event is, if (Ω, F , P ) A B , . P ( B ) > 0 P ( A | B ) = P ( A ∩ B )/ ( B ) P Definition. (Bayes' rule) Let be a probability space and with non zero probability. Then, (Ω, F , P ) A , B P ( A ) P ( A | B ) = P ( B | A ) . P ( B ) Definition. (Independence) Let be a probability space. Event is independent of event if . (Ω, F , P ) A B P ( A ∩ B ) = P ( A ) ( B ) P For , the events are independent if such that , n n > 2 { A i } ∀ S ⊂ [1; n ] | S | > 2 i =1 n P ∩ i ∈ S A i ( ) = ∏ P A i ( ) i =1 14

DETOUR: RANDOM VARIABLES DETOUR: RANDOM VARIABLES Definition. (Random variable) Let be a probability space. A random variable is a function . (Ω, F , P ) X X : Ω → R 1. might undefined or infinite on a subset of zero probability. X 2. must be an event for all { ω ∈ Ω : X ( ω ) ≤ x } x ∈ R 3. For a finite set of random variables , the set n { X i } { ω : X 1 ( ω ) ≤ x 1 , ⋯ , X n ( ω ) ≤ x n } must be an event for all i =1 n { x i } i =1 Definition. (Cumulative distribution function) Let be a probability space and a random variable. The CDF of is the function (Ω, F , P ) X X F X : R → R : x ↦ P ( ω ∈ Ω : X ( ω ) ≤ x ) ≜ P ( X ≤ x ) If or countable, the random variable is discrete . We can write and | X | | X | < ∞ X = { x i } i =1 is called the probability mass function (PMF) of . P X x i ( ) ≜ P ( X = x i ) X If the CDF of as a finite derivative at , the derivative is called the probability density function (PDF), X x denoted by . If has a derivative for every , is continuous p X F X x ∈ R X We o�en don’t need to specify . All we need is a CDF (of PMF or PDF) (Ω, F , P ) 15

DETOUR: RANDOM VARIABLES DETOUR: RANDOM VARIABLES Definition. (Expectation/Mean) Let be a random variable with PMF . Then . X P X E [ X ] ≜ ∑ x ∈ X x P X ( x ) Let be a random variable with PDF . Then . X p X E [ X ] ≜ ∫ x ∈ X x p X ( x ) dx Expectation of a function of a discrete is (and idem for PDFs). f X E [ f ( X )] = ∑ x ∈ X f ( x ) P X ( x ) Definition. (Moment) Let be a random variable. The th moment of is . E X m X m X [ ] The variance is the second centered moment . 2 ) 2 E X 2 Var( X ) ≜ E [ ( X − E [ X ] ] = [ ] − E [ X ] Proposition (Expectation of indicator function) Let be a random variable and . Then X E ⊂ R E [ 1 { X ∈ E }] = P ( X ∈ E ) 11th commandment: thou shall denote random variables by capital letters 12th commandment: but sometimes not 16

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - PowerPoint PPT Presentation

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday January 9, 2020 1 LOGISTICS LOGISTICS Registration update Please decide soon if you want to take the class or not Still many people on the waiting

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Tuesday,

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020

Current State of Unsupervised Deep Learning William Falcon, PhD Student AGENDA AGENDA

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Foundations of Computer Science Lecture 21 Deviations from the Mean How Good is the Expectation

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Convergence of Adaptive and Interacting MCMC algorithms Gersende FORT LTCI / CNRS - TELECOM

Limit Theorems Markovs Inequality Chebyshevs Inequality Importance Allows

The Source Coding Theorem Mathias Winther Madsen mathias.winther@gmail.com Institute for Logic,

COL866: Foundations of Data Science Ragesh Jaiswal, IITD Ragesh Jaiswal, IITD COL866:

The Probabilistic Method Week 6: Expectation, Variance, and Beyond Joshua Brody CS49/Math59

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 4