WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - - PowerPoint PPT Presentation

why supervised learning may work why supervised learning
SMART_READER_LITE
LIVE PREVIEW

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK - - PowerPoint PPT Presentation

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday January 9, 2020 1 LOGISTICS LOGISTICS Registration update Please decide soon if you want to take the class or not Still many people on the waiting


slide-1
SLIDE 1

Matthieu R Bloch Thrusday January 9, 2020

WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK

1

slide-2
SLIDE 2

LOGISTICS LOGISTICS

Registration update Please decide soon if you want to take the class or not Still many people on the waiting list! Lecture videos on Canvas available in “Media Gallery” Please keep coming to class! Self-assessment online Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) I don’t expect you to do the assignment without refreshing your memory first

http://www.phdcomics.com

here

2

slide-3
SLIDE 3

Learning model #1

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown function

to learn The formula to distinguish cats from dogs

  • 2. A dataset

: picture of cat/dog : the corresponding label cat/dog

  • 3. A set of hypotheses

as to what the function could be Example: deep neural nets with AlexNet architecture

  • 4. An algorithm

to find the best that explains Terminology: : regression problem : binary classification problem : binary classification problem The goal is to generalize, i.e., be able to classify inputs we have not seen.

f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN ∈ X ≜ xi Rd ∈ Y ≜ R yi H ALG h ∈ H f Y = R |Y| < ∞ |Y| = 2

3

slide-4
SLIDE 4

A LEARNING PUZZLE A LEARNING PUZZLE

Learning seems impossible without additional assumptions!

4

slide-5
SLIDE 5

https://xkcd.com/221/

POSSIBLE VS PROBABLE POSSIBLE VS PROBABLE

Flip a biased coin that lands on head with unknown probability and Say we flip the coin times, can we estimate ? Can we relate to ? The law of large numbers tells us that converges in probability to as gets large It is possible that is completely off but it is not probable

p ∈ [0, 1] (head) = p P (tail) = 1 − p P N p = p ^ # head N p ^ p p ^ p N ∀ϵ > 0 (| − p| > ϵ) 0. P p ^ ⟶

N→∞

p ^

5

slide-6
SLIDE 6

Learning model #2

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown function

to learn

  • 2. A dataset

drawn i.i.d. from unknown distribution

  • n

are the corresponding targets

  • 3. A set of hypotheses

as to what the function could be

  • 4. An algorithm

to find the best that explains

f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi H ALG h ∈ H f

6

slide-7
SLIDE 7

ANOTHER LEARNING PUZZLE ANOTHER LEARNING PUZZLE

Which color is the dress?

7

slide-8
SLIDE 8

Learning model #3

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown conditional distribution

to learn models with noise

  • 2. A dataset

drawn i.i.d. from an unknown probability distribution

  • n

are the corresponding targets

  • 3. A set of hypotheses

as to what the function could be

  • 4. An algorithm

to find the best that explains The roles of and are different is what we want to learn, captures the underlying function and the noise added to it models sampling of dataset, need not be learned

Py|x Py|x f : X → Y D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi H ALG h ∈ H f Py|x Px Py|x Px

8

slide-9
SLIDE 9

Biometric authentication system

YET ANOTHER LEARNING PUZZLE YET ANOTHER LEARNING PUZZLE

Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design

9

slide-10
SLIDE 10

Final supervised learning model

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. A dataset

drawn i.i.d. from an unknown probability distribution

  • n

are the corresponding targets

  • 2. An unknown conditional distribution

models with noise

  • 3. A set of hypotheses

as to what the function could be

  • 4. A loss function

capturing the “cost” of prediction

  • 5. An algorithm

to find the best that explains

D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi Py|x Py|x f : X → Y H ℓ : Y × Y → R+ ALG h ∈ H f

10

slide-11
SLIDE 11

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM

Learning is not memorizing Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) What we really care about is the true risk (a.k.a. out-sample error) Question #1: Can generalize? For a given , is close to ? Question #2: Can we learn well? Given , the best hypothesis is Our algorithm can only find Is close to ? Is ?

h ∈ H D h ∈ H h ∈ H (h) ≜ ℓ( , h( )) R ˆN 1 N ∑

i=1 N

yi xi R(h) ≜ [ℓ(y, h(x))] Exy h (h) R ˆN R(h) H ≜ R(h) h♯ argminh∈H ≜ (h) h∗ argminh∈H R ˆN ( ) R ˆN h∗ R( ) h♯ R( ) ≈ 0 h♯

11

slide-12
SLIDE 12

WHY THE QUESTIONS MATTERS WHY THE QUESTIONS MATTERS

Quick demo: nearest neighbor classification

12

slide-13
SLIDE 13

DETOUR: DETOUR: PROBABILITIES PROBABILITIES

Probabilities are not that old. The axiomatic theory was carried out by Kolmogorov in 1932

  • Definition. (Axiom for events)

Let be a sample space. The class of subsets of that constitutes events satisfies the following axioms: 1. is an event;

  • 2. For

some events in , is an event;

  • 3. For every event

in , is an event.

  • Definition. (Axiom for probability)

Let be a sample space and a class of events satisfying the axioms for events. A probability rule is a function such that: 1. ;

  • 2. For every

;

  • 3. For any disjoint events in

, . Proposition (Union bound) Let be a probability space. For any events we have

Ω ≠ ∅ Ω Ω {Ai}∞

i≥1

Ω ∪∞

i=1Ai

A Ω Ac Ω ≠ ∅ F P : F → R+ (Ω) = 1 P A ⊂ F (A) ≥ 0 P F {Ai}∞

i=1

( ) = ( ) P ∪∞

i=1Ai

∑∞

i=1 P Ai

(Ω, F, P) { } Ai

i≥1

( ) ≤ ( ) P ∪i≥1Ai ∑i≥1 P Ai

13

slide-14
SLIDE 14

DETOUR: DETOUR: PROBABILITIES PROBABILITIES

  • Definition. (Conditional probability)

Let be a probability space. The conditional probability of event given event is, if , .

  • Definition. (Bayes' rule)

Let be a probability space and with non zero probability. Then,

  • Definition. (Independence)

Let be a probability space. Event is independent of event if . For , the events are independent if such that ,

(Ω, F, P) A B (B) > 0 P (A|B) = (A ∩ B)/ (B) P P P (Ω, F, P) A, B (A|B) = (B|A) . P P (A) P (B) P (Ω, F, P) A B (A ∩ B) = (A) (B) P P P n > 2 { } Ai

n i=1

∀S ⊂ [1; n] |S| > 2 ( ) = ( ) P ∩i∈SAi ∏

i=1 n

P Ai

14

slide-15
SLIDE 15

DETOUR: DETOUR: RANDOM VARIABLES RANDOM VARIABLES

  • Definition. (Random variable)

Let be a probability space. A random variable is a function . 1. might undefined or infinite on a subset of zero probability. 2. must be an event for all

  • 3. For a finite set of random variables

, the set must be an event for all

  • Definition. (Cumulative distribution function)

Let be a probability space and a random variable. The CDF of is the function If

  • r countable, the random variable is discrete. We can write

and is called the probability mass function (PMF) of . If the CDF of as a finite derivative at , the derivative is called the probability density function (PDF), denoted by . If has a derivative for every , is continuous We oen don’t need to specify . All we need is a CDF (of PMF or PDF)

(Ω, F, P) X X : Ω → R X {ω ∈ Ω : X(ω) ≤ x} x ∈ R { } Xi

n i=1

{ω : (ω) ≤ , ⋯ , (ω) ≤ } X1 x1 Xn xn { } xi

n i=1

(Ω, F, P) X X : R → R : x ↦ (ω ∈ Ω : X(ω) ≤ x) ≜ (X ≤ x) FX P P |X| < ∞ X = { } xi

|X| i=1

( ) ≜ (X = ) PX xi P xi X X x pX FX x ∈ R X (Ω, F, P)

15

slide-16
SLIDE 16

DETOUR: DETOUR: RANDOM VARIABLES RANDOM VARIABLES

  • Definition. (Expectation/Mean)

Let be a random variable with PMF . Then . Let be a random variable with PDF . Then . Expectation of a function of a discrete is (and idem for PDFs).

  • Definition. (Moment)

Let be a random variable. The th moment of is . The variance is the second centered moment . Proposition (Expectation of indicator function) Let be a random variable and . Then 11th commandment: thou shall denote random variables by capital letters 12th commandment: but sometimes not

X PX [X] ≜ x (x) E ∑x∈X PX X pX [X] ≜ x (x)dx E ∫x∈X pX f X [f(X)] = f(x) (x) E ∑x∈X PX X m X [ ] E Xm Var(X) ≜ [(X − [X] ] = [ ] − E E )2 E X2 [X] E

2

X E ⊂ R [1{X ∈ E}] = (X ∈ E) E P

16

slide-17
SLIDE 17

A SIMPLER SUPERVISED LEARNING PROBLEM A SIMPLER SUPERVISED LEARNING PROBLEM

Consider a special case of the general supervised learning problem

  • 1. Dataset

drawn i.i.d. from unknown

  • n

labels with (binary classification)

  • 2. Unknown

, no noise.

  • 3. Finite set of hypotheses

,

  • 4. Binary loss function

In this very specific case, the true risk simplifies The empirical risk becomes

D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

Y = {0, 1} f : X → Y H |H| = M < ∞ H ≜ {hi}M

i=1

ℓ : Y × Y → : ( , ) ↦ 1{ ≠ } R+ y1 y2 y1 y2 R(h) ≜ [1{h(x) ≠ y}] = (h(x) ≠ y) Exy Pxy (h) = 1{h( ) ≠ y} R ˆN 1 N ∑

i=1 N

xi

17

slide-18
SLIDE 18

Image

CAN WE LEARN? CAN WE LEARN?

Our objective is to find a hypothesis that ensures a small risk For a fixed , how does compares to ? Observe that for The empirical risk is a sum of iid random variables is a statement about the deviation of a normalized sum of iid random variables from its mean We’re in luck! Such bounds, a.k.a, known as concentration inequalities, are a well studied subject

h∗ = (h) h∗ argmin

h∈H

R ˆN ∈ H hj ( ) R ˆN hj R( ) hj ∈ H hj ( ) = 1{ ( ) ≠ y} R ˆN hj 1 N ∑

i=1 N

hj xi [ ( )] = R( ) E R ˆN hj hj ( ( ) − R( ) > ϵ) P ∣ ∣R ˆN hj hj ∣ ∣

18

slide-19
SLIDE 19

CONCENTRATION INEQUALITIES 101 CONCENTRATION INEQUALITIES 101

Lemma (Markov's inequality) Let be a non-negative real-valued random variable. Then for all Lemma (Chebyshev's inequality) Let be a real-valued random variable. Then for all Proposition (Weak law of large numbers) Let be i.i.d. real-valued random variables with finite mean and finite variance . Then

X t > 0 (X ≥ t) ≤ . P [X] E t X t > 0 (|X − [X]| ≥ t) ≤ . P E Var(X) t2 {Xi}N

i=1

μ σ2 ( − μ ≥ ϵ) ≤ ( − μ ≥ ϵ) = 0. P ∣ ∣ ∣ ∣ 1 N ∑

i=1 N

Xi ∣ ∣ ∣ ∣ σ2 Nϵ2 lim

N→∞ P

∣ ∣ ∣ ∣ 1 N ∑

i=1 N

Xi ∣ ∣ ∣ ∣

19

slide-20
SLIDE 20

BACK TO LEARNING BACK TO LEARNING

By the law of large number, we know that Given enough data, we can generalize How much data? to ensure . That’s not quite enough! We care about where If is large we should expect the existence of such that If we choose we can ensure . That’s a lot of samples!

∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ ≤ P{(

, )} xi yi

∣ ∣R ˆN hj hj ∣ ∣ Var(1{ ( ) ≠ y}) hj x1 Nϵ2 1 Nϵ2 N =

1 δϵ2

( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN hj hj ∣ ∣ ( ) R ˆN h∗ = (h) h∗ argminh∈H R ˆN M = |H| ∈ H hk ( ) ≪ R( ) R ˆN hk hk ( ( ) − R( ) ≥ ϵ) ≤? P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ (∃j : ( ) − R( ) ≥ ϵ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ( ( ) − R( ) ≥ ϵ) ≤ P ∣ ∣R ˆN h∗ h∗ ∣ ∣ M Nϵ2 N ≥ M

δϵ2

( ( ) − R( ) ≥ ϵ) ≤ δ P ∣ ∣R ˆN h∗ h∗ ∣ ∣

20

slide-21
SLIDE 21

CONCENTRATION INEQUALITIES 102 CONCENTRATION INEQUALITIES 102

We can obtain much better bounds than with Chebyshev Lemma (Hoeffding's inequality) Let be i.i.d. real-valued zero-mean random variables such that . Then for all In our learning problem We can now choose can be quite large (almost exponential in ) and, with enough data, we can generalize . How about learning ?

{Xi}N

i=1

∈ [ ; ] Xi ai bi ϵ > 0 ( ≥ ϵ) ≤ 2 exp(− ). P ∣ ∣ ∣ ∣ 1 N ∑

i=1 N

Xi ∣ ∣ ∣ ∣ 2N 2ϵ2 ( − ∑N

i=1 bi

ai)2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2 exp(−2N ) P ∣ ∣R ˆN hj hj ∣ ∣ ϵ2 ∀ϵ > 0 ( ( ) − R( ) ≥ ϵ) ≤ 2M exp(−2N ) P ∣ ∣R ˆN h∗ h∗ ∣ ∣ ϵ2 N ≥ (log M + log )

1 2ϵ2 2 δ

M N h∗ ≜ R(h) h♯ argminh∈H

21

slide-22
SLIDE 22

LEARNING CAN WORK! LEARNING CAN WORK!

Lemma. If then . How do we make small? Need bigger hypothesis class ! (could we take ?) Fundamental trade-off of learning Image

∀j ∈ H ( ) − R( ) ≤ ϵ ∣ ∣R ˆN hj hj ∣ ∣ R( ) − R( ) ≤ 2ϵ ∣ ∣ h∗ h♯ ∣ ∣ R( ) h♯ H M → ∞

22

slide-23
SLIDE 23

WHAT IS A GOOD HYPOTHESIS? WHAT IS A GOOD HYPOTHESIS?

Ideally we want small so that and get lucky so that In general this is not possible Remember, we usually have to learn , not a function Next time What is the optimal binary classification hypothesis class? How small can be?

|H| R( ) ≈ R( ) h∗ h♯ R( ) ≈ 0 h∗ Py|x f R( ) h∗

23

 