THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM - - PowerPoint PPT Presentation

the supervised learning problem the supervised learning
SMART_READER_LITE
LIVE PREVIEW

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM - - PowerPoint PPT Presentation

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020 1 WHY ML ? WHY ML ? Traditional engineering is top-down We use fundamental principles (mathematics, physics) to build models and abstractions


slide-1
SLIDE 1

Matthieu R Bloch January 7, 2020

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM

1

slide-2
SLIDE 2

Adversarial examples [Elsayed et al.’18’]

WHY ML ? WHY ML ?

Traditional engineering is top-down We use fundamental principles (mathematics, physics) to build models and abstractions Design is performed based on models Example: building a communication system Machine learning is bottom-up We think there is a model to be found The model is too complex to describe or identify from fundamental principles We have data Example: classifying cats and dogs There are plenty of problems that do not require ML We should probably not try to learn laws of physics with ML There are plenty of situation in which ML can help Engineering design based on heuristics Example: Computer Aided Design

2

slide-3
SLIDE 3

ML IN THE REAL WORLD ML IN THE REAL WORLD

Credit card fraud detection Movie recommendations Autonomous vehicles Match making Handwriting recognition Cooking Painting Teaching

3

slide-4
SLIDE 4

TYPES OF LEARNING TYPES OF LEARNING

Supervised learning Given input data representing observation of phenomenon Given output data representing “label” attached to observation Goal is to identify input-output relationship from training data and generalize Unsupervised learning Given input data representing observation of phenomenon No output data! Goal is to understand structure in data, or infer some characteristic of underlying probability distribution Other types of learning semi-supervised learning active learning

  • nline learning

reinforcement learning transfer learning imitation learning

{xi}N

i=1

{yi}N

i=1

{( , ) xi yi }N

i=1

{xi}N

i=1

4

slide-5
SLIDE 5

Learning model #1

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown function

to learn The formula to distinguish cats from dogs

  • 2. A dataset

: picture of cat/dog : the corresponding label cat/dog

  • 3. A set of hypotheses

as to what the function could be Example: deep neural nets with AlexNet architecture

  • 4. An algorithm

to find the best that explains Terminology: : regression problem : binary classification problem The goal is to generalize, i.e., be able to classify inputs we have not seen.

f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN ∈ X ≜ xi Rd ∈ Y ≜ R yi H ALG h ∈ H f Y = R |Y| = 2

5

slide-6
SLIDE 6

A LEARNING PUZZLE A LEARNING PUZZLE

Learning seems impossible without additional assumptions!

6

slide-7
SLIDE 7

https://xkcd.com/221/

POSSIBLE VS PROBABLE POSSIBLE VS PROBABLE

Flip a biased coin that lands on head with unknown probability and Say we flip the coin times, can we estimate ? Can we relate to ? The law of large numbers tells us that converges in probability to as gets large It is possible that is completely off but it is not probable

p ∈ [0, 1] (head) = p P (tail) = 1 − p P N p = p ^ # head N p ^ p p ^ p N ∀ϵ > 0 (| − p| > ϵ) 0. P p ^ ⟶

N→∞

p ^

7

slide-8
SLIDE 8

Learning model #2

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown function

to learn

  • 2. A dataset

drawn i.i.d. from unknown distribution

  • n

are the corresponding targets

  • 3. A set of hypotheses

as to what the function could be

  • 4. An algorithm

to find the best that explains

f : X → Y : x ↦ y = f(x) D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi H ALG h ∈ H f

8

slide-9
SLIDE 9

ANOTHER LEARNING PUZZLE ANOTHER LEARNING PUZZLE

Which color is the dress?

9

slide-10
SLIDE 10

Learning model #3

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. An unknown conditional distribution

to learn models with noise

  • 2. A dataset

drawn i.i.d. from an unknown probability distribution

  • n

are the corresponding targets

  • 3. A set of hypotheses

as to what the function could be

  • 4. An algorithm

to find the best that explains The roles of and are different is what we want to learn, captures the underlying function and the noise added to it models sampling of dataset, need not be learned

Py|x Py|x f : X → Y D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi H ALG h ∈ H f Py|x Px Py|x Px

10

slide-11
SLIDE 11

Biometric authentication system

YET ANOTHER LEARNING PUZZLE YET ANOTHER LEARNING PUZZLE

Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design

11

slide-12
SLIDE 12

Learning model

COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING

  • 1. A dataset

drawn i.i.d. from an unknown probability distribution

  • n

are the corresponding targets

  • 2. An unknown conditional distribution

models with noise

  • 3. A set of hypotheses

as to what the function could be

  • 4. A loss function

capturing the “cost” of prediction

  • 5. An algorithm

to find the best that explains

D ≜ {( , ), ⋯ , ( , )} x1 y1 xN yN {xi}N

i=1

Px X {yi}N

i=1

∈ Y ≜ R yi Py|x Py|x f : X → Y H ℓ : Y × Y → R+ ALG h ∈ H f

12

slide-13
SLIDE 13

THE LEARNING PROBLEM THE LEARNING PROBLEM

Learning is not memorizing Our goal is not to find that accurately assigns values to elements of Our goal is to find the best that accurately predicts values of unseen samples Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) What we really care about is the true risk (a.k.a. out-sample error) Question #1: Can generalize? For a given , is close to ? Question #2: Can we learn well? Given , the best hypothesis is Our algorithm can only find Is close to ? Is ?

h ∈ H D h ∈ H h ∈ H (h) ≜ ℓ( , h( )) R ˆN 1 N ∑

i=1 N

yi xi R(h) ≜ [ℓ(y, h(x))] Exy h (h) R ˆN R(h) H ≜ R(h) h♯ argminh∈H ≜ (h) h∗ argminh∈H R ˆN ( ) R ˆN h∗ R( ) h♯ R( ) ≈ 0 h♯

13

 