Learning From Data Lecture 3 Is Learning Feasible?
Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon
- M. Magdon-Ismail
CSCI 4100/6100
Learning From Data Lecture 3 Is Learning Feasible? Outside the - - PowerPoint PPT Presentation
Learning From Data Lecture 3 Is Learning Feasible? Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon M. Magdon-Ismail CSCI 4100/6100 recap: The Perceptron Learning Algorithm y = +1 y x
Outside the Data Probability to the Rescue Learning vs. Verification Selection Bias - A Cartoon
CSCI 4100/6100
recap: The Perceptron Learning Algorithm Age Income
w(t + 1) w(t) w(t) w(t + 1) x∗ y∗ = −1 x∗ y∗x∗ y∗ = +1 y∗x∗
− →
Age Income
PLA finds a linear separator in finite time.
– Separating the data amounts to “memorizing the data”: g ≈ f only on D. – g ≈ f means we are interested in outside the data.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 2 /27
Puzzle - Outside the data − →
f = −1 f = +1 f = ?
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 3 /27
±1 both possible − →
f = −1 f = +1 f = ?
Who is correct? – we cannot rule out either possibility.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 4 /27
No Free Lunch − →
f = −1 f = +1 f = ?
For every f that fits the data and is “+1” on the new point, there is one that is “−1”. Since f is unknown, it can take on any value outside the data, no matter how large the data.
You cannot know anything for sure about f outside the data without making assumptions.
Is there any hope to know anything about f outside the data set without making assumptions about f?
Yes, if we are willing to give up the “for sure”.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 5 /27
The big question − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 6 /27
Population mean − →
SAMPLE BIN µ = probability to pick a red marble ν = fraction of red marbles in sample
The BIN Model
ν: fraction of red marbles in the sample. Sample − → the data set − → ν BIN − → outside the data − → µ
Can we say anything about µ (outside the data) after observing ν (the data)?
ANSWER: No. It is possible for the sample to be all green marbles and the bin to be mostly red.
Then, why do we trust polling (e.g. to predict the outcome of the presidential election).
ANSWER: The bad case is possible, but not probable.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 7 /27
Hoeffding − →
Hoeffding/Chernoff proved that, most of the time, ν cannot be too far from µ:
P [|ν − µ| > ǫ] ≤ 2e−2ǫ2N,
for any ǫ > 0.
P [|ν − µ| ≤ ǫ] ≥ 1 − 2e−2ǫ2N,
for any ǫ > 0.
box it and memorize it
We get to select any ǫ we want.
newsflash: ν ≈ µ = ⇒ µ ≈ ν . µ ≈ ν is probably approximately correct (PAC-learning)
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 8 /27
Hoeffding example − →
P [|ν − µ| > ǫ] ≤ 2e−2ǫ2N, for any ǫ > 0. P [|ν − µ| ≤ ǫ] ≥ 1 − 2e−2ǫ2N, for any ǫ > 0.
box it and memorize it
Example: N = 1, 000; draw a sample and observe ν. 99% of the time µ − 0.05 ≤ ν ≤ µ + 0.05 (ǫ = 0.05) 99.9999996% of the time µ − 0.10 ≤ ν ≤ µ + 0.10 (ǫ = 0.10) What does this mean? If I repeatedly pick a sample of size 1,000, observe ν and claim that µ ∈ [ν − 0.05, ν + 0.05],
(the error bar is ±0.05)
I will be right 99% of the time. On any particular sample you may be wrong, but not often.
We learned something. From ν, we reached outside the data to µ.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 9 /27
Probability rescued us − →
If the sample is constructed in some arbitrary fashion, then indeed we cannot say anything. Even with independence, ν can take on arbitrary values; but some values are way more likely than others. This is what allows us to learn something – it is likely that ν ≈ µ.
The bin can be infinite. It’s great that it does not depend on µ because µ is unknown; and we mean unknown.
If N → ∞, µ ≈ ν with very very very . . . high probabilty, but not for sure. Can you live with 10−100 probability of error?
We should probably have said “independence to the rescue”
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 10 /27
Bin and learning − →
Target Function f Fixed hypothesis h
Age Income Age Income
UNKNOWN KNOWN In learning, the unknown is an entire function f; in the bin it was a single number µ.
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 11 /27
The error function − →
Target Function f Fixed hypothesis h
Age Income Age Income
Age Income
green: h(x) = f(x) red: h(x) = f(x) E(h) = Px[h(x) = f(x)]
(“size” of the red region)
P(x) UNKNOWN
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 12 /27
Errors=red ‘marbles’ − →
Target Function f Fixed a hypothesis h
Age Income Age Income
Age Income
green “marble”: h(x) = f(x) red “marble”: h(x) = f(x) BIN: X Eout(h) = Px[h(x) = f(x)]
UNKNOWN
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 13 /27
Data − →
Target Function f Fixed a hypothesis h
Age Income Age Income
Age Income
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 14 /27
Data=sample of marbles − →
Target Function f Fixed a hypothesis h
Age Income Age Income
Age Income
green data: h(xn) = f(xn) red data: h(xn) = f(xn) Ein(h) = fraction of red data
in-sample
misclassified KNOWN!
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 15 /27
Learning vs. bin − →
Age Income Age Income
Unknown f and P(x), fixed h
SAMPLE BIN µ = probability to pick a red marble ν = fraction of red marbles in sample
Learning Bin Model
input space X Bin x for which h(x) = f(x)
x for which h(x) = f(x)
P(x) randomly picking a marble data set D sample of N marbles Out-of-sample Error: Eout(h) = Px[h(x) = f(x)] µ = probability of picking a red marble In-sample Error: Ein(h) = 1
N N
h(x) = f(x) ν = fraction of red marbles in the sample
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 16 /27
Hoeffding for Ein − →
P [|Ein(h) − Eout(h)| > ǫ] ≤ 2e−2ǫ2N,
for any ǫ > 0.
P [|Ein(h) − Eout(h)| ≤ ǫ] ≥ 1 − 2e−2ǫ2N,
for any ǫ > 0. Ein is random, but known; Eout fixed, but unknown.
= ⇒ Eout ≈ 0 (with high probability), i.e. Px[h(x) = f(x)] ≈ 0;
We have learned something about the entire f: f ≈ h over X (outside D)
But, we have still learned something about the entire f: f ≈ h; it is not very useful though.
Questions: Suppose that Ein ≈ 1, have we learned something about the entire f that is useful? What is the worst Ein for inferring about f?
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 17 /27
Verification vs. learning − →
The entire previous argument assumed a FIXED h and then came the data.
if Ein is small, h is good, with high confidence. if Ein is large, h is bad with high confidence.
We have no control over Ein. It is what it is.
g results from searching an entire hypothesis set H for a hypothesis with small Ein.
Verification Real Learning Fixed single hypothesis h Fixed hypothesis set H h to be certified g to be certified h does not depend on D g results after searching H to fit D No control over Ein Pick best Ein
Verification: we can say something outside the data about h?
Learning: can we say something outside the data about g?
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 18 /27
Real learning – finite model − →
h1
Age Income
Eout(h1) h2
Age Income
Eout(h2) h3
Age Income
Eout(h3) · · · hM
Age Income
Eout(hM)
Age Income
Ein(h1) = 2
9
Age Income
Ein(h2) = 0
Age Income
Ein(h3) = 5
9
· · ·
Age Income
Ein(hM) = 6
9
Pick the hypothesis with minimum Ein; will Eout be small?
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 19 /27
Coin experiment − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 20 /27
Is it a freak coin − →
Do we expect P[heads] ≈ 0? Let’s toss this coin (this coin has never come up heads). heads: you give me $2; tails: I give you $1. Who wants this bet?
(we’re gonna play this game 100 times)
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 21 /27
Selection bias − →
Coin tossing example:
P = 1 2N
We expect it is biased: P[heads] ≈ 0.
P = 1 −
2N 70
Do we expect P[heads] ≈ 0 for the selected coin? Similar to the “birthday problem”: among 30 people, two will likely share the same birthday.
Selection bias is a very serious trap. For example iterated medical screening.
If we select an h ∈ H with smallest Ein, can we expect Eout to be small?
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 22 /27
Jelly beans and acne - a cartoon − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 23 /27
False − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 24 /27
Try again − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 25 /27
Eventually true − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 26 /27
News headline! − →
c A M L Creator: Malik Magdon-Ismail
Is Learning Feasible: 27 /27
News headline! − →