Lecture 6:
−Learning Theory −Probability Review
Aykut Erdem
October 2016 Hacettepe University
Lecture 6: Learning Theory Probability Review Aykut Erdem October - - PowerPoint PPT Presentation
Lecture 6: Learning Theory Probability Review Aykut Erdem October 2016 Hacettepe University Last time Regularization , Cross-Validation N E ( w ) = 1 { y ( x n , w ) t n } 2 + 2 w 2 2 n =1 where w
−Learning Theory −Probability Review
Aykut Erdem
October 2016 Hacettepe University
2
Last time… Regularization, Cross-Validation
2
N
{y(xn, w) − tn}2 + λ 2 ∥w∥2
0 + w2 1 + . . . + w2
M,
importance of the regularization term compared
ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆
1
232.37 4.74
w⋆
2
w⋆
3
48568.31
w⋆
4
w⋆
5
640042.26 55.28
w⋆
6
41.32
w⋆
7
1042400.18
w⋆
8
0.00 w⋆
9
125201.43 72.68 0.01
Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson
the data NN classifier 5-NN classifier
− Why ML works
3
4
mathematical analysis of machine learning algorithms
− PAC (probably approximately correct) learning
→ boosting
− VC (Vapnik–Chervonenkis) theory
→ support vector machines
5
slide by Eric Eaton
(
Annual conference: Conference on Learning Theory (COLT)
good job learning?
than my training performance?
6
The key idea that underlies all these answer is that simple functions generalize well.
adapted from Hal Daume III
− It can justify and help understand why
common practice works.
− It can also serve to suggest new algorithms
and approaches that turn out to work well in practice.
7
adapted from Hal Daume III
theory after theory before
Often, it turns out to be a mix!
surprisingly well.
something about it.
− In the process, they make it better or find new
algorithms.
possible and what’s not possible.
8
adapted from Hal Daume III
− whether there is an “ultimate” learning algorithm, Aawesome,
that solves the Binary Classification problem.
Aawesome is out there?
− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect
classification on all future examples drawn from the same distribution that produced D.
9
adapted from Hal Daume III
− whether there is an “ultimate” learning algorithm, Aawesome,
that solves the Binary Classification problem.
Aawesome is out there?
− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect
classification on all future examples drawn from the same distribution that produced D.
10
adapted from Hal Daume III
− 80% of data points in this distribution have x = y and 20%
don’t.
there’s no way that it can do better than 20% error on this data.
− No Aawesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too
large.”
11
adapted from Hal Daume III
D = (⟨+1⟩,+1) = 0.4 D = (⟨+1⟩,-1) = 0.1 D = (⟨-1⟩,-1) = 0.4 D = (⟨-1⟩,+1) = 0.1
sampling.
− When trying to learn about a distribution, you only get to
see data points drawn from that distribution.
− You know that “eventually” you will see enough data points
that your sample is representative of the distribution, but it might not happen immediately.
a sequence of four coin flips you never see a tails, or perhaps only see one tails.
12
adapted from Hal Daume III
work.
− In particular, if we happen to get a lousy sample of
data from D, we need to allow Aawesome to do something completely unreasonable.
time.
13
adapted from Hal Daume III
The best we can reasonably hope of Aawesome is that it will do pretty well, most of the time.
an algorithm is that
− It does a good job most of the time (probably approximately
correct)
− We have 10 different binary classification data sets. − For each one, it comes back with functions f1, f2, . . . , f10.
✦ For some reason, whenever you run f4 on a test point, it crashes your
is always at most 5% error.
✦ If this situtation is guaranteed to happen, then this hypothetical learning
algorithm is a PAC learning algorithm.
✤ It satisfies probably because it only failed in one out of ten cases, and
it’s approximate because it achieved low, but non-zero, error on the remainder of the cases.
14
adapted from Hal Daume III
− Computational complexity: Prefer an algorithm that runs quickly
to one that takes forever
− Sample complexity: The number of examples required for your
algorithm to achieve its goals
15
adapted from Hal Daume III
Definitions 1. An algorithm A is an (e, d)-PAC learning algorithm if, for all distributions D: given samples from D, the probability that it returns a “bad function” is at most d; where a “bad” function is one with test error rate more than e on D.
Definition: An algorithm A is an efficient (e, d)-PAC learning al- gorithm if it is an (e, d)-PAC learning algorithm whose runtime is polynomial in 1
e and 1 d.
In other words, suppose that you want your algorithm to achieve
In other words, to let your algorithm to achieve 4% error rather than 5%, the runtime required to do so should not go up by an exponential factor!
Example: PAC Learning of Conjunctions
(e.g. x1 ⋀ x2 ⋀ x5)
x = ⟨x1, x2, . . . , xD⟩.
y = c(x)
− Clearly, the true formula cannot
include the terms x1 , x2, ¬x3, ¬x4
16
adapted from Hal Daume III
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
Example: PAC Learning
f 0(x) = x1 ⋀ ¬x1 ⋀ x2 ⋀ ¬x2 ⋀ x3 ⋀ ¬x3 ⋀ x4 ⋀ ¬x4 f 1(x) = ¬x1 ⋀ ¬x2 ⋀ x3 ⋀ x4 f
2(x) = ¬x1 ⋀ x3 ⋀ x4
f
3(x) = ¬x1 ⋀ x3 ⋀ x4
example correctly (provided that there is no noise)
− Given a data set of N examples in D dimensions, it takes O (ND)
time to process the data. This is linear in the size of the data set.
17
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fadapted from Hal Daume III
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
“Throw Out Bad Terms”
− How many examples N do you need to see in order to
guarantee that it achieves an error rate of at most ε (in all but δ- many cases)?
− Perhaps N has to be gigantic (like ) to (probably) guarantee
a small error.
18
adapted from Hal Daume III
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fExample: PAC Learning
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
most e (like 22D/e)
“Throw Out Bad Terms”
achieve a small error is not-too-big.
− Say there is some term (say ¬x8) that should have been thrown
− If this was the case, then you must not have seen any positive
training examples with ¬x8 = 0.
− So example with x8 = 0 must have low probability (otherwise you
would have seen them). So such a thing is not that common
19
adapted from Hal Daume III
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD// initialize function
2: for all positive examples (x,+1) in D do 3:for d = 1 . . . D do
4:if xd = 0 then
5:f ← f without term “xd”
6:else
7:f ← f without term “¬xd”
8:end if
9:end for
10: end for 11: return fExample: PAC Learning
y x1 x2 x3 x4 +1 1 1 +1 1 1 1
1 1 1
able 10.1: Data set for learning con-
“Throw Out Bad Terms”
many variables.
− The hypothesis class for Boolean conjunctions is finite; the
hypothesis class for linear classifiers is infinite.
− For Occam’s razor, we can only work with finite hypothesis classes.
20
adapted from Hal Daume III
William of Occam (c. 1288 – c. 1348)
“If one can explain a phenomenon without assuming this or that hypothetical entity, then there is no ground for assuming it i.e. that one should always opt for an explanation in terms of the fewest possible number of causes, factors, or variables.”
Theorem 14 (Occam’s Bound). Suppose A is an algorithm that learns a function f from some finite hypothesis class H. Suppose the learned function always gets zero error on the training data. Then, the sample com- plexity of f is at most log |H|.
a Boolean conjunction, represent it as a conjunction of inequalities.
− Instead of having x1 ∧ ¬x2 ∧ x5, you have
[x1 > 0.2] ∧ [x2 < 0.77] ∧ [x5 < π/4]
− In this representation, for each feature, you need to choose
an inequality (< or >) and a threshold.
− Since the thresholds can be arbitrary real values, there are
now infinitely many possibilities: |H| = 2D×∞ = ∞
21
adapted from Hal Daume III
based on this intuition.
complexity
− The idea is to look at a finite set of unlabeled examples − no matter how these points were labeled, would we be able to
find a hypothesis that correctly classifies them
represent an arbitrary labeling becomes harder and harder.
22
adapted from Hal Daume III
Definitions 2. For data drawn from some space X , the VC dimension of a hypothesis space H over X is the maximal K such that: there exists a set X ⊆ X of size |X| = K, such that for all binary labelings of X, there exists a function f ∈ H that matches this labeling.
shatter 3 points.
points.
hence has VC dimension greater than 3.
23
adapted from Trevor Hastie, Robert Tibshirani, Jerome Friedman
24
– Can think of A as a boolean-valued variable
– A = your next patient has cancer – A = Rafael Nada wins French Open 2015
25
slide by Dhruv Batra
If I flip this coin, the probability that it will come up
heads is 0.5
come up heads about half the time. Probabilities are the expected frequencies of events over repeated trials.
is equally likely to come up heads or tails. Probabilities quantify subjective beliefs about single events.
knowledge, and automatically derive learning algorithms
learning algorithms, in limit of large datasets
26
slide by Erik Suddherth
27
7
The Axioms Of Probabi lity
slide by Andrew Moore
28
slide by Dhruv Batra
29
all possible worlds Its area is 1
Worlds in which A is False Worlds in which A is true
P(A) = Area of reddish oval
slide by Dhruv Batra
30
t get any smaller than 0 And a zero area would mean no world could ever have A true
slide by Dhruv Batra
31
t get any bigger than 1 And an area of 1 would mean all worlds will have A true
slide by Dhruv Batra
32
A B
B P(A and B) Simple addition and subtraction
slide by Dhruv Batra
33
Discrete Random Variables
X X
X
discrete random variable sample space of possible outcomes, which may be finite or countably infinite
x ∈ X
}
slide by Erik Suddherth
34
Discrete Random Variables
X X
p(X = x) p(x)
0 ≤ p(x) ≤ 1 for all x ∈ X
X
x∈X
p(x) = 1
discrete random variable sample space of possible outcomes, which may be finite or countably infinite
x ∈ X
probability distribution (probability mass function) shorthand used when no ambiguity
uniform distribution degenerate distribution
X = {1, 2, 3, 4}
slide by Erik Suddherth
35
slide by Dhruv Batra
− Events: P(A) = P(A and B) + P(A and not B) − Random variables
36
P(X = x) = P(X = x,Y = y)
y
∑
slide by Dhruv Batra
37
p(x, y) = X
z∈Z
p(x, y, z)
p(x) = X
y∈Y
p(x, y)
y z
slide by Erik Suddherth
− He has won the French Open 9/10 he has played there − Novak Djokovic is ranked 1; just won Australian Open − I offered a similar analysis last year and Nadal won
38
slide by Dhruv Batra
fraction where A is true
− H: “Have a headache” − F: “Coming down with Flu”
39
H
P(F) = 1/40 P(H|F) = 1/2
is rarer, but if you re coming down with flu there s a 50- 50 chance you ll have a headache.
slide by Dhruv Batra
Conditional Distributions
40
slide by Erik Suddherth
41
p(x, y) = p(x)p(y)
X ⊥ Y
for all x ∈ X, y ∈ Y
Equivalent conditions on conditional probabilities:
p(x | Y = y) = p(x) and p(y) > 0 for all y ∈ Y p(y | X = x) = p(y) and p(x) > 0 for all x ∈ X
slide by Erik Suddherth
42
posterior distribution (learned information)
p(y | x)
unknown parameters we would like to infer
prior distribution (domain knowledge) likelihood function (measurement model)
p(x | y)
Y X = x
p(y)
Bayes Rule (Bayes Theorem)
p(x, y) = p(x)p(y | x) = p(y)p(x | y)
p(y | x) = p(x, y) p(x) = p(x | y)p(y) P
y02Y p(y0)p(x | y0)
∝ p(x | y)p(y)
slide by Erik Suddherth
coin
coin n times, and report the number k of times it comes up heads
43
Ber(x | θ) = θδ(x,1)(1 − θ)δ(x,0) X = {0, 1} 0 ≤ θ ≤ 1 0 ≤ θ ≤ 1 K = {0, 1, 2, . . . , n}
Bin(k | n, θ) = ✓ n k ◆ θk(1 − θ)n−k ✓ n k ◆ = n! (n − k)!k!
slide by Erik Suddherth
44
Binomial Distributions
1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 θ=0.250 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 θ=0.900 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 θ=0.500slide by Erik Suddherth
45
http://en.wikipedia.org/wiki/Bean_machine
46
Multinoulli Distribution:
X = {0, 1}K,
K
X
k=1
xk = 1
θ = (θ1, θ2, . . . , θK), θk ≥ 0,
K
X
k=1
θk = 1
binary vector encoding
Cat(x | θ) =
K
Y
k=1
θxk
k
Multinomial Distribution: Roll a single (possibly biased) die
n times, and record the number nk of each possible outcome
nk =
n
X
i=1
xik Mu(x | n, θ) = ✓ n n1 . . . nK ◆ K Y
k=1
θnk
k
slide by Erik Suddherth
n times, and report the number nk of each possible
47
slide by Erik Suddherth
48
Multinomial Model of DNA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 Sequence Position Bits
slide by Erik Suddherth