Lecture 6: Learning Theory Probability Review Aykut Erdem October - - PowerPoint PPT Presentation

lecture 6
SMART_READER_LITE
LIVE PREVIEW

Lecture 6: Learning Theory Probability Review Aykut Erdem October - - PowerPoint PPT Presentation

Lecture 6: Learning Theory Probability Review Aykut Erdem October 2016 Hacettepe University Last time Regularization , Cross-Validation N E ( w ) = 1 { y ( x n , w ) t n } 2 + 2 w 2 2 n =1 where w


slide-1
SLIDE 1

Lecture 6:

−Learning Theory −Probability Review

Aykut Erdem

October 2016 Hacettepe University

slide-2
SLIDE 2

2

Last time… Regularization, Cross-Validation

  • E(w) = 1

2

N

  • n=1

{y(xn, w) − tn}2 + λ 2 ∥w∥2

  • where ∥w∥2 ≡ wTw = w2

0 + w2 1 + . . . + w2

M,

importance of the regularization term compared

ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆

1

232.37 4.74

  • 0.05

w⋆

2

  • 5321.83
  • 0.77
  • 0.06

w⋆

3

48568.31

  • 31.97
  • 0.05

w⋆

4

  • 231639.30
  • 3.89
  • 0.03

w⋆

5

640042.26 55.28

  • 0.02

w⋆

6

  • 1061800.52

41.32

  • 0.01

w⋆

7

1042400.18

  • 45.95
  • 0.00

w⋆

8

  • 557682.99
  • 91.53

0.00 w⋆

9

125201.43 72.68 0.01

Figure credit: Fei-Fei Li, Andrej Karpathy, Justin Johnson

the data NN classifier 5-NN classifier

slide-3
SLIDE 3

Today

  • Learning Theory

− Why ML works

  • Probability Review

3

slide-4
SLIDE 4

Learning Theory: 
 Why ML Works

4

slide-5
SLIDE 5

Computational Learning 
 Theory

  • Entire subfield devoted to the 


mathematical analysis of machine 
 learning algorithms

  • Has led to several practical methods:

− PAC (probably approximately correct) learning 


→ boosting

− VC (Vapnik–Chervonenkis) theory 


→ support vector machines 


5

slide by Eric Eaton

(

Annual conference: Conference on Learning Theory (COLT)

slide-6
SLIDE 6

Computational Learning Theory

  • Is learning always possible?
  • How many training examples will I need to do a

good job learning?

  • Is my test performance going to be much worse

than my training performance?

6

The key idea that underlies all these answer is that simple functions generalize well.

adapted from Hal Daume III

slide-7
SLIDE 7

The Role of Theory

  • Theory can serve two roles:

− It can justify and help understand why

common practice works.

− It can also serve to suggest new algorithms

and approaches that turn out to work well in practice.

7

adapted from Hal Daume III

theory after theory before

Often, it turns out to be a mix!

slide-8
SLIDE 8

The Role of Theory

  • Practitioners discover something that works

surprisingly well.

  • Theorists figure out why it works and prove

something about it.

− In the process, they make it better or find new

algorithms.

  • Theory can also help you understand what’s

possible and what’s not possible.

8

adapted from Hal Daume III

slide-9
SLIDE 9

Induction is Impossible

  • From an algorithmic perspective, a natural question is

− whether there is an “ultimate” learning algorithm, Aawesome,

that solves the Binary Classification problem.

  • Have you been wasting your time learning about KNN and
  • ther methods Perceptron and decision trees, when

Aawesome is out there?

  • What would such an ultimate learning algorithm do?

− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect

classification on all future examples drawn from the same distribution that produced D.

9

adapted from Hal Daume III

slide-10
SLIDE 10

Induction is Impossible

  • From an algorithmic perspective, a natural question is

− whether there is an “ultimate” learning algorithm, Aawesome,

that solves the Binary Classification problem.

  • Have you been wasting your time learning about KNN and
  • ther methods Perceptron and decision trees, when

Aawesome is out there?

  • What would such an ultimate learning algorithm do?

− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect

classification on all future examples drawn from the same distribution that produced D.

10

adapted from Hal Daume III

Impossible

slide-11
SLIDE 11

Label Noise

  • Let X = {−1, +1} (i.e., a one-dimensional, binary distribution


− 80% of data points in this distribution have x = y and 20%

don’t.

  • No matter what function your learning algorithm produces,

there’s no way that it can do better than 20% error on this data.

− No Aawesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too

large.”

11

adapted from Hal Daume III

D = (⟨+1⟩,+1) = 0.4
 D = (⟨+1⟩,-1) = 0.1 D = (⟨-1⟩,-1) = 0.4
 D = (⟨-1⟩,+1) = 0.1

slide-12
SLIDE 12

Sampling

  • Another source of difficulty comes from the fact that the
  • nly access we have to the data distribution is through

sampling.

− When trying to learn about a distribution, you only get to

see data points drawn from that distribution.

− You know that “eventually” you will see enough data points

that your sample is representative of the distribution, but it might not happen immediately.

  • For instance, even though a fair coin will come up heads
  • nly with probability 1/2, it’s completely plausible that in

a sequence of four coin flips you never see a tails, or perhaps only see one tails.

12

adapted from Hal Daume III

slide-13
SLIDE 13

Induction is Impossible

  • We need to understand that Aawesome will not always

work.

− In particular, if we happen to get a lousy sample of

data from D, we need to allow Aawesome to do something completely unreasonable.

  • We cannot hope that Aawesome will do perfectly, every

time.

13

adapted from Hal Daume III

The best we can reasonably hope of Aawesome is that it will do pretty well, most of the time.

slide-14
SLIDE 14

Probably Approximately Correct 
 (PAC) Learning

  • A formalism based on the realization that the best we can hope of

an algorithm is that

− It does a good job most of the time (probably approximately

correct)

  • Consider a hypothetical learning algorithm

− We have 10 different binary classification data sets. − For each one, it comes back with functions f1, f2, . . . , f10.

✦ For some reason, whenever you run f4 on a test point, it crashes your

  • computer. For the other learned functions, their performance on test data

is always at most 5% error.

✦ If this situtation is guaranteed to happen, then this hypothetical learning

algorithm is a PAC learning algorithm.

✤ It satisfies probably because it only failed in one out of ten cases, and

it’s approximate because it achieved low, but non-zero, error on the remainder of the cases.

14

adapted from Hal Daume III

slide-15
SLIDE 15

PAC Learning

  • Two notions of efficiency

− Computational complexity: Prefer an algorithm that runs quickly

to one that takes forever

− Sample complexity: The number of examples required for your

algorithm to achieve its goals

15

adapted from Hal Daume III

Definitions 1. An algorithm A is an (e, d)-PAC learning algorithm if, for all distributions D: given samples from D, the probability that it returns a “bad function” is at most d; where a “bad” function is one with test error rate more than e on D.

Definition: An algorithm A is an efficient (e, d)-PAC learning al- gorithm if it is an (e, d)-PAC learning algorithm whose runtime is polynomial in 1

e and 1 d.

In other words, suppose that you want your algorithm to achieve

In other words, to let your algorithm to achieve 4% error rather than 5%, 
 the runtime required to do so should not go up by an exponential factor!

slide-16
SLIDE 16

Example: PAC Learning of Conjunctions

  • Data points are binary vectors, for instance x = ⟨0, 1, 1, 0, 1⟩
  • Some Boolean conjunction defines the true labeling of this data 


(e.g. x1 ⋀ x2 ⋀ x5)

  • There is some distribution DX over binary data points (vectors) 


x = ⟨x1, x2, . . . , xD⟩.

  • There is a fixed concept conjunction c that we are trying to learn.
  • There is no noise, so for any example x, its true label is simply 


y = c(x)

  • Example:

− Clearly, the true formula cannot 


include the terms x1 , x2, ¬x3, ¬x4 
 


16

adapted from Hal Daume III

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

slide-17
SLIDE 17

Example: PAC Learning

  • f Conjunctions

f 0(x) = x1 ⋀ ¬x1 ⋀ x2 ⋀ ¬x2 ⋀ x3 ⋀ ¬x3 ⋀ x4 ⋀ ¬x4 f 1(x) = ¬x1 ⋀ ¬x2 ⋀ x3 ⋀ x4 f

2(x) = ¬x1 ⋀ x3 ⋀ x4

f

3(x) = ¬x1 ⋀ x3 ⋀ x4

  • After processing an example, it is guaranteed to classify that

example correctly (provided that there is no noise)

  • Computationally very efficient

− Given a data set of N examples in D dimensions, it takes O (ND)

time to process the data. This is linear in the size of the data set.

17

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

adapted from Hal Daume III

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

“Throw Out Bad Terms”

slide-18
SLIDE 18
  • Is this an efficient (ε, δ)-PAC learning algorithm?
  • What about sample complexity?

− How many examples N do you need to see in order to

guarantee that it achieves an error rate of at most ε (in all but δ- many cases)?

− Perhaps N has to be gigantic (like ) to (probably) guarantee

a small error.

18

adapted from Hal Daume III

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

Example: PAC Learning

  • f Conjunctions

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

most e (like 22D/e)

“Throw Out Bad Terms”

slide-19
SLIDE 19
  • Prove that the number of samples N required to (probably)

achieve a small error is not-too-big.

  • Sketch of the proof:

− Say there is some term (say ¬x8) that should have been thrown

  • ut, but wasn’t.

− If this was the case, then you must not have seen any positive

training examples with ¬x8 = 0.

− So example with x8 = 0 must have low probability (otherwise you

would have seen them). So such a thing is not that common

19

adapted from Hal Daume III

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

Example: PAC Learning

  • f Conjunctions

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

“Throw Out Bad Terms”

slide-20
SLIDE 20

Occam’s Razor

  • Simple solutions generalize well
  • The hypothesis class H, is the set of all boolean formulae over D-

many variables.

− The hypothesis class for Boolean conjunctions is finite; the

hypothesis class for linear classifiers is infinite.

− For Occam’s razor, we can only work with finite hypothesis classes.

20

adapted from Hal Daume III

William of Occam 
 (c. 1288 – c. 1348)

“If one can explain a phenomenon without assuming this or that hypothetical entity, then there is no ground for assuming it i.e. that one should always opt for an explanation in terms of the fewest possible number of causes, factors, or variables.”

Theorem 14 (Occam’s Bound). Suppose A is an algorithm that learns a function f from some finite hypothesis class H. Suppose the learned function always gets zero error on the training data. Then, the sample com- plexity of f is at most log |H|.

slide-21
SLIDE 21

Complexity of Infinite 
 Hypothesis Spaces

  • Occam’s Bound is is completely useless when |H | = ∞
  • In example, instead of representing your hypothesis as 


a Boolean conjunction, represent it as a conjunction of inequalities.

− Instead of having x1 ∧ ¬x2 ∧ x5, you have 


[x1 > 0.2] ∧ [x2 < 0.77] ∧ [x5 < π/4]

− In this representation, for each feature, you need to choose

an inequality (< or >) and a threshold.

− Since the thresholds can be arbitrary real values, there are

now infinitely many possibilities: |H| = 2D×∞ = ∞

21

adapted from Hal Daume III

slide-22
SLIDE 22

Vapnik-Chervonenkis 
 (VC) Dimension

  • A classic measure of complexity of infinite hypothesis classes

based on this intuition.

  • The VC dimension is a very classification-oriented notion of

complexity

− The idea is to look at a finite set of unlabeled examples − no matter how these points were labeled, would we be able to

find a hypothesis that correctly classifies them

  • The idea is that as you add more points, being able to

represent an arbitrary labeling becomes harder and harder.

22

adapted from Hal Daume III

Definitions 2. For data drawn from some space X , the VC dimension of a hypothesis space H over X is the maximal K such that: there exists a set X ⊆ X of size |X| = K, such that for all binary labelings of X, there exists a function f ∈ H that matches this labeling.

slide-23
SLIDE 23

VC Dimension Example

  • The first 3 examples show that the class of lines in the plane can

shatter 3 points.

  • However, the last example shows that this class cannot shatter 4

points.

  • Hence the VC dimension of the class of straight lines in the plane is 3.
  • Note that a class of nonlinear curves could shatter four points, and

hence has VC dimension greater than 3.

23

adapted from Trevor Hastie, Robert Tibshirani, Jerome Friedman

slide-24
SLIDE 24

Basic Probability
 Review

24

slide-25
SLIDE 25

Probability

  • A is non-deterministic event


– Can think of A as a boolean-valued variable

  • Examples


– A = your next patient has cancer
 – A = Rafael Nada wins French Open 2015

25

slide by Dhruv Batra

slide-26
SLIDE 26

Interpreting Probabilities

If I flip this coin, the probability that it will come up

heads is 0.5


  • Frequentist Interpretation: If we flip this coin many times, it will

come up heads about half the time. Probabilities are the expected frequencies of events over repeated trials.

  • Bayesian Interpretation: I believe that my next toss of this coin

is equally likely to come up heads or tails. Probabilities quantify subjective beliefs about single events.

  • Viewpoints play complementary roles in machine learning:
  • Bayesian view used to build models based on domain 


knowledge, and automatically derive learning algorithms

  • Frequentist view used to analyze worst case behavior of 


learning algorithms, in limit of large datasets

  • From either view, basic mathematics is the same!

26

slide by Erik Suddherth

slide-27
SLIDE 27

27

7

The Axioms Of Probabi lity

slide by Andrew Moore

slide-28
SLIDE 28

Axioms of Probability

  • 0<= P(A) <= 1
  • P(empty-set) = 0
  • P(everything) = 1
  • P(A or B) = P(A) + P(B) – P(A and B)

28

slide by Dhruv Batra

slide-29
SLIDE 29

Interpreting the Axioms

29

  • Event space of

all possible worlds Its area is 1

Worlds in which A is False Worlds in which A is true

P(A) = Area of reddish oval

  • 0<= P(A) <= 1
  • P(empty-set) = 0
  • P(everything) = 1
  • P(A or B) = P(A) + P(B) – P(A and B)

slide by Dhruv Batra

slide-30
SLIDE 30

Interpreting the Axioms

30

  • The area of A can

t get any smaller than 0 And a zero area would mean no world could ever have A true

  • 0<= P(A) <= 1
  • P(empty-set) = 0
  • P(everything) = 1
  • P(A or B) = P(A) + P(B) – P(A and B)

slide by Dhruv Batra

slide-31
SLIDE 31

Interpreting the Axioms

31

  • The area of A can

t get any bigger than 1 And an area of 1 would mean all worlds will have A true

  • 0<= P(A) <= 1
  • P(empty-set) = 0
  • P(everything) = 1
  • P(A or B) = P(A) + P(B) – P(A and B)

slide by Dhruv Batra

slide-32
SLIDE 32

Interpreting the Axioms

32

A B

  • P(A or B)

B P(A and B) Simple addition and subtraction

  • 0<= P(A) <= 1
  • P(empty-set) = 0
  • P(everything) = 1
  • P(A or B) = P(A) + P(B) – P(A and B)

slide by Dhruv Batra

slide-33
SLIDE 33

Discrete Random Variables

33

Discrete Random Variables

X X

X

discrete random variable sample space of possible outcomes, which may be finite or countably infinite

x ∈ X

  • utcome of sample of discrete random variable

}

slide by Erik Suddherth

slide-34
SLIDE 34

Discrete Random Variables

34

Discrete Random Variables

X X

p(X = x) p(x)

0 ≤ p(x) ≤ 1 for all x ∈ X

X

x∈X

p(x) = 1

discrete random variable sample space of possible outcomes, which may be finite or countably infinite

x ∈ X

  • utcome of sample of discrete random variable

probability distribution (probability mass function) shorthand used when no ambiguity

uniform distribution degenerate distribution

X = {1, 2, 3, 4}

slide by Erik Suddherth

slide-35
SLIDE 35

Joint Distribution

35

slide by Dhruv Batra

slide-36
SLIDE 36

Marginalization

  • Marginalization

− Events: P(A) = P(A and B) + P(A and not B) − Random variables

36

P(X = x) = P(X = x,Y = y)

y

slide by Dhruv Batra

slide-37
SLIDE 37

Marginal Distributions

37

p(x, y) = X

z∈Z

p(x, y, z)

p(x) = X

y∈Y

p(x, y)

y z

slide by Erik Suddherth

slide-38
SLIDE 38

Conditional Probabilities

  • P(Y=y | X=x)
  • What do you believe about Y=y, if I tell you X=x?
  • P(Rafael Nadal wins French Open 2015)?
  • What if I tell you:

− He has won the French Open 9/10 he has played there − Novak Djokovic is ranked 1; just won Australian Open − I offered a similar analysis last year and Nadal won

38

slide by Dhruv Batra

slide-39
SLIDE 39

Conditional Probabilities

  • P(A | B) = In worlds that where B is true, 


fraction where A is true

  • Example

− H: “Have a headache” − F: “Coming down with Flu”

39

  • F

H

  • P(H) = 1/10

P(F) = 1/40 P(H|F) = 1/2

  • Headaches are rare and flu

is rarer, but if you re coming down with flu there s a 50- 50 chance you ll have a headache.

slide by Dhruv Batra

slide-40
SLIDE 40

Conditional Distributions

40

slide by Erik Suddherth

slide-41
SLIDE 41

Independent Random Variables

41

p(x, y) = p(x)p(y)

X ⊥ Y

for all x ∈ X, y ∈ Y

Equivalent conditions on conditional probabilities:

p(x | Y = y) = p(x) and p(y) > 0 for all y ∈ Y p(y | X = x) = p(y) and p(x) > 0 for all x ∈ X

slide by Erik Suddherth

slide-42
SLIDE 42

Bayes Rule (Bayes Theorem)

  • A basic identity from the definition of conditional probability
  • Used in ways that have no thing to do with Bayesian statistics!
  • Typical application to learning and data analysis:

42

posterior distribution (learned information)

p(y | x)

unknown parameters we would like to infer

  • bserved data available for learning

prior distribution (domain knowledge) likelihood function (measurement model)

p(x | y)

Y X = x

p(y)

Bayes Rule (Bayes Theorem)

p(x, y) = p(x)p(y | x) = p(y)p(x | y)

  • A basic identity from the definition of conditional probability

p(y | x) = p(x, y) p(x) = p(x | y)p(y) P

y02Y p(y0)p(x | y0)

∝ p(x | y)p(y)

slide by Erik Suddherth

slide-43
SLIDE 43

Binary Random Variables

  • Bernoulli Distribution: Single toss of a (possibly biased)

coin
 
 


  • Binomial Distribution: Toss a single (possibly biased)

coin n times, and report the number k of times it comes up heads

43

Ber(x | θ) = θδ(x,1)(1 − θ)δ(x,0) X = {0, 1} 0 ≤ θ ≤ 1 0 ≤ θ ≤ 1 K = {0, 1, 2, . . . , n}

Bin(k | n, θ) = ✓ n k ◆ θk(1 − θ)n−k ✓ n k ◆ = n! (n − k)!k!

slide by Erik Suddherth

slide-44
SLIDE 44

Binomial Distributions

44

Binomial Distributions

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 θ=0.250 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 θ=0.900 1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 θ=0.500

slide by Erik Suddherth

slide-45
SLIDE 45

Bean Machine (Sir Francis Galton)

45

http://en.wikipedia.org/wiki/Bean_machine

slide-46
SLIDE 46

Categorical Random Variables

46

Multinoulli Distribution:

X = {0, 1}K,

K

X

k=1

xk = 1

θ = (θ1, θ2, . . . , θK), θk ≥ 0,

K

X

k=1

θk = 1

binary vector encoding

Cat(x | θ) =

K

Y

k=1

θxk

k

Multinomial Distribution: Roll a single (possibly biased) die

n times, and record the number nk of each possible outcome

nk =

n

X

i=1

xik Mu(x | n, θ) = ✓ n n1 . . . nK ◆ K Y

k=1

θnk

k

slide by Erik Suddherth

  • Multinoulli Distribution: Single roll of a (possibly biased) die



 



 
 
 


  • Multinomial Distribution: Roll a single (possibly biased) die 


n times, and report the number nk of each possible

  • utcome
slide-47
SLIDE 47

Aligned DNA Sequences

47

slide by Erik Suddherth

slide-48
SLIDE 48

Multinomial Model of DNA

48

Multinomial Model of DNA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 Sequence Position Bits

slide by Erik Suddherth