This Lecture Basic de fi nitions and concepts. Introduction to the - - PowerPoint PPT Presentation

this lecture
SMART_READER_LITE
LIVE PREVIEW

This Lecture Basic de fi nitions and concepts. Introduction to the - - PowerPoint PPT Presentation

This Lecture Basic de fi nitions and concepts. Introduction to the problem of learning. Probability tools. Foundations of Machine Learning page 16 De fi nitions Spaces: input space , output space . Y X Loss function:


slide-1
SLIDE 1

page

Foundations of Machine Learning

This Lecture

Basic definitions and concepts. Introduction to the problem of learning. Probability tools.

16

slide-2
SLIDE 2

page

Foundations of Machine Learning

Definitions

Spaces: input space , output space . Loss function: .

  • : cost of predicting instead of .
  • binary classification: 0-1 loss, .
  • regression: , .

Hypothesis set: , subset of functions out of which the learner selects his hypothesis.

  • depends on features.
  • represents prior knowledge about task.

17

X Y L: Y ×Y →R

L(b y, y) b y y L(y, y0)=1y6=y0 l(y, y0)=(y0 − y)2 Y ⊆R H ⊆Y X

slide-3
SLIDE 3

page

Foundations of Machine Learning

Supervised Learning Set-Up

Training data: sample of size drawn i.i.d. from according to distribution : Problem: find hypothesis with small generalization error.

  • deterministic case: output label deterministic function of

input, .

  • stochastic case: output probabilistic function of input.

18

S

m

D X×Y S = ((x1, y1), . . . , (xm, ym)). h∈H

y=f(x)

slide-4
SLIDE 4

page

Foundations of Machine Learning

Errors

Generalization error: for , it is defined by Empirical error: for and sample , it is Bayes error:

  • in deterministic case,

19

h∈H R(h) = E

(x,y)∼D[L(h(x), y)].

b R(h) = 1 m

m

X

i=1

L(h(xi), yi). R? = inf

h h measurable

R(h). h∈H S R? =0.

slide-5
SLIDE 5

page

Foundations of Machine Learning

Noise

Noise:

  • in binary classification, for any ,
  • bserve that

20

x ∈ X noise(x) = min{Pr[1|x], Pr[0|x]}.

E[noise(x)] = R∗.

slide-6
SLIDE 6

page

Foundations of Machine Learning

Learning ≠ Fitting

21

Notion of simplicity/complexity. How do we define complexity?

slide-7
SLIDE 7

page

Foundations of Machine Learning

Generalization

Observations:

  • the best hypothesis on the sample may not be the best
  • verall.
  • generalization is not memorization.
  • complex rules (very complex separation surfaces) can be

poor predictors.

  • trade-off: complexity of hypothesis set vs sample size

(underfitting/overfitting).

22

slide-8
SLIDE 8

page

Foundations of Machine Learning

Model Selection

General equality: for any , Approximation: not a random variable, only depends on . Estimation: only term we can hope to bound.

23

best in class h∈H H R(h) − R∗ = [R(h) − R(h∗)] | {z }

estimation

+ [R(h∗) − R∗] | {z }

approximation

.

slide-9
SLIDE 9

page

Foundations of Machine Learning

Empirical Risk Minimization

Select hypothesis set . Find hypothesis minimizing empirical error:

  • but may be too complex.
  • the sample size may not be large enough.

24

h = argmin

h∈H

b R(h). h∈H

H H

slide-10
SLIDE 10

page

Foundations of Machine Learning

Generalization Bounds

Definition: upper bound on Bound on estimation error for hypothesis given by ERM:

25

h0

Pr  sup

h∈H

|R(h) − b R(h)| > ✏

  • .

R(h0) − R(h∗) = R(h0) − b R(h0) + b R(h0) − R(h∗) ≤ R(h0) − b R(h0) + b R(h∗) − R(h∗) ≤ 2 sup

h∈H

|R(h) − b R(h)|. How should we choose ? (model selection problem) H

slide-11
SLIDE 11

page

Foundations of Machine Learning

Model Selection

26

γ

error γ∗

estimation approximation upper bound

H = [

γ∈Γ

Hγ.

slide-12
SLIDE 12

page

Foundations of Machine Learning

Structural Risk Minimization

Principle: consider an infinite sequence of hypothesis sets

  • rdered for inclusion,
  • strong theoretical guarantees.
  • typically computationally hard.

27

(Vapnik, 1995)

H1 ⊂ H2 ⊂ · · · ⊂ Hn ⊂ · · · h = argmin

h∈Hn,n∈N

b R(h) + penalty(Hn, m).

slide-13
SLIDE 13

page

Foundations of Machine Learning

General Algorithm Families

Empirical risk minimization (ERM): Structural risk minimization (SRM): , Regularization-based algorithms: ,

28

h = argmin

h∈H

b R(h). Hn ⊆Hn+1 λ≥0 h = argmin

h∈H

b R(h) + λkhk2. h = argmin

h∈Hn,n∈N

b R(h) + penalty(Hn, m).

slide-14
SLIDE 14

page

Foundations of Machine Learning

This Lecture

Basic definitions and concepts. Introduction to the problem of learning. Probability tools.

29

slide-15
SLIDE 15

page

Foundations of Machine Learning

Basic Properties

Union bound: Inversion: if , then, for any , with probability at least , . Jensen’s inequality: if is convex, . Expectation: if , .

30

f

Pr[A ∨ B] ≤ Pr[A] + Pr[B]. Pr[X ≥ ✏]≤f(✏) δ >0 1−δ f(E[X])≤E[f(X)] E[X]= Z +∞ Pr[X > t] dt X ≥0 X ≤f −1(δ)

slide-16
SLIDE 16

page

Foundations of Machine Learning

Basic Inequalities

Markov’s inequality: if and , then Chebyshev’s inequality: for any ,

31

X ≥0 ✏>0

Pr[X ≥✏]≤ E[X]

. Pr[|X − E[X]| ≥ ✏] ≤ 2

X

✏2 .

✏>0

slide-17
SLIDE 17

page

Foundations of Machine Learning

Hoeffding’s Inequality

Theorem: Let be indep. rand. variables with the same expectation and , ( ). Then, for any , the following inequalities hold:

32

X1, . . . , Xm

µ Xi ∈[a, b] a<b ✏>0 Pr  µ − 1 m

m

X

i=1

Xi > ✏

  • ≤ exp

✓ − 2m✏2 (b − a)2 ◆ Pr  1 m

m

X

i=1

Xi − µ > ✏

  • ≤ exp

✓ − 2m✏2 (b − a)2 ◆ .

slide-18
SLIDE 18

page

Foundations of Machine Learning

McDiarmid’s Inequality

Theorem: let be independent random variables taking values in and a function verifying for all ,

33

(McDiarmid, 1989)

Then, for all , X1, . . . , Xm ✏>0 U f : U m →R i∈[1, m] sup

x1,...,xm,x0

i

|f(x1, . . . , xi, . . . , xm)−f(x1, . . . , x0

i, . . . , xm)|≤ci.

Pr h f(X1, . . . , Xm)−E[f(X1, . . . , Xm)]

  • >✏

i ≤2 exp ✓ − 2✏2 Pm

i=1 c2 i

◆ .

slide-19
SLIDE 19

Foundations of Machine Learning

page

Appendix

34

slide-20
SLIDE 20

page

Foundations of Machine Learning

Markov’s Inequality

Theorem: let be a non-negative random variable with , then, for all , Proof:

35

X

E[X]<∞

t>0 Pr[X ≥ tE[X]] ≤ 1 t . Pr[X ≥ t E[X]] = X

x≥tE[X]

Pr[X = x] ≤ X

x≥t E[X]

Pr[X = x] x t E[X] ≤ X

x

Pr[X = x] x t E[X] = E  X t E[X]

  • = 1

t .

slide-21
SLIDE 21

page

Foundations of Machine Learning

Chebyshev’s Inequality

Theorem: let be a random variable with , then, for all , Proof: Observe that The result follows Markov’s inequality.

36

X Var[X]<∞ t>0 Pr[|X − E[X]| ≥ tσX] ≤ 1 t2 . Pr[|X − E[X]| ≥ tσX] = Pr[(X − E[X])2 ≥ t2σ2

X].

slide-22
SLIDE 22

page

Foundations of Machine Learning

Weak Law of Large Numbers

Theorem: let be a sequence of independent random variables with the same mean and variance and let , then, for any , Proof: Since the variables are independent, Thus, by Chebyshev’s inequality,

37

(Xn)n∈N µ σ2<∞ ✏>0 Xn = 1

n

Pn

i=1 Xi

lim

n→∞ Pr[|Xn − µ| ≥ ✏] = 0.

Var[Xn] =

n

X

i=1

Var Xi n

  • = nσ2

n2 = σ2 n . Pr[|Xn − µ| ≥ ✏] ≤ 2 n✏2 .

slide-23
SLIDE 23

page

Foundations of Machine Learning

Concentration Inequalities

Some general tools for error analysis and bounds:

  • Hoeffding’s inequality (additive).
  • Chernoff bounds (multiplicative).
  • McDiarmid’s inequality (more general).

38

slide-24
SLIDE 24

page

Foundations of Machine Learning

Hoeffding’s Lemma

Lemma: Let be a random variable with and . Then for any , Proof: by convexity of , for all ,

39

Thus, with, E[etX] ≤ E[ b−X

b−a eta + X−a b−a etb] = b b−aeta + −a b−aetb = eφ(t),

E[X]=0 X ∈ [a, b] b6=a t>0 E[etX] ≤ e

t2(b−a)2 8

. x7!etx a≤x≤b etx ≤ b − x b − aeta + x − a b − a etb. φ(t) = log(

b b−aeta + −a b−aetb) = ta + log( b b−a + −a b−aet(b−a)).

slide-25
SLIDE 25

page

Foundations of Machine Learning

Taking the derivative gives: Note that: Furthermore, with There exists such that:

40

φ0(t) = a −

aet(b−a)

b b−a a b−a et(b−a) = a −

a

b b−a e−t(b−a) a b−a .

φ(0) = 0 and φ0(0) = 0.

Φ00(t) = −abet(ba) [

b baet(ba) − a ba]2

= α(1 − α)et(ba)(b − a)2 [(1 − α)et(ba) + α]2 = α [(1 − α)et(ba) + α] (1 − α)et(ba) [(1 − α)et(ba) + α](b − a)2 = u(1 − u)(b − a)2 ≤ (b − a)2 4 ,

α = −a b − a. 0≤θ≤t φ(t) = φ(0) + tφ0(0) + t2 2 φ00(θ) ≤ t2 (b − a)2 8 .

slide-26
SLIDE 26

page

Foundations of Machine Learning

Hoeffding’s Theorem

Theorem: Let be independent random variables. Then for , the following inequalities hold for , for any , Proof: The proof is based on Chernoff’s bounding technique: for any random variable and , apply Markov’s inequality and select to minimize

41

X1, . . . , Xm Xi ∈[ai, bi] ✏>0

Sm = Pm

i=1 Xi

Pr[Sm − E[Sm] ≥ ✏] ≤ e−2✏2/ Pm

i=1(bi−ai)2

Pr[Sm − E[Sm] ≤ −✏] ≤ e−2✏2/ Pm

i=1(bi−ai)2.

X t>0

t Pr[X ≥ ✏] = Pr[etX ≥ et✏] ≤ E[etX] et✏ .

slide-27
SLIDE 27

page

Foundations of Machine Learning

Using this scheme and the independence of the random variables gives The second inequality is proved in a similar way.

42

choosing Pr[Sm − E[Sm] ≥ ✏] ≤ e−t✏ E[et(Sm−E[Sm])] = e−t✏Πm

i=1 E[et(Xi−E[Xi])]

(lemma applied to Xi−E[Xi]) ≤ e−t✏Πm

i=1et2(bi−ai)2/8

= e−t✏et2 Pm

i=1(bi−ai)2/8

≤ e−2✏2/ Pm

i=1(bi−ai)2,

t = 4✏/ Pm

i=1(bi − ai)2.

slide-28
SLIDE 28

page

Foundations of Machine Learning

Hoeffding’s Inequality

Corollary: for any , any distribution and any hypothesis , the following inequalities hold: Proof: follows directly Hoeffding’s theorem. Combining these one-sided inequalities yields

43

✏>0 D h: X →{0, 1} Pr[ b R(h) − R(h) ≥ ✏] ≤ e−2m✏2 Pr[ b R(h) − R(h) ≤ −✏] ≤ e−2m✏2. Pr h b R(h) − R(h)

  • ≥ ✏

i ≤ 2e−2m✏2.

slide-29
SLIDE 29

page

Foundations of Machine Learning

Chernoff’s Inequality

Theorem: for any , any distribution and any hypothesis , the following inequalities hold: Proof: proof based on Chernoff’s bounding technique.

44

✏>0 D h: X →{0, 1} Pr[ b R(h) ≥ (1 + ✏)R(h)] ≤ e−m R(h) ✏2/3 Pr[ b R(h) ≤ (1 − ✏)R(h)] ≤ e−m R(h) ✏2/2.

slide-30
SLIDE 30

page

Foundations of Machine Learning

McDiarmid’s Inequality

Theorem: let be independent random variables taking values in and a function verifying for all ,

45

(McDiarmid, 1989)

Then, for all , X1, . . . , Xm ✏>0 U f : U m →R i∈[1, m] sup

x1,...,xm,x0

i

|f(x1, . . . , xi, . . . , xm)−f(x1, . . . , x0

i, . . . , xm)|≤ci.

Pr h f(X1, . . . , Xm)−E[f(X1, . . . , Xm)]

  • >✏

i ≤2 exp ✓ − 2✏2 Pm

i=1 c2 i

◆ .

slide-31
SLIDE 31

page

Foundations of Machine Learning

Comments:

  • Proof: uses Hoeffding’s lemma.
  • Hoeffding’s inequality is a special case of McDiarmid’s

with

46

f(x1, . . . , xm) = 1 m

m

X

i=1

xi and ci = |bi − ai| m .

slide-32
SLIDE 32

page

Foundations of Machine Learning

Jensen’s Inequality

Theorem: let be a random variable and a measurable convex function. Then, Proof: definition of convexity, continuity of convex functions, and density of finite distributions.

47

X f(E[X]) ≤ E[f(X)]. f