Introduction to Machine Learning CMU-10701 Stochastic Convergence - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 Stochastic Convergence - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabs Pczos Basic Estimation Theory 2 Rolling a Dice, Estimation of parameters 1 , 2 ,, 6 12 24 Does the MLE estimation converge to the


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

Stochastic Convergence and Tail Bounds

Barnabás Póczos

slide-2
SLIDE 2

Basic Estimation Theory

2

slide-3
SLIDE 3

Rolling a Dice, Estimation of parameters 1,2,…,6

24 120 60 12

3

Does the MLE estimation converge to the right value? How fast does it converge?

slide-4
SLIDE 4

4

Rolling a Dice Calculating the Empirical Average

Does the empirical average converge to the true mean? How fast does it converge?

slide-5
SLIDE 5

5 sample traces

5

How fast do they converge?

Rolling a Dice, Calculating the Empirical Average

slide-6
SLIDE 6

Key Questions

I want to know the coin parameter 2[0,1] within  = 0.1 error, with probability at least 1- = 0.95.

How many flips do I need?

6

  • Do empirical averages converge?
  • Does the MLE converge in the dice rolling problem?
  • What do we mean on convergence?
  • What is the rate of convergence?
slide-7
SLIDE 7

Outline

Theory:

  • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

7

  • Limit theorems:

– Law of large numbers – Central limit theorem

  • Tail bounds:

– Markov, Chebyshev

slide-8
SLIDE 8

Stochastic convergence definitions and properties

8

slide-9
SLIDE 9

Convergence of vectors

9

slide-10
SLIDE 10

Convergence in Distribution = Convergence Weakly = Convergence in Law

10

Notation:

Let {Z, Z1, Z2, …} be a sequence of random variables.

Definition: This is the “weakest” convergence.

slide-11
SLIDE 11

Only the distribution functions converge! (NOT the values of the random variables)

1 a

11

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-12
SLIDE 12

12

Continuity is important!

Proof: The limit random variable is constant 0: Example:

In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable. 1 1/n 1

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-13
SLIDE 13

13

Properties

Scheffe's theorem:

convergence of the probability density functions ) convergence in distribution

Example:

(Central Limit Theorem)

Zn and Z can still be independent even if their distributions are the same!

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-14
SLIDE 14

Convergence in Probability

14

Notation: Definition:

This indeed measures how far the values of Zn() and Z() are from each other.

slide-15
SLIDE 15

Almost Surely Convergence

15

Notation: Definition:

slide-16
SLIDE 16

Convergence in p-th mean, Lp norm

Definition:

16

Notation: Properties:

slide-17
SLIDE 17

Counter Examples

17

slide-18
SLIDE 18

Further Readings on Stochastic convergence

  • http://en.wikipedia.org/wiki/Convergence_of_random_variables
  • Patrick Billingsley: Probability and Measure
  • Patrick Billingsley: Convergence of Probability Measures

18

slide-19
SLIDE 19

Finite sample tail bounds

Useful tools!

19

slide-20
SLIDE 20

Gauss Markov inequality

Decompose the expectation

If X is a nonnegative random variable and a > 0, then

Proof: Corollary: Chebyshev's inequality

20

slide-21
SLIDE 21

Chebyshev inequality

If X is any nonnegative random variable and a > 0, then

Proof: Here Var(X) is the variance of X, defined as:

21

slide-22
SLIDE 22

Generalizations of Chebyshev's inequality

Chebyshev:

Asymmetric two-sided case (X is asymmetric distribution) Symmetric two-sided case (X is symmetric distribution) This is equivalent to this: There are lots of other generalizations, for example multivariate X.

22

slide-23
SLIDE 23

Higher moments?

Chebyshev: Markov: Higher moments:

where n ≥ 1

Other functions instead of polynomials? Exp function: Proof:

(Markov ineq.)

23

slide-24
SLIDE 24

Law of Large Numbers

24

slide-25
SLIDE 25

Do empirical averages converge?

25

Answer: Yes, they do. (Law of large numbers)

Chebyshev’s inequality is good enough to study the question:

Do the empirical averages converge to the true mean?

slide-26
SLIDE 26

Law of Large Numbers

26

Strong Law of Large Numbers: Weak Law of Large Numbers:

slide-27
SLIDE 27

Weak Law of Large Numbers

Proof I:

Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1.

27

slide-28
SLIDE 28

Fourier Transform and Characteristic Function

28

slide-29
SLIDE 29

Fourier Transform

29

Fourier transform Inverse Fourier transform

Other conventions: Where to put 2?

Not preferred: not unitary transf. Doesn’t preserve inner product unitary transf. unitary transf.

slide-30
SLIDE 30

Fourier Transform

30

Fourier transform Inverse Fourier transform

Inverse is really inverse:

Properties:

and lots of other important ones…

Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way.

slide-31
SLIDE 31

Characteristic function

31

The Characteristic function provides an alternative way for describing a random variable

How can we describe a random variable?

  • cumulative distribution function (cdf)
  • probability density function (pdf)

Definition: The Fourier transform of the density

slide-32
SLIDE 32

Characteristic function

32

Properties

For example, Cauchy doesn’t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Levi’s: continuity theorem Characteristic function of constant a:

slide-33
SLIDE 33

Weak Law of Large Numbers

Proof II:

33

Taylor's theorem for complex functions The Characteristic function

Properties of characteristic functions : Levi’s continuity theorem ) Limit is a constant distribution with mean  mean

slide-34
SLIDE 34

34

“Convergence rate” for LLN

Gauss-Markov: Doesn’t give rate Chebyshev:

Can we get smaller, logarithmic error in δ???

with probability 1-

slide-35
SLIDE 35
  • http://en.wikipedia.org/wiki/Levy_continuity_theorem
  • http://en.wikipedia.org/wiki/Law_of_large_numbers
  • http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)
  • http://en.wikipedia.org/wiki/Fourier_transform

35

Further Readings on LLN, Characteristic Functions, etc

slide-36
SLIDE 36

More tail bounds

More useful tools!

36

slide-37
SLIDE 37

Hoeffding’s inequality (1963)

It only contains the range of the variables, but not the variances.

37

slide-38
SLIDE 38

“Convergence rate” for LLN from Hoeffding

Hoeffding

38

slide-39
SLIDE 39

Proof of Hoeffding’s Inequality

39

A few minutes of calculations.

slide-40
SLIDE 40

Bernstein’s inequality (1946)

It contains the variances, too, and can give tighter bounds than Hoeffding.

40

slide-41
SLIDE 41

Benett’s inequality (1962)

Benett’s inequality ) Bernstein’s inequality.

41

Proof:

slide-42
SLIDE 42

McDiarmid’s Bounded Difference Inequality

It follows that

42

slide-43
SLIDE 43

Further Readings on Tail bounds

43

http://en.wikipedia.org/wiki/Hoeffding's_inequality http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/Bennett%27s_inequality http://en.wikipedia.org/wiki/Markov%27s_inequality http://en.wikipedia.org/wiki/Chebyshev%27s_inequality http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)

slide-44
SLIDE 44

Limit Distribution?

44

slide-45
SLIDE 45

Central Limit Theorem

45

Lindeberg-Lévi CLT: Lyapunov CLT:

+ some other conditions Generalizations: multi dim, time processes

slide-46
SLIDE 46

Central Limit Theorem in Practice

unscaled scaled

46

slide-47
SLIDE 47

47

Proof of CLT

From Taylor series around 0: Properties of characteristic functions : Levi’s continuity theorem + uniqueness ) CLT

characteristic function

  • f Gauss distribution
slide-48
SLIDE 48

How fast do we converge to Gauss distribution?

Berry-Esseen Theorem

48

Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942)

CLT:

It doesn’t tell us anything about the convergence rate.

slide-49
SLIDE 49

Did we answer the questions we asked?

  • Do empirical averages converge?
  • What do we mean on convergence?
  • What is the rate of convergence?
  • What is the limit distrib. of “standardized” averages?

 How good are the ML algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve?

Next time we will continue with these questions:

49

slide-50
SLIDE 50
  • http://en.wikipedia.org/wiki/Central_limit_theorem
  • http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm

Further Readings on CLT

50

slide-51
SLIDE 51

Tail bounds in practice

51

slide-52
SLIDE 52
  • Two possible webpage layouts
  • Which layout is better?

Experiment

  • Some users see A
  • The others see design B

How many trials do we need to decide which page attracts more clicks?

A/B testing

52

slide-53
SLIDE 53

A/B testing

53

Assume that in group A p(click|A) = 0.10 click and p(noclick|A) = 0.90 Assume that in group B p(click|B) = 0.11 click and p(noclick|B) = 0.89 Assume also that we know these probabilities in group A, but we don’t know yet them in group B.

We want to estimate p(click|B) with less than 0.01 error Let us simplify this question a bit:

slide-54
SLIDE 54
  • In group B the click probability is  = 0.11 (we don’t know this yet)
  • Want failure probability of =5%

Chebyshev Inequality

54

Chebyshev:

  • If we have no prior knowledge, we can only bound the variance by σ2 =

0.25 (Uniform distribution hast the largest variance 0.25)

  • If we know that the click probability is < 0.15, then we can bound 2 at

0.15 * 0.85 = 0.1275. This requires at least 25,500 users.

slide-55
SLIDE 55

Hoeffding’s bound

  • Random variable has bounded range [0, 1] (click or no click),

hence c=1

  • Solve Hoeffding’s inequality for n:

55

This is better than Chebyshev.

  • Hoeffding
slide-56
SLIDE 56

What we have learned today

56

Theory:

  • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

  • Limit theorems:

– Law of large numbers – Central limit theorem

  • Tail bounds:

– Markov, Chebyshev – Hoeffding, Benett, McDiarmid

slide-57
SLIDE 57

Thanks for your attention 

57