Introduction to Machine Learning CMU-10701 Stochastic Convergence - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 Stochastic Convergence - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Stochastic Convergence Barnabs Pczos Motivation 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes classifier


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

Stochastic Convergence

Barnabás Póczos

slide-2
SLIDE 2

Motivation

2

slide-3
SLIDE 3

What have we seen so far?

3

Several algorithms that seem to work fine on training datasets:

  • Linear regression
  • Naïve Bayes classifier
  • Perceptron classifier
  • Support Vector Machines for regression and classification

How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?

) Learning Theory

To answer these questions, we will need a few powerful tools

slide-4
SLIDE 4

Basic Estimation Theory

4

slide-5
SLIDE 5

Tossing a Dice, Estimation of parameters 1,2,…,6

24 120 60 12

5

Does the MLE estimation converge to the right value? How fast does it converge?

slide-6
SLIDE 6

6

Tossing a Dice Calculating the Empirical Average

Does the empirical average converge to the true mean? How fast does it converge?

slide-7
SLIDE 7

5 sample traces

7

How fast do they converge?

Tossing a Dice, Calculating the Empirical Average

slide-8
SLIDE 8

Key Questions

I want to know the coin parameter 2[0,1] within  = 0.1 error, with probability at least 1- = 0.95.

How many flips do I need?

8

  • Do empirical averages converge?
  • Does the MLE converge in the dice tossing problem?
  • What do we mean on convergence?
  • What is the rate of convergence?

Applications:

  • drug testing (Does this drug modifies the average blood pressure?)
  • user interface design (We will see later)
slide-9
SLIDE 9

Outline

Theory:

  • Stochastic Convergences:

– Weak convergence – Convergence in probability – Strong (almost surely)

Application:

A/B testing for page layout

9

  • Limit theorems:

– Law of large numbers – Central limit theorem

  • Tail bounds:

– Markov, Chebyshev,Chernoff, Hoeffding, Bernstein, McDiarmid inequalities

slide-10
SLIDE 10

Stochastic convergence definitions and properties

10

slide-11
SLIDE 11

Convergence of vectors

11

slide-12
SLIDE 12

Convergence in Distribution = Convergence Weakly = Convergence in Law

12

Notation:

Let {Z, Z1, Z2, …} be a sequence of random variables.

Definition: This is the “weakest” convergence.

slide-13
SLIDE 13

Only the distribution functions converge! (NOT the values of the random variables)

1 a

13

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-14
SLIDE 14

14

Continuity is important!

Proof: The limit random variable is constant 0: Example:

In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable. 1 1/n 1

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-15
SLIDE 15

15

Properties

Scheffe's theorem:

convergence of the probability density functions ) convergence in distribution

Example:

(Central Limit Theorem)

Zn and Z can still be independent even if their distributions are the same!

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-16
SLIDE 16

Convergence in Probability

16

Notation: Definition:

This indeed measures how far the values of Zn() and Z() are from each other.

slide-17
SLIDE 17

Almost Surely Convergence

17

Notation: Definition:

slide-18
SLIDE 18

Convergence in p-th mean, Lp norm

Definition:

18

Notation: Properties:

slide-19
SLIDE 19

Counter Examples

19

slide-20
SLIDE 20

Further Readings on Stochastic convergence

  • http://en.wikipedia.org/wiki/Convergence_of_random_variables
  • Patrick Billingsley: Probability and Measure
  • Patrick Billingsley: Convergence of Probability Measures

20

slide-21
SLIDE 21

Finite sample tail bounds

Useful tools!

21

slide-22
SLIDE 22

Gauss Markov inequality

Decompose the expectation

If X is any nonnegative random variable and a > 0, then

Proof: Corollary: Chebyshev's inequality

22

slide-23
SLIDE 23

Chebyshev inequality

If X is any nonnegative random variable and a > 0, then

Proof: Here Var(X) is the variance of X, defined as:

23

slide-24
SLIDE 24

Generalizations of Chebyshev's inequality

Chebyshev:

Asymmetric two-sided case (X is asymmetric distribution) Symmetric two-sided case (X is symmetric distribution) This is equivalent to this: There are lots of other generalizations, for example multivariate X.

24

slide-25
SLIDE 25

Higher moments?

Chebyshev: Markov: Higher moments:

where n ≥ 1

Other functions instead of polynomials? Exp function: Proof:

(Markov ineq.)

25

slide-26
SLIDE 26

Law of Large Numbers

26

slide-27
SLIDE 27

Do empirical averages converge?

27

Answer: Yes, they do. (Law of large numbers)

Chebyshev’s inequality is good enough to study the question:

Do the empirical averages converge to the true mean?

slide-28
SLIDE 28

Law of Large Numbers

28

Strong Law of Large Numbers: Weak Law of Large Numbers:

slide-29
SLIDE 29

Weak Law of Large Numbers

Proof I:

Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1.

29

slide-30
SLIDE 30

Fourier Transform and Characteristic Function

30

slide-31
SLIDE 31

Fourier Transform

31

Fourier transform Inverse Fourier transform

Other conventions: Where to put 2?

Not preferred: not unitary transf. Doesn’t preserve inner product unitary transf. unitary transf.

slide-32
SLIDE 32

Fourier Transform

32

Fourier transform Inverse Fourier transform

Inverse is really inverse:

Properties:

and lots of other important ones…

Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way.

slide-33
SLIDE 33

Characteristic function

33

The Characteristic function provides an alternative way for describing a random variable

How can we describe a random variable?

  • cumulative distribution function (cdf)
  • probability density function (pdf)

Definition: The Fourier transform of the density/

slide-34
SLIDE 34

Characteristic function

34

Properties

For example, Cauchy doesn’t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Levi’s: continuity theorem Characteristic function of constant a:

slide-35
SLIDE 35

Weak Law of Large Numbers

Proof II:

35

Taylor's theorem for complex functions The Characteristic function

Properties of characteristic functions : Levi’s continuity theorem ) Limit is a constant distribution with mean  mean

slide-36
SLIDE 36

36

“Convergence rate” for LLN

Gauss-Markov: Doesn’t give rate Chebyshev:

Can we get smaller, logarithmic error in δ???

with probability 1-

slide-37
SLIDE 37
  • http://en.wikipedia.org/wiki/Levy_continuity_theorem
  • http://en.wikipedia.org/wiki/Law_of_large_numbers
  • http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)
  • http://en.wikipedia.org/wiki/Fourier_transform

37

Further Readings on LLN, Characteristic Functions, etc

slide-38
SLIDE 38

More tail bounds

More useful tools!

38

slide-39
SLIDE 39

Hoeffding’s inequality (1963)

It only contains the range of the variables, but not the variances.

39

slide-40
SLIDE 40

“Convergence rate” for LLN from Hoeffding

Hoeffding

40

slide-41
SLIDE 41

Proof of Hoeffding’s Inequality

41

A few minutes of calculations.

slide-42
SLIDE 42

Bernstein’s inequality (1946)

It contains the variances, too, and can give tighter bounds than Hoeffding.

42

slide-43
SLIDE 43

Benett’s inequality (1962)

Benett’s inequality ) Bernstein’s inequality.

43

Proof:

slide-44
SLIDE 44

McDiarmid’s Bounded Difference Inequality

It follows that

44

slide-45
SLIDE 45

Further Readings on Tail bounds

45

http://en.wikipedia.org/wiki/Hoeffding's_inequality http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/Bennett%27s_inequality http://en.wikipedia.org/wiki/Markov%27s_inequality http://en.wikipedia.org/wiki/Chebyshev%27s_inequality http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)

slide-46
SLIDE 46

Limit Distribution?

46

slide-47
SLIDE 47

Central Limit Theorem

47

Lindeberg-Lévi CLT: Lyapunov CLT:

+ some other conditions Generalizations: multi dim, time processes

slide-48
SLIDE 48

Central Limit Theorem in Practice

unscaled scaled

48

slide-49
SLIDE 49

49

Proof of CLT

From Taylor series around 0: Properties of characteristic functions : Levi’s continuity theorem + uniqueness ) CLT

characteristic function

  • f Gauss distribution
slide-50
SLIDE 50

How fast do we converge to Gauss distribution?

Berry-Esseen Theorem

50

Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942)

CLT:

It doesn’t tell us anything about the convergence rate.

slide-51
SLIDE 51

Did we answer the questions we asked?

  • Do empirical averages converge?
  • What do we mean on convergence?
  • What is the rate of convergence?
  • What is the limit distrib. of “standardized” averages?

 How good are the ML algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve?

Next time we will continue with these questions:

51

slide-52
SLIDE 52
  • http://en.wikipedia.org/wiki/Central_limit_theorem
  • http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm

Further Readings on CLT

52

slide-53
SLIDE 53

Tail bounds in practice

53

slide-54
SLIDE 54
  • Two possible webpage layouts
  • Which layout is better?

Experiment

  • Some users see A
  • The others see design B

How many trials do we need to decide which page attracts more clicks?

A/B testing

54

slide-55
SLIDE 55

A/B testing

55

Assume that in group A p(click|A) = 0.10 click and p(noclick|A) = 0.90 Assume that in group B p(click|B) = 0.11 click and p(noclick|A) = 0.89 Assume also that we know these probabilities in group A, but we don’t know yet them in group B.

We want to estimate p(click|B) with less than 0.01 error Let us simplify this question a bit:

slide-56
SLIDE 56
  • In group B the click probability is  = 0.11 (we don’t know this yet)
  • Want failure probability of =5%

Chebyshev Inequality

56

Chebyshev:

  • If we have no prior knowledge, we can only bound the variance by σ2 =

0.25 (Uniform distribution hast the largest variance 0.25)

  • If we know that the click probability is < 0.15, then we can bound 2 at

0.15 * 0.85 = 0.1275. This requires at most 25,500 users.

slide-57
SLIDE 57

Hoeffding’s bound

  • Random variable has bounded range [0, 1] (click or no click),

hence c=1

  • Solve Hoeffding’s inequality for n:

57

This is better than Chebyshev.

  • Hoeffding
slide-58
SLIDE 58

Thanks for your attention 

58