Introduction to Machine Learning CMU-10701 8. Stochastic - - PowerPoint PPT Presentation

introduction to machine learning cmu 10701
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning CMU-10701 8. Stochastic - - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos Motivation 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes classifier


slide-1
SLIDE 1

Introduction to Machine Learning CMU-10701

  • 8. Stochastic Convergence

Barnabás Póczos

slide-2
SLIDE 2

Motivation

2

slide-3
SLIDE 3

What have we seen so far?

3

Several algorithms that seem to work fine on training datasets:

  • Linear regression
  • Naïve Bayes classifier
  • Perceptron classifier
  • Support Vector Machines for regression and classification

How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?

) Learning Theory

To answer these questions, we will need a few powerful tools

slide-4
SLIDE 4

Basic Estimation Theory

4

slide-5
SLIDE 5

Rolling a Dice, Estimation of parameters θ1,θ2,…,θ6

24 120 60 12

5

Does the MLE estimation converge to the right value? How fast does it converge?

slide-6
SLIDE 6

6

Rolling a Dice Calculating the Empirical Average

Does the empirical average converge to the true mean? How fast does it converge?

slide-7
SLIDE 7

5 sample traces

7

How fast do they converge?

Rolling a Dice, Calculating the Empirical Average

slide-8
SLIDE 8

Key Questions

I want to know the coin parameter θ2[0,1] within ε = 0.1 error, with probability at least 1-δ = 0.95.

How many flips do I need?

8

  • Do empirical averages converge?
  • Does the MLE converge in the dice rolling problem?
  • What do we mean on convergence?
  • What is the rate of convergence?

Applications:

  • drug testing (Does this drug modifies the average blood pressure?)
  • user interface design (We will see later)
slide-9
SLIDE 9

Outline

Theory:

  • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

9

  • Limit theorems:

– Law of large numbers – Central limit theorem

  • Tail bounds:

– Markov, Chebyshev

slide-10
SLIDE 10

Stochastic convergence definitions and properties

10

slide-11
SLIDE 11

Convergence of vectors

11

slide-12
SLIDE 12

Convergence in Distribution = Convergence Weakly = Convergence in Law

12

Notation:

Let {Z, Z1, Z2, …} be a sequence of random variables.

Definition: This is the “weakest” convergence.

slide-13
SLIDE 13

Only the distribution functions converge! (NOT the values of the random variables)

1 a

13

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-14
SLIDE 14

14

Continuity is important!

Proof: The limit random variable is constant 0: Example:

In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable. 1 1/n 1

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-15
SLIDE 15 This image cannot currently be displayed.

15

Properties

Scheffe's theorem:

convergence of the probability density functions ) convergence in distribution

Example:

(Central Limit Theorem)

Zn and Z can still be independent even if their distributions are the same!

Convergence in Distribution = Convergence Weakly = Convergence in Law

slide-16
SLIDE 16

Convergence in Probability

16

Notation: Definition:

This indeed measures how far the values of Zn(ω) and Z(ω) are from each other.

slide-17
SLIDE 17

Almost Surely Convergence

17

Notation: Definition:

slide-18
SLIDE 18

Convergence in p-th mean, Lp norm

Definition:

18

Notation: Properties:

slide-19
SLIDE 19

Counter Examples

19

slide-20
SLIDE 20

Further Readings on Stochastic convergence

  • http://en.wikipedia.org/wiki/Convergence_of_random_variables
  • Patrick Billingsley: Probability and Measure
  • Patrick Billingsley: Convergence of Probability Measures

20

slide-21
SLIDE 21

Finite sample tail bounds

Useful tools!

21

slide-22
SLIDE 22 This image cannot currently be displayed.

Gauss Markov inequality

Decompose the expectation

If X is any nonnegative random variable and a > 0, then

Proof: Corollary: Chebyshev's inequality

22

slide-23
SLIDE 23

Chebyshev inequality

If X is any nonnegative random variable and a > 0, then

Proof: Here Var(X) is the variance of X, defined as:

23

slide-24
SLIDE 24 This image cannot currently be displayed.

Generalizations of Chebyshev's inequality

Chebyshev:

Asymmetric two-sided case (X is asymmetric distribution) Symmetric two-sided case (X is symmetric distribution) This is equivalent to this: There are lots of other generalizations, for example multivariate X.

24

slide-25
SLIDE 25

Higher moments?

Chebyshev: Markov: Higher moments:

where n ≥ 1

Other functions instead of polynomials? Exp function: Proof:

(Markov ineq.)

25

slide-26
SLIDE 26

Law of Large Numbers

26

slide-27
SLIDE 27

Do empirical averages converge?

27

Answer: Yes, they do. (Law of large numbers)

Chebyshev’s inequality is good enough to study the question:

Do the empirical averages converge to the true mean?

slide-28
SLIDE 28

Law of Large Numbers

28

Strong Law of Large Numbers: Weak Law of Large Numbers:

slide-29
SLIDE 29

Weak Law of Large Numbers

Proof I:

Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1.

29

slide-30
SLIDE 30

What we have learned today

30

Theory:

  • Stochastic Convergences:

– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm

  • Limit theorems:

– Law of large numbers – Central limit theorem

  • Tail bounds:

– Markov, Chebyshev

slide-31
SLIDE 31

Thanks for your attention 

31