Introduction to Machine Learning CMU-10701
Stochastic Convergence and Tail Bounds
Barnabás Póczos
Introduction to Machine Learning CMU-10701 Stochastic Convergence - - PowerPoint PPT Presentation
Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabs Pczos Basic Estimation Theory 2 Rolling a Dice, Estimation of parameters 1 , 2 ,, 6 12 24 Does the MLE estimation converge to the
Stochastic Convergence and Tail Bounds
Barnabás Póczos
2
24 120 60 12
3
Does the MLE estimation converge to the right value? How fast does it converge?
4
Does the empirical average converge to the true mean? How fast does it converge?
5 sample traces
5
How fast do they converge?
I want to know the coin parameter 2[0,1] within = 0.1 error, with probability at least 1- = 0.95.
How many flips do I need?
6
Theory:
– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm
7
– Law of large numbers – Central limit theorem
– Markov, Chebyshev
8
9
10
Notation:
Let {Z, Z1, Z2, …} be a sequence of random variables.
Definition: This is the “weakest” convergence.
Only the distribution functions converge! (NOT the values of the random variables)
1 a
11
12
Continuity is important!
Proof: The limit random variable is constant 0: Example:
In this example the limit Z is discrete, not random (constant 0), although Zn is a continuous random variable. 1 1/n 1
13
Properties
Scheffe's theorem:
convergence of the probability density functions ) convergence in distribution
Example:
(Central Limit Theorem)
Zn and Z can still be independent even if their distributions are the same!
14
Notation: Definition:
This indeed measures how far the values of Zn() and Z() are from each other.
15
Notation: Definition:
Definition:
16
Notation: Properties:
17
18
Useful tools!
19
Decompose the expectation
If X is a nonnegative random variable and a > 0, then
Proof: Corollary: Chebyshev's inequality
20
If X is any nonnegative random variable and a > 0, then
Proof: Here Var(X) is the variance of X, defined as:
21
Chebyshev:
Asymmetric two-sided case (X is asymmetric distribution) Symmetric two-sided case (X is symmetric distribution) This is equivalent to this: There are lots of other generalizations, for example multivariate X.
22
Chebyshev: Markov: Higher moments:
where n ≥ 1
Other functions instead of polynomials? Exp function: Proof:
(Markov ineq.)
23
24
25
Answer: Yes, they do. (Law of large numbers)
Chebyshev’s inequality is good enough to study the question:
Do the empirical averages converge to the true mean?
26
Strong Law of Large Numbers: Weak Law of Large Numbers:
Proof I:
Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1.
27
28
29
Fourier transform Inverse Fourier transform
Other conventions: Where to put 2?
Not preferred: not unitary transf. Doesn’t preserve inner product unitary transf. unitary transf.
30
Fourier transform Inverse Fourier transform
Inverse is really inverse:
Properties:
and lots of other important ones…
Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way.
31
The Characteristic function provides an alternative way for describing a random variable
How can we describe a random variable?
Definition: The Fourier transform of the density
32
Properties
For example, Cauchy doesn’t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Levi’s: continuity theorem Characteristic function of constant a:
Proof II:
33
Taylor's theorem for complex functions The Characteristic function
Properties of characteristic functions : Levi’s continuity theorem ) Limit is a constant distribution with mean mean
34
Gauss-Markov: Doesn’t give rate Chebyshev:
Can we get smaller, logarithmic error in δ???
with probability 1-
35
More useful tools!
36
It only contains the range of the variables, but not the variances.
37
Hoeffding
38
39
A few minutes of calculations.
It contains the variances, too, and can give tighter bounds than Hoeffding.
40
Benett’s inequality ) Bernstein’s inequality.
41
Proof:
It follows that
42
43
http://en.wikipedia.org/wiki/Hoeffding's_inequality http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/Bennett%27s_inequality http://en.wikipedia.org/wiki/Markov%27s_inequality http://en.wikipedia.org/wiki/Chebyshev%27s_inequality http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)
44
45
Lindeberg-Lévi CLT: Lyapunov CLT:
+ some other conditions Generalizations: multi dim, time processes
unscaled scaled
46
47
From Taylor series around 0: Properties of characteristic functions : Levi’s continuity theorem + uniqueness ) CLT
characteristic function
Berry-Esseen Theorem
48
Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942)
CLT:
It doesn’t tell us anything about the convergence rate.
How good are the ML algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve?
Next time we will continue with these questions:
49
50
51
Experiment
How many trials do we need to decide which page attracts more clicks?
52
53
Assume that in group A p(click|A) = 0.10 click and p(noclick|A) = 0.90 Assume that in group B p(click|B) = 0.11 click and p(noclick|B) = 0.89 Assume also that we know these probabilities in group A, but we don’t know yet them in group B.
We want to estimate p(click|B) with less than 0.01 error Let us simplify this question a bit:
54
Chebyshev:
0.25 (Uniform distribution hast the largest variance 0.25)
0.15 * 0.85 = 0.1275. This requires at least 25,500 users.
hence c=1
55
This is better than Chebyshev.
56
Theory:
– Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in Lp norm
– Law of large numbers – Central limit theorem
– Markov, Chebyshev – Hoeffding, Benett, McDiarmid
57