10701 machine learning
play

10701 Machine Learning Recitation 7 - Tail bounds and Averages - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola Why this stuff ? Can machine learning work ? Why this stuff ? Can machine learning work ? Yes, otherwise: No Google No


  1. 10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola

  2. Why this stuff ? • Can machine learning work ?

  3. Why this stuff ? • Can machine learning work ? • Yes, otherwise: • No Google • No spam-filters • No face detectors • No 701 midterm • I’d be living my life

  4. Why this stuff ? • Will machine learning always work ?

  5. Why this stuff ? • Will machine learning always work ? • No, Class 1 ? Class 2 ?

  6. Why this stuff ? • We need some theory to analyze machine learning algorithms. • We will go through basic tools used to build theory. • How well can we estimate stuff from data ? • What is the convergence behavior of empirical averages ?

  7. Outline • Estimation Example • Convergence of Averages • Law of Large Numbers • Central Limit Theorem • Inequalities and Tail Bounds • Markov Inequality • Chebychev’s Inequality • Heoffding’s and McDiarmid’s Inequalities • Proof Tools • Union Bound • Fourier Analysis • Characteristic Functions

  8. Estimating Probabilities

  9. Discrete Distribution • n outcomes (e.g. faces of a dice) • Data likelihood • Maximum Likelihood Estimation • Constrained optimization problem ... or ... • Incorporate constraint via • Taking derivatives yields

  10. Tossing a Die 12 24 60 120

  11. Tossing a Die 12 24 60 120

  12. Tossing a Die 12 24 60 120

  13. Tossing a Die 12 24 60 120

  14. Empirical average for a die

  15. Empirical average for a die is it guaranteed to converge ? how quickly does it converge?

  16. Convergence of Empirical Averages

  17. Expectations • Random variable x with probability measure • Expected value of f(x) • Special case - discrete probability mass (same trick works for intervals) • Draw x i identically and independently from p • Empirical average

  18. Law of Large Numbers • Random variables x i with mean • Empirical average • Weak Law of Large Numbers convergence in probability • Strong Law of Large Numbers Almost sure convergence

  19. Empirical average for a dice 5 sample traces

  20. Central Limit Theorem • Independent random variables x i with mean μ i and standard deviation σ i • The random variable converges to a Normal Distribution • Special case - IID random variables & average convergence

  21. Central Limit Theorem in Practice unscaled scaled

  22. Tail Bounds

  23. Simple tail bounds • Gauss Markov inequality Non-negative Random variable X with mean μ • Proof - decompose expectation

  24. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result.

  25. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Correct ?

  26. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Approximately Correct ?

  27. Simple tail bounds • Chebyshev inequality Random variable X with mean μ and variance σ 2 𝜀 • Proof - applying Gauss-Markov to Y = (X - μ) 2 with confidence ε 2 yields the result. Probably Approximately Correct !

  28. Scaling behavior • Gauss-Markov Scales properly in μ but expensive in δ • Chebyshev Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ?

  29. Chernoff bound • For Bernoulli Random Variable with P(x=1)=p • Ex: n independent tosses from biased coin with p probability of getting head Pr 𝜈 𝑜 − 𝑞 ≥ 𝜗 ≤ exp(−2𝑜𝜗 2 )

  30. Chernoff bound • Proof: We show that Pinsker’s inequality Pinsker’s inequality • Where

  31. Heoffding’s Inequality • If X i have bounded range c

  32. Heoffding’s Inequality • Scaling Behavior • This helps when we need to combine several tail bounds since we only pay logarithmically in terms of their combination.

  33. McDiarmid Inequality • Generalization of Heoffding’s Inequality • Independent random variables X i • Function • Deviation from expected value • Here C is given by where • f is average and X i have bounded range c  Heoffding’s Inequality

  34. More tail bounds • Higher order moments • Bernstein inequality (needs variance bound) 𝑜𝜗 2 /2 Pr 𝜈 𝑛 − 𝜈 ≥ 𝜗 ≤ 2 exp − 2 + 𝐷𝑜𝜗/3 𝐹 𝑌 𝑗 𝑗 • Absolute / relative error bounds • Bounds for (weakly) dependent random variables

  35. Summary • Markov [X is non-negative] • Chebychev [Finite variance] • Heoffding [Bound on range] • Brenestein [Bounded on range + Bound on second moment] Tighter Bounds More Assumptions

  36. Tools for the proof

  37. Union Bound 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 In general 𝑄( 𝐵 𝑗 ) ≤ 𝑄(𝐵 𝑗 ) 𝑗 𝑗

  38. Fourier Transform • Fourier transform relations • Useful identities • Identity • Derivative • Convolution (also holds for inverse transform)

  39. The Characteristic Function Method • Characteristic function • For X and Y independent we have • Joint distribution is convolution • Characteristic function is product • Proof - plug in definition of Fourier transform • Characteristic function is unique

  40. Proof - Weak law of large numbers • Require that expectation exists • Taylor expansion of exponential (need to assume that we can bound the tail) • Average of random variables convolution • Limit is constant distribution vanishing higher order terms mean

  41. Warning • Moments may not always exist • Cauchy distribution • For the mean to exist the following integral would have to converge

  42. Proof - Central limit theorem • Require that second order moment exists (we assume they’re all identical WLOG) • Characteristic function • Subtract out mean (centering) • This is the FT of a Normal Distribution

  43. Conclusion & what’s next ? We looked at basic building blocks of learning theory - Convergence of empirical averages - Tail bounds - Union bound

  44. Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ?

  45. Conclusion & what’s next ? Evaluate classifier C on N data points and estimate accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality

  46. Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ?

  47. Conclusion & what’s next ? Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality + union bound

  48. Conclusion & what’s next ? What if the set of classifiers is infinite ??

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend