Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabás Póczos

Fourier Transform and Characteristic Function 2

Fourier Transform Fourier transform unitary transf. Inverse Fourier transform Other conventions: Where to put 2 π ? Not preferred: not unitary transf. Doesn’t preserve inner product unitary transf. 3

Fourier Transform Fourier transform Inverse Fourier transform Properties: Inverse is really inverse: and lots of other important ones… Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way. 4

This image cannot currently be displayed. Characteristic function How can we describe a random variable? • cumulative distribution function (cdf) • probability density function (pdf) The Characteristic function provides an alternative way for describing a random variable Definition: The Fourier transform of the density 5

Characteristic function Properties For example, Cauchy doesn’t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Characteristic function of constant a : Levi’s: continuity theorem 6

Weak Law of Large Numbers Proof II: Taylor's theorem for complex functions The Characteristic function Properties of characteristic functions : mean Levi’s continuity theorem ) Limit is a constant distribution with mean µ 7

“Convergence rate” for LLN Gauss-Markov: Doesn’t give rate Chebyshev: with probability 1- δ Can we get smaller, logarithmic error in δ??? 8

Further Readings on LLN, Characteristic Functions, etc • http://en.wikipedia.org/wiki/Levy_continuity_theorem • http://en.wikipedia.org/wiki/Law_of_large_numbers • http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory) • http://en.wikipedia.org/wiki/Fourier_transform 9

More tail bounds More useful tools! 10

Hoeffding’s inequality (1963) It only contains the range of the variables, but not the variances. 11

“Convergence rate” for LLN from Hoeffding Hoeffding 12

Proof of Hoeffding’s Inequality A few minutes of calculations. 13

Bernstein’s inequality (1946) It contains the variances, too, and can give tighter bounds than Hoeffding. 14

Benett’s inequality (1962) Benett’s inequality ) Bernstein’s inequality. Proof: 15

McDiarmid’s Bounded Difference Inequality It follows that 16

Further Readings on Tail bounds http://en.wikipedia.org/wiki/Hoeffding's_inequality http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/Bennett%27s_inequality http://en.wikipedia.org/wiki/Markov%27s_inequality http://en.wikipedia.org/wiki/Chebyshev%27s_inequality http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory) 17

Limit Distribution? 18

This image cannot currently be displayed. Central Limit Theorem Lindeberg-Lévi CLT: Lyapunov CLT: + some other conditions Generalizations : multi dim, time processes 19

Central Limit Theorem in Practice unscaled scaled 20

Proof of CLT From Taylor series around 0: Properties of characteristic functions : characteristic function Levi’s continuity theorem + uniqueness ) CLT of Gauss distribution 21

How fast do we converge to Gauss distribution? CLT: It doesn’t tell us anything about the convergence rate. Berry-Esseen Theorem Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942) 22

Did we answer the questions we asked? • Do empirical averages converge? • What do we mean on convergence? • What is the rate of convergence? • What is the limit distrib. of “standardized” averages? Next time we will continue with these questions:  How good are the ML algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve? 23

Further Readings on CLT • http://en.wikipedia.org/wiki/Central_limit_theorem • http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm 24

Tail bounds in practice 25

A/B testing • Two possible webpage layouts • Which layout is better? Experiment • Some users see A • The others see design B How many trials do we need to decide which page attracts more clicks? 26

A/B testing Let us simplify this question a bit: Assume that in group A p(click|A) = 0.10 click and p(noclick|A) = 0.90 Assume that in group B p(click|B) = 0.11 click and p(noclick|B) = 0.89 Assume also that we know these probabilities in group A, but we don’t know yet them in group B. We want to estimate p(click|B) with less than 0.01 error 27

Chebyshev Inequality Chebyshev: In group B the click probability is µ = 0.11 (we don’t know this yet) • Want failure probability of δ =5% • • If we have no prior knowledge, we can only bound the variance by σ 2 = 0.25 (Uniform distribution hast the largest variance 0.25) • If we know that the click probability is < 0.15, then we can bound σ 2 at 0.15 * 0.85 = 0.1275. This requires at least 25,500 users. 28

Hoeffding’s bound • Hoeffding • Random variable has bounded range [0, 1] (click or no click), hence c=1 • Solve Hoeffding’s inequality for n: This is better than Chebyshev. 29

Thanks for your attention  30

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs Pczos Fourier Transform and Characteristic Function 2 Fourier Transform Fourier transform unitary transf. Inverse Fourier transform Other conventions: Where to put 2

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Patient Experience Webinar Series Part II Best Practices in Using Data and Information You

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

1. Limits 1.1 Definition of a Limit 1.2 Computing Basic Limits 1.3 Continuity 1.4 Squeeze

tt ss str

Dedekinds forgotten axiom and why we should teach it (and why we shouldnt teach

How to create meter and why (for beginning students) J O H N R O E D E R U N I V E R S I T Y O

F isc al/ E mploye r Age nt T r ansfe r s: A Se r ie s of T r ade - Offs Kate Mur r

Technical Debt: Unintentional Vs Intentional Hands On Christos Kotselidis & Bijan

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 9. Tail Bounds Barnabs Pczos Fourier Transform and Characteristic Function 2 Fourier Transform Fourier transform unitary transf. Inverse Fourier transform Other conventions: Where to put 2

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Patient Experience Webinar Series Part II Best Practices in Using Data and Information You

Fourier Series Fourier Sine Series Fourier Cosine Series Fourier Series Convergence

1. Limits 1.1 Definition of a Limit 1.2 Computing Basic Limits 1.3 Continuity 1.4 Squeeze

tt ss str

Dedekinds forgotten axiom and why we should teach it (and why we shouldnt teach

How to create meter and why (for beginning students) J O H N R O E D E R U N I V E R S I T Y O

F isc al/ E mploye r Age nt T r ansfe r s: A Se r ie s of T r ade - Offs Kate Mur r

Technical Debt: Unintentional Vs Intentional Hands On Christos Kotselidis &amp; Bijan

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Technical Debt: Unintentional Vs Intentional Hands On Christos Kotselidis & Bijan