Introduction to Machine Learning CMU-10701 8. Stochastic - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabás Póczos

Motivation 2

What have we seen so far? Several algorithms that seem to work fine on training datasets: • Linear regression • Naïve Bayes classifier • Perceptron classifier • Support Vector Machines for regression and classification  How good are these algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve? ) Learning Theory To answer these questions, we will need a few powerful tools 3

Basic Estimation Theory 4

Rolling a Dice, Estimation of parameters θ 1 , θ 2 ,…, θ 6 12 24 Does the MLE estimation converge to the right value? How fast does it converge? 60 120 5

Rolling a Dice Calculating the Empirical Average Does the empirical average converge to the true mean? How fast does it converge? 6

Rolling a Dice, Calculating the Empirical Average 5 sample traces How fast do they converge? 7

Key Questions • Do empirical averages converge? • Does the MLE converge in the dice rolling problem? • What do we mean on convergence? • What is the rate of convergence? I want to know the coin parameter θ 2 [0,1] within ε = 0.1 error, with probability at least 1- δ = 0.95. How many flips do I need? Applications: • drug testing (Does this drug modifies the average blood pressure?) • user interface design (We will see later) 8

Outline Theory : • Stochastic Convergences: – Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in L p norm • Limit theorems: – Law of large numbers – Central limit theorem • Tail bounds: – Markov, Chebyshev 9

Stochastic convergence definitions and properties 10

Convergence of vectors 11

Convergence in Distribution = Convergence Weakly = Convergence in Law Let {Z, Z 1 , Z 2 , …} be a sequence of random variables. Notation: Definition: This is the “weakest” convergence. 12

Convergence in Distribution = Convergence Weakly = Convergence in Law Only the distribution functions converge! (NOT the values of the random variables) 1 0 a 13

Convergence in Distribution = Convergence Weakly = Convergence in Law Continuity is important! Example: Proof: 1 1 0 0 0 1/n 0 The limit random variable is constant 0: In this example the limit Z is discrete, not random (constant 0), although Z n is a continuous random variable. 14

This image cannot currently be displayed. Convergence in Distribution = Convergence Weakly = Convergence in Law Properties Z n and Z can still be independent even if their distributions are the same! Scheffe's theorem: convergence of the probability density functions ) convergence in distribution Example: (Central Limit Theorem ) 15

Convergence in Probability Notation: Definition: This indeed measures how far the values of Z n ( ω ) and Z( ω ) are from each other. 16

Almost Surely Convergence Notation: Definition: 17

Convergence in p-th mean, L p norm Notation: Definition: Properties: 18

Counter Examples 19

Further Readings on Stochastic convergence • http://en.wikipedia.org/wiki/Convergence_of_random_variables • Patrick Billingsley : Probability and Measure • Patrick Billingsley : Convergence of Probability Measures 20

Finite sample tail bounds Useful tools! 21

This image cannot currently be displayed. Gauss Markov inequality If X is any nonnegative random variable and a > 0, then Proof: Decompose the expectation Corollary: Chebyshev's inequality 22

Chebyshev inequality If X is any nonnegative random variable and a > 0, then Here Var( X ) is the variance of X , defined as: Proof: 23

This image cannot currently be displayed. Generalizations of Chebyshev's inequality Chebyshev: This is equivalent to this: Symmetric two-sided case ( X is symmetric distribution ) Asymmetric two-sided case ( X is asymmetric distribution ) There are lots of other generalizations, for example multivariate X. 24

Higher moments? Markov: Chebyshev: Higher moments: where n ≥ 1 Other functions instead of polynomials? Exp function: Proof: (Markov ineq.) 25

Law of Large Numbers 26

Do empirical averages converge? Chebyshev’s inequality is good enough to study the question: Do the empirical averages converge to the true mean? Answer: Yes, they do. (Law of large numbers) 27

Law of Large Numbers Weak Law of Large Numbers: Strong Law of Large Numbers: 28

Weak Law of Large Numbers Proof I: Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1. 29

What we have learned today Theory : • Stochastic Convergences: – Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in L p norm • Limit theorems: – Law of large numbers – Central limit theorem • Tail bounds: – Markov, Chebyshev 30

Thanks for your attention  31

Introduction to Machine Learning CMU-10701 8. Stochastic - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos Motivation 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes classifier

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

i:*ha:i : in probability conveyance almost Tests for sure convergence

gam.check summary(resid_fit) Randomised quantile residuals Example Fitting to residuals

Introduction to Machine Learning CMU-10701 8. Stochastic - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos Motivation 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes classifier

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

i:*ha:i : in probability conveyance almost Tests for sure convergence

gam.check summary(resid_fit) Randomised quantile residuals Example Fitting to residuals

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti