Introduction to Machine Learning CMU-10701 Stochastic Convergence - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabás Póczos

Basic Estimation Theory 2

Rolling a Dice, Estimation of parameters  1 ,  2 ,…,  6 12 24 Does the MLE estimation converge to the right value? How fast does it converge? 60 120 3

Rolling a Dice Calculating the Empirical Average Does the empirical average converge to the true mean? How fast does it converge? 4

Rolling a Dice, Calculating the Empirical Average 5 sample traces How fast do they converge? 5

Key Questions • Do empirical averages converge? • Does the MLE converge in the dice rolling problem? • What do we mean on convergence? • What is the rate of convergence? I want to know the coin parameter  2 [0,1] within  = 0.1 error, with probability at least 1-  = 0.95. How many flips do I need? 6

Outline Theory : • Stochastic Convergences: – Weak convergence = Convergence in distribution – Convergence in probability – Strong (almost surely) – Convergence in L p norm • Limit theorems: – Law of large numbers – Central limit theorem • Tail bounds: – Markov, Chebyshev 7

Stochastic convergence definitions and properties 8

Convergence of vectors 9

Convergence in Distribution = Convergence Weakly = Convergence in Law Let {Z, Z 1 , Z 2 , …} be a sequence of random variables. Notation: Definition: This is the “weakest” convergence. 10

Convergence in Distribution = Convergence Weakly = Convergence in Law Only the distribution functions converge! (NOT the values of the random variables) 1 0 a 11

Convergence in Distribution = Convergence Weakly = Convergence in Law Continuity is important! Example: Proof: 1 1 0 0 0 1/n 0 The limit random variable is constant 0: In this example the limit Z is discrete, not random (constant 0), although Z n is a continuous random variable. 12

Convergence in Distribution = Convergence Weakly = Convergence in Law Properties Z n and Z can still be independent even if their distributions are the same! Scheffe's theorem: convergence of the probability density functions ) convergence in distribution Example: (Central Limit Theorem ) 13

Convergence in Probability Notation: Definition: This indeed measures how far the values of Z n (  ) and Z(  ) are from each other. 14

Almost Surely Convergence Notation: Definition: 15

Convergence in p-th mean, L p norm Notation: Definition: Properties: 16

Counter Examples 17

Further Readings on Stochastic convergence • http://en.wikipedia.org/wiki/Convergence_of_random_variables • Patrick Billingsley : Probability and Measure • Patrick Billingsley : Convergence of Probability Measures 18

Finite sample tail bounds Useful tools! 19

Gauss Markov inequality If X is a nonnegative random variable and a > 0, then Proof: Decompose the expectation Corollary: Chebyshev's inequality 20

Chebyshev inequality If X is any nonnegative random variable and a > 0, then Here Var( X ) is the variance of X , defined as: Proof: 21

Generalizations of Chebyshev's inequality Chebyshev: This is equivalent to this: Symmetric two-sided case ( X is symmetric distribution ) Asymmetric two-sided case ( X is asymmetric distribution ) There are lots of other generalizations, for example multivariate X. 22

Higher moments? Markov: Chebyshev: Higher moments: where n ≥ 1 Other functions instead of polynomials? Exp function: Proof: (Markov ineq.) 23

Law of Large Numbers 24

Do empirical averages converge? Chebyshev’s inequality is good enough to study the question: Do the empirical averages converge to the true mean? Answer: Yes, they do. (Law of large numbers) 25

Law of Large Numbers Weak Law of Large Numbers: Strong Law of Large Numbers: 26

Weak Law of Large Numbers Proof I: Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1. 27

Fourier Transform and Characteristic Function 28

Fourier Transform Fourier transform unitary transf. Inverse Fourier transform Other conventions: Where to put 2  ? Not preferred: not unitary transf. Doesn’t preserve inner product unitary transf. 29

Fourier Transform Fourier transform Inverse Fourier transform Properties: Inverse is really inverse: and lots of other important ones… Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way. 30

Characteristic function How can we describe a random variable? • cumulative distribution function (cdf) • probability density function (pdf) The Characteristic function provides an alternative way for describing a random variable Definition: The Fourier transform of the density 31

Characteristic function Properties For example, Cauchy doesn’t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Characteristic function of constant a : Levi’s: continuity theorem 32

Weak Law of Large Numbers Proof II: Taylor's theorem for complex functions The Characteristic function Properties of characteristic functions : mean Levi’s continuity theorem ) Limit is a constant distribution with mean  33

“Convergence rate” for LLN Gauss-Markov: Doesn’t give rate Chebyshev: with probability 1-  Can we get smaller, logarithmic error in δ??? 34

Further Readings on LLN, Characteristic Functions, etc • http://en.wikipedia.org/wiki/Levy_continuity_theorem • http://en.wikipedia.org/wiki/Law_of_large_numbers • http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory) • http://en.wikipedia.org/wiki/Fourier_transform 35

More tail bounds More useful tools! 36

Hoeffding’s inequality (1963) It only contains the range of the variables, but not the variances. 37

“Convergence rate” for LLN from Hoeffding Hoeffding 38

Proof of Hoeffding’s Inequality A few minutes of calculations. 39

Bernstein’s inequality (1946) It contains the variances, too, and can give tighter bounds than Hoeffding. 40

Benett’s inequality (1962) Benett’s inequality ) Bernstein’s inequality. Proof: 41

McDiarmid’s Bounded Difference Inequality It follows that 42

Further Readings on Tail bounds http://en.wikipedia.org/wiki/Hoeffding's_inequality http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/Bennett%27s_inequality http://en.wikipedia.org/wiki/Markov%27s_inequality http://en.wikipedia.org/wiki/Chebyshev%27s_inequality http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory) 43

Limit Distribution? 44

Central Limit Theorem Lindeberg-Lévi CLT: Lyapunov CLT: + some other conditions Generalizations : multi dim, time processes 45

Central Limit Theorem in Practice unscaled scaled 46

Proof of CLT From Taylor series around 0: Properties of characteristic functions : characteristic function Levi’s continuity theorem + uniqueness ) CLT of Gauss distribution 47

How fast do we converge to Gauss distribution? CLT: It doesn’t tell us anything about the convergence rate. Berry-Esseen Theorem Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942) 48

Did we answer the questions we asked? • Do empirical averages converge? • What do we mean on convergence? • What is the rate of convergence? • What is the limit distrib. of “standardized” averages? Next time we will continue with these questions:  How good are the ML algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve? 49

Further Readings on CLT • http://en.wikipedia.org/wiki/Central_limit_theorem • http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm 50

Tail bounds in practice 51

A/B testing • Two possible webpage layouts • Which layout is better? Experiment • Some users see A • The others see design B How many trials do we need to decide which page attracts more clicks? 52

A/B testing Let us simplify this question a bit: Assume that in group A p(click|A) = 0.10 click and p(noclick|A) = 0.90 Assume that in group B p(click|B) = 0.11 click and p(noclick|B) = 0.89 Assume also that we know these probabilities in group A, but we don’t know yet them in group B. We want to estimate p(click|B) with less than 0.01 error 53

Chebyshev Inequality Chebyshev: In group B the click probability is  = 0.11 (we don’t know this yet) • Want failure probability of  =5% • • If we have no prior knowledge, we can only bound the variance by σ 2 = 0.25 (Uniform distribution hast the largest variance 0.25) • If we know that the click probability is < 0.15, then we can bound  2 at 0.15 * 0.85 = 0.1275. This requires at least 25,500 users. 54

Hoeffding’s bound • Hoeffding • Random variable has bounded range [0, 1] (click or no click), hence c=1 • Solve Hoeffding’s inequality for n: This is better than Chebyshev. 55

Introduction to Machine Learning CMU-10701 Stochastic Convergence - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabs Pczos Basic Estimation Theory 2 Rolling a Dice, Estimation of parameters 1 , 2 ,, 6 12 24 Does the MLE estimation converge to the

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Convergence of Random Variables Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of

Modes of Convergence Will Perkins February 7, 2013 Limit Theorems We are often interested in

1 2 3 4 5 6 7 8 9 10 11 12 13

Variable-Lived Short-Run Selves Drew Fudenberg and David K. Levine September 8, 2009 The Problem

ALMOST SURE CONVERGENCE OF RANDOM GOSSIP ALGORITHMS Giorgio Picci with T. Taylor, ASU Tempe AZ.

Confluence and Convergence in Probabilistically Terminating Reduction Systems Maja H. Kirkeby

Weak convergence of rescaled discrete objects in combinatorics Jean-Fran cois Marckert (LaBRI -

Asymptotic Theory Part I Review of Asymptotic Theory James J. Heckman University of Chicago