Estimation: Sample Averages, Bias, and Concentration Inequalities - PowerPoint PPT Presentation

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook §3.1-3.3

Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 ) New: • Group Slack channel: #cmput296-fall20 (on Amii workspace)

Recap • Random variables are functions from sample to some value • Upshot: A random variable takes different values with some probability • The value of one variable can be informative about the value of another (because they are both functions of the same sample) • Distributions of multiple random variables are described by the joint probability distribution (joint PMF or joint PDF) • Conditioning on a random variable gives a new distribution over others is independent of : conditioning on does not give a new distribution over X Y X Y • is conditionally independent of given : X Y Z P ( Y ∣ X , Z ) = P ( Y ∣ Z ) • • The expected value of a random variable is an average over its values, weighted by the probability of each value

Outline 1. Recap & Logistics 2. Variance and Correlation 3. Estimators 4. Concentration Inequalities 5. Consistency

Variance Definition: The variance of a random variable is Var ( X ) = 𝔽 [ ( X − 𝔽 [ X ]) 2 ] . f ( x ) = ( x − 𝔽 [ X ]) 2 i.e., where . 𝔽 [ f ( X )] Equivalently, Var ( X ) = 𝔽 [ X 2 ] − ( 𝔽 [ X ]) 2 ( why? )

Covariance Definition: The covariance of two random variables is Cov ( X , Y ) = 𝔽 [ ( X − 𝔽 [ X ])( Y − 𝔽 [ Y ]) ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] . Question: What is the range of Cov ( X , Y ) ?

Correlation Definition: The correlation of two random variables is Cov ( X , Y ) Corr ( X , Y ) = Var ( X ) Var ( Y ) Question: What is the range of Corr ( X , Y ) ? hint: Var ( X ) = Cov ( X , X )

Independence and Decorrelation • Independent RVs have zero correlation ( why? ) hint: Cov [ X , Y ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] • Uncorrelated RVs (i.e., Cov ( X , Y ) = 0 ) might be dependent (i.e., ). p ( x , y ) ≠ p ( x ) p ( y ) • Correlation ( Pearson's correlation coefficient ) shows linear relationships; but can miss nonlinear relationships • Example: X ∼ Uniform { − 2, − 1,0,1,2} Y = X 2 , 𝔽 [ XY ] = .2( − 2 × 4) + .2(2 × 4) + .2( − 1 × 1) + .2(1 × 1) + .2(0 × 0) • 𝔽 [ X ] = 0 • • So 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] = 0 − 0 𝔽 [ Y ] = 0

Properties of Variances Var [ c ] = 0 for constant c • Var [ cX ] = c 2 Var [ X ] for constant c • Var [ X + Y ] = Var [ X ] + Var [ Y ] + 2 Cov [ X , Y ] • • For independent , X , Y ( why? ) Var [ X + Y ] = Var [ X ] + Var [ Y ]

Estimators Definition: An estimator is a procedure for estimating an unobserved quantity based on data. Example: Estimating for r.v. . 𝔽 [ X ] X ∈ ℝ random Questions: variable! Suppose we can observe a different variable . Is a Y Y good estimator of in the following cases? Why or 𝔽 [ X ] why not? 1. Y ∼ Uniform [0,10] 2. Y = 𝔽 [ X ] + Z , where Z ∼ Uniform [0,1] Y = 𝔽 [ X ] + Z , where Z ∼ N (0,100 2 ) 3. 4. Y = X 5. How would you estimate ? 𝔽 [ X ]

̂ ̂ Bias Questions: What is the bias of the Definition: The bias of an estimator is its expected following estimators of X ? 𝔽 [ X ] difference from the true value of the estimated quantity : X 1. Y ∼ Uniform [0,10] Bias( ̂ X ) = 𝔽 [ ̂ X − X ] 2. , Y = 𝔽 [ X ] + Z where Z ∼ Uniform [0,1] • Bias can be positive or negative or zero 3. , Y = 𝔽 [ X ] + Z Bias( ̂ • When , we say that the estimator is unbiased X ) = 0 X Z ∼ N (0,100 2 ) where 4. Y = X

Independent and Identically Distributed (i.i.d.) Samples • We usually won't try to estimate anything about a distribution based on only a single sample • Usually, we use multiple samples from the same distribution • Multiple samples: This gives us more information • Same distribution: We want to learn about a single population • One additional condition: the samples must be independent ( why? ) Definition: When a set of random variables are are all X 1 , X 2 , … independent, and each has the same distribution , we say they are i.i.d. X ∼ F (independent and identically distributed), written X 1 , X 2 , … i . i . d . . ∼ F

Estimating Expected Value via the Sample Mean X ] = 𝔽 [ X i ] n Example: We have i.i.d. samples from the same distribution , n F 1 ∑ 𝔽 [ ¯ n i . i . d , X 1 , X 2 , …, X n ∼ F i =1 n = 1 ∑ Var( X i ) = σ 2 with and for each . 𝔽 [ X i ] = μ X i 𝔽 [ X i ] n i =1 We want to estimate . μ n = 1 ∑ μ n X = 1 n ∑ ¯ Let's use the sample mean to estimate . X i μ i =1 n = 1 i =1 n n μ Question: Is this estimator unbiased ? Question: Are more samples better? Why? = μ . ∎

Variance of the Estimator X ] = Var [ Xi ] n 1 ∑ Var[ ¯ • Intuitively, more samples should make the estimator n "closer" to the estimated quantity i =1 n 2 Var [ X i ] n = 1 • We can formalize this intuition partly by characterizing ∑ Var[ ̂ the variance of the estimator itself . X ] i =1 n = 1 • The variance of the estimator should decrease as ∑ Var[ X i ] n 2 the number of samples increases i =1 n = 1 ¯ • Example: for estimating : ∑ σ 2 X μ n 2 • The variance of the estimator shrinks linearly as i =1 = 1 n 2 n σ 2 = 1 the number of samples grows. n σ 2 .

Concentration Inequalities Pr ( ¯ < ϵ ) > 1 − δ • We would like to be able to claim X − μ for some δ , ϵ > 0 X ] = 1 Var[ ¯ n σ 2 means that with "enough" data, • Pr ( ¯ < ϵ ) > 1 − δ for any that we pick ( why? ) X − μ δ , ϵ > 0 σ 2 = 81 Var[ ¯ • Suppose we have samples, and we know ; so . n = 10 X ] = 8.1 Pr ( ¯ < 2 ) • Question: What is ? X − μ

Variance Is Not Enough Var[ ¯ Pr( | ¯ is not enough to compute Knowing ! X ] = 8.1 X − μ | < 2) Examples: x ) = { if ¯ 0.9 x = μ x = μ ± 9 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.9 if ¯ 0.05 x ) = { if ¯ 0.999 x = μ x = μ ± 90 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.999 if ¯ 0.0005 x ) = { if ¯ 0.1 x = μ x = μ ± 3 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.1 if ¯ 0.45

Hoeffding's Inequality Theorem: Hoeffding's Inequality Suppose that are distributed i.i.d, with . X 1 , …, X n a ≤ X i ≤ b Then for any , ϵ > 0 ≥ ϵ ) ≤ 2 exp ( − ( b − a ) 2 ) Pr ( ¯ 2 n ϵ 2 X − 𝔽 [ ¯ . X ] Pr ( ) ≥ 1 − δ ln(2/ δ ) X − 𝔽 [ ¯ ¯ Equivalently, . X ] ≤ ( b − a ) 2 n

Chebyshev's Inequality Theorem: Chebyshev's Inequality σ 2 Suppose that are distributed i.i.d. with variance . X 1 , …, X n Then for any , ϵ > 0 Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ . X ] n ϵ 2 σ 2 X − 𝔽 [ ¯ ¯ Equivalently, . Pr X ] ≤ ≥ 1 − δ δ n

When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 • If , then a ≤ X i ≤ b ln(2/ δ ) ln(2/ δ ) 1 • Hoeffding's inequality gives ; ϵ = ( b − a ) = ( b − a ) 2 n 2 n σ 2 ( b − a ) 2 1 1 Chebyshev's inequality gives ϵ = δ n ≤ = ( b − a ) 4 δ n n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < ⟺ δ < ∼ 0.232 2 2 δ • Chebyshev's inequality can be applied even for unbounded variables

̂ ̂ Consistency Definition: A sequence of random variables converges in probability X n p to a random variable (written ) if for all , X X n → X ϵ > 0 . n →∞ Pr( | X n − X | > ϵ ) = 0 lim p Definition: An estimator for a quantity is consistent if . → X X X X

Weak Law of Large Numbers Proof: Theorem: Weak Law of Large Numbers 𝔽 [ ¯ 1. We have already shown that X ] = μ Let be distributed i.i.d. with X 1 , …, X n 2. By Chebyshev, Var[ X i ] = σ 2 and . 𝔽 [ X i ] = μ Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ X ] Then the sample mean n ϵ 2 for arbitrary ϵ > 0 n X = 1 n →∞ Pr ( ¯ ∑ ¯ ≥ ϵ ) = 0 X i 3. Hence lim X − μ n i =1 for any ϵ > 0 is a consistent estimator for . μ p ¯ 4. Hence . X → μ ∎

Summary • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

Estimation: Sample Averages, Bias, and Concentration Inequalities - PowerPoint PPT Presentation

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook 3.1-3.3 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Chapter 27: More Tests for Averages If we have two independent simple random samples from two

Multiplicity and Estimation P.Bauer Medical University of Vienna London, November 2012

Bias in, Bias out: Gender Equality and the Fourth Industrial Revolution Debra Howcroft and

Transistor bias circuits 1 Objectives Discuss the concept of dc biasing of a transistor for

go to the source The Media Bias Chart The Media Bias Chart A new taxonomy for discussing the

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Equity & Excellence: Hidden Bias Implicit Bias Inherent Bias

Lo Locally Differentially Private Frequency Es Esti timati tion on Ex Exploi oiti ting Con

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Consistent Kernel Mean Estimation for Functions of Random Variables Ilya Tolstikhin jointly with

Estimation II: Consistency Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1 The Usual Setup Suppose we

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Linear programming and the DEA approach Anders Ringgaard Kristensen Department of Veterinary and

Estimation: Sample Averages, Bias, and Concentration Inequalities - PowerPoint PPT Presentation

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook 3.1-3.3 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Chapter 27: More Tests for Averages If we have two independent simple random samples from two

Multiplicity and Estimation P.Bauer Medical University of Vienna London, November 2012

Bias in, Bias out: Gender Equality and the Fourth Industrial Revolution Debra Howcroft and

Transistor bias circuits 1 Objectives Discuss the concept of dc biasing of a transistor for

go to the source The Media Bias Chart The Media Bias Chart A new taxonomy for discussing the

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Equity &amp; Excellence: Hidden Bias Implicit Bias Inherent Bias

Lo Locally Differentially Private Frequency Es Esti timati tion on Ex Exploi oiti ting Con

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Consistent Kernel Mean Estimation for Functions of Random Variables Ilya Tolstikhin jointly with

Estimation II: Consistency Stat 3202 @ OSU, Autumn 2018 Dalpiaz 1 The Usual Setup Suppose we

Point Estimation The goal of Point Estimation is to find the point in -space which gives the

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Binary choice 3.3 Maximum likelihood estimation Michel Bierlaire Output of the estimation

Linear programming and the DEA approach Anders Ringgaard Kristensen Department of Veterinary and

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Equity & Excellence: Hidden Bias Implicit Bias Inherent Bias