estimation sample averages bias and concentration
play

Estimation: Sample Averages, Bias, and Concentration Inequalities - PowerPoint PPT Presentation

Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook 3.1-3.3 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )


  1. Estimation: Sample Averages, Bias, and Concentration Inequalities CMPUT 296: Basics of Machine Learning Textbook §3.1-3.3

  2. Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 ) New: • Group Slack channel: #cmput296-fall20 (on Amii workspace)

  3. Recap • Random variables are functions from sample to some value • Upshot: A random variable takes different values with some probability • The value of one variable can be informative about the value of another (because they are both functions of the same sample) • Distributions of multiple random variables are described by the joint probability distribution (joint PMF or joint PDF) • Conditioning on a random variable gives a new distribution over others is independent of : conditioning on does not give a new distribution over X Y X Y • is conditionally independent of given : X Y Z P ( Y ∣ X , Z ) = P ( Y ∣ Z ) • • The expected value of a random variable is an average over its values, weighted by the probability of each value

  4. Outline 1. Recap & Logistics 2. Variance and Correlation 3. Estimators 4. Concentration Inequalities 5. Consistency

  5. Variance Definition: The variance of a random variable is Var ( X ) = 𝔽 [ ( X − 𝔽 [ X ]) 2 ] . f ( x ) = ( x − 𝔽 [ X ]) 2 i.e., where . 𝔽 [ f ( X )] Equivalently, Var ( X ) = 𝔽 [ X 2 ] − ( 𝔽 [ X ]) 2 ( why? )

  6. Covariance Definition: The covariance of two random variables is Cov ( X , Y ) = 𝔽 [ ( X − 𝔽 [ X ])( Y − 𝔽 [ Y ]) ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] . Question: What is the range of Cov ( X , Y ) ?

  7. Correlation Definition: The correlation of two random variables is Cov ( X , Y ) Corr ( X , Y ) = Var ( X ) Var ( Y ) Question: What is the range of Corr ( X , Y ) ? hint: Var ( X ) = Cov ( X , X )

  8. Independence and Decorrelation • Independent RVs have zero correlation ( why? ) hint: Cov [ X , Y ] = 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] • Uncorrelated RVs (i.e., Cov ( X , Y ) = 0 ) might be dependent (i.e., ). p ( x , y ) ≠ p ( x ) p ( y ) • Correlation ( Pearson's correlation coefficient ) shows linear relationships; but can miss nonlinear relationships • Example: X ∼ Uniform { − 2, − 1,0,1,2} Y = X 2 , 𝔽 [ XY ] = .2( − 2 × 4) + .2(2 × 4) + .2( − 1 × 1) + .2(1 × 1) + .2(0 × 0) • 𝔽 [ X ] = 0 • • So 𝔽 [ XY ] − 𝔽 [ X ] 𝔽 [ Y ] = 0 − 0 𝔽 [ Y ] = 0

  9. Properties of Variances Var [ c ] = 0 for constant c • Var [ cX ] = c 2 Var [ X ] for constant c • Var [ X + Y ] = Var [ X ] + Var [ Y ] + 2 Cov [ X , Y ] • • For independent , X , Y ( why? ) Var [ X + Y ] = Var [ X ] + Var [ Y ]

  10. Estimators Definition: An estimator is a procedure for estimating an unobserved quantity based on data. Example: Estimating for r.v. . 𝔽 [ X ] X ∈ ℝ random Questions: variable! Suppose we can observe a different variable . Is a Y Y good estimator of in the following cases? Why or 𝔽 [ X ] why not? 1. Y ∼ Uniform [0,10] 2. Y = 𝔽 [ X ] + Z , where Z ∼ Uniform [0,1] Y = 𝔽 [ X ] + Z , where Z ∼ N (0,100 2 ) 3. 4. Y = X 5. How would you estimate ? 𝔽 [ X ]

  11. ̂ ̂ Bias Questions: What is the bias of the Definition: The bias of an estimator is its expected following estimators of X ? 𝔽 [ X ] difference from the true value of the estimated quantity : X 1. Y ∼ Uniform [0,10] Bias( ̂ X ) = 𝔽 [ ̂ X − X ] 2. , Y = 𝔽 [ X ] + Z where Z ∼ Uniform [0,1] • Bias can be positive or negative or zero 3. , Y = 𝔽 [ X ] + Z Bias( ̂ • When , we say that the estimator is unbiased X ) = 0 X Z ∼ N (0,100 2 ) where 4. Y = X

  12. Independent and Identically Distributed (i.i.d.) Samples • We usually won't try to estimate anything about a distribution based on only a single sample • Usually, we use multiple samples from the same distribution • Multiple samples: This gives us more information • Same distribution: We want to learn about a single population • One additional condition: the samples must be independent ( why? ) Definition: When a set of random variables are are all X 1 , X 2 , … independent, and each has the same distribution , we say they are i.i.d. X ∼ F (independent and identically distributed), written X 1 , X 2 , … i . i . d . . ∼ F

  13. Estimating Expected Value via the Sample Mean X ] = 𝔽 [ X i ] n Example: We have i.i.d. samples from the same distribution , n F 1 ∑ 𝔽 [ ¯ n i . i . d , X 1 , X 2 , …, X n ∼ F i =1 n = 1 ∑ Var( X i ) = σ 2 with and for each . 𝔽 [ X i ] = μ X i 𝔽 [ X i ] n i =1 We want to estimate . μ n = 1 ∑ μ n X = 1 n ∑ ¯ Let's use the sample mean to estimate . X i μ i =1 n = 1 i =1 n n μ Question: Is this estimator unbiased ? Question: Are more samples better? Why? = μ . ∎

  14. Variance of the Estimator X ] = Var [ Xi ] n 1 ∑ Var[ ¯ • Intuitively, more samples should make the estimator n "closer" to the estimated quantity i =1 n 2 Var [ X i ] n = 1 • We can formalize this intuition partly by characterizing ∑ Var[ ̂ the variance of the estimator itself . X ] i =1 n = 1 • The variance of the estimator should decrease as ∑ Var[ X i ] n 2 the number of samples increases i =1 n = 1 ¯ • Example: for estimating : ∑ σ 2 X μ n 2 • The variance of the estimator shrinks linearly as i =1 = 1 n 2 n σ 2 = 1 the number of samples grows. n σ 2 .

  15. Concentration Inequalities Pr ( ¯ < ϵ ) > 1 − δ • We would like to be able to claim X − μ for some δ , ϵ > 0 X ] = 1 Var[ ¯ n σ 2 means that with "enough" data, • Pr ( ¯ < ϵ ) > 1 − δ for any that we pick ( why? ) X − μ δ , ϵ > 0 σ 2 = 81 Var[ ¯ • Suppose we have samples, and we know ; so . n = 10 X ] = 8.1 Pr ( ¯ < 2 ) • Question: What is ? X − μ

  16. Variance Is Not Enough Var[ ¯ Pr( | ¯ is not enough to compute Knowing ! X ] = 8.1 X − μ | < 2) Examples: x ) = { if ¯ 0.9 x = μ x = μ ± 9 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.9 if ¯ 0.05 x ) = { if ¯ 0.999 x = μ x = μ ± 90 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.999 if ¯ 0.0005 x ) = { if ¯ 0.1 x = μ x = μ ± 3 ⟹ Var[ ¯ X ] = 8.1 and Pr( | ¯ p (¯ X − μ | < 2) = 0.1 if ¯ 0.45

  17. Hoeffding's Inequality Theorem: Hoeffding's Inequality Suppose that are distributed i.i.d, with . X 1 , …, X n a ≤ X i ≤ b Then for any , ϵ > 0 ≥ ϵ ) ≤ 2 exp ( − ( b − a ) 2 ) Pr ( ¯ 2 n ϵ 2 X − 𝔽 [ ¯ . X ] Pr ( ) ≥ 1 − δ ln(2/ δ ) X − 𝔽 [ ¯ ¯ Equivalently, . X ] ≤ ( b − a ) 2 n

  18. Chebyshev's Inequality Theorem: Chebyshev's Inequality σ 2 Suppose that are distributed i.i.d. with variance . X 1 , …, X n Then for any , ϵ > 0 Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ . X ] n ϵ 2 σ 2 X − 𝔽 [ ¯ ¯ Equivalently, . Pr X ] ≤ ≥ 1 − δ δ n

  19. When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 • If , then a ≤ X i ≤ b ln(2/ δ ) ln(2/ δ ) 1 • Hoeffding's inequality gives ; ϵ = ( b − a ) = ( b − a ) 2 n 2 n σ 2 ( b − a ) 2 1 1 Chebyshev's inequality gives ϵ = δ n ≤ = ( b − a ) 4 δ n n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < ⟺ δ < ∼ 0.232 2 2 δ • Chebyshev's inequality can be applied even for unbounded variables

  20. ̂ ̂ Consistency Definition: A sequence of random variables converges in probability X n p to a random variable (written ) if for all , X X n → X ϵ > 0 . n →∞ Pr( | X n − X | > ϵ ) = 0 lim p Definition: An estimator for a quantity is consistent if . → X X X X

  21. Weak Law of Large Numbers Proof: Theorem: Weak Law of Large Numbers 𝔽 [ ¯ 1. We have already shown that X ] = μ Let be distributed i.i.d. with X 1 , …, X n 2. By Chebyshev, Var[ X i ] = σ 2 and . 𝔽 [ X i ] = μ Pr ( ¯ ≥ ϵ ) ≤ σ 2 X − 𝔽 [ ¯ X ] Then the sample mean n ϵ 2 for arbitrary ϵ > 0 n X = 1 n →∞ Pr ( ¯ ∑ ¯ ≥ ϵ ) = 0 X i 3. Hence lim X − μ n i =1 for any ϵ > 0 is a consistent estimator for . μ p ¯ 4. Hence . X → μ ∎

  22. Summary • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend