estimation sample complexity and the bias variance
play

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - PowerPoint PPT Presentation

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook 3.4-3.5 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )


  1. Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook §3.4-3.5

  2. Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 )

  3. Recap • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

  4. When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 Popoviciu's inequality: If , then a ≤ X i ≤ b ln(2/ δ ) 1 ln(2/ δ ) Hoe ff ding's inequality: = ( b − a ) ϵ = ( b − a ) 2 n 2 n ( b − a ) 2 σ 2 1 1 ≤ Chebyshev's inequality: = ( b − a ) ϵ = 4 δ n n δ n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < 2 2 δ Var[ X i ] ≈ 1 4( b − a ) 2 ✴ E.g., if , then whenever δ < ∼ 0.232 • Chebyshev's inequality can be applied even for unbounded variables • or for bounded variables with known, small σ 2

  5. Outline 1. Recap & Logistics 2. Sample Complexity 3. Bias-Variance Tradeoff

  6. Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ • We want sample complexity to be small ( why? ) • Sample complexity is determined by: 1. The estimator itself • Smarter estimators can sometimes improve sample complexity 2. Properties of the data generating process • If the data are high-variance, we need more samples for an accurate estimate • But we can reduce the sample complexity if we can bias our estimate toward the correct value

  7. Convergence Rate via Chebyshev The convergence rate indicates how quickly the error in an estimator decays as the number of samples grows. n X = 1 ∑ ¯ Example: Estimating mean of a distribution using X i n i =1 • Recall that Chebyshev's inequality guarantees σ 2 X − 𝔽 [ ¯ ¯ Pr X ] ≤ ≥ 1 − δ δ n O ( 1/ n ) • Convergence rate is thus ( why? )

  8. Convergence Rate via Gaussian i . i . d . ∼ N ( μ , σ 2 ) Example: Now assume that we know , X i σ 2 and we know but not . μ ¯ X ∼ N ( μ , σ 2 / n ) Pr( | ¯ Find such that by finding X − μ | < ϵ ) = 0.95 ϵ ϵ ∫ such that ( why? ) p ( x ) dx = 0.025 ϵ Questions: −∞ inverse CDF 1. What is the expected value of ⟹ ϵ = 1.96 σ n X = 1 ∑ ¯ ? X i n n i =1 ¯ 2. What is the variance of ? X ¯ 3. What is the distribution of ? X

  9. Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ For , Chebyshev gives With Gaussian assumption and δ = 0.05 δ = 0.05, ϵ = 1.96 σ σ 2 1 σ ϵ = δ n = n 0.05 n n = 1.96 σ ⟺ ϵ = 4.47 σ ⟺ ϵ n ⟺ n = 3.84 σ 2 n = 4.47 σ ⟺ ϵ 2 ϵ ⟺ n = 19.98 σ 2 ϵ 2

  10. ̂ Mean-Squared Error • Bias: whether an estimator is correct in expectation • Consistency: whether an estimator is correct in the limit of infinite data • Convergence rate: how fast the estimator approaches its own mean • For an unbiased estimator, this is also how fast its error bounds shrink • We don't necessarily care about an estimator's being unbiased. • Often, what we care about is our estimator's accuracy in expectation Definition: Mean squared error of an estimator of a quantity : X X X ) = 𝔽 [ ( ̂ X − 𝔽 [ X ]) 2 ] MSE( ̂ di ff erent!

  11. Bias-Variance Decomposition Sometimes a biased estimator can be closer to the estimated quantity than an unbiased one. MSE ( ̂ X ) = 𝔽 [( ̂ X − 𝔽 [ X ]) 2 ] = 𝔽 [( ̂ X − μ ) 2 ] μ = 𝔽 [ X ] = 𝔽 [( ̂ X −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] − μ ) 2 ] −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] = 0 = 𝔽 [(( ̂ X − 𝔽 [ ̂ X ]) + b ) 2 ] b = Bias( ̂ X ) = 𝔽 [ ̂ X ] − μ X ]) 2 + 2 b ( ̂ = 𝔽 [( ̂ X − 𝔽 [ ̂ X − 𝔽 [ ̂ X ]) + b 2 ] = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 𝔽 [2 b ( ̂ X − 𝔽 [ ̂ X ])] + 𝔽 [ b 2 ] linearity of 𝔽 = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 constants come out of 𝔽 = Var[ ̂ X ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 def. variance linearity of 𝔽 = Var[ ̂ X ] + 2 b ( 𝔽 [ ̂ X ] − 𝔽 [ ̂ X ]) + b 2 = Var[ ̂ X ] + b 2 = Var[ ̂ X ] + Bias( ̂ X ) 2 ∎

  12. Bias-Variance Tradeoff MSE( ̂ X ) = Var[ ̂ X ] + Bias( ̂ X ) 2 • If we can decrease bias without increasing variance, error goes down • If we can decrease variance without increasing bias, error goes down • Question: Would we ever want to increase bias ? • YES. If we can increase (squared) bias in a way that decreases variance more , then error goes down! • Interpretation: Biasing the estimator toward values that are more likely to be true (based on prior information)

  13. Downward-biased Mean Estimation n 1 ∑ Example: Let's estimate given i.i.d with using: X 1 , …, X n 𝔽 [ X i ] = μ μ Y = X i n +100 i =1 This estimator is biased : This estimator has low variance : 𝔽 [ Y ] = 𝔽 [ Var( Y ) = Var [ X i ] X i ] n n 1 1 ∑ ∑ n + 100 n + 100 i =1 i =1 ( n + 100) 2 Var [ n X i ] 1 n 1 ∑ ∑ = 𝔽 [ X i ] = n + 100 i =1 i =1 n n 1 = n + 100 μ ∑ = Var[ X i ] ( n + 100) 2 i =1 n − 100 n Bias( Y ) = n + 100 μ − μ = n + 100 μ ( n + 100) 2 σ 2 =

  14. Estimating Near 0 μ Example: Suppose that , , and σ = 1 n = 10 μ = 0.1 Bias( ¯ X ) = 0 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( X ) = σ 2 2 = Var( ¯ n + 100 μ ) X ) n 100 Var( ¯ n = = 1 110 2 + ( 10 2 1100.1 ) 10 100 = ≈ 9 × 10 − 4

  15. Prior Information and Bias: There's No Free Lunch Example: Suppose that , , and σ = 1 n = 10 μ = 5 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( 2 = Var( ¯ n + 100 μ ) X ) n − 100 = = 1 110 2 + ( − 100 10 2 1105 ) 10 = ≈ 20.66 Whoa! What went wrong?

  16. Summary • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend