Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - - PowerPoint PPT Presentation

estimation sample complexity and the bias variance
SMART_READER_LITE
LIVE PREVIEW

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - - PowerPoint PPT Presentation

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook 3.4-3.5 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )


  • Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook §3.4-3.5

  • Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 )

  • Recap • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

  • When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 Popoviciu's inequality: If , then a ≤ X i ≤ b ln(2/ δ ) 1 ln(2/ δ ) Hoe ff ding's inequality: = ( b − a ) ϵ = ( b − a ) 2 n 2 n ( b − a ) 2 σ 2 1 1 ≤ Chebyshev's inequality: = ( b − a ) ϵ = 4 δ n n δ n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < 2 2 δ Var[ X i ] ≈ 1 4( b − a ) 2 ✴ E.g., if , then whenever δ < ∼ 0.232 • Chebyshev's inequality can be applied even for unbounded variables • or for bounded variables with known, small σ 2

  • Outline 1. Recap & Logistics 2. Sample Complexity 3. Bias-Variance Tradeoff

  • Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ • We want sample complexity to be small ( why? ) • Sample complexity is determined by: 1. The estimator itself • Smarter estimators can sometimes improve sample complexity 2. Properties of the data generating process • If the data are high-variance, we need more samples for an accurate estimate • But we can reduce the sample complexity if we can bias our estimate toward the correct value

  • Convergence Rate via Chebyshev The convergence rate indicates how quickly the error in an estimator decays as the number of samples grows. n X = 1 ∑ ¯ Example: Estimating mean of a distribution using X i n i =1 • Recall that Chebyshev's inequality guarantees σ 2 X − 𝔽 [ ¯ ¯ Pr X ] ≤ ≥ 1 − δ δ n O ( 1/ n ) • Convergence rate is thus ( why? )

  • Convergence Rate via Gaussian i . i . d . ∼ N ( μ , σ 2 ) Example: Now assume that we know , X i σ 2 and we know but not . μ ¯ X ∼ N ( μ , σ 2 / n ) Pr( | ¯ Find such that by finding X − μ | < ϵ ) = 0.95 ϵ ϵ ∫ such that ( why? ) p ( x ) dx = 0.025 ϵ Questions: −∞ inverse CDF 1. What is the expected value of ⟹ ϵ = 1.96 σ n X = 1 ∑ ¯ ? X i n n i =1 ¯ 2. What is the variance of ? X ¯ 3. What is the distribution of ? X

  • Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ For , Chebyshev gives With Gaussian assumption and δ = 0.05 δ = 0.05, ϵ = 1.96 σ σ 2 1 σ ϵ = δ n = n 0.05 n n = 1.96 σ ⟺ ϵ = 4.47 σ ⟺ ϵ n ⟺ n = 3.84 σ 2 n = 4.47 σ ⟺ ϵ 2 ϵ ⟺ n = 19.98 σ 2 ϵ 2

  • ̂ Mean-Squared Error • Bias: whether an estimator is correct in expectation • Consistency: whether an estimator is correct in the limit of infinite data • Convergence rate: how fast the estimator approaches its own mean • For an unbiased estimator, this is also how fast its error bounds shrink • We don't necessarily care about an estimator's being unbiased. • Often, what we care about is our estimator's accuracy in expectation Definition: Mean squared error of an estimator of a quantity : X X X ) = 𝔽 [ ( ̂ X − 𝔽 [ X ]) 2 ] MSE( ̂ di ff erent!

  • Bias-Variance Decomposition Sometimes a biased estimator can be closer to the estimated quantity than an unbiased one. MSE ( ̂ X ) = 𝔽 [( ̂ X − 𝔽 [ X ]) 2 ] = 𝔽 [( ̂ X − μ ) 2 ] μ = 𝔽 [ X ] = 𝔽 [( ̂ X −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] − μ ) 2 ] −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] = 0 = 𝔽 [(( ̂ X − 𝔽 [ ̂ X ]) + b ) 2 ] b = Bias( ̂ X ) = 𝔽 [ ̂ X ] − μ X ]) 2 + 2 b ( ̂ = 𝔽 [( ̂ X − 𝔽 [ ̂ X − 𝔽 [ ̂ X ]) + b 2 ] = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 𝔽 [2 b ( ̂ X − 𝔽 [ ̂ X ])] + 𝔽 [ b 2 ] linearity of 𝔽 = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 constants come out of 𝔽 = Var[ ̂ X ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 def. variance linearity of 𝔽 = Var[ ̂ X ] + 2 b ( 𝔽 [ ̂ X ] − 𝔽 [ ̂ X ]) + b 2 = Var[ ̂ X ] + b 2 = Var[ ̂ X ] + Bias( ̂ X ) 2 ∎

  • Bias-Variance Tradeoff MSE( ̂ X ) = Var[ ̂ X ] + Bias( ̂ X ) 2 • If we can decrease bias without increasing variance, error goes down • If we can decrease variance without increasing bias, error goes down • Question: Would we ever want to increase bias ? • YES. If we can increase (squared) bias in a way that decreases variance more , then error goes down! • Interpretation: Biasing the estimator toward values that are more likely to be true (based on prior information)

  • Downward-biased Mean Estimation n 1 ∑ Example: Let's estimate given i.i.d with using: X 1 , …, X n 𝔽 [ X i ] = μ μ Y = X i n +100 i =1 This estimator is biased : This estimator has low variance : 𝔽 [ Y ] = 𝔽 [ Var( Y ) = Var [ X i ] X i ] n n 1 1 ∑ ∑ n + 100 n + 100 i =1 i =1 ( n + 100) 2 Var [ n X i ] 1 n 1 ∑ ∑ = 𝔽 [ X i ] = n + 100 i =1 i =1 n n 1 = n + 100 μ ∑ = Var[ X i ] ( n + 100) 2 i =1 n − 100 n Bias( Y ) = n + 100 μ − μ = n + 100 μ ( n + 100) 2 σ 2 =

  • Estimating Near 0 μ Example: Suppose that , , and σ = 1 n = 10 μ = 0.1 Bias( ¯ X ) = 0 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( X ) = σ 2 2 = Var( ¯ n + 100 μ ) X ) n 100 Var( ¯ n = = 1 110 2 + ( 10 2 1100.1 ) 10 100 = ≈ 9 × 10 − 4

  • Prior Information and Bias: There's No Free Lunch Example: Suppose that , , and σ = 1 n = 10 μ = 5 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( 2 = Var( ¯ n + 100 μ ) X ) n − 100 = = 1 110 2 + ( − 100 10 2 1105 ) 10 = ≈ 20.66 Whoa! What went wrong?

  • Summary • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias