Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - PowerPoint PPT Presentation

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook §3.4-3.5

Logistics Reminders: • Thought Question 1 (due Thursday, September 17 ) • Assignment 1 (due Thursday, September 24 )

Recap • The variance of a random variable is its expected squared Var[ X ] X distance from the mean • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

When to Use Chebyshev, When to Use Hoeffding? Var[ X i ] ≤ 1 4( b − a ) 2 Popoviciu's inequality: If , then a ≤ X i ≤ b ln(2/ δ ) 1 ln(2/ δ ) Hoe ff ding's inequality: = ( b − a ) ϵ = ( b − a ) 2 n 2 n ( b − a ) 2 σ 2 1 1 ≤ Chebyshev's inequality: = ( b − a ) ϵ = 4 δ n n δ n 2 δ • Hoeffding's inequality gives a tighter bound* , but it can only be used on bounded random variables ln(2/ δ ) 1 ✴ whenever < 2 2 δ Var[ X i ] ≈ 1 4( b − a ) 2 ✴ E.g., if , then whenever δ < ∼ 0.232 • Chebyshev's inequality can be applied even for unbounded variables • or for bounded variables with known, small σ 2

Outline 1. Recap & Logistics 2. Sample Complexity 3. Bias-Variance Tradeoff

Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ • We want sample complexity to be small ( why? ) • Sample complexity is determined by: 1. The estimator itself • Smarter estimators can sometimes improve sample complexity 2. Properties of the data generating process • If the data are high-variance, we need more samples for an accurate estimate • But we can reduce the sample complexity if we can bias our estimate toward the correct value

Convergence Rate via Chebyshev The convergence rate indicates how quickly the error in an estimator decays as the number of samples grows. n X = 1 ∑ ¯ Example: Estimating mean of a distribution using X i n i =1 • Recall that Chebyshev's inequality guarantees σ 2 X − 𝔽 [ ¯ ¯ Pr X ] ≤ ≥ 1 − δ δ n O ( 1/ n ) • Convergence rate is thus ( why? )

Convergence Rate via Gaussian i . i . d . ∼ N ( μ , σ 2 ) Example: Now assume that we know , X i σ 2 and we know but not . μ ¯ X ∼ N ( μ , σ 2 / n ) Pr( | ¯ Find such that by finding X − μ | < ϵ ) = 0.95 ϵ ϵ ∫ such that ( why? ) p ( x ) dx = 0.025 ϵ Questions: −∞ inverse CDF 1. What is the expected value of ⟹ ϵ = 1.96 σ n X = 1 ∑ ¯ ? X i n n i =1 ¯ 2. What is the variance of ? X ¯ 3. What is the distribution of ? X

Sample Complexity Definition: The sample complexity of an estimator is the number of samples required to guarantee an expected error of at most with probability , for given and . 1 − δ ϵ δ ϵ For , Chebyshev gives With Gaussian assumption and δ = 0.05 δ = 0.05, ϵ = 1.96 σ σ 2 1 σ ϵ = δ n = n 0.05 n n = 1.96 σ ⟺ ϵ = 4.47 σ ⟺ ϵ n ⟺ n = 3.84 σ 2 n = 4.47 σ ⟺ ϵ 2 ϵ ⟺ n = 19.98 σ 2 ϵ 2

̂ Mean-Squared Error • Bias: whether an estimator is correct in expectation • Consistency: whether an estimator is correct in the limit of infinite data • Convergence rate: how fast the estimator approaches its own mean • For an unbiased estimator, this is also how fast its error bounds shrink • We don't necessarily care about an estimator's being unbiased. • Often, what we care about is our estimator's accuracy in expectation Definition: Mean squared error of an estimator of a quantity : X X X ) = 𝔽 [ ( ̂ X − 𝔽 [ X ]) 2 ] MSE( ̂ di ff erent!

Bias-Variance Decomposition Sometimes a biased estimator can be closer to the estimated quantity than an unbiased one. MSE ( ̂ X ) = 𝔽 [( ̂ X − 𝔽 [ X ]) 2 ] = 𝔽 [( ̂ X − μ ) 2 ] μ = 𝔽 [ X ] = 𝔽 [( ̂ X −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] − μ ) 2 ] −𝔽 [ ̂ X ] + 𝔽 [ ̂ X ] = 0 = 𝔽 [(( ̂ X − 𝔽 [ ̂ X ]) + b ) 2 ] b = Bias( ̂ X ) = 𝔽 [ ̂ X ] − μ X ]) 2 + 2 b ( ̂ = 𝔽 [( ̂ X − 𝔽 [ ̂ X − 𝔽 [ ̂ X ]) + b 2 ] = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 𝔽 [2 b ( ̂ X − 𝔽 [ ̂ X ])] + 𝔽 [ b 2 ] linearity of 𝔽 = 𝔽 [( ̂ X − 𝔽 [ ̂ X ]) 2 ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 constants come out of 𝔽 = Var[ ̂ X ] + 2 b 𝔽 [( ̂ X − 𝔽 [ ̂ X ])] + b 2 def. variance linearity of 𝔽 = Var[ ̂ X ] + 2 b ( 𝔽 [ ̂ X ] − 𝔽 [ ̂ X ]) + b 2 = Var[ ̂ X ] + b 2 = Var[ ̂ X ] + Bias( ̂ X ) 2 ∎

Bias-Variance Tradeoff MSE( ̂ X ) = Var[ ̂ X ] + Bias( ̂ X ) 2 • If we can decrease bias without increasing variance, error goes down • If we can decrease variance without increasing bias, error goes down • Question: Would we ever want to increase bias ? • YES. If we can increase (squared) bias in a way that decreases variance more , then error goes down! • Interpretation: Biasing the estimator toward values that are more likely to be true (based on prior information)

Downward-biased Mean Estimation n 1 ∑ Example: Let's estimate given i.i.d with using: X 1 , …, X n 𝔽 [ X i ] = μ μ Y = X i n +100 i =1 This estimator is biased : This estimator has low variance : 𝔽 [ Y ] = 𝔽 [ Var( Y ) = Var [ X i ] X i ] n n 1 1 ∑ ∑ n + 100 n + 100 i =1 i =1 ( n + 100) 2 Var [ n X i ] 1 n 1 ∑ ∑ = 𝔽 [ X i ] = n + 100 i =1 i =1 n n 1 = n + 100 μ ∑ = Var[ X i ] ( n + 100) 2 i =1 n − 100 n Bias( Y ) = n + 100 μ − μ = n + 100 μ ( n + 100) 2 σ 2 =

Estimating Near 0 μ Example: Suppose that , , and σ = 1 n = 10 μ = 0.1 Bias( ¯ X ) = 0 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( X ) = σ 2 2 = Var( ¯ n + 100 μ ) X ) n 100 Var( ¯ n = = 1 110 2 + ( 10 2 1100.1 ) 10 100 = ≈ 9 × 10 − 4

Prior Information and Bias: There's No Free Lunch Example: Suppose that , , and σ = 1 n = 10 μ = 5 MSE( ¯ X ) = Var( ¯ X ) + Bias( ¯ X ) 2 MSE( Y ) = Var( Y ) + Bias( Y ) 2 ( n + 100) 2 σ 2 + ( 2 = Var( ¯ n + 100 μ ) X ) n − 100 = = 1 110 2 + ( − 100 10 2 1105 ) 10 = ≈ 20.66 Whoa! What went wrong?

Summary • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - PowerPoint PPT Presentation

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook 3.4-3.5 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Analysis of variance and regression December 4, 2007 Variance component models Variance

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

Chapter 8.3. Maximum Likelihood Estimation Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.3

Parameter Estimation Probability theory tells us what to expect when we carry out some experiment

Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical

1 1 easy to compute , 1 easy to compute 2

Lecture 18 Local Methods Sasha Rakhlin Nov 07, 2018 1 / 23 Today: analysis of local

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience

Second order reduced bias tail index estimators under a third order framework M. Ivette Gomes

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT - PowerPoint PPT Presentation

Estimation: Sample Complexity and the Bias-Variance Tradeoff CMPUT 296: Basics of Machine Learning Textbook 3.4-3.5 Logistics Reminders: Thought Question 1 (due Thursday, September 17 ) Assignment 1 (due Thursday, September 24 )

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Expectancy bias and Bias and forensic evidence Bias and speech research forensic speech

Publication bias in QCA Publication bias in QCA Publication bias in QCA Meaning, diagnosis and

Analysis of variance and regression December 4, 2007 Variance component models Variance

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Estimation Theory Overview Introduction Up until now we have defined and discussed properties

Chapter 8.3. Maximum Likelihood Estimation Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.3

Parameter Estimation Probability theory tells us what to expect when we carry out some experiment

Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical

1 1 easy to compute , 1 easy to compute 2

Lecture 18 Local Methods Sasha Rakhlin Nov 07, 2018 1 / 23 Today: analysis of local

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience

Second order reduced bias tail index estimators under a third order framework M. Ivette Gomes

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh