 
              Bounding Methods I min f C [ f ] = C [ f ⇤ ] is called the Bayes error Larger than 0 when there is randomness in P ( y | x ) E.g., in our regression problem: y = f ⇤ ( x ; w )+ ε , ε ⇠ N ( 0 , σ 2 ) Cannot be avoided even we know P ( x , y ) in the ground truth So, our target is to make C [ f N ] as close to C [ f ⇤ ] as possible Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 9 / 44
Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44
Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44
Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44
Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? We can reduce E app by choosing a more complex F A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44
Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? We can reduce E app by choosing a more complex F A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression How to reduce E est ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44
Bounding Methods III Bounds of E est for, e.g., binary classifiers [1, 2, 3]: ✓ Complexity ( F ) log N ◆ α �  1 � E est = O , α 2 2 , 1 , with high probability N So, to reduce E est , we should either have Simpler model (e.g., smaller polynomial degree P ), or Larger training set Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 11 / 44
Model Complexity, Overfit, and Underfit Too simple a model leads to high E app Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44
Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44
Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44
Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est due to overfitting f N captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44
Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est due to overfitting f N captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set Low training error; high testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44
Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44
Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44
Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Can be visualized using the learning curves Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44
Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Can be visualized using the learning curves Too small N results in overfit regardless of model complexity Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44
Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44
Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44
Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44
Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Assume particular Loss function loss ( · ) , and Data generating distribution P ( x , y ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44
Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Assume particular Loss function loss ( · ) , and Data generating distribution P ( x , y ) Require knowledge about the point estimation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44
Outline Learning Theory 1 Point Estimation: Bias and Variance 2 Consistency* Decomposing Generalization Error 3 Regularization 4 Weight Decay Validation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 15 / 44
Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44
Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44
Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ n ∑ i x ( i ) µ x = 1 Sample mean: ˆ n ∑ i ( x ( i ) � ˆ σ x = 1 µ x ) 2 Sample variance: ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44
Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ n ∑ i x ( i ) µ x = 1 Sample mean: ˆ n ∑ i ( x ( i ) � ˆ σ x = 1 µ x ) 2 Sample variance: ˆ How good are these estimators? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44
Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44
Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44
Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? µ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44
Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? Yes [Homework] µ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44
Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? Yes [Homework] µ x = 1 Is ˆ What much is Var X ( ˆ µ x ) ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 µ 2 � µ 2 = 1 � E [ x 2 ] � µ 2 � n E [ x 2 ]+ ( n � 1 ) = 1 = 1 n σ 2 x n n Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 µ 2 � µ 2 = 1 � E [ x 2 ] � µ 2 � n E [ x 2 ]+ ( n � 1 ) = 1 = 1 n σ 2 x n n The variance of ˆ µ x diminishes as n ! ∞ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? σ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 What’s the unbiased estimator of σ x ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 What’s the unbiased estimator of σ x ? n � 1 ( 1 n 1 ( x ( i ) � ˆ ( x ( i ) � ˆ µ x ) 2 ) = µ x ) 2 n ∑ n � 1 ∑ σ x = ˆ i i Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ⇥ θ n ]) 2 ⇤ � ( ˆ θ n � E [ ˆ E [ ˆ = E + θ n ] � θ θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ⇥ θ n ]) 2 ⇤ � ( ˆ θ n � E [ ˆ E [ ˆ = E + θ n ] � θ θ n ] � θ ) = Var X ( ˆ θ n )+ bias ( ˆ θ n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44
Outline Learning Theory 1 Point Estimation: Bias and Variance 2 Consistency* Decomposing Generalization Error 3 Regularization 4 Weight Decay Validation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 21 / 44
Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44
Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44
Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent i ff : Pr ˆ θ n � ! θ , lim n ! ∞ where Pr � ! means “converge in probability” Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44
Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent i ff : Pr ˆ θ n � ! θ , lim n ! ∞ where Pr � ! means “converge in probability” Strong consistent i ff “converge almost surely” Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44
Law of Large Numbers Theorem (Weak Law of Large Numbers) n ∑ i x ( i ) is a consistent estimator of µ x , i.e., µ x = 1 The sample mean ˆ lim n ! ∞ Pr ( | ˆ µ x , n � µ x | < ε ) = 1 for any ε > 0 . Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44
Law of Large Numbers Theorem (Weak Law of Large Numbers) n ∑ i x ( i ) is a consistent estimator of µ x , i.e., µ x = 1 The sample mean ˆ lim n ! ∞ Pr ( | ˆ µ x , n � µ x | < ε ) = 1 for any ε > 0 . Theorem (Strong Law of Large Numbers) In addition, ˆ µ x is a strong consistent estimator: Pr ( lim n ! ∞ ˆ µ x , n = µ x ) = 1 . Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44
Outline Learning Theory 1 Point Estimation: Bias and Variance 2 Consistency* Decomposing Generalization Error 3 Regularization 4 Weight Decay Validation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 24 / 44
Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44
Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44
Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44
Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] � � = E x E X , y [ loss ( f N ( x ) � y ) | x = x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44
Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] � � = E x E X , y [ loss ( f N ( x ) � y ) | x = x ] There’s a simple decomposition of E X , y [ loss ( f N ( x ) � y ) | x ] for linear/polynomial regression Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] = ( Var y [ y | x ]+ E y [ y | x ] 2 )+( Var X [ f N ( x ) | x ]+ E X [ f N ( x ) | x ] 2 ) � 2E y [ y | x ] E X [ f N ( x ) | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] = ( Var y [ y | x ]+ E y [ y | x ] 2 )+( Var X [ f N ( x ) | x ]+ E X [ f N ( x ) | x ] 2 ) � 2E y [ y | x ] E X [ f N ( x ) | x ] = Var y [ y | x ]+ Var X [ f N ( x ) | x ]+( E X [ f N ( x ) | x ] � E y [ y | x ]) 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] = ( Var y [ y | x ]+ E y [ y | x ] 2 )+( Var X [ f N ( x ) | x ]+ E X [ f N ( x ) | x ] 2 ) � 2E y [ y | x ] E X [ f N ( x ) | x ] = Var y [ y | x ]+ Var X [ f N ( x ) | x ]+( E X [ f N ( x ) | x ] � E y [ y | x ]) 2 = Var y [ y | x ]+ Var X [ f N ( x ) | x ]+ E X [ f N ( x ) � f ⇤ ( x ) | x ] 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] = ( Var y [ y | x ]+ E y [ y | x ] 2 )+( Var X [ f N ( x ) | x ]+ E X [ f N ( x ) | x ] 2 ) � 2E y [ y | x ] E X [ f N ( x ) | x ] = Var y [ y | x ]+ Var X [ f N ( x ) | x ]+( E X [ f N ( x ) | x ] � E y [ y | x ]) 2 = Var y [ y | x ]+ Var X [ f N ( x ) | x ]+ E X [ f N ( x ) � f ⇤ ( x ) | x ] 2 = σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44
Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44
Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x The first term cannot be avoided when P ( y | x ) is stochastic Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44
Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x The first term cannot be avoided when P ( y | x ) is stochastic Model complexity controls the tradeo ff between variance and bias E.g., polynomial regressors (dotted line = average training error): Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44
Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44
Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting High training error; high testing error (given a su ffi ciently large N ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44
Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high variance or overfitting Low training error; high testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44
Outline Learning Theory 1 Point Estimation: Bias and Variance 2 Consistency* Decomposing Generalization Error 3 Regularization 4 Weight Decay Validation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 29 / 44
Regularization We get f N = argmin f 2 F C N [ f ] by minimizing the empirical error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44
Recommend
More recommend