Learning Theory & Regularization Shan-Hung Wu - PowerPoint PPT Presentation

Bounding Methods I min f C [ f ] = C [ f ⇤ ] is called the Bayes error Larger than 0 when there is randomness in P ( y | x ) E.g., in our regression problem: y = f ⇤ ( x ; w )+ ε , ε ⇠ N ( 0 , σ 2 ) Cannot be avoided even we know P ( x , y ) in the ground truth So, our target is to make C [ f N ] as close to C [ f ⇤ ] as possible Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 9 / 44

Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? We can reduce E app by choosing a more complex F A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

Bounding Methods II Let E = C [ f N ] � C [ f ⇤ ] be the excess error We have E = C [ f ⇤ F ] � C [ f ⇤ ] + C [ f N ] � C [ f ⇤ F ] | {z } | {z } E app E est E app is called the approximation error E est is called the estimation error How to reduce these errors? We can reduce E app by choosing a more complex F A complex F has a larger capacity E.g., larger polynomial degree P in polynomial regression How to reduce E est ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 10 / 44

Bounding Methods III Bounds of E est for, e.g., binary classifiers [1, 2, 3]: ✓ Complexity ( F ) log N ◆ α �  1 � E est = O , α 2 2 , 1 , with high probability N So, to reduce E est , we should either have Simpler model (e.g., smaller polynomial degree P ), or Larger training set Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 11 / 44

Model Complexity, Overfit, and Underfit Too simple a model leads to high E app Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est due to overfitting f N captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

Model Complexity, Overfit, and Underfit Too simple a model leads to high E app due to underfitting f N fails to capture the shape of f ⇤ High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high E est due to overfitting f N captures not only the shape of f ⇤ but also some spurious patterns (e.g., noise) local to a particular training set Low training error; high testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 12 / 44

Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Can be visualized using the learning curves Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

Sample Complexity and Learning Curves How many training examples ( N ) are su ffi cient? Di ff erent models/algorithms may have di ff erent sample complexity I.e., the N required to learn a target function with specified generalizability Can be visualized using the learning curves Too small N results in overfit regardless of model complexity Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 13 / 44

Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Assume particular Loss function loss ( · ) , and Data generating distribution P ( x , y ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

Decomposition Methods Bounding methods analyze C [ f N ] qualitatively General, as no (or weak) assumption on data distribution is made However, in practice, these bounds are too loose to quantify C [ f N ] In some particular situations, we can decompose C [ f N ] into multiple meaningful terms Assume particular Loss function loss ( · ) , and Data generating distribution P ( x , y ) Require knowledge about the point estimation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 14 / 44

Outline Learning Theory 1 Point Estimation: Bias and Variance 2 Consistency* Decomposing Generalization Error 3 Regularization 4 Weight Decay Validation Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 15 / 44

Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ n ∑ i x ( i ) µ x = 1 Sample mean: ˆ n ∑ i ( x ( i ) � ˆ σ x = 1 µ x ) 2 Sample variance: ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

Sample Mean and Variance Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let X = { x ( 1 ) , ··· , x ( n ) } be a set of n i.i.d. samples of a random variable x , a point estimator or statistic is a function of the data: θ n = g ( x ( 1 ) , ··· , x ( n ) ) ˆ The value ˆ θ n is called the estimate of θ n ∑ i x ( i ) µ x = 1 Sample mean: ˆ n ∑ i ( x ( i ) � ˆ σ x = 1 µ x ) 2 Sample variance: ˆ How good are these estimators? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 16 / 44

Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? µ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? Yes [Homework] µ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

Bias & Variance Bias of an estimator: bias ( ˆ θ n ) = E X ( ˆ θ n ) � θ Here, the expectation is defined over all possible X ’s of size n , i.e., R ˆ E X ( ˆ θ n ) = θ n d P ( X ) We call a statistic unbiased estimator i ff it has zero bias Variance of an estimator: h� ˆ � 2 i Var X ( ˆ θ n � E X [ ˆ θ n ) = E X θ n ] n ∑ i x ( i ) an unbiased estimator of µ x ? Yes [Homework] µ x = 1 Is ˆ What much is Var X ( ˆ µ x ) ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 17 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 µ 2 � µ 2 = 1 � E [ x 2 ] � µ 2 � n E [ x 2 ]+ ( n � 1 ) = 1 = 1 n σ 2 x n n Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Variance of ˆ µ x µ 2 � 2 ˆ µ ]) 2 ] = E [ ˆ µµ + µ 2 ] = E [ ˆ µ 2 ] � µ 2 Var X ( ˆ µ ) = E X [( ˆ µ � E X [ ˆ n 2 ∑ i , j x ( i ) x ( j ) ] � µ 2 = 1 = E [ 1 n 2 ∑ i , j E [ x ( i ) x ( j ) ] � µ 2 � � = 1 ∑ i = j E [ x ( i ) x ( j ) ]+ ∑ i 6 = j E [ x ( i ) x ( j ) ] � µ 2 n 2 � � = 1 ∑ i E [ x ( i ) 2 ]+ n ( n � 1 ) E [ x ( i ) ] E [ x ( j ) ] � µ 2 n 2 µ 2 � µ 2 = 1 � E [ x 2 ] � µ 2 � n E [ x 2 ]+ ( n � 1 ) = 1 = 1 n σ 2 x n n The variance of ˆ µ x diminishes as n ! ∞ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 18 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? σ x = 1 Is ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 What’s the unbiased estimator of σ x ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Unbiased Estimator of σ x n ∑ i ( x ( i ) � ˆ µ x ) 2 and an unbiased estimator of σ x ? No σ x = 1 Is ˆ n ∑ i ( x ( i ) � ˆ n ( ∑ i x ( i ) 2 � 2 ∑ i x ( i ) ˆ = E [ 1 µ ) 2 ] = E [ 1 µ 2 )] E X [ ˆ σ ] µ + ∑ i ˆ n ( ∑ i x ( i ) 2 � n ˆ n ( ∑ i E [ x ( i ) 2 ] � n E [ ˆ = E [ 1 µ 2 )] = 1 µ 2 ]) µ 2 ] = E [( x � µ ) 2 + 2x µ � µ 2 ] � E [ ˆ = E [ x 2 ] � E [ ˆ µ 2 ] = ( σ 2 + µ 2 ) � ( Var [ ˆ µ ] 2 ) µ ]+ E [ ˆ = σ 2 + µ 2 � 1 n σ 2 � µ 2 = n � 1 n σ 2 6 = σ 2 What’s the unbiased estimator of σ x ? n � 1 ( 1 n 1 ( x ( i ) � ˆ ( x ( i ) � ˆ µ x ) 2 ) = µ x ) 2 n ∑ n � 1 ∑ σ x = ˆ i i Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 19 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ⇥ θ n ]) 2 ⇤ � ( ˆ θ n � E [ ˆ E [ ˆ = E + θ n ] � θ θ n ] � θ ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Mean Square Error Mean square error of an estimator: ⇥ θ n � θ ) 2 ⇤ MSE ( ˆ ( ˆ θ n ) = E X Can be decomposed into the bias and variance: ⇥ θ n � θ ) 2 ⇤ ⇥ θ n ]+ θ ) 2 ⇤ ( ˆ ( ˆ θ n � E [ ˆ θ n ] � E [ ˆ = E E X ⇥ θ n ]) 2 +( E [ ˆ θ n ] � θ ) 2 + 2 ( ˆ ⇤ ( ˆ θ n � E [ ˆ θ n � E [ ˆ θ n ])( E [ ˆ = E θ n ] � θ ) � ˆ ⇥ θ n ]) 2 ⇤ ⇥ θ n ] � θ ) 2 ⇤ � ( ˆ θ n � E [ ˆ ( E [ ˆ θ n � E [ ˆ ( E [ ˆ = E + E + 2E θ n ] θ n ] � θ ) � 2 + 2 · 0 · ( E [ ˆ ⇥ θ n ]) 2 ⇤ � ( ˆ θ n � E [ ˆ E [ ˆ = E + θ n ] � θ θ n ] � θ ) = Var X ( ˆ θ n )+ bias ( ˆ θ n ) 2 MSE of an unbiased estimator is its variance Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 20 / 44

Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent i ff : Pr ˆ θ n � ! θ , lim n ! ∞ where Pr � ! means “converge in probability” Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

Consistency So far, we discussed the “goodness” of an estimator based on samples of fixed size If we have more samples, will the estimate become more accurate? An estimator is (weak) consistent i ff : Pr ˆ θ n � ! θ , lim n ! ∞ where Pr � ! means “converge in probability” Strong consistent i ff “converge almost surely” Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 22 / 44

Law of Large Numbers Theorem (Weak Law of Large Numbers) n ∑ i x ( i ) is a consistent estimator of µ x , i.e., µ x = 1 The sample mean ˆ lim n ! ∞ Pr ( | ˆ µ x , n � µ x | < ε ) = 1 for any ε > 0 . Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44

Law of Large Numbers Theorem (Weak Law of Large Numbers) n ∑ i x ( i ) is a consistent estimator of µ x , i.e., µ x = 1 The sample mean ˆ lim n ! ∞ Pr ( | ˆ µ x , n � µ x | < ε ) = 1 for any ε > 0 . Theorem (Strong Law of Large Numbers) In addition, ˆ µ x is a strong consistent estimator: Pr ( lim n ! ∞ ˆ µ x , n = µ x ) = 1 . Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 23 / 44

Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] � � = E x E X , y [ loss ( f N ( x ) � y ) | x = x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

Expected Generalization Error In ML, we get f N = argmin f 2 F C N [ f ] by minimizing the empirical error over a training set of size N How to decompose the generalization error C [ f N ] ? Regard f N ( x ) as an estimate of true label y given x f N an estimator mapped from i.i.d. samples in the training set X To evaluate the estimator f N , we consider the expected generalization error: R loss ( f N ( x ) � y ) d P ( x , y )] E X ( C [ f N ]) = E X [ = E X , x , y [ loss ( f N ( x ) � y )] � � = E x E X , y [ loss ( f N ( x ) � y ) | x = x ] There’s a simple decomposition of E X , y [ loss ( f N ( x ) � y ) | x ] for linear/polynomial regression Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 25 / 44

Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

Example: Linear/Polynomial Regression In linear/polynomial regression, we have loss ( · ) = ( · ) 2 a squared loss y = f ⇤ ( x )+ ε , ε ⇠ N ( 0 , σ 2 ) , thus E y [ y | x ] = f ⇤ ( x ) and Var y [ y | x ] = σ 2 We can decompose the mean square error: E X , y [ loss ( f N ( x ) � y ) | x ] = E X , y [( f N ( x ) � y ) 2 | x ] = E X , y [ y 2 + f N ( x ) 2 � 2 f N ( x ) y | x ] = E y [ y 2 | x ]+ E X [ f N ( x ) 2 | x ] � 2E X , y [ f N ( x ) y | x ] = ( Var y [ y | x ]+ E y [ y | x ] 2 )+( Var X [ f N ( x ) | x ]+ E X [ f N ( x ) | x ] 2 ) � 2E y [ y | x ] E X [ f N ( x ) | x ] Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 26 / 44

Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x The first term cannot be avoided when P ( y | x ) is stochastic Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

Bias-Variance Tradeo ff I � � E X ( C [ f N ]) = E x E X , y [ loss ( f N ( x ) � y ) | x ] � σ 2 + Var X [ f N ( x ) | x ]+ bias [ f N ( x ) | x ] 2 � = E x The first term cannot be avoided when P ( y | x ) is stochastic Model complexity controls the tradeo ff between variance and bias E.g., polynomial regressors (dotted line = average training error): Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 27 / 44

Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting High training error; high testing error (given a su ffi ciently large N ) Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

Bias-Variance Tradeo ff II Provides another way to understand the generalization/testing error Too simple a model leads to high bias or underfitting High training error; high testing error (given a su ffi ciently large N ) Too complex a model leads to high variance or overfitting Low training error; high testing error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 28 / 44

Regularization We get f N = argmin f 2 F C N [ f ] by minimizing the empirical error Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 30 / 44

Learning Theory & Regularization Shan-Hung Wu - PowerPoint PPT Presentation

Learning Theory & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 1 /

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2015 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2014 Tomaso

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Predator Prevention and Educational Barnyard Program Purpose To reduce wildlife injury and

New Generation Drug Eluting Stent Nobori: 2 Year Clinical Outcome of Patients with Chronic Total

Providing Science Professional Development for Early Childhood Teachers May 16, 2017

Fire Safety in Residential Care Premises Workshop 2 Community Protection Directorate Fire

Nothing About Us Without Us: Using a Participatory and Equitable Approach to Evaluating an

Spark Machine Learning Future Cloud Summer School Paco Nathan @pacoid 2015-08-08

"Physics of the Deuteron" Group: Blaine Norum, Professor Richard Lindgren, Research

High level OCaml optimisations Pierre Chambart, OCamlPro OCaml 2013, 23 September 2013 OCaml is

Learning Theory & Regularization Shan-Hung Wu - PowerPoint PPT Presentation

Learning Theory & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Learning Theory & Regularization Machine Learning 1 /

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2015 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2011 Tomaso

The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 September 2014 Tomaso

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Predator Prevention and Educational Barnyard Program Purpose To reduce wildlife injury and

New Generation Drug Eluting Stent Nobori: 2 Year Clinical Outcome of Patients with Chronic Total

Providing Science Professional Development for Early Childhood Teachers May 16, 2017

Fire Safety in Residential Care Premises Workshop 2 Community Protection Directorate Fire

Nothing About Us Without Us: Using a Participatory and Equitable Approach to Evaluating an

Spark Machine Learning Future Cloud Summer School Paco Nathan @pacoid 2015-08-08

&quot;Physics of the Deuteron&quot; Group: Blaine Norum, Professor Richard Lindgren, Research

High level OCaml optimisations Pierre Chambart, OCamlPro OCaml 2013, 23 September 2013 OCaml is

Regularization Overview Regularization Overview Problems & Multicollinearity We will

"Physics of the Deuteron" Group: Blaine Norum, Professor Richard Lindgren, Research