stk in4300
play

STK-IN4300 Methods using Derived Input Directions Statistical - PowerPoint PPT Presentation

STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Model Assessment and Selection Cross-Validation Bootstrap Methods STK-IN4300 Methods using Derived Input Directions Statistical Learning Methods in Data Science


  1. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Model Assessment and Selection Cross-Validation Bootstrap Methods STK-IN4300 Methods using Derived Input Directions Statistical Learning Methods in Data Science Principal Component Regression Partial Least Squares Riccardo De Bin Shrinkage Methods debin@math.uio.no Ridge Regression STK-IN4300: lecture 3 1/ 42 STK-IN4300: lecture 3 2/ 42 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Cross-Validation: k -fold cross-validation Cross-Validation: k -fold cross-validation The cross-validation aims at estimating the expected test error, Err “ E r L p Y, ˆ f p X qqs . ‚ with enough data, we can split them in a training and test set; ‚ since usually it is not the case, we mimic this split by using the limited amount of data we have, § split data in K folds F 1 , . . . , F K , approximatively same size; § use, in turn, K ´ 1 folds to train the model (derive ˆ f ´ k p X q ); § evaluate the model in the remaining fold, ÿ 1 CV p ˆ f ´ k q “ L p y i , ˆ f ´ k p x i qq | F k | i P F k § estimate the expected test error as an average, ÿ K ÿ ÿ N | F k |“ N f q “ 1 1 1 CV p ˆ L p y i , ˆ f ´ k p x i qq L p y i , ˆ f ´ k p x i qq . K “ K | F k | N (figure from http://qingkaikong.blogspot.com/2017/02/machine-learning-9-more-on-artificial.html ) k “ 1 i P F k i “ 1 STK-IN4300: lecture 3 3/ 42 STK-IN4300: lecture 3 4/ 42

  2. STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Cross-Validation: choice of K Cross-Validation: choice of K How to choose K ? ‚ there is no a clear solution; ‚ bias-variance trade-off: § smaller the K , smaller the variance (but larger bias); § larger the K , smaller the bias (but larger variance); § extreme cases: § K “ 2 , half observations for training, half for testing; § K “ N , leave-one-out cross-validation (LOOCV); § LOOCV estimates the expected test error approximatively unbiased; § LOOCV has very large variance (the “training sets” are very similar to one another); ‚ usual choices are K “ 5 and K “ 10 . STK-IN4300: lecture 3 5/ 42 STK-IN4300: lecture 3 6/ 42 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Cross-Validation: further aspects Cross-Validation: the wrong and the right way to do cross-validation If we want to select a tuning parameter (e.g., no. of neighbours) Consider the following procedure: ‚ train ˆ f ´ k p X, α q for each α ; ř K ř 1. find a subset of good (= most correlated with the outcome) ‚ compute CV p ˆ f, α q “ 1 1 i P F k L p y i , ˆ f ´ k p x i , α qq ; k “ 1 K | F k | predictors; α “ argmin α CV p ˆ ‚ obtain ˆ f, α q . 2. use the selected predictors to build a classifier; 3. use cross-validation to compute the prediction error. The generalized cross-validation (GCV), « ff 2 Practical example (see R file): ÿ N y i ´ ˆ f p x i q f q “ 1 GCV p ˆ ‚ generated X , an r N “ 50 s ˆ r p “ 5000 s data matrix; 1 ´ trace p S q{ N N i “ 1 ‚ generate independently y i , i “ 1 , . . . , 50 , y i P t 0 , 1 u ; ‚ the true error test is 0 . 50 ; ‚ is a convenient approximation of LOOCV for linear fitting ‚ implementing the procedure above. What does it happen? under square loss; ‚ has computational advantages. STK-IN4300: lecture 3 7/ 42 STK-IN4300: lecture 3 8/ 42

  3. STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Cross-Validation: the wrong and the right way to do cross-validation Bootstrap Methods: bootstrap IDEA: generate pseudo-samples from the empirical distribution function computed on the original sample; Why it is not correct? ‚ by sampling with replacement from the original dataset; ‚ Training and test sets are NOT independent! ‚ mimic new experiments. ‚ observations on the test sets are used twice. Suppose Z “ tp x 1 , y 1 q , . . . , p y N , x N q u be the training set: loomoon looomooon Correct way to proceed: z 1 z N 1 “ tp y ˚ 1 , x ˚ , . . . , p y ˚ N , x ˚ ‚ by sampling with replacement, Z ˚ ‚ divide the sample in K folds; 1 q N q u ; looomooon looomooon ‚ both perform variable selection and build the classifier using z ˚ z ˚ observations from K ´ 1 folds; 1 N ‚ . . . . . . . . . . . . . . . . . . . . . § possible choice of the tuning parameter included; B “ tp y ˚ 1 , x ˚ , . . . , p y ˚ N , x ˚ ‚ by sampling with replacement, Z ˚ 1 q N q u ; looomooon looomooon ‚ compute the prediction error on the remaining fold. z ˚ z ˚ 1 N ‚ use the B bootstrap samples Z ˚ 1 , . . . , Z ˚ B to estimate any aspect of the distribution of a map S p Z q . STK-IN4300: lecture 3 9/ 42 STK-IN4300: lecture 3 10/ 42 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Bootstrap Methods: bootstrap Bootstrap Methods: bootstrap For example, to estimate the variance of S p Z q , ÿ B 1 x p S p Z ˚ b q ´ ¯ S ˚ q 2 Var r S p Z qs “ B ´ 1 b “ 1 ř B S ˚ “ 1 where ¯ b “ 1 S p Z ˚ b q . B Note that: ‚ x Var r S p Z qs is the Monte Carlo estimate of Var r S p Z qs under sampling from the empirical distribution ˆ F . STK-IN4300: lecture 3 11/ 42 STK-IN4300: lecture 3 12/ 42

  4. STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Bootstrap Methods: estimate prediction error Bootstrap Methods: example Consider a classification problem: Very simple: ‚ two classes with the same number of observations; ‚ generate B bootstrap samples Z ˚ 1 , . . . , Z ˚ B ; ‚ predictors and class label independent ñ Err “ 0 . 5 . ‚ apply the prediction rule to each bootstrap sample to derive the predictions ˆ f ˚ b p x i q , b “ 1 , . . . , B ; Using the 1-nearest neighbour: ‚ compute the error for each point, and take the average, b Ñ ˆ ‚ if y i P Z ˚ Err “ 0 ; ÿ ÿ B N Err boot “ 1 1 x ‚ if y i R Z ˚ b Ñ ˆ L p y i , ˆ f ˚ Err “ 0 . 5 ; b p x i qq . B N b “ 1 i “ 1 Therefore, Is it correct? NO!!! Err boot “ 0 ˆ Pr r Y i P Z ˚ x b s ` 0 . 5 ˆ Pr r Y i R Z ˚ b s “ 0 . 184 looooomooooon 0 . 368 Again, training and test set are NOT independent! STK-IN4300: lecture 3 13/ 42 STK-IN4300: lecture 3 14/ 42 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Bootstrap Methods: why 0.368 Bootstrap Methods: correct estimate prediction error Note: Pr r observation i does not belong to the boostrap sample b s “ 0 . 368 ‚ each bootstrap sample has N observations; ‚ some of the original observations are included more than once; Since b r j s ‰ y i s “ N ´ 1 ‚ some of them (in average, 0 . 368 N ) are not included at all; Pr r Z ˚ , N § these are not used to compute the predictions; § they can be used as a test set, is true for each position r j s , then ˆ N ´ 1 ˙ N Ñ e ´ 1 « 0 . 368 , ÿ N ÿ N Ñ8 Pr r Y i R Z ˚ p 1 q “ 1 b s “ Ý Ý Ý Ý 1 x L p y i , ˆ f ˚ Err b p x i qq N | C r´ i s | N i “ 1 b P C r´ i s Consequently, where C r´ i s is the set of indeces of the bootstrap samples which do not contain the observation i and | C r´ i s | denotes its cardinality. Pr r observation i is in the boostrap sample b s « 0 . 632 . STK-IN4300: lecture 3 15/ 42 STK-IN4300: lecture 3 16/ 42

  5. STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Bootstrap Methods: 0.632 bootstrap Bootstrap Methods: 0.632+ bootstrap Issue: ‚ the average number of unique observations in the bootstrap sample is 0 . 632 N Ñ not so far from 0 . 5 N of 2-fold CV; Further improvement, 0.632+ bootstrap: ‚ similar bias issues of 2-fold CV; ‚ based on the no-information error rate γ ; p 1 q slightly overestimates the prediction error. ‚ x Err ‚ γ takes into account the amount of overfitting; ‚ γ is the error rate if predictors and response were independent; To solve this, the 0 . 632 bootstrap estimator has been developed, ‚ computed by considering all combinations of x i and y i , p 0 . 632 q “ 0 . 368 Ď p 1 q x err ` 0 . 632 x Err Err ÿ N ÿ N γ “ 1 1 L p y i , ˆ ˆ f p x i 1 qq . N N i “ 1 i 1 “ 1 ‚ in practice it works well; ‚ in case of strong overfit, it can break down; § consider again the previous classification problem example; § with 1-nearest neighbour, Ď err “ 0 ; p 0 . 632 q “ 0 . 632 x p 1 q “ 0 . 632 ˆ 0 . 5 “ 0 . 316 ‰ 0 . 5 . § x Err Err STK-IN4300: lecture 3 17/ 42 STK-IN4300: lecture 3 18/ 42 STK-IN4300 - Statistical Learning Methods in Data Science STK-IN4300 - Statistical Learning Methods in Data Science Bootstrap Methods: 0.632+ bootstrap Methods using Derived Input Directions: summary The quantity ˆ γ is used to estimate the relative overfitting rate, p 1 q ´ Ď x Err err ˆ R “ , γ ´ Ď err ˆ ‚ Principal Components Regression which is then use in the 0.632+ bootstrap estimator, ‚ Partial Least Squares p 0 . 632 `q “ p 1 ´ ˆ p 1 q , x w x Err w q Ď err ` ˆ Err where 0 . 632 w “ . ˆ 1 ´ 0 . 368 ˆ R STK-IN4300: lecture 3 19/ 42 STK-IN4300: lecture 3 20/ 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend