stk in4300 statistical learning methods in data science
play

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 8 1/ 39 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Generalized Additive Models Definition


  1. STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 8 1/ 39

  2. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Generalized Additive Models Definition Fitting algorithm Tree-based Methods Background How to grow a regression tree Bagging Bootstrap aggregation Bootstrap trees STK-IN4300: lecture 8 2/ 39

  3. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: introduction From the previous lecture: ‚ linear regression model are easy and effective models; ‚ often the effect of a predictor on the response is not linear; Ó local polynomials and splines. Generalized Additive Models : ‚ flexible statistical methods to identify and characterize nonlinear regression effects; ‚ lerger class than the generalized linear models. STK-IN4300: lecture 8 3/ 39

  4. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: additive models Consider the usual framework: ‚ X 1 , . . . , X p are the predictors; ‚ Y is the response variable; ‚ f 1 p¨q , . . . , f p p¨q are unspecified smooth functions. Then, an additive model has the form E r Y | X 1 , . . . , X p s “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q . STK-IN4300: lecture 8 4/ 39

  5. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: more generally As linear models are extended to generalized linear models, we can generalized the additive models to the generalized additive models, g p µ p X , . . . , X p qq “ α ` f 1 p X 1 q ` ¨ ¨ ¨ ` f p p X p q , where: ‚ µ p X 1 , . . . , X p q “ E r Y | X 1 , . . . , X p s is the link function; ‚ g p µ p X 1 , . . . , X p qq is the link function; ‚ classical examples: § g p µ q “ µ Ø identity link Ñ Gaussian models; § g p µ q “ log p µ {p 1 ´ µ qq Ø logit link Ñ Binomial models; § g p µ q “ Φ ´ 1 p µ q Ø probit link Ñ Binomial models; § g p µ q “ log p µ q Ø logarithmic link Ñ Poisson models; § . . . STK-IN4300: lecture 8 5/ 39

  6. STK-IN4300 - Statistical Learning Methods in Data Science Generalized Additive Models: semiparametric models Generalized additive models are very flexible: ‚ not all functions f j p¨q must be nonlinear; g p µ q “ X T β ` f p Z q in which case we talk about semiparametric models . ‚ nonlinear effect can be combined with qualitative inputs, g p µ q “ f p X q ` g k p Z q “ f p X q ` g p V, Z q where k indexes the level of a qualitative variable V . STK-IN4300: lecture 8 6/ 39

  7. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: difference with splines When implementing splines: ‚ each function is modelled by a basis expansion; ‚ the resulting model can be fitted with least squares. Here the approach is different: ‚ each function is modelled with a smoother (smoothing splines, kernel smoothers, . . . ) ‚ all p functions are simultaneously fitted via an algorithm. STK-IN4300: lecture 8 7/ 39

  8. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: ingredients Consider an additive model p ÿ Y “ α ` f j p X j q ` ǫ. j “ 1 We can define a loss function, ¸ 2 N ˜ p p ż ÿ ÿ ÿ t f 2 j p t j qu 2 dt j y i ´ α ´ f j p x ij q ` λ j i “ 1 j “ 1 j “ 1 ‚ λ j are tuning parameters; ‚ the minimizer is an additive cubic spline model, § each f j p X j q is a cubic spline with knots at the (unique) x ij ’s. STK-IN4300: lecture 8 8/ 39

  9. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: constrains The parameter α is in general not identifiable: ‚ same result if adding a constant to each f j p X j q and subtracting it from α ; ‚ by convention, ř p j “ 1 f j p X j q “ 0 : § the functions average 0 over the data; § α is therefore identifiable; § in particular, ˆ α “ ¯ y . If this is true and the matrix of inputs X has full rank: ‚ the loss function is convex; ‚ the minimizer is unique. STK-IN4300: lecture 8 9/ 39

  10. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: backfitting algorithm The backfitting algorithm : α “ N ´ 1 ř N i “ 1 y i and ˆ 1. Initialization: ˆ f j ” 0 @ j 2. In cycle, j “ 1 , . . . , p, 1 , . . . , p, . . . » fi ÿ ˆ f k p x ik qu N ˆ f j Ð S j – t y i ´ α ´ 1 fl k ‰ j N f j ´ 1 f j Ð ˆ ˆ ÿ ˆ f j p x ij qu N i “ 1 until ˆ f j change less than a pre-specified threshold. S j is usually a cubic smoothing spline, but other smoothing operators can be used. STK-IN4300: lecture 8 10/ 39

  11. STK-IN4300 - Statistical Learning Methods in Data Science Fitting algorithm: remarks Note: ‚ the smoother S can be (when applied only at the training points) represented by the N ˆ N smoothing matrix S , § the degrees of freedom for the j -th terms are trace p S q ; ‚ for the generalized additive model, the loss function is the penalized log-likelihood; ‚ the backfitting algorithm fits all predictors, § not feasible when p ąą N . STK-IN4300: lecture 8 11/ 39

  12. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Consider a regression problem, Y the response, X the input matrix. A tree is a recursive binary partition of the feature space: ‚ each time a region is divide in two or more regions; § until a stopping criterion applies; ‚ at the end, the input space is split in M regions R m ; ‚ a constant c m is fitted to each R m . The final prediction is M ˆ ÿ f p X q “ c m 1 p X P R m q , ˆ m “ 1 where ˆ c m is an estimate for the region R m (e.g., ave p y i | x i P R m q ). STK-IN4300: lecture 8 12/ 39

  13. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 8 13/ 39

  14. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction Note: ‚ the split can be represented as a junction of a tree; ‚ this representation works for p ą 2 ; ‚ each observation is assigned to a branch at each junction; ‚ the model is easy to interpret. STK-IN4300: lecture 8 14/ 39

  15. STK-IN4300 - Statistical Learning Methods in Data Science Tree-based Methods: introduction STK-IN4300: lecture 8 15/ 39

  16. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: split How to grow a regression tree: ‚ we need to automatically decide the splitting variables . . . ‚ . . . and the splitting points; ‚ we need to decide the shape (topology) of the tree. Using a sum of squares criterion, ř N i “ 1 p y i ´ f p x i qq 2 , ‚ the best ˆ c m “ ave p y i | x i P R m q ; ‚ finding the best partition in terms of minimum sum of squares is generally computationally infeasible Ó go greedy STK-IN4300: lecture 8 16/ 39

  17. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: greedy algorithm Starting with all data: ‚ for each X j , find the best split point s § define the two half-hyperplanes, § R 1 p j, s q “ t X | X j ď s u ; § R 2 p j, s q “ t X | X j ą s u ; § the choice of s can be done really quickly; ‚ for each j and s , solve p y i ´ c 1 q 2 ` min ÿ ÿ p y i ´ c 2 q 2 s j, s r min min c 1 c 2 x i P R 1 p j,s q x i P R 2 p j,s q ‚ the inner minimization is solved by § ˆ c 1 “ ave p y i | x i P R 1 p j, s qq ; § ˆ c 2 “ ave p y i | x i P R 2 p j, s qq . ‚ the identification of the best p j, s q is feasible. STK-IN4300: lecture 8 17/ 39

  18. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: when to stop The tree size: ‚ is a tuning parameter; ‚ it controls the model complexity; ‚ its optimal values should be chosen from the data. Naive approach : ‚ split the tree nodes only if there is a sufficient decrease in the sum-of-squares (e.g., larger than a pre-specified threshold); § intuitive; § short-sighted (a split can be preparatory for a split below). Preferred strategy : ‚ grow a large (pre-specified # of nodes) or complete tree T 0 ; ‚ prune (remove branches) it to find the best tree. STK-IN4300: lecture 8 18/ 39

  19. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: cost-complexity pruning Consider a tree T Ă T 0 , computed by pruning T 0 and define: ‚ R m the region defined by the node m ; ‚ | T | the number of terminal nodes in T ; ‚ N m the number of observations in R m , N m “ # t x i P R m u ; c m “ N ´ 1 ‚ ˆ c m the estimate in R m , ˆ ř x i P R m y i ; m ‚ Q m p T q the loss in R m , Q m p T q “ N ´ 1 c m q 2 . ř x i P R m p y i ´ ˆ m Then, the cost complexity criterion is | T | ÿ C α p T q “ N m Q m p T q ` α | T | . m “ 1 STK-IN4300: lecture 8 19/ 39

  20. STK-IN4300 - Statistical Learning Methods in Data Science How to grow a regression tree: cost-complexity pruning The idea is to find the subtree T ˆ α Ă T 0 which minimizes C α p T q : ‚ @ α , find the unique subtree T α which minimizes C α p T q ; ‚ through weakest link pruning : § successively collapse the internal node that produce the smallest increase in ř | T | m “ 1 N m Q m p T q ; § until the single node tree; § find T α within the sequence; ‚ find ˆ α via cross-validation. Here the tuning parameter α : ‚ governs the trade-off between tree size and goodness of fit; ‚ larger values of α correspond to smaller trees; ‚ α “ 0 Ñ full tree. STK-IN4300: lecture 8 20/ 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend