machine learning in r
play

Machine Learning in R The mlr package Lars Kotthofg 1 University of - PowerPoint PPT Presentation

Machine Learning in R The mlr package Lars Kotthofg 1 University of Wyoming larsko@uwyo.edu St Andrews, 24 July 2018 1 with slides from Bernd Bischl Outline 2 Overview Basic Usage Wrappers Preprocessing with mlrCPO Feature


  1. Machine Learning in R The mlr package Lars Kotthofg 1 University of Wyoming larsko@uwyo.edu St Andrews, 24 July 2018 1 with slides from Bernd Bischl

  2. Outline 2 ▷ Overview ▷ Basic Usage ▷ Wrappers ▷ Preprocessing with mlrCPO ▷ Feature Importance ▷ Parameter Optimization

  3. Don’t reinvent the wheel. 3

  4. Motivation The good news The bad news not easily available 4 ▷ hundreds of packages available in R ▷ often high-quality implementation of state-of-the-art methods ▷ no common API (although very similar in many cases) ▷ not all learners work with all kinds of data and predictions ▷ what data, predictions, hyperparameters, etc are supported is � mlr provides a domain-specifjc language for ML in R

  5. Overview hyperparameters… 5 ▷ https://github.com/mlr-org/mlr ▷ 8-10 main developers, > 50 contributors, 5 GSoC projects ▷ unifjed interface for the basic building blocks: tasks, learners,

  6. Basic Usage setosa 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 ## 6 4.6 5.4 3.9 1.7 0.4 setosa # create task task = makeClassifTask (id = ”iris”, iris, target = ”Species”) # create learner learner = makeLearner (”classif.randomForest”) 3.1 ## 4 head (iris) ## 2 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa 4.9 setosa 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 6

  7. Basic Usage ## Aggregated Result: ## Runtime: 0.0425465 ## Aggr perf: mmce.test.mean=0.0400000 ## Learner: classif.randomForest ## Task: iris ## Resample Result ## mmce.test.mean=0.0400000 ## # build model and evaluate 0.0400000 ## [Resample] iter 1: mmce ## Measures: holdout ## Resampling: holdout (learner, task) 7

  8. Basic Usage ## Aggregated Result: ## Runtime: 0.0333493 ## Aggr perf: acc.test.mean=0.9800000 ## Learner: classif.randomForest ## Task: iris ## Resample Result ## acc.test.mean=0.9800000 ## # measure accuracy 0.9800000 ## [Resample] iter 1: acc ## Measures: holdout ## Resampling: holdout (learner, task, measures = acc) 8

  9. Basic Usage ## ## [Resample] iter 8: 0.9333333 ## [Resample] iter 9: 1.0000000 ## [Resample] iter 10: 0.9333333 ## Aggregated Result: ## [Resample] iter 7: acc.test.mean=0.9600000 ## ## Resample Result ## Task: iris ## Learner: classif.randomForest ## Aggr perf: acc.test.mean=0.9600000 ## Runtime: 0.530509 1.0000000 1.0000000 # 10 fold cross-validation 1.0000000 crossval (learner, task, measures = acc) ## Resampling: cross-validation ## Measures: acc ## [Resample] iter 1: ## [Resample] iter 2: ## [Resample] iter 6: 0.9333333 ## [Resample] iter 3: 1.0000000 ## [Resample] iter 4: 1.0000000 ## [Resample] iter 5: 0.8000000 9

  10. Basic Usage ## Aggregated Result: 1.0000000 0.0000000 ## [Resample] iter 7: 0.9444444 0.0555556 ## [Resample] iter 8: 0.8947368 0.1052632 ## acc.test.mean=0.9535819,mmce.test.mean=0.0464181 0.9473684 0.0526316 ## ## Resample Result ## Task: iris ## Learner: classif.randomForest ## Aggr perf: acc.test.mean=0.9535819,mmce.test.mean=0.0464181 ## Runtime: 0.28359 ## [Resample] iter 6: ## [Resample] iter 5: # more general -- resample description mmce rdesc = makeResampleDesc (”CV”, iters = 8) resample (learner, task, rdesc, measures = list (acc, mmce)) ## Resampling: cross-validation ## Measures: acc ## [Resample] iter 1: 1.0000000 0.0000000 0.9473684 0.0526316 ## [Resample] iter 2: 0.9473684 0.0526316 ## [Resample] iter 3: 0.9473684 0.0526316 ## [Resample] iter 4: 10

  11. Finding Your Way Around ”multiclass.aunu” ”lsr” ## [4] ”bac” ”qsr” ”timeboth” ## [7] ”multiclass.aunp” ”timetrain” ## [10] ”ber” [1] ”featperc” ”timepredict” ”multiclass.brier” ## [13] ”ssr” ”acc” ”logloss” ## [16] ”wkappa” ”multiclass.au1p” ”multiclass.au1u” ## [19] ”kappa” ”mmce” ## listLearners (task)[1:5, c (1,3,4)] classif.C50 ## class short.name package ## 1 classif.adaboostm1 adaboostm1 RWeka ## 2 classif.boosting adabag adabag,rpart ## 3 C50 listMeasures (task) C50 ## 4 classif.cforest cforest party ## 5 classif.ctree ctree party 11

  12. Integrated Learners Clustering Survival Classifjcation Regression 12 ▷ LDA, QDA, RDA, MDA ▷ Linear, lasso and ridge ▷ Trees and forests ▷ Boosting ▷ Boosting (difgerent variants) ▷ Trees and forests ▷ SVMs (difgerent variants) ▷ Gaussian processes ▷ … ▷ … ▷ K-Means ▷ Cox-PH ▷ EM ▷ Cox-Boost ▷ DBscan ▷ Random survival forest ▷ X-Means ▷ Penalized regression ▷ … ▷ …

  13. Learner Hyperparameters TRUE - - - logical ## oob.prox - FALSE - - - FALSE logical ## proximity - - FALSE - - FALSE logical ## localImp - TRUE - - - FALSE logical ## importance - TRUE Y - - 1 to Inf logical - FALSE - - - FALSE logical ## keep.inbag - FALSE - - TRUE - ## keep.forest ## norm.votes - FALSE - - - FALSE logical ## do.trace - FALSE - - TRUE - logical - - getParamSet (learner) - 1 to Inf numericvector <NA> ## classwt - TRUE - - TRUE - logical ## replace - TRUE - - - integer ## mtry - TRUE - 500 1 to Inf - integer ## ntree Constr Req Tunable Trafo Def len Type ## - 0 to Inf TRUE integer ## sampsize ## maxnodes - TRUE - 1 1 to Inf - integer ## nodesize - TRUE - - 1 to Inf integervector <NA> - - FALSE - - - - untyped ## strata - TRUE - 0 to 1 - numericvector <NA> ## cutoff 13

  14. Learner Hyperparameters lrn = makeLearner (”classif.randomForest”, ntree = 100, mtry = 10) 14 lrn = setHyperPars (lrn, ntree = 100, mtry = 10)

  15. Wrappers impute wrapper 15 ▷ extend the functionality of learners ▷ e.g. wrap a learner that cannot handle missing values with an ▷ hyperparameter spaces of learner and wrapper are joined ▷ can be nested

  16. Wrappers Available Wrappers algorithms, CMAES, iRace, MBO exhaustive search, genetic algorithms, … min, max, empirical distribution or other learners 16 ▷ Preprocessing: PCA, normalization (z-transformation) ▷ Parameter Tuning: grid, optim, random search, genetic ▷ Filter: correlation- and entropy-based, X 2 -test, mRMR, … ▷ Feature Selection: (fmoating) sequential forward/backward, ▷ Impute: dummy variables, imputations with mean, median, ▷ Bagging to fuse learners on bootstraped samples ▷ Stacking to combine models in heterogenous ensembles ▷ Over- and Undersampling for unbalanced classifjcation

  17. Preprocessing with mlrCPO https://github.com/mlr-org/mlrCPO objects with their own hyperparameters operation = cpoScale () print (operation) ## scale(center = TRUE, scale = TRUE) 17 ▷ Composable Preprocessing Operators for mlr – ▷ separate R package due to complexity, mlrCPO ▷ preprocessing operations (e.g. imputation or PCA) as R

  18. Preprocessing with mlrCPO imputing.pca = cpoImputeMedian () %>>% cpoPca () task %>>% imputing.pca pipeline: pca.rf = imputing.pca %>>% makeLearner (”classif.randomForest”) 18 ▷ objects are handled using the “piping” operator %>>% ▷ composition: ▷ application to data: ▷ combination with a Learner to form a machine learning

  19. mlrCPO Example: Titanic # drop uninteresting columns dropcol.cpo = cpoSelect (names = c (”Cabin”, ”Ticket”, ”Name”), invert = TRUE) # impute impute.cpo = cpoImputeMedian (affect.type = ”numeric”) %>>% cpoImputeConstant (”__miss__”, affect.type = ”factor”) 19

  20. mlrCPO Example: Titanic 3 ## Positive class: 0 ## 541 331 1 0 ## ## Classes: 2 ## Has coordinates: FALSE ## Has blocking: FALSE ## Has weights: FALSE ## Missings: FALSE 0 0 4 train.task = makeClassifTask (”Titanic”, train.data, ## ordered functionals factors numerics ## ## Features: ## Observations: 872 ## Target: Survived ## Type: classif ## Supervised task: Titanic print (pp.task) pp.task = train.task %>>% dropcol.cpo %>>% impute.cpo target = ”Survived”) 20

  21. Combination with Learners learning pipelines learner = dropcol.cpo %>>% impute.cpo %>>% makeLearner (”classif.randomForest”, predict.type = ”prob”) # train using the task that was not preprocessed pp.mod = train (learner, train.task) 21 ▷ attach one or more CPOs to a learner to build machine ▷ automatically handles preprocessing of test data

  22. mlrCPO Summary conversion, target value transformation, over/undersampling, ... preprocessing operations selectable through hyperparameter 22 ▷ listCPO() to show available CPOs ▷ currently 69 CPOs, and growing: imputation, feature type ▷ CPO “multiplexer” enables combination of difgerent distinct ▷ custom CPO s can be created using makeCPO()

  23. Feature Importance ## Number of Monte-Carlo iterations: NA 44.58139 42.51918 2.282677 9.857828 ## 1 Sepal.Length Sepal.Width Petal.Length Petal.Width ## ## Local: FALSE ## Replace: NA model = train ( makeLearner (”classif.randomForest”), iris.task) x ## Aggregation: function (x) ## Contrast: NA ## Measure: NA ## Learner: classif.randomForest ## ## Task: iris-example ## FeatureImportance: getFeatureImportance (model) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend