About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global - PowerPoint PPT Presentation

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Research Statistics Groton, CT

The Package caret is short for c lassification a nd re gression t raining It is not on CRAN yet, but it will be this year It is a package full of miscellaneous functions that I find useful for building predictive models. There is way more information and details in the package vignette. Load the package via library(caret) and type vignette("caret") to see it. Max Kuhn (Pfizer Global R & D) 2 / 8 caret

Pre–Processing There are a few simple functions to pre–process data, such as centering and scaling Also, there are some methods to do unsupervised feature selection: If there are highly correlated predictors, as is the case in quantitative structure–activity relationship (QSAR) models and in gene expression studies, caret has an algorithm to identify a subset of predictors with absolute correlations below a threshold. There is a function to enumerate linear dependencies in predictors so that they can be removed. Also, there are cases where numeric predictors have sparse, discrete distributions. We call these “near–zero–variance” predictors. There is also a function to identify these. Max Kuhn (Pfizer Global R & D) 3 / 8 caret

Transforming Predictors Transforming variables can help some models. One way to doing this the “spatial–sign” transformation. Let x be a vector containing the predictors for a single sample. The transformation is x ∗ = x / || x || . Samples are projected onto a unit circle: Active ● Inactive Active ● Inactive 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Wap Wap ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 2 4 −1.0 −0.5 0.0 0.5 1.0 D.Dr06 D.Dr06 Max Kuhn (Pfizer Global R & D) 4 / 8 caret

Training Models The main function in the package is called train . It has two main purposes: 1 To be a uniform interface to numerous regression and classification models. Many different models can be evaluated with minimal code modifications 2 To choose values for model tuning parameters (if any) using resampling techniques, such as cross–validation or bootstrapping. There are similar R and Bioconductor packages: ipred , e1071 and MLInterfaces . Also, caret was built so that there is minimal effort to extending it to your favorite parallel processing library (such as nws or Rlsf ) Max Kuhn (Pfizer Global R & D) 5 / 8 caret

The Basic Idea Create multiple splits or resamples of the data; Create a grid of model complexity parameters; for Each complexity parameter combination do for Each Data Split/Resample do Train a model with the current complexity parameter combination; Predict the held–back samples; end Calculate performance (e.g. accuracy, R 2 ) over the held–back samples; end Determine the complexity parameters with the best performance; Refit the model using these parameters on the entire data set; Max Kuhn (Pfizer Global R & D) 6 / 8 caret

Example As a QSAR example, the multidrug resistance reversal (MDRR) agent data was used to predict a specific type of chemical activity. Given a set of compounds with know activity data, the molecular structures were used to predict activity in new (or virtual) compounds. To fit a support vector machine with a radial basis function, we need to determine the value of the cost (aka regularization) parameter. (There is a RBF parameter, but we fix that value up–front based on an analytical solution). We used bootstrapping: Accuracy = 77%, Kappa = 0.53 ● ● ● 0.50 boot resampled training kappa ● ● 0.45 ● 0.40 0.35 ● 0.1 1 10 100 1000 10000 1e+05 Cost Max Kuhn (Pfizer Global R & D) 7 / 8 caret

Other Functions There are a few different functions for data splitting (and a few more to come), a class for confusion matrices and functions to calculate ROC curves. A wrapper for partial least squares is included so that there is a formula interface. This function also enables classification models using PLS. There is a variable importance class. This has specific methods for several models (trees, bagged trees, boosted trees, random forests, MARS, PLS, OLS) and generic methods for other models. There is a set of functions to apply RMA–like signal processing methods to Affymetrix gene chip data. This method is not batch–oriented, but does require a training set. Max Kuhn (Pfizer Global R & D) 8 / 8 caret

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global - PowerPoint PPT Presentation

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Research Statistics Groton, CT The Package caret is short for c lassification a nd re gression t raining It is not on CRAN yet, but it will be this year It is a

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Nonclinical

PINAR SU SANAY VE T CARET A. . PINAR SU SANAY VE T CARET A. . Pnar Su

PINAR SU SANAY VE T CARET A. PINAR SU SANAY VE T CARET A. History 2002 1997

PINAR SU SANAY VE T CARET A. . PINAR SU SANAY VE T CARET A. . 2011 H1

The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R & D

Package Managers CC-BY-SA 2016 Nate Levesque What is a Package Manager? A package manager or

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

GETTING STARTED? BASIC PREMIUM SHRED10 PACKAGE PACKAGE* PACKAGE* $61.50 /month $132.75

croft design studio Package Prices 2020 Package Prices We are now offering these package

Package Management with Package Management with Package Management with Anaconda Anaconda

Parsing package docs: Part III: Using the ReadP package

Thank you to our Sponsors Zeek Package Contest Winners First Prize EternalSafety Package - Lexi

The traitr package John Verzani CUNY/The College of Staten Island useR!2010 The traitr package

PINAR SU SANAY VE T CARET A. . INVESTOR PRESENTATION 2009, June 1 PRESENTATION

PINAR SU SANAY VE T CARET A. . INVESTOR PRESENTATION 2009, March 1 PRESENTATION

1 67

Using Weatherization Assistance Program (WAP) Funds for Low-Income Solar January 11, 2018

WAP Reauthorization- 2019 WAP Reauthorization Webinar | 4-18-19 1 Agenda History

Good Morning! LIS1001 Information and Technology for Searching September 2017, Ulrich Werner,

The Energy Burden What It Is, How Its Felt, and What You Can Do Paige Heverly , Vital

The Non-Virtual Reality of Testing or What's Feasible in Real World Testing Contents 1.

UC.yber; Meeting 25 Vulnerabilities and more! If Youre New! Join our Slack

Where Energy Efficiency Counts the Most: Multifamily Housing Zack Miller, Virginia Housing

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global - PowerPoint PPT Presentation

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Research Statistics Groton, CT The Package caret is short for c lassification a nd re gression t raining It is not on CRAN yet, but it will be this year It is a

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R &amp; D Nonclinical

PINAR SU SANAY VE T CARET A. . PINAR SU SANAY VE T CARET A. . Pnar Su

PINAR SU SANAY VE T CARET A. PINAR SU SANAY VE T CARET A. History 2002 1997

PINAR SU SANAY VE T CARET A. . PINAR SU SANAY VE T CARET A. . 2011 H1

The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R &amp; D

Package Managers CC-BY-SA 2016 Nate Levesque What is a Package Manager? A package manager or

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

GETTING STARTED? BASIC PREMIUM SHRED10 PACKAGE PACKAGE* PACKAGE* $61.50 /month $132.75

croft design studio Package Prices 2020 Package Prices We are now offering these package

Package Management with Package Management with Package Management with Anaconda Anaconda

Parsing package docs: Part III: Using the ReadP package

Thank you to our Sponsors Zeek Package Contest Winners First Prize EternalSafety Package - Lexi

The traitr package John Verzani CUNY/The College of Staten Island useR!2010 The traitr package

PINAR SU SANAY VE T CARET A. . INVESTOR PRESENTATION 2009, June 1 PRESENTATION

PINAR SU SANAY VE T CARET A. . INVESTOR PRESENTATION 2009, March 1 PRESENTATION

1 67

Using Weatherization Assistance Program (WAP) Funds for Low-Income Solar January 11, 2018

WAP Reauthorization- 2019 WAP Reauthorization Webinar | 4-18-19 1 Agenda History

Good Morning! LIS1001 Information and Technology for Searching September 2017, Ulrich Werner,

The Energy Burden What It Is, How Its Felt, and What You Can Do Paige Heverly , Vital

The Non-Virtual Reality of Testing or What's Feasible in Real World Testing Contents 1.

UC.yber; Meeting 25 Vulnerabilities and more! If Youre New! Join our Slack

Where Energy Efficiency Counts the Most: Multifamily Housing Zack Miller, Virginia Housing

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Nonclinical

The caret Package: A Unified Interface for Predictive Models Max Kuhn Pfizer Global R & D