Model Type Selection in Predictive Big Data Analytics Mustafa - PowerPoint PPT Presentation

Using Meta-learning for Model Type Selection in Predictive Big Data Analytics Mustafa Nural, Hao Peng, John A. Miller Department of Computer Science University of Georgia

What is is Predictive Analytics? • The process of building a statistical model from data to capture the relationships between variables in order to • make sense of • to predict outcomes • Model 𝑧 = 𝑔 𝒚 + ϵ • Modeling Technique / Model Type • E.g., OLS Regression, Lasso Regression • Classification • Target outcome of the model is a categorical variable • Prediction • Target outcome of the model is a non-categorical variable • Includes many types of regression models

What is is the Problem? • Choosing the most predictive model from a set of candidate models is non-trivial • No free lunch theorem (Wolpert & Macready, 1997) • No single modeling technique can consistently outperform others • Different Restrictions per problem • Interpretability • Parsimony • Etc.

Meta-learning • “Learning to learn” • Active area of research in machine learning • Learning performance of classification algorithms • Hyper-parameter optimization • Pre-processing of datasets • Little focus has been given to prediction algorithms • No previous work on the regression family • OLS Regression • Regression with regularization • Generalized Linear Models

Overview of Meta-learning Performance Training Modeling Statistics Datasets Techniques Report most predictive technique for each dataset Meta-learning Training Set Feature Extraction Meta-features Candidate s Train Meta-learner Meta-features Suggestion Candidate Dataset Most Predictive Technique(s) Engine

Meta-feature Extraction • Features from the literature • 𝑐𝑏𝑡𝑓 𝑒𝑔, 𝑐𝑏𝑡𝑓 𝑠𝑒𝑔, non-negative response, domain response, distinct ratio of response, % numeric, % categorical, % binary variables • Grand mean: stddev, mean, skewness, and kurtosis of numeric variables • Grand mean: min, max, mean, stddev of categorical variables • Additional features particularly relevant for Regression problems • Log Dimensionality • Matrix Condition Number • Skewness & Kurtosis of Response • Coefficient of Variation of Response

Target Modeling Techniques • Ordinary Least Squares • Exponential Regression (R) Regression (ScalaTion) • Poisson Regression (R) • Weighted Least Squares • Inverse Gaussian Regression (R) Regression (ScalaTion) • Gamma Regression (R) • Back-elim Regression (ScalaTion) • Ridge Regression (R, ScalaTion) • Response Surface Analysis ( Quadratic Expansion) • Lasso Regression (R, ScalaTion) (ScalaTion) • Partial Least Squares Regression • Response Surface Analysis (Cubic (R) Expansion) (ScalaTion) • Principal Components Regression • Log Transformed Regression (R) (ScalaTion) • Root Transformed Regression (ScalaTion)

Generating Training Set • Performance metrics • Root mean squared error ( 𝑆𝑁𝑇𝐹 ) • Root relative squared error ( 𝑆𝑆𝑇𝐹 ) ( 1 − 𝑆 2 ) • 15 modeling techniques • 114 datasets • UCI, OpenML, R, Luis Torgo collection, Bilkent Unv. Collection and etc. • https://github.com/scalation/data • 10-fold cross validation repeated 10 times per dataset/technique to get more reliable estimates • Hyper-parameter optimization is done by some modeling techniques • E.g ., 𝜇 penalty for 𝑀 1 (Lasso) and 𝑀 2 (Ridge) regularization

Training the Meta-learner • Meta-features are used as predictors • Top-performing modeling technique as the response • Random Forest Classifier, k-NN Classifier • Evaluation metrics • Mean Average Precision ( 𝑁𝐵𝑄@𝑙 ) • Rank-wise precision • Loose Accuracy ( 𝑀𝐵@𝑙 ) • If any of the top-k predictions match actual top-1 => 1 • Otherwise => 0 • Normalized Discounted Cumulative Gain ( 𝑂𝐸𝐷𝐻@𝑙 ) • Graded penalty if rankings are out of order

Results (Cont’d) 1.00 0.90 0.84 0.83 0.77 0.80 0.74 0.70 0.70 0.65 0.60 0.56 0.55 0.53 0.50 0.45 0.40 0.30 0.20 0.10 0.00 LA@1 LA@3 MAP@3 NDGC@1 NDGC@3 Random Forest kNN

Conclusions & Future Work • Meta-learning can be used for predictive analytics including regression family of techniques • Random forest classifier is a viable alternative as a meta-learner for prediction • Dimensionality and characteristics of the response variable are the most important meta-features. • Generalized Linear Models have specific assumptions on the response variable. • Low dimensionality and negative base degrees of freedom are important indicators for using a regularization technique such as Lasso or Ridge. • Future work includes: • More through comparison with AutoWEKA • Comparison with Ontology-based and Subsampling-based

Questions ?

Current Approaches • Exhaustive Search • Meta-learning • Ontology-based Semantics • Other/Proprietary

Exhaustive Search • Naïve approach • Build a model using each modeling technique to find the optimal model • 238 in R caret package • > 10000 packages in R total • Examples: AutoWEKA, caret(R), performanceEstimation(R), SPSS Auto Modeler, Data Robot, … • PROS • Conceptually simple • Not complex to implement • CONS • Might be tedious to implement • Time consuming • Doesn’t scale well w.r.t. dataset size and number of techniques

Meta-learning • Applying a learning algorithm to pick a base machine learning algorithm • Learns a mapping between dataset characteristics and top-performing technique(s) among candidates • Has been studied extensively for classification problems. • Limited work on • predictive models & regression based models • mapping data to a model (rather than a technique)

Meta- learning (cont’d) • PROS • Fast once trained • Let m be number of instances and n number of variables • Scalable w.r.t dataset size • CONS • Training required • Adding new techniques not possible without re-training

Ontology-based Semantics • Leverage domain expertise captured formally in an Ontology • Use logical reasoning to suggest optimal model(s)

Ontology- based Semantics(cont’d) • PROS • Fast • Scalable • Extending is straightforward • CONS • Requires manual curation

Other/Proprietary A More Modern Approach • No expertise needed • Limited analysis capabilities • Doesn’t let you change default model criteria and diagnostics • Not transparent • Doesn’t walk you through decisions it’s making • Therefore limited statistical insight • Emphasizes Text Analysis Screenshot taken from Watson Analytics platform

Other/Proprietary (cont’d) • Examples: IBM Watson Analytics, Google Prediction API .. • PROS • Very simple to use • CONS • Decision-making process is not transparent ( Watson Analytics, Google Prediction API) • The chosen technique is not known ( Google Prediction API)

Generating Training Set • 114 datasets • 43 datasets from UCI Machine Learning Repository • 17 datasets from OpenML • 16 datasets from publicly available packages in R • 12 datasets from Luis Torgo Regression datasets collection • 9 datasets from Bilkent University Function Approximation Library • 9 datasets from NCI-60 Cell Line panel: • Similar to (Lee et al. 2011), we have used gene expression data obtained from Affymetrix HG-U133A and B chips normalized using the GCRMA method as predictors of proteins with top 9 most variance obtained from Reverse-phase protein lysate arrays (RPLA). • 8 datasets from various other sources https://github.com/scalation/data

Model Type Selection in Predictive Big Data Analytics Mustafa - PowerPoint PPT Presentation

Using Meta-learning for Model Type Selection in Predictive Big Data Analytics Mustafa Nural, Hao Peng, John A. Miller Department of Computer Science University of Georgia What is is Predictive Analytics? The process of building a

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Type Reconstruction and Polymorphism 1 Type Checking and Type Reconstruction We now come to the

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

SMD Type Raised SMD Type Solderless Type Thru-Hole Type 05/03/09 2 MiniGrid and SMT

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Pavem ent Type Selection and Pavem ent Type Selection and Alternate Pavem ent Bidding Alternate

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A Pavlos Protopapas Kevin Rader

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar

Contrast Coding in R: An Exploration of a Dataset Rachel Baker Phonatics, Sept. 29, 2009 Thanks

Ex parte Communications and Orders Conference of Superior Court Judges Summer Conference 2020(sort

Validating the PRIDIT method for determining hospital g p quality with outcomes data Robert

Reshape Flexible data restructuring with R Hadley Wickham Statistics, Iowa State University

Further Properties of Wireless Channel Capacity Fengyou Sun and Yuming Jiang Norwegian University

Powerful Subordinates: Internal Governance and Stock Market Liquidity Pawan Jain Department of

Model Type Selection in Predictive Big Data Analytics Mustafa - PowerPoint PPT Presentation

Using Meta-learning for Model Type Selection in Predictive Big Data Analytics Mustafa Nural, Hao Peng, John A. Miller Department of Computer Science University of Georgia What is is Predictive Analytics? The process of building a

Type Checking Grammar Rule Semantic Rule var-decl id : type-exp Insert (id.name, type-exp .

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Type Reconstruction and Polymorphism 1 Type Checking and Type Reconstruction We now come to the

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

SMD Type Raised SMD Type Solderless Type Thru-Hole Type 05/03/09 2 MiniGrid and SMT

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Pavem ent Type Selection and Pavem ent Type Selection and Alternate Pavem ent Bidding Alternate

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Lecture #1: Exploratory Data Analysis CS 109A, STAT 121A, AC 209A Pavlos Protopapas Kevin Rader

GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar

Contrast Coding in R: An Exploration of a Dataset Rachel Baker Phonatics, Sept. 29, 2009 Thanks

Ex parte Communications and Orders Conference of Superior Court Judges Summer Conference 2020(sort

Validating the PRIDIT method for determining hospital g p quality with outcomes data Robert

Reshape Flexible data restructuring with R Hadley Wickham Statistics, Iowa State University

Further Properties of Wireless Channel Capacity Fengyou Sun and Yuming Jiang Norwegian University

Powerful Subordinates: Internal Governance and Stock Market Liquidity Pawan Jain Department of

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?