CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model Selection Systems DL book; Chapters 8.2 and 8.3 of MLSys book 1

Model Selection in the Lifecycle Feature Engineering Data acquisition Serving Training & Inference Data preparation Monitoring Model Selection 2

Model Selection in the Big Picture 3

Outline ❖ Recap: Bias-Variance-Noise. Decomposition ❖ The Model Selection Triple ❖ Feature Engineering ❖ Hyperparameter Tuning ❖ Algorithm/Architecture Selection ❖ Model Selection Systems ❖ Feature Engineering Systems ❖ Advanced Model Selection Systems Issues 4

Bias-Variance-Noise Decomposition ML (Test) Error = Bias + Variance + Bayes Noise Complexity of model/ Discriminability of hypothesis space examples x = (a,b,c); y = +1 vs x = (a,b,c); y = -1 5

<latexit sha1_base64="3nwsg8hxgtnGtmOBHQ5mjABxB8g=">ACMnicbVDLSgMxFM34rPU16tJNsAiuyowPFdFXeiugn1Ip5RMmlDM5khuaOUod/kxi8RXOhCEbd+hOm0C9t6IHA4515y7vFjwTU4zps1N7+wuLScW8mvrq1vbNpb21UdJYqyCo1EpOo+0UxwySrAQbB6rBgJfcFqfu9y6NcemNI8knfQj1kzJB3JA04JGKl3wTn2AsJdCkR6dWglXE/SGsD7AEPmZ5w69hTvNMFolT0OHc45ZdcIpOBjxL3DEpoDHKLfvFa0c0CZkEKojWDdeJoZkSBZwKNsh7iWYxoT3SYQ1DJTFpml28gDvG6WNg0iZJwFn6t+NlIRa90PfTA5T6mlvKP7nNRIzpopl3ECTNLR0EiMER42B9uc8UoiL4hCpusmLaJYpQMC3nTQnu9MmzpHpYdI+KJ7fHhdLFuI4c2kV76AC56BSV0DUqowqi6Am9og/0aT1b79aX9T0anbPGOztoAtbPLyQXq1A=</latexit> Hypothesis Space of Functions ❖ A trained ML model is a parametric prediction function: f : D W × D X → D Y ❖ Hypothesis Space: The set of all possible functions f <latexit sha1_base64="XH3I3s1WlW/61E5YOPkn4yHCZA=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRJRdFl02UF+4A2lMl0g6dzISZG6GEfoYbF4q49Wvc+TdO2iy0emDgcM69zLknTAQ36HlfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt5ChYL9GMxKFg3XB6l/vdR6YNV/IBZwkLYjKWPOKUoJX6g5jghBKRNefDas2rewu4f4lfkBoUaA2rn4ORomnMJFJBjOn7XoJBRjRyKti8MkgNSwidkjHrWypJzEyQLSLP3TOrjNxIafskugv150ZGYmNmcWgn84hm1cvF/7x+itFNkHGZpMgkX4UpcJF5eb3uyOuGUxs4RQzW1Wl06IJhRtSxVbgr968l/Suaj7V3Xv/rLWuC3qKMJnMI5+HANDWhC9pAQcETvMCrg86z8+a8L0dLTrFzDL/gfHwDfEORYg=</latexit> H that can be represented by a model ❖ Training: Picks one f from hypo. space; needs estimation procedure (e.g. optimization, greedy, etc.) ❖ Factors that determine hypo. space: ❖ Feature representation ❖ Inductive bias of model ❖ Regularization 6

Another View of Bias-Variance ❖ Bias arise because hypo. space does not hold “truth” ❖ Shrinking hypo. space raises bias ❖ Variance arises due to finite training sample ❖ Estimation approximately nears truth ❖ Shrinking hypo. space lowers variance 7

3 Ways to Control Learning/Accuracy ❖ Reduce Bayes Noise: ❖ Augment with new useful features from appl. ❖ Reduce Bias: ❖ Enhance hypo. space: derive different features; more complex model ❖ Reduce shrinkage (less regularization) ❖ Reduce Variance: ❖ Shrink hypo. space: derive different features; drop features; less complex model ❖ Enhance shrinkage (more regularization) 8

The Double Descent Phenomenon ❖ DL and some other ML families can get arbitrarily complex ❖ Can “memorize” entire training set ❖ Curiously, variance can drop after rising; bias goes to 0! ❖ “Interpolation regime” is open question in ML theory 9 https://arxiv.org/pdf/1812.11118.pdf

Unpredictability of Model Selection ❖ Recall 3 ways to control ML accuracy: reduce bias, reduce variance, reduce Bayes noise ❖ Alas, the exact raises/drops in errors on given training task and sample are not predictable ❖ Need empirical comparisons of configurations on data ❖ Train-validation-test splits; cross-validation procedures 11

The Model Selection Triple ❖ The data scientist/AutoML procedure must steer 3 key activities to alter the Model Selection Triple (MST): 1. Feature Engineering (FE): What is/are the domain(s) of the hypo. space(s) to consider? 2. Algorithm/Architecture Selection (AS): What exact hypo. space to use (model type/ANN architecture)? 3. Hyper-parameter Tuning (HT): How to configure hypo. space shrinkage and estimation procedure approx.? 12 https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

The Model Selection Triple ❖ The data scientist/AutoML procedure must steer 3 key activities to explore the Model Selection Triple (MST) FE1 AS1 HT1 Train and test model config(s) FE2 AS2 HT2 on ML system … … … Next iteration Post-process and consume results ❖ Stopping criterion is application-specific / user-specific on Pareto surface: time, cost, accuracy, tiredness (!), etc. 13 https://adalabucsd.github.io/papers/2015_MSMS_SIGMODRecord.pdf

Feature Engineering ❖ Process of converting prepared data into a feature vector representation for ML training/inference ❖ Aka feature extraction, representation extraction, etc. ❖ Activities vary based on data type: Join and Group Bys Temporal feature Feature interactions extraction Feature selection Value recoding Dimensionality reduction 15

Feature Engineering ❖ Process of converting prepared data into a feature vector representation for ML training/inference ❖ Aka feature extraction, representation extraction, etc. ❖ Activities vary based on data type: Bag of words Signal processing- N-grams based features Parsing-based features Deep learning Transfer learning 16

Hyperparameter Tuning ❖ Most ML models have hyper-parameter knobs Learning rate Complexity Regularization Number of trees Learning rate Max height/min split Regularization Learning rate? Dropout prob. ❖ Most of them raise bias slightly but reduce variance more ❖ No hyp.par. settings universally best for all tasks/data 18

Hyperparameter Tuning ❖ Common methods to tune hyp.par. configs: Grid “Random” search search https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf 19 http://gael-varoquaux.info/science/survey-of-machine-learning-experimental-methods-at-neurips2019-and-iclr2020.html

Hyperband ❖ An automated ML (AutoML) procedure for tuning hyp.par. ❖ Basic Idea: For iterative procedures (e.g., SGD), stop non- promising hyp.par. configs at earlier epochs ❖ Based on multi-armed bandit idea from gambling/RL ❖ Benefits: ❖ Reapportioning resources with early stopping may help reach better overall accuracy sooner ❖ Total resource use may be lower vs grid/random search ❖ 2 knobs as input: ❖ R: Max budget per config (e.g., # SGD epochs) ❖ \eta: Stop rate for configs 20 https://arxiv.org/pdf/1603.06560.pdf

Hyperband Brackets : independent trials Akin to random search Survival of the fittest! 21 https://arxiv.org/pdf/1603.06560.pdf

Hyperband R = 81; \eta = 3 n i : # hyp.par.configs run r i : # epochs per config ❖ Still not as popular as grid/random search; latter is simpler and easier to use (e.g., how to set R and \eta?) 22 https://arxiv.org/pdf/1603.06560.pdf

Review Zoom Poll 23

Algorithm Selection ❖ Basic Goal: AutoML procedure to pick among a set of interchangeable models (hyp.par. tuning included) ❖ Automate a data scientist’s intuition on feature preprocessing, missing values, hyp.par. tuning, etc. ❖ Many heuristics: AutoWeka, AutoSKLearn, DataRobot, etc. AutoWeka 25 https://www.cs.ubc.ca/labs/beta/Projects/autoweka/papers/autoweka.pdf

Algorithm Selection ❖ AutoScikitLearn uses a more sequential Bayesian optimization approach 26 http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf

NAS and AutoKeras ❖ DL NCG arch. akin to model family in classical ML ❖ Some AutoML tools aim to automate NCG design too Google’ NAS uses RL AutoKeras uses to construct and Bayesian optimization evaluate NCGs and has optimized impl. ❖ Not that popular in practice; compute-intensive; hard to debug https://arxiv.org/pdf/1611.01578.pdf 27 https://arxiv.org/pdf/1806.10282.pdf

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model Selection Systems DL book; Chapters 8.2 and 8.3 of MLSys book 1 Model Selection in the Lifecycle Feature Engineering Data acquisition Serving

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009:

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Clustering shrinkage, L 0 and Staircases K. PELCKMANS, J.A.K. SUYKENS, B. DE MOOR NIPS workshop

Multivariate smoothing, model selection David L Miller Recap How GAMs work How to include

The One-Quarter Fraction Need two generating relations. E.g. a 2 6 2 design, with generating

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

New Algorithms for Sparse Representation of Discrete Signals Based on p - 2 Optimization

A Study of Probability Estimation Techniques for Rule Learning Jan-Nikolas Sulzmann Johannes F

Forecasting with R A practical workshop International Symposium on Forecasting 2017 25 th June

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences

Sambuz

Useful Links

Newsletter

Mail Us

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: - PowerPoint PPT Presentation

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 3: Feature Engineering and Model Selection Systems DL book; Chapters 8.2 and 8.3 of MLSys book 1 Model Selection in the Lifecycle Feature Engineering Data acquisition Serving

CSE 291D/234 Data Systems for Machine Learning Fall 2020 Arun Kumar 1 About Myself 2009:

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems DL

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 4: Data Sourcing and

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

Natural Backgrounds 1 U 238 Decay Chain U 234 U 238 92 92 4.5e9 245,500 Years Years

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Relate Multiplication to Addition click to return to table of contents Slide 5 / 234 Relate

Relate Multiplication to Addition Multiplication Table Activity Multiply by 3 Multiply by 4

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Clustering shrinkage, L 0 and Staircases K. PELCKMANS, J.A.K. SUYKENS, B. DE MOOR NIPS workshop

Multivariate smoothing, model selection David L Miller Recap How GAMs work How to include

The One-Quarter Fraction Need two generating relations. E.g. a 2 6 2 design, with generating

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

New Algorithms for Sparse Representation of Discrete Signals Based on p - 2 Optimization

A Study of Probability Estimation Techniques for Rule Learning Jan-Nikolas Sulzmann Johannes F

Forecasting with R A practical workshop International Symposium on Forecasting 2017 25 th June

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences

Sambuz

Useful Links

Newsletter

Mail Us

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506: