How Developers Iterate on Machine Learning Workflows -- A Survey of - PowerPoint PPT Presentation

How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning Literature Doris Xin 1 , Litian Ma 1 , Shuchen Song 1 , Rong Ma 2 , Aditya Parameswaran 1 1 University of Illinois at Urbana-Champaign 2 Peking University

Developing Machine Learning Applications is Iterative Images, text, logs, Input data tables, etc. Data cleaning, Data Preprocessing Add features, feature eng., etc. scale features, etc. Train ML models, Add regularization, Make predictions w/ Learning/Inference change model type, etc. trained models Change metrics, Evaluation, drill down on results, etc. Interpretation & Post Processing Explanation 2

Interactive ! Developing Machine Learning Applications is Iterative Input data Creating systems to Data Preprocessing ( DPR ) Data Preprocessing ( DPR ) enhance interactivity requires a statistical characterization of how developers iterate Learning/Inference ( L/I ) Learning/Inference ( L/I ) on ML workflows . Post Processing ( PPR ) Post Processing ( PPR ) Num. Iterations 3

How Do Developers Iterate on Machine Learning Workflows? Try new neural Data cleaning! net architecture Better features! BLEU vs. ROUGE Computer Vision Web App Natural Sciences NLP 4

How Do Developers Iterate on Machine Learning Workflows? Our approach : study iterations by collecting statistics from applied ML papers grouped by application domains . 5

Outline ● Data & Limitations ● Methodology ○ Statistics ○ Estimation ● Results ● Conclusion & Future Work 6

Corpus: 105 Papers from 2016 Limitations ● Incomplete picture of iterations ○ Focus on ML and omit DPR ● Results presented side-by-side ○ Can’t determine the order ● # papers / domain is small ○ May lead to spurious results Remedies ● Multiple surveyors to reduce chance of spurious results ● Iteration estimators that do not rely on order 8

Collecting Statistics Data Prep. ML Model Class ML Tuning Evaluation Metrics norm. impute ... LSTM SVM ... Reg. Learn. Rate ... AUC ... # tables # figs 5 2 5 2 5 2 Aggregate 5 2 10 Open source dataset at https://github.com/helix-ml/AppliedMLSurvey

Estimating Iterations Data Prep. ML Model Class ML Tuning Evaluation Metrics norm. impute ... LSTM SVM ... Reg. Learn. Rate ... AUC ... # tables # figs Aggregate 5 2 Number of data prep. iterations Number of ML iterations Number of post proc. iterations 11

Mean Iteration Count by Domains 13

Data Preprocessing Social Sciences Natural Sciences Web Apps NLP Computer Vision Join (31.0%) Feat. Def . (40.6%) Feat. Def. (36.1%) Feat. Def. (32.1%) Feat. Def. (37.5%) Feat. Def. (27.6%) Univar. FS (18.8%) Join (22.2%) BOW (17.9%) BOW (25.0%) Normalize (17.2%) Normalize (12.5%) Normalize (13.9%) Join (14.3%) Interaction (25.0%) Impute (6.9%) PCA (9.4%) Discretize (8.3%) Normalize (10.7%) Join (12.5%) Feat. Def. = human defined features from raw attributes ● ○ e.g. adult=true if age >=18 14

ML Model Classes Social Sciences Natural Sciences Web Apps NLP Computer Vision GLM (36.0%) SVM (32.7%) GLM (37.0%) RNN (32.4%) CNN (38.2%) SVM (28.0%) GLM (15.4%) SVM (11.1%) GLM (14.7%) SVM (17.6%) RF (20.0%) RF (13.5%) RF (11.1%) SVM (11.8%) RNN (17.6%) DNN(13.5%) CNN (8.8%) RF (5.9%) Decision Tree (12.0%) Matrix Factor. (11.1%) ● Generalized linear models: logistic regression, linear regressions, etc. ● SVMs are popular (especially in natural sciences!) possibly due to kernels ● Deep learning is only popular in NLP and computer vision so far 15

ML Model Tuning Social Sciences Natural Sciences Web Apps NLP Computer Vision Regularize (40.0%) Cross Val. (31.8%) Regularize (41.2%) Learn Rate (39.4%) Learn Rate (46.2%) Cross Val. (30.0%) Learn Rate (22.7%) Learn Rate (23.5%) Batch Size (24.2%) Batch Size (30.8%) Learn Rate (10.0%) DNN Arch. (18.2%) Batch Size (11.8%) DNN Arch. (18.2%) DNN Arch. (11.5%) Batch Size (10.0%) Kernel (9.1%) Cross Val. (11.8%) Kernel (6.1%) Regularize (11.5%) ● Learning Rate + Batch Size → looking for faster training 16

Post Processing Social Sciences Natural Sciences Web Apps NLP Computer Vision Prec/Rec (25.7%) Accuracy (28.6%) Accuracy (20.8%) Prec/Rec (29.2%) Visualiz. (33.3%) Accuracy (20.0%) Prec/Rec (18.6%) Prec/Rec (20.8%) Accuracy (27.1%) Accuracy (29.8%) Visualiz. (15.7%) Prec/Rec (17.5%) Feat. Contrib. (17.1%) Case Studies (13.2%) Case Studies (14.6%) Visualiz. (14.3%) Correlation (11.4%) DCG (9.4%) Human Eval (8.3%) Case Studies (12.3%) ● Precision/Recall & Accuracy → coarse-grained evaluation ● Case Studies & Visualization → fine-grained evaluation 17

Takeaways Study iteration using empirical evidence from applied ML papers ● ○ Grouping by domains gives better insights Lessons from results ● ○ Data prep : fine-grained feature engineering, efficient joins ○ ML : explainable models and fast training ○ Eval : fine-grained evals are as common as coarse-grained metrics ● Open source dataset at https://github.com/helix-ml/AppliedMLSurvey 18

Future Work Refine statistics and estimators ● Develop insights and trends into a benchmark ● Look at code repositories (e.g. Kaggle) for a more complete picture ● DAG Optimizer q Address user needs discovered in our survey Accelerate Iterative Execution Materialization Intermediate q Selectively materialize Optimizer Code Gen. via Intermediates Reuse intermediate results for reuse in future iterations https://helix-ml.github.io 19 More on Helix in the technical report @ http://data-people.cs.illinois.edu/helix-tr.pdf

How Developers Iterate on Machine Learning Workflows -- A Survey of - PowerPoint PPT Presentation

How Developers Iterate on Machine Learning Workflows -- A Survey of the Applied Machine Learning Literature Doris Xin 1 , Litian Ma 1 , Shuchen Song 1 , Rong Ma 2 , Aditya Parameswaran 1 1 University of Illinois at Urbana-Champaign 2 Peking

Title Text You want to start a new web agency... but you're a developer iterate.ie/drupaldays14

gen , iterate , and ana Still more higher order list functions Theory of Programming Languages

I M P R O V I S E TO E M PAT H I S E Email: hello@iterate.ie Phone: 353 (1) 524 1346 1 Email:

Developers Developers Developers (what about creatives?) Ryan Gorley Designer + Marketer + Art

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in

Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira , Pedro Fonseca, Alexey

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Return of the Fat Client Erwin van der Koogh Co-Founder Bitgenics/Linc WE AUTOMATE, YOU ITERATE

Workflows Description, Workflows Description, Enactment and Monitoring in Enactment and

Introduction to differential binding Peter Humburg Statistician, Macquarie University DataCamp

Automate your workflows with Kotlin Fosdem - 2020 1 Automate your workflows with Kotlin

Unit12:RoadMap(VERBAL)

Spectrum of invariant differential operators for the supersymmetric pair gl m | 2 n ,

1 Projections Orthographic Projection Projections Orthographic Projection To lower

MINING Final Review Instructor: Yizhou Sun yzsun@cs.ucla.edu December 6, 2017 Learnt

Overview By end of the week: - Know the basics of git - Make sure we can all compile and run a

d i E Relative Extrema a l l u d Dr. Abdulla Eid b A College of Science . r D MATHS

MAT 137 LEC 0601 Instructor: Alessandro Malus TA: Julia Kim November 6th, 2020 Warm-up :

CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer