How much data is enough? Predicting accuracy on large datasets from - PowerPoint PPT Presentation

How much data is enough? Predicting accuracy on large datasets from smaller pilot data Mark Johnson, Peter Anderson, Mark Dras, Mark Steedman Macquarie University Sydney, Australia July 12, 2018 1 / 16

Outline Introduction Empirical models of accuracy vs training data size Accuracy extrapolation task Conclusions and future work 2 / 16

ML as an engineering discipline • A mature engineering discipline should be able to predict the cost of a project before it starts • Collecting/producing training data is typically the most expensive part of an ML or NLP project • We usually have only the vaguest idea of how accuracy is related to training data size and quality ◮ More data produces better accuracy ◮ Higher quality data (closer domain, less noise) produces better accuracy ◮ But we usually have no idea how much data or what quality of data is required to achieve a given performance goal • Imagine if engineers designed bridges the way we build systems! See statistical power analysis for experimental design, e.g., Cohen (1992) 3 / 16

Goals of this research project • Given desiderata (accuracy, speed, computational and data resource pricing, etc.) for an ML/NLP system, design for a system that meets these. • Example: design a semantic parser for a target application domain that achieves 95% accuracy across a given range of queries. ◮ What hardware/software should I use? ◮ How many labelled training examples do I need? • Idea: Extrapolate performance from small pilot data to predict performance on much larger data 4 / 16

What this paper contributes • Studies different methods for predicting accuracy on a full dataset from results on a small pilot dataset • We propose new accuracy extrapolation task , provide results for the 9 extrapolation methods on 8 text corpora ◮ Uses the fastText document classifier and corpora (Joulin et al., 2016) • Investigates three extrapolation models and three item weighting functions for predicting accuracy as a function of training data size ◮ Easily inverted to estimate training size required to achieve a target accuracy • Highlights the importance of hyperparameter tuning and item weighting in extrapolation 5 / 16

Overview • Extrapolation models of how error e ( = 1 − accuracy ) depends on training data size n ◮ Power law: ˆ e ( n ) = bn c ◮ Inverse square-root: ˆ e ( n ) = a + bn − 1 / 2 ◮ Biased power law: ˆ e ( n ) = a + bn c • Extrapolation model estimated from multiple runs using weighted least squares regression ◮ Model trained on different-sized subsets of pilot data ◮ Same test set is used to evaluate each run ◮ The evaluation of each model training/test run is a training data point for extrapolation model • Weighting functions for least squares regression ◮ constant weight ( 1 ) ◮ linear weight ( n ) ◮ binomial weight ( n / e ( 1 − e ) ) See e.g., Haussler et al. (1996); Mukherjee et al. (2003); Figueroa et al. (2012); Beleites et al. (2013); Hajian-Tilaki (2014); Cho et al. (2015); Sun et al. (2017); Barone et al. (2017); Hestness et al. (2017) 7 / 16

Accuracy extrapolation task • FastText document classifier & data Corpus Labels Train (K) Test (K) ◮ 4 development corpora Development ag_news 4 120 7.6 ◮ 4 evaluation corpora dbpedia 14 560 70 amazon_review_full 5 3,000 650 ◮ Joulin et al. (2016)’s yelp_review_polarity 2 560 38 Evaluation train/test division amazon_review_polarity 2 3,600 400 sogou_news 5 450 60 • Pilot data is 0.5 or 0.1 of yahoo_answers 10 1,400 60 yelp_review_full 5 650 50 train data • Goal: use pilot data to predict test accuracy when trained on full train data 9 / 16

Extrapolation on ag_news corpus • Extrapolation with biased power-law model 0.30 ( ˆ e ( n ) = a + bn c ) and 0.25 binomial weights ( n / e ( 1 − e ) ) Pilot data 0.20 Error rate ==0.1 • Extrapolation from <=0.1 0.15 0 . 5 training data is ==0.5 <=0.5 generally good 0.10 • Extrapolation from 0 . 1 training data is poor 10 3 10 4 10 5 unless hyperparameters Pilot data size are optimised at each subset of pilot data 10 / 16

Relative residuals ( ˆ e / e − 1 ) on dev corpora ==0.1 <=0.1 ==0.5 <=0.5 0.05 0.00 ag_news −0.05 −0.10 −0.15 amazon_review_full 0.000 −0.025 −0.050 Extrapolation −0.075 b*n^c a+b*n^−1/2 0.00 a+b*n^c dbpedia −0.01 −0.02 −0.03 yelp_review_polarity 0.00 −0.01 −0.02 1 n n/e*(1−e) 1 n n/e*(1−e) 1 n n/e*(1−e) 1 n n/e*(1−e) 11 / 16

RMS relative residuals on test corpora amazon yelp Pilot sogou yahoo review review Overall news answers data polarity full 0.1016 0.2752 0.0519 0.0496 0.1510 = 0 . 1 0.0209 0.1900 0.0264 0.0406 0.0986 ≤ 0 . 1 0.0338 0.0438 0.0254 0.0160 0.0315 = 0 . 5 0.0049 0.0390 0.0053 0.0046 0.0200 ≤ 0 . 5 • Based on dev corpora results, use: ◮ biased power law model ( ˆ e ( n ) = a + bn c ) ◮ binomial item weights ( n / e ( 1 − e ) ) • Evaluate extrapolations with RMS of relative residuals ( ˆ e / e − 1 ) • Larger pilot data ⇒ smaller extrapolation error • Optimise hyperparameters at each pilot subset ⇒ smaller extrapolation error 12 / 16

Conclusions and future work • The field need methods for predicting how much training data a system needs to achieve a target performance • We introduced an extrapolation task for predicting a classifier’s accuracy on a large dataset from a small pilot dataset • Highlight the importance of hyperparameter tuning and item weighting • Future work: extrapolation methods that don’t require expensive hyperparameter optimisation 14 / 16

We are recruiting PhD students and Postdocs! Centre for Research in AI and Language (CRAIL) Macquarie University Parsing, Dialog, Deep Unsupervised Learning, Language in Context Vision and Language, Language for Robot Control • We are recruiting top PhD Students and Postdoc Researchers ◮ With generous pay and top-up scholarships to $41K tax-free • Send CV and sample papers to Mark.Johnson@MQ.edu.au 15 / 16

References Barone, A. V. M., Haddow, B., Germann, U., and Sennrich, R. (2017). Regularization techniques for fine-tuning in neural machine translation. CoRR , abs/1707.09920. Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C., and Popp, J. (2013). Sample size planning for classification models. Analytica chimica acta , 760:25–33. Cho, J., Lee, K., Shin, E., Choy, G., and Do, S. (2015). How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv:1511.06348 . Cohen, J. (1992). A power primer. Psychological bulletin , 112(1):155. Figueroa, R. L., Zeng-Treitler, Q., Kandula, S., and Ngo, L. H. (2012). Predicting sample size required for classification performance. BMC medical informatics and decision making , 12(1):8. Hajian-Tilaki, K. (2014). Sample size estimation in diagnostic test studies of biomedical informatics. Journal of biomedical informatics , 48:193–204. Haussler, D., Kearns, M., Seung, H. S., and Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning , 25(2). Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. (2017). Deep learning scaling is predictable, empirically. arXiv:1712.00409 . Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759 . Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R., and Mesirov, J. P. (2003). Estimating dataset size requirements for classifying DNA microarray data. Journal of computational biology , 10(2):119–142. Sun, C., Shrivastava, A., Singh, S., and Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. arXiv:1707.02968 . 16 / 16

How much data is enough? Predicting accuracy on large datasets from - PowerPoint PPT Presentation

How much data is enough? Predicting accuracy on large datasets from smaller pilot data Mark Johnson, Peter Anderson, Mark Dras, Mark Steedman Macquarie University Sydney, Australia July 12, 2018 1 / 16 Outline Introduction Empirical

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

Indoor Accuracy Test Bed Framework Indoor Accuracy Test Bed Framework Working Group #3 E911

the myth of accuracy Damian Harty, Lucid Motors the myth of accuracy Its easy to believe

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Predicting Return to Work Predicting Return to Work with Data Mining with Data Mining Claim A

Less and More Min Chen, University of Oxford Evaluation: How Much Evaluation is Enough? Less and

How large is large enough? Large-eddy simulation of clear and cloudy convective boundary layers

Predicting implicit and explicit questions Matthijs Westera COLT kick-off workshop Predicting

Predicting Regulatory Elements Predicting Regulatory Elements in P. falciparum in P. falciparum

Predicting and Comprehending Predicting and Comprehending Asteroid Impacts Asteroid Impacts

Predicting and modeling water chemistry Predicting and modeling water chemistry associated with

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Predicting Min Predicting Min-Bias and the Bias and the Underlying Event at

Computational Algorithm Predicting Surface Computational Algorithm Predicting Surface Morphology

WEARABLES?!? LIZA KINDRED @LIZAK SO MUCH HYPE SO MUCH MEH SO MUCH MATTERS. HERES HOW TO THINK

Rational Krylov methods for linear and nonlinear eigenvalue problems Mele Giampaolo

reality being used to transform asset safety and risk assessment ? Dylan Roberts Director of

CLASS OF 2021 Agenda: Review of transcript, grades, GPA, and graduation requirements.

FHA Update Santa Ana Homeownership Center Esther Yamashiro Processing & Underwriting Division

Introducing Ulixes EEIG Ulixes Training and Research European Economic Interest Grouping

PRINCETON CONFERENCE Residential Appraisal Best Practices FROM THE REVIEW APPRAISERS DESK:

Full Year Results for the year ended 31 December 2016 Disclaimer This presentation may contain

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen

How much data is enough? Predicting accuracy on large datasets from - PowerPoint PPT Presentation

How much data is enough? Predicting accuracy on large datasets from smaller pilot data Mark Johnson, Peter Anderson, Mark Dras, Mark Steedman Macquarie University Sydney, Australia July 12, 2018 1 / 16 Outline Introduction Empirical

Shunem 1. Sufficiency means enough to meet the situation; enough to accomplish the task.

Indoor Accuracy Test Bed Framework Indoor Accuracy Test Bed Framework Working Group #3 E911

the myth of accuracy Damian Harty, Lucid Motors the myth of accuracy Its easy to believe

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Predicting Return to Work Predicting Return to Work with Data Mining with Data Mining Claim A

Less and More Min Chen, University of Oxford Evaluation: How Much Evaluation is Enough? Less and

How large is large enough? Large-eddy simulation of clear and cloudy convective boundary layers

Predicting implicit and explicit questions Matthijs Westera COLT kick-off workshop Predicting

Predicting Regulatory Elements Predicting Regulatory Elements in P. falciparum in P. falciparum

Predicting and Comprehending Predicting and Comprehending Asteroid Impacts Asteroid Impacts

Predicting and modeling water chemistry Predicting and modeling water chemistry associated with

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Predicting Min Predicting Min-Bias and the Bias and the Underlying Event at

Computational Algorithm Predicting Surface Computational Algorithm Predicting Surface Morphology

WEARABLES?!? LIZA KINDRED @LIZAK SO MUCH HYPE SO MUCH MEH SO MUCH MATTERS. HERES HOW TO THINK

Rational Krylov methods for linear and nonlinear eigenvalue problems Mele Giampaolo

reality being used to transform asset safety and risk assessment ? Dylan Roberts Director of

CLASS OF 2021 Agenda: Review of transcript, grades, GPA, and graduation requirements.

FHA Update Santa Ana Homeownership Center Esther Yamashiro Processing &amp; Underwriting Division

Introducing Ulixes EEIG Ulixes Training and Research European Economic Interest Grouping

PRINCETON CONFERENCE Residential Appraisal Best Practices FROM THE REVIEW APPRAISERS DESK:

Full Year Results for the year ended 31 December 2016 Disclaimer This presentation may contain

Recent Advances in Sparse Linear Solver Stacks for Exascale NCAR Multi-core 9 Workshop Stephen

FHA Update Santa Ana Homeownership Center Esther Yamashiro Processing & Underwriting Division