complexity vs performance empirical analysis of machine
play

Complexity vs. Performance: Empirical Analysis of Machine Learning - PowerPoint PPT Presentation

Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service Yuanshun Yao , Zhujun Xiao, Bolun Wang*, Bimal Viswanath, Haitao Zheng and Ben Y. Zhao The University of Chicago *University of California, Santa Barbara


  1. Complexity vs. Performance: Empirical Analysis of Machine Learning as a Service Yuanshun Yao , Zhujun Xiao, Bolun Wang*, Bimal Viswanath, Haitao Zheng and Ben Y. Zhao The University of Chicago *University of California, Santa Barbara ysyao@cs.uchicago.edu

  2. ML in Network Research congestion network user behavior control protocols link prediction analysis • Sivaraman et al., • Liu et al., IMC’16 • Wang et al., IMC’14 SIGCOMM’14 • Zhao et al., IMC’12 • Zannettouet al., IMC’17 • Winstein & Balakrishnan, SIGCOMM’13 …

  3. Running ML is Hard Solution: Machine Learning as a Service (ML-as-a-Service) dataset model

  4. ML-as-a-Service training data ML-as-a-Service user input (model, parameter etc.)

  5. Why Study ML-as-a-Service? Is my model good enough? Q: How well do they perform? Q: How much does the amount of user control impact ML performance?

  6. ML-as-a-Service Platforms Google Amazon ABM BigML PIO Microsoft Prediction ML ML less more amount of user input

  7. Control in ML ? training data trained model

  8. Control in ML Data Cleaning • Invalid/dup/missing data ? training data trained model

  9. Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi… ? training data trained model

  10. Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi_square… ? training data trained model Classifier Choice • Logistic Regression, Decision Tree, kNN…

  11. Control in ML Data Cleaning Feature Selection • Invalid/dup/missing • Mutual Info,Pearson, data Chi_square… training data trained model Classifier Choice Parameter Tuning • Logistic Regression, • Logistic Regression: L1, Decision Tree, kNN… L2, max_iter…

  12. Control in ML-as-a-Service Data ✖ ✖ ✖ ✖ ✖ ✖ Cleaning Feature ✖ ✖ ✖ ✖ ✖ ✖ ✔ Selection Complexity vs. Performance? Classifier ✖ ✖ ✖ ✔ ✔ ✔ Choice Parameter ✖ ✖ ✔ ✔ ✔ ✔ Tuning Amazon Google ABM PIO BigML Microsoft high low user control/complexity

  13. Performance Measurement

  14. Characterizing Performance • Theoretical modeling is hard • Output of ML model depends on dataset • No access to implementation details • Empirical data-driven analysis • Simulate a real-world scenario from end to end • Need a large number of diverse datasets • Focus on binary classification

  15. Dataset • 119 datasets • From diverse application domains • Sample size: 15 - 245K, number of features: 1 - 4K • 79% of them are from UCI ML Repository Other Financial & Business 11% 6% Life Science 37% Physical Science 8% Social Science 9% Artificial Test Computer Applications 14% 15%

  16. Methodology • Tune all available control dimensions Feature Selection Classifier Choice Parameter Tuning API ✖ ✔ ✔ training trained data model • Logistic Regression L1_reg • • KNN L2_reg • • SVM Max_iter • API API • … … •

  17. Methodology • Tune all available control dimensions Feature Selection Classifier Choice Parameter Tuning API ✖ ✔ ✔ training trained data model API testing data

  18. Trade-offs between Complexity and Performance

  19. Complexity vs. Performance • Q: How does the complexity correlate with performance? • High complexity -> high performance 1 Optimized 0.9 Average F-Score 0.8 0.7 0.6 0.5 ABM Google Amazon BigML PIO Microsoft Scikit low complexity high

  20. Complexity vs. Risk • Q: How does the risk correlate with complexity? • High complexity -> high risk 0.5 Performance Variance 0.4 (F-Score) 0.3 0.2 0.1 0 ABM Google Amazon BigML PIO Microsoft Scikit low complexity high

  21. Understanding Server-side Optimization

  22. Reverse-engineering Optimization • Q: Does server-side adapt to different datasets? • Reverser-engineering using datasets • Create synthetic datasets • Use prediction results to infer classifier information Circular Linear 2 6 Class 0 Class 1 Feature #2 Feature #2 3 1 0 0 Class 0 -3 -1 Class 1 -6 -1.5 -1 -0.5 0 0.5 1 1.5 -3 -2 -1 0 1 2 3 Feature #1 Feature #1

  23. Understanding Optimization Google decision boundaries 2 6 Class 0 Class 1 3 Feature #2 Feature #2 1 0 0 Class 0 -3 -1 Class 1 -6 -1.5 -1 -0.5 0 0.5 1 1.5 -3 -2 -1 0 1 2 3 Feature #1 Feature #1 • Google switches between classifiers based on the dataset • Use supervised learning to infer classifier family used

  24. Takeaways • ML-as-a-Service is an attractive tool to reduce workload • But user control still has a large impact on performance • Fully automated systems are less risky

  25. Thank you! Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend