snorkel drybell a case study in deploying weak
play

Snorkel DryBell: A Case Study in Deploying Weak Supervision at - PowerPoint PPT Presentation

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale Stephen Bach (Brown University); Daniel Rodriguez (Google); Yintao Liu (Google); Chong Luo (Google); Haidong Shao (Google); Cassandra Xia (Google); Souvik Sen


  1. Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale Stephen Bach (Brown University); Daniel Rodriguez (Google); Yintao Liu (Google); Chong Luo (Google); Haidong Shao (Google); Cassandra Xia (Google); Souvik Sen (Google); Alex Ratner (Stanford University); Braden Hancock (Stanford University); Houman Alborzi (Google); Rahul Kuchhal (Google); Chris Ré (Stanford University); Rob Malkin (Google);

  2. This Talk • Weakly supervised machine learning seeks to train classifiers without hand labeled training data • What impact can it have on industry and other organizations that use machine learning? What challenges arise? • It can save labeling tens of thousands of examples without sacrificing prediction quality!

  3. Training Data is the Bottleneck for Industrial Machine Learning

  4. Supervised Machine Learning Labeled Training Data Learning Algorithm Classifier for Unlabeled Data X X X

  5. Today’s Organizations: Many Classifiers X X X X X X X X X X X X X X X X X X

  6. Weak Supervision with Rules

  7. Open-Source Framework: Snorkel • Open-source framework to program classifiers by writing rules that label data • Results: State-of-the-art performance on benchmark tasks and new applications without any hand-labeled training data snorkel.stanford.edu Snorkel: Rapid Training Data Creation with Weak Supervision. A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, C. Re. PVLDB 11(3):269-282, 2017. Best of VLDB 2018

  8. Supervised Machine Learning Pipeline: Labeled Classifier for Learning Training Unlabeled Algorithm Data Data

  9. Weakly Supervised Machine Learning Unlabeled Labeling Training Functions Data Pipeline: Labeled Classifier for Learning Training Unlabeled Algorithm Data Data

  10. Example Task: Celebrity News True Label Is this Lori Loughlin’s ‘Fuller House’ headline Future Dim Due To Elite School about celebrity Bribery Scandal news? HOW TO WATCH TESLA’S X MODEL Y REVEAL TONIGHT Morris Dees, a Co-Founder of X the Southern Poverty Law Center, Is Ousted

  11. Example Labeling Function: Keywords Vote Are there any Lori Loughlin’s ‘Fuller House’ gossip-related Future Dim Due To Elite School keywords in Bribery Scandal the headlines? HOW TO WATCH TESLA’S ? MODEL Y REVEAL TONIGHT Morris Dees, a Co-Founder of the Southern Poverty Law Center, Is Ousted

  12. In the Industrial Setting… How Can We: 1. Manage the Proliferation of Supervision Sources? 2. Turn the Many Overlapping Sources into an Advantage? Labeling Labeling Labeling Function Function Function Set 1 Set 2 Set 3

  13. Don’t Start from Scratch!

  14. Knowledge Resources Web Crawlers Related Classifiers Knowledge Graphs If Pattern(data) Then data.label = True Rules Aggregate Stats Topic Models

  15. Example: Related Classifier Vote If it doesn’t Lori Loughlin’s ‘Fuller House’ ? mention a Future Dim Due To Elite School Bribery Scandal person, it’s probably not about celebrities! HOW TO WATCH TESLA’S X MODEL Y REVEAL TONIGHT Morris Dees, a Co-Founder of ? the Southern Poverty Law Center, Is Ousted

  16. Snorkel DryBell

  17. Snorkel DryBell Architecture Snorkel DryBell Snorkel DryBell Labeling Function Templates Generative Model Unlabeled Examples Abstract 𝑍 Labeling Function … Labeling NLP Labeling Function Function 𝜇 " 𝜇 # 𝜇 $ Labeling Function Binary Knowledge Resources Probabilistic Production Training Labels ML Systems Web Knowledge Related Crawlers Graphs Classifiers

  18. Resources Come in Diverse Forms • Related classifiers need their own servers • Knowledge Graph has REST API • Web crawlers maintained by separate team

  19. Snorkel DryBell Provides Templates Example: NLP Labeling Function Defines text to analyze “If the text doesn’t mention any people, vote negative” Launches MapReduce pipeline, starts NLP classifier server on each worker, and saves the results

  20. Resources are Often Not Servable Servable Not Servable Predicted Label X Service-Level Agreement No Service-Level Agreement X Fixed Model Input Varies in Size X Fixed-Size Input Input Expensive to Collect Related Knowledge Classifiers Graphs Aggregate Web Topic 0010010000111010101000010101 Stats Crawlers Models

  21. Knowledge Transfers to Servable Models 𝜇 " Predicted Label 𝑍 𝜇 # 𝜇 $ DEVELOPMENT PRODUCTION 0010010000111010101000010101

  22. Experimental Study

  23. Case Studies at Google • Collaborated with an engineering team responsible for 100+ classifiers in production • Looked at two recent instances where strategic decisions necessitated new classifiers • Due to sensitive nature of applications, we describe at a high-level and report relative scores

  24. Case #1: Product Classification • Existing classifier used to detect products in a certain category of interest Previous: Products • Goal: expand label to include accessories New: Products + Accessories • Instant depreciation of investment in labels!

  25. Case #2: Topic Classification • Emerging topic of interest in Google content • Goal: develop new classifier to identify topic • Default procedure is to collect hundreds of thousands of labels for new topic!

  26. Setup Since these are Hundreds of thousands to production tasks, millions of examples for large labeled data training data, which were sets were available treated as unlabeled by Snorkel DryBell ~10k labeled validation set ~10k labeled test set

  27. Comparison with Baselines Products Topics Rel. F1 Lift F1 Lift Rel. Train on Val. Data 100% 100% Generative Model 103% +3% 94% -6% Snorkel DryBell 105% +5% 118% +18%

  28. Break-Even Point Product Classification Topic Classification 110% 120% Relative F1 Relative F1 105% Fully Supervised 110% Fully Supervised Snorkel DryBell Snorkel DryBell (6.5M (684K Unlabeled) Unlabeled) 100% 100% 7 K 9 K 11 K 13 K 15 K 17 K 19 K 21 K 25 K 45 K 65 K 85 K 105 K 125 K 145 K Number of Hand-Labeled Training Examples Number of Hand-Labeled Training Examples

  29. Break-Even Point Topic Classification 120% Relative F1 Fully Supervised 110% Snorkel DryBell (684K Unlabeled) 100% 25 K 45 K 65 K 85 K 105 K 125 K 145 K Number of Hand-Labeled Training Examples

  30. Summary

  31. Summary • Snorkel DryBell is a new system for industrial workloads , enabling users to transfer knowledge from organization resources to machine learning classifiers • Our study shows that Snorkel DryBell can save labeling tens of thousands of training examples • The key lesson for other organizations: knowledge resources are abundant , take advantage of them!

  32. More Information Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. S. H. Bach, et al. SIGMOD 2019 Industrial Track. https://arxiv.org/abs/1812.00417 snorkel.stanford.edu Thank you!

  33. Appendix

  34. Snorkel DryBell Scales Up to Big Data • Using Google’s distributed compute environment, we can, for example, label and fit the generative model for 5 million+ examples in ~30 minutes . • Scalability of the generative model relies on new, TensorFlow-based implementation

  35. Non-Servable Resources Products Topics Rel. F1 Lift F1 Lift Rel. Servable Resources 63% 86% + Non-Servable 105% +68% 118% +36%

  36. Labeling Function Details: Topic • 10 labeling functions • Examples: • URL-based: Heuristics regarding URLs in the content • NER tagger-based: Heuristics based on presence of named entities • Topic model-based: Heuristics based on coarse-grain topic model

  37. Labeling Function Details: Product • 8 labeling functions • Examples: • Keyword-based: rules looking for product-related keywords • Knowledge Graph-based: queried for names of related products and translations in 10 languages for which the classifier is used • Topic model-based: Heuristics based on coarse-grain topic model

  38. Example 2: Knowledge Graph Vote Lori Loughlin’s ‘Fuller House’ If it mentions a Future Dim Due To Elite School known celebrity, Bribery Scandal it’s probably about celebrities! HOW TO WATCH TESLA’S ? MODEL Y REVEAL TONIGHT Morris Dees, a Co-Founder of ? the Southern Poverty Law Center, Is Ousted

  39. Example 3: Web Crawler Vote If it points to a Lori Loughlin’s ‘Fuller House’ page that Future Dim Due To Elite School mentions lots of Bribery Scandal celebrities, it is probably about HOW TO WATCH TESLA’S celebrities! MODEL Y REVEAL TONIGHT Morris Dees, a Co-Founder of ? the Southern Poverty Law Center, Is Ousted

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend