Better Machine Learning Through Data Sa Saleema ema Amershi shi - PowerPoint PPT Presentation

Better Machine Learning Through Data Sa Saleema ema Amershi shi Machine T eaching Group Microsoft Research August 14, 2016

Making better sense of data. Better data makes better machine learning.

Data + Algorithm = Model Data Algorithm Model

Data + Algorithm = Model Data Algorithm Model Machine learning research often takes the data as given.

When Algorithms Discriminate – The New York Times, 2015 Big Data’s all -too-human failings – Reuters, 2016 Artificial Intelligence’s White Guy Problem – The New York Times, 2016 Mapping Crime – Or Stirring Hate? – Financial Times, 2014

Making better sense of data. Better data makes better machine learning. Most influence practitioners have on machine learning is through data.

Data + Algorithm = Model Data Algorithm Model In research, data is often taken as given.

Data + Algorithm = Model Data Algorithm Model In practice, the In research, algorithm is often data is often taken as given. taken as given.

Data + Algorithm = Model Data Algorithm Model In practice, the “Data scientists, according to interviews and expert estimates, spend 50 perc ercen ent to 80 perc ercen ent algorithm is often of thei eir r time mired in this more mundane labor taken as given. of collecting and preparing unruly digital data.” - New York Times, 2014

[Patel et al., CHI 2008]

Data + Algorithm = Model Data Algorithm Model Iterations are driven by evaluating models on data.

Data + Algorithm = Model Data Algorithm Model In practice, most effort is Iterations are driven by spent crafting input data. evaluating models on data.

Data Algorithm Model Machine learning in theory

Collect & Create Evaluate Label Algorithm Features Results Samples Machine learning in practice

Collect & Create Evaluate Label Algorithm Features Results Samples

Collect & Create Evaluate Label Algorithm Features Results Samples Structured Labeling Feature Insight ModelTracker [CHI 2014] [VAST 2015] [CHI 2015, VAST 2016]

Traditional Labeling Is this a Cat? Cat Cat Pre-defined high-level categories. 0 Not Cat Cat 0

Traditional Labeling Is this a Cat? Cat Cat Pre-defined high-level categories. Does not support 4 Not Cat Cat concep ncept t evolution olution (refining the target concept as data is observed). 3

How common is concept evolution? Nine machine learning experts labeled the same 200 pages in two sessions (4 weeks apart). 100 Average consistency 81.7% (SD=6.8%) 75 Consistency 6 out of 9 people’s labels 50 changed significantly (via Chi 25 Square test of symmetry) Participants

Proposed Solution – Structured Labeling Enable people to exp xplic licitly itly organize anize th their eir concept ncept via grou ouping ping and ta tagging gging within a traditional labeling scheme.

Structured Labeling Is this a Cat? Cat Cat Definitely Cat 2 Not Cat Cat Definitely Not Cat 2

Structured Labeling Is this a Cat? Cat Cat Grouping within Definitely Cat high-level 2 2 Cat Poster categories. 1 User provided Not Cat Cat tags on groups Definitely Not Cat Definitely Not Cat aid recall. 2 2

Structured Labeling Is this a Cat? Cat Cat Grouping within Definitely Cat high-level 2 2 Cat Poster categories. 1 Blogs User provided 2 Not Cat Cat tags on groups Definitely Not Cat Definitely Not Cat aid recall. 2 2 Lions 2

Structured Labeling Is this a Cat? Cat Cat Grouping within Definitely Cat high-level 2 2 Cat Poster categories. 2 Blogs User provided 2 Not Cat Cat tags on groups Definitely Not Cat Definitely Not Cat aid recall. 2 2 Lions 2

Structured Labeling Is this a Cat? Cat Cat Grouping within Definitely Cat high-level 2 2 Blogs categories. 2 User provided Not Cat Cat tags on groups Definitely Not Cat Definitely Not Cat aid recall. 2 2 Lions Can move, merge 2 Cat Poster and split groups 2 as desired.

Assisted Structured Labeling Is this a Cat? Cat Cat Definitely Cat Grouping 2 2 Blogs recommendations 2 to improve label consistency. Not Cat Cat Definitely Not Cat Definitely Not Cat 2 2 Lions 2 Cat Poster 2

Assisted Structured Labeling Is this a Cat? Cat Cat Definitely Cat Grouping 2 2 Blogs recommendations 2 to improve label consistency. Not Cat Cat Definitely Not Cat Definitely Not Cat 2 2 Lions 2 Cat Poster 2 Similar items to help users make decisions.

Findings People revised labels significantly more with structured labeling People labeled more consistently People preferred it over traditional labeling Label Consistency Mean # Groups # Revisions ( X 2 =12, df=2, p < .002) ( X 2 =6.53, df=2, p < .038) ( X 2 =20.19, df=2, p < .001)

Structured Labeling Summary Current tools do not support concept cept evoluti olution on. Str tructur uctured ed labeli eling ng helps people refine their concepts by surfacing labeling decisions and aiding recall. People used structured labeling when it was available and labeled eled mo more cons nsistently stently. Str tructur ucture e conta ntains ins additi itional onal infor ormat mation ion (e.g., group related features, group related accuracy, decisions made…)

Collect & Create Evaluate Label Algorithm Features Results Samples Structured labeling Feature Insight ModelTracker improves consistency [VAST 2015] [CHI 2015, VAST 2016] [CHI 2014]

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Eas asily y the he mo most st imp mportan tant t fac actor r is the s the fea eatures es use sed. d. ” [Domingos, CACM 2012] …yet, little guidance or best practices exist.

How do people come up with features? Look for features used in related domains. Use intuition or domain knowledge. Apply automated techniques Featu ture e ideation ation – Think of and experiment with custom features (a “black art”).

Proposed Solution – Feature Insight Support comp mpar are e and contra ntrast st of data.

What makes a cat a cat?

Proposed Solution – Feature Insight Support comp mpar are e and contra ntrast st of data. Comparing pairs vs sets?

Comparing Pairs vs Sets Sets may help people think of generalizable features. Negatives Positives Positive Negative vs

Proposed Solution – Feature Insight Support comp mpar are e and contra ntrast st of data. Comparing pairs vs sets? Raw data vs visual summaries?

Looking at Raw Data vs. Visual Summaries Visual summaries may reveal relevant characteristics and hide irrelevant noise. Visual Summary Raw Data vs

Better Machine Learning Through Data Sa Saleema ema Amershi shi - PowerPoint PPT Presentation

Better Machine Learning Through Data Sa Saleema ema Amershi shi Machine T eaching Group Microsoft Research August 14, 2016 Making better sense of data. Better data makes better machine learning. Data + Algorithm = Model Data Algorithm

ROCKBOX FABRIQ EDITION ITS TIME FOR FOR BETTER SOUND. BETTER DESIGN. BETTER SPECS.

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Better Advice, Better Lives Adults Select Committee 21 st June Usk 1 Better Advice, Better Lives

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Chapter 1. Introduction to Graph Theory (Chapters 1.1, 1.31.6, Appendices A.2A.3) Prof.

MPLS Special Purpose Labels

Labeled graphs and Digraphs: Theory and Applications Dr. S.M . Hegde Dept. of Mathematical and

Analysis of Cylindrical Waveguide Structures with Noncircular Cross Sections Marcos. V. T.

Labeling the Active Route in Interactive Navigational Maps Benjamin Morgan University of

Network Traffic Classification: From Theory To Practice Valentn Carela-Espaol Advisor: Pere

CS3102 Theory of Computation www.cs.virginia.edu/~njb2b/cstheory/s2020 Warm up: There are two

The AIM Problem in Loop Theory Conjecture: Let Q be an Abelian Inner Mapping (AIM) loop. Then Q/N