Acc ccele elera ratin ting g Ma Mach chin ine e Lea earnin - - PowerPoint PPT Presentation

acc ccele elera ratin ting g ma mach chin ine e lea
SMART_READER_LITE
LIVE PREVIEW

Acc ccele elera ratin ting g Ma Mach chin ine e Lea earnin - - PowerPoint PPT Presentation

Data Council 4/17/19 | Alexander Ratner Accelerating Machine Learning with Training Data Management Acc ccele elera ratin ting g Ma Mach chin ine e Lea earnin rning g wit with h Tra rain inin ing g Data Data Ma Mana nage


slide-1
SLIDE 1

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Acc ccele elera ratin ting g Ma Mach chin ine e Lea earnin rning g wit with h Tra rain inin ing g Data Data Ma Mana nage gement ment

Alex x Ratne tner Stanford University

1

slide-2
SLIDE 2

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

2

Training data is the key ingredient in ML But it’s created and managed in manual, ad hoc ways

slide-3
SLIDE 3

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

3

Can we add mathematical & systems structure to the way people build & manage training sets today?

KEY RESEARCH QUESTION

slide-4
SLIDE 4

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Running Example: Chest X-Ray Triage

4

“Abnormal”

Motivation: Case prioritization for e.g. low- resource hospitals

[Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]

slide-5
SLIDE 5

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Running Example: Chest X-Ray Triage

5

(All scores: ROC AUC)

Model dev is often radically easier today!

2-3 days

Unlabeled data (multi-modal) Training set creation Model development Model (e.g. ResNet)

± 1 point due to model choice

[Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]

slide-6
SLIDE 6

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Running Example: Chest X-Ray Triage

6

(All scores: ROC AUC) Unlabeled data (multi-modal) Training set creation Model development Model (e.g. ResNet)

8 months 2-3 days

± 1 point due to model choice ± 9 points due to training set size ± 8 points due to training set quality

Training data is often the key differentiator

[Dunnmon et. al., Radiology 2018; Dunnmon & Ratner et. al., 2019; Khandewala et. al., NeurIPS ML4H 2017]

slide-7
SLIDE 7

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Challenges of Training Data Management

  • Vol
  • lume is

is crit itic ical al

  • But training

ining data ata is lar argel gely y hand nd-labele abeled: : slow & e expensiv xpensive

  • Qualit

lity is is c crit itic ical

  • But this

s is chall llenging ging to to assess assess

  • Fle

lexi xibi bilit lity is is c crit itic ical al

  • But training

ining sets ts are e comp mpletel ely y stati atic

7

𝑍 ∈ {“𝐵𝑐𝑜𝑝𝑠𝑛𝑏𝑚”, “𝑂𝑝𝑠𝑛𝑏𝑚”} 𝑍 ∈ {“𝑉𝑠𝑕𝑓𝑜𝑢”, “𝐹𝑛𝑓𝑠𝑕𝑓𝑜𝑢”, “𝑂𝑝𝑠𝑛𝑏𝑚”}

slide-8
SLIDE 8

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

8

Our research: building systems that A new way to specify ML models---in hours rather than months

3 Use as training data for ML models 1 Let users specify training sets in higher-level, programmatic ways 2 Clean and integrate this input

slide-9
SLIDE 9

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Labeling

Multi-Task Supervision

Augmentation

This talk: Three systems that support and accelerate critical steps of training data creation & management

slide-10
SLIDE 10

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data

Normal

Labeling

1

Snorkel

Programmatically label training data

Model

Multi-Task Supervision

Augmentation

slide-11
SLIDE 11

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data

Normal

Labeling

1

Snorkel TANDA

Augmentation

2

Programmatically transform training data

Model

Multi-Task Supervision

slide-12
SLIDE 12

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Labeling

1 𝑍

1

𝑍

2

𝑍

3

Multi-Task Supervision 3

Snorkel TANDA MeTaL

Augmentation

2

Programmatically integrate training data across multiple tasks

slide-13
SLIDE 13

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Labeling

1 𝑍

1

𝑍

2

𝑍

3

Multi-Task Supervision 3

Snorkel TANDA MeTaL

Augmentation

2

Deployments:

Industry Government Medicine

slide-14
SLIDE 14

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

slide-15
SLIDE 15

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Problem: Hand-labeling is slow, expensive, & static Idea: Enable users to label training data programmatically

15

slide-16
SLIDE 16

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

KEY TECHNICAL IDEA:

View training set labeling as a noisy programmatic process that we can model

16

slide-17
SLIDE 17

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

The Snorkel Pipeline

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL

snorkel.stanford.edu

17

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

Note: No hand-labeled training data!

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

slide-18
SLIDE 18

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Snorkel: Real-World Deployments

18

snorkel.stanford.edu

Industry Government

In many cases: From person-months of hand- labeling to hours

Science & Medicine

slide-19
SLIDE 19

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(1) Writing Labeling Functions

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 19

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

1

slide-20
SLIDE 20

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(1) Writing Labeling Functions

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS 20

𝜇: 𝒴 ↦ 𝒵 ∪ {0}

Data Labels Abstain

Labeling function:

A simple abstraction for expressing domain heuristics or other noisy label sources

slide-21
SLIDE 21

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Simple Example: Pattern Matching

“Indication: Chest pain. Findings: Focal consolidation and pneumothorax.”

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL”

Labeling beling functio nctions ns (LFs) s) are e si simple ple UDF DFs s for r expr pressing essing domain main exper pertise tise

21

slide-22
SLIDE 22

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Simple Example: Pattern Matching

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL”

LFs s can n also so be noisy sy---

  • -we can

n est stima imate e their eir accuracies ccuracies to to handle ndle this s (next) )

“Indication: Chest pain. Findings: No focal consolidation or pneumothorax…”

22

slide-23
SLIDE 23

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

A Simple Formalism for Weak Supervision Strategies

  • Pattern matching
  • Distant supervision
  • Domain heuristics
  • Functions of features

def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL”

An And many ny others ers: : crowdsour dsourcing, ing, other her models, dels, etc. tc.

23

[e.g. Mintz 2009] [e.g. Hearst 1992, Zhang 2017]

def LF_circular_mass(x): c = off_shelf_circle_finder(x)[0] if c.radius > 1: return “ABNORMAL”

[e.g. Varma 2017]

slide-24
SLIDE 24

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS

Result sult: : Sup uper ervision vision as as Co Code de

24

Bu But, , very ry me messy ssy su super pervision vision…

slide-25
SLIDE 25

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Challenges of Supervision as Code

“Indication: Chest

  • pain. Findings: No

focal consolidation or pneumothorax.”

def LF_pneumo(x): if search(r’pneumo.*’, X): return “ABNORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL”

True Label:

25

  • Different unknown

accurac uracies ies

  • Different unknown

cor

  • rrela

elati tion

  • ns
  • No g
  • grou
  • und

nd truth uth

A new w type e of d f data cle leaning ning and in integra gratio tion n problem blem

slide-26
SLIDE 26

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(1) Writing Labeling Functions

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 26

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

1

slide-27
SLIDE 27

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(2) Clean & integrate noisy labels

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 27

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

2

Ho How can an we do e do th this is wit itho hout ut grou

  • und

nd-tr truth uth la labe bels ls?

slide-28
SLIDE 28

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

A Simple Labeling Model

28

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier\\\(x) == 1: return “NORMAL”

𝜇1 𝜇2 𝜇3 𝜇4 𝑍 𝑸(𝝁𝟒, 𝝁𝟓|𝒁) 𝑸(𝝁𝒋, 𝒁) = ቊ 𝜷𝒋, 𝝁𝒋 = 𝒁 𝟐 − 𝜷𝒋, 𝝁𝒋 ≠ 𝒁

Simpl mple e ge genera nerativ tive e model del of the labeli beling ng process

  • cess
  • Represent the LF outputs as RVs
  • Model with a single parameter
  • Can be extended

In Include ude pair irwis ise e dependen endencies cies

  • Assume edges are known
  • We (provably) estimate from

unlabeled data [ICML ’17, Arxiv’19]

Th This s talk: lk: Ho How to to learn arn model del witho thout ut obser serving ving 𝒁?

slide-29
SLIDE 29

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Some me prior ior work: rk: 1.

  • 1. Crowdsour

dsourci cing ng

1. 1. EM EM-ba based ed [e.g. Dawid & Skene, 1979] 2. 2. Spectr tral al [e.g. Gosh, 2011; Anandkumar 2012]

2.

  • 2. Data

ata fusion sion [e.g. Dong, 2015; Rekatsinas 2017]

  • 3. Others (see snorkel.stanford.edu)

Differenc ences es highli hlight hted d in this s portion tion of the talk: lk: 1.

  • 1. Compl

plex ex dependenc pendencies ies bet etween een LFs 2.

  • 2. End

End-to to-en end d theor

  • retic

tical al guara arantee ees

Prior Work on Weak Supervision Modeling

29

𝜇1 𝜇2 𝜇3 𝑍 𝜇1 𝜇2 𝜇3 𝑍 𝜇4 𝜇5

slide-30
SLIDE 30

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

𝜇1 𝜇2 𝜇3 𝜇4 𝑍

?

Key idea: Learn from the agreements & disagreements between the LFs

30

[Ratner et. al., AAAI ’19] [Ratner et. al., NeurIPS ‘16]

slide-31
SLIDE 31

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Operationally: Look at the covariance

Σ =

𝜇1 𝜇2 𝜇3 𝑍

Σ𝑃

(𝜇3, 𝜇4)

𝜇4 𝜇1 𝜇2 𝜇3 𝜇4 𝑍

?

𝜇1 𝜇2 𝜇3 𝑍

(𝜇3, 𝜇4)

𝜇4

Th This s encod ncodes es the e obser served ed LF agreements ements / d / disa sagreements ements

Parameters to solve for

31

Observed Unobserved

slide-32
SLIDE 32

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Idea: Use graph-sparsity of Σ−1

Σ−1 =

𝜇1 𝜇2 𝜇3 𝑍

(𝜇3, 𝜇4)

𝜇4 𝜇1 𝜇2 𝜇3 𝜇4 𝑍

?

𝜇1 𝜇2 𝜇3 𝑍

(𝜇3, 𝜇4)

𝜇4

We know the zeros s of 𝚻−𝟐 from rom our model el [Loh & & Wainwrig nwright ht 2013, , Rat atne ner r 2019]

32

Th This s encod ncodes es our r knowledge wledge of the e depend pendency ncy st struct uctur ure

Observed Unobserved

slide-33
SLIDE 33

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Putting the pieces together

Σ = Σ−1 = Σ𝑃

−1 + 𝑨𝑨𝑈 = Σ−1 𝑃

Let 𝛁 be the se set of 0 e 0 entry try indices dices in 𝚻−𝟐

33

Observed Unobserved

slide-34
SLIDE 34

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Putting the pieces together

Σ = Σ−1 = Σ𝑃

−1 + 𝑨𝑨𝑈 Ω = 0

Th This s is s si similar ilar to to a matrix trix comp mple letion tion problem! blem!

34

Observed Unobserved

slide-35
SLIDE 35

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Result: Recovering the LF accuracies & correlations

35

argmin

𝑨

Σ𝑃

−1 + 𝑨𝑨𝑈 Ω

Why is is t this is nic ice?

  • Simple

mple to to optimi timize ze: : E.g. SGD

  • Scalable

alable: : No dependence on n!

  • Theo

eoretic tical l guarant rantees es: : We can leverage random matrix & perturbation tools to prove convergence Let:

  • 𝑜 = number of unlabeled data points
  • 𝑒 = number of LF cliques

𝑒 × 𝑒

slide-36
SLIDE 36

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Recovery Results (Informal)

Given:

  • n unlabeled data points
  • A set of LFs that are on average better than 50% accurate
  • A sufficiently sparse dependency structure (per deterministic test)

36

[Ratner et. al., AAAI 2019; Ratner et. al., NeurIPS 2016]

𝜇1 𝜇2 𝜇3 𝑍 𝜇4 𝜇1 𝜇2 𝜇3 𝑍 𝜇4

Fully-connected (Bad) Conditionally independent (Good)

?

slide-37
SLIDE 37

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Recovery Results (Informal)

𝐹 Ƹ 𝑨 − 𝑨∗ = Ο 1 √𝑜

Given:

  • n unlabeled data points
  • A set of LFs that are on average better than 50% accurate
  • A sufficiently sparse dependency structure (per deterministic test)

Then:

37

[Ratner et. al., AAAI 2019; Ratner et. al., NeurIPS 2016]

Parameter (LF accuracy & correlation strength) estimation error Decreases with unlabeled data!

slide-38
SLIDE 38

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Th This s gives es us s a si simple ple---

  • -and

and provabl ably y consist nsistent ent--

  • --way

y to to clean ean and integra tegrate e the LF outpu tputs ts

𝜇1 𝜇2 𝜇3 𝜇4 𝑍

?

Model el

argmin

𝑨

Σ𝑃

−1 + 𝑨𝑨𝑈 Ω

Para rame meters ers (ac accuracies curacies & correlations) tions)

38

෨ 𝑍 = 𝑄

𝑨(⋅ |𝜇)

Training Label

Result: Recovering the LF accuracies and correlations

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier\\\(x) == 1: return “NORMAL”

Labeling ling Functi tion

  • ns
slide-39
SLIDE 39

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(2) Clean & integrate noisy labels

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 39

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

2

slide-40
SLIDE 40

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

(3) Train end model w/ training DB

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 40

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

3

Key que y questi tion:

  • n: Ho

How do do we co e comm mmun unic icate e th the lin e lineage eage (qu qual alit ity) ) of

  • f th

the t e trai aini ning ng la labe bels ls?

slide-41
SLIDE 41

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Ex: Importance of Label Lineage

41

Res esul ult: t: Aver erage e tr trai aini ning ng la labe bel l qu qual alit ity y ba barely ely be better er th than an 60 60% % ac accu cura racy cy

1M points 60% accuracy 10K points 90% accuracy

slide-42
SLIDE 42

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Solution: Modify Loss to Use Probabilistic Labels

  • Standard ERM:
  • Minimize

1 n σ𝑗=1 𝑜

𝑚(𝑦 𝑗 , 𝑧 𝑗 )

  • We use a noise-aware loss:
  • Minimize

1 n σ𝑗=1 𝑜

𝐹 ෤

𝑧∼𝑞𝑨(⋅|𝜇 𝑗 )[𝑚 𝑦 𝑗 , ෤

𝑧 ]

42 END MODEL PROBABILISTIC TRAINING LABEL

Normal Abnormal

A sim impl ple e twea eak to

  • th

the los e loss fun unct ction ion to

  • co

comm mmun unic icate e li line neage e / qu qual alit ity! y!

Normal

slide-43
SLIDE 43

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Recap: The Snorkel Pipeline

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 43

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

slide-44
SLIDE 44

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

End-to-End Recovery Results (Informal)

𝐹 𝑚 ෝ

𝑥 − 𝑚𝑥∗

= Ο 1 √𝑜

Result: Given conditions from before, and some loose assumptions about the end model, generalization error decreases at the same rate

Same asymptotic rate as with labeled data!

44

[Ratner et. al., AAAI 2019; Ratner et. al., NeurIPS 2016]

Expected test error of the end model Decreases with unlabeled data!

slide-45
SLIDE 45

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Question: Why train a final model at all?

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL 45

Users write labeling functions to heuristically label data Snorkel cleans and combines the LF labels The resulting training database used to train an ML model

TRAINING DATABASE END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS UNLABELED DATA

slide-46
SLIDE 46

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Question: Why train a final model at all?

46

The resulting training database used to train an ML model

END MODEL

  • (1) Generali

ralization tion

  • Often hard to write good, high-coverage LFs
  • We can leverage commodity ML models and

tools to do better! [VLDB 2018]

  • (2) Cros
  • ss-modal

modal tran ransf sfer er

  • Write LFs over one feature set → train model
  • ver a different one
slide-47
SLIDE 47

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Highlight: Cross-Modal Transfer

𝑍

1

𝑍

2

𝑍

3

𝑍

4

𝑍

LABEL MODEL END MODEL

def LF_pneumo(x): if re.search(r’pneumo.*’, X.text): return “ABNORMAL” def LF_short_report(x): if len(X.words) < 15: return “NORMAL” def LF_ontology(x): if DISEASES & X.words: return “ABNORMAL” def LF_off_shelf_classifier(x): if off_shelf_classifier(x) == 1: return “NORMAL”

LABELING FUNCTIONS 47

Training data as a medium of transferring domain knowledge across modalities

“Indication: Chest pain. Findings: Focal consolidation…”

slide-48
SLIDE 48

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Months

Cross-Modal Chest X-ray Classification

48

slide-49
SLIDE 49

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Months

Cross-Modal Chest X-ray Classification

49

Years

slide-50
SLIDE 50

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Months

Cross-Modal Chest X-ray Classification

50

Years

Indication: Chest pain. Findings: Mediastinal contours are within normal normal limits. Heart size is within normal normal limits. No No focal consolidation, pneumothorax pneumothorax or pleural pleural effusion

  • effusion. Impression: No

No acute cardiopulmonary abnormality.

20 Labeling Functions

slide-51
SLIDE 51

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Months

Cross-Modal Chest X-ray Classification

51

Years Days

Indication: Chest pain. Findings: Mediastinal contours are within normal normal limits. Heart size is within normal normal limits. No No focal consolidation, pneumothorax pneumothorax or pleural pleural effusion

  • effusion. Impression: No

No acute cardiopulmonary abnormality.

20 Labeling Functions

slide-52
SLIDE 52

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Applications: Diversity and Real-World Impact

52

Medical Monitoring (image, video, time series)

[Dunnmon & Ratner, 2019] [Khandwala, NeurIPS ML4H 2017] [Fries, 2018]

Industry (web, text, other)

[Bach, SIGMOD Industry 2019] [Mallinar, AAAI 2019] [Bringer, 2019]

Knowledge Base Construction (text, tables, PDFs, HTML)

[Ratner, VLDB 2018] [Wu, SIGMOD 2018] [Kuleshov, NeurIPS ML4H 2016]

In many cases: exceeds the efficacy of person- months of labeled data in hours

slide-53
SLIDE 53

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Highlight: Scaling with unlabeled data

53

Takeaway: Add more unlabeled data---without changing the LFs---and get better end performance!

slide-54
SLIDE 54

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

NIH Snorkel Workshop and User Study

Novice vice use sers rs: : 45 45.5 .5% % better er on avera erage ge usi sing ng Snork rkel vs vs. . hand nd-labeling labeling

slide-55
SLIDE 55

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Snorkel: A System for Rapidly Creating Training Sets

“causes”, “induces”, “linked to”, “aggravates”, …

External KBs Patterns & dictionaries Natural Language

“Chemicals of type A should be harmless…” Subset A Subset B Subset C

EXPERT KNOWLEDGE & DATA

Pattern(“{{0}} reacts with”)

Expert Developers 55

Snorkel can enable a more accessible, faster, and powerful way of building ML applications

slide-56
SLIDE 56

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

slide-57
SLIDE 57

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-58
SLIDE 58

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

One Critical Tool: Data Augmentation

58

Ex: 13.4 pt. avg. accuracy gain from data augmentation across top ten CIFAR-100 models

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-59
SLIDE 59

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Problem: Data augmentation is critical, but hard to hand-tune Idea: Users provide transformations which we automatically tune and compose

59

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-60
SLIDE 60

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Automatic Data Augmentation from User- Specified Invariances

Rotate 2.5° Rescale 1.1x Blur 1.1x

Users write transformation functions (TFs)

60

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-61
SLIDE 61

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Images NLP Medical

Rotations Crops Hue shifts Synonymy swaps Language model swaps Positional swaps Standard image TFs Parameterized value shifts Move segmented mass

TFs can express a diverse range of invariances

61

How do we tune & compose these?

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-62
SLIDE 62

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Automating Tuning & Composing

62

Idea #1: Treat this as a sequence modeling problem How do we generate diverse but realistic transformed images?

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-63
SLIDE 63

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Automating Tuning & Composing

63

Idea #2: Use adversarial approach to learn to generate realistic images from unlabeled data

𝐸∅ 𝐻

ℎ𝜐1 ℎ𝜐𝑀

TF sequences

Unlabeled Data Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-64
SLIDE 64

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Automatic Data Augmentation from User- Specified Invariances

Rotate 2.5° Rescale 1.1x Blur 1.1x

𝐸∅ 𝐻

ℎ𝜐1 ℎ𝜐𝑀

TF sequences

Unlabeled Data

64

We learn a generative model to tune & compose the TFs Users write transformation functions (TFs) The learned data augmentation policy used for training the end model

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-65
SLIDE 65

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Empirical Results: Gains Across Domains

  • Gains over heuristic data augmentation approaches:
  • 4 p

pts.

  • s. in accuracy

uracy on CIFAR-10

  • 1.4 F1 scor
  • re

e pts.

  • pts. on a text relation extraction problem
  • 3.8 pts.
  • pts. in accuracy

uracy on a clinical mammography classification task

  • Our core ideas have since been adopted in industry:
  • Google’s AutoAugment, yielding new SOTA on Imagenet

65

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-66
SLIDE 66

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-67
SLIDE 67

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-68
SLIDE 68

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Most real-world settings: multiple related modeling tasks

“Indication: Chest pain. Findings: Focal consolidation and pneumothorax…”

68

𝑍

1 = “𝐵𝑐𝑜𝑝𝑠𝑛𝑏𝑚”?

𝑍

2 = “𝐷𝑝𝑜𝑡𝑝𝑚𝑗𝑒𝑏𝑢𝑗𝑝𝑜”?

𝑍

3 = “𝑄𝑜𝑓𝑣𝑛𝑝𝑢ℎ𝑝𝑠𝑏𝑦”?

How do we approach these multiple related modeling tasks?

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-69
SLIDE 69

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Basic approach: T pipelines for T tasks

69

𝑄

𝜈(𝑍 1|𝝁𝟐)

𝑍

1

𝑄

𝜈(𝑍 2|𝝁𝟑)

𝑍

2

𝑄

𝜈(𝑍 3|𝝁𝟒)

𝑍

3

For ex: three separate Snorkel pipelines

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-70
SLIDE 70

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Problem: We have to provide supervision (write LFs) for multiple tasks Idea: Jointly model across multiple related tasks to do better with less

70

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-71
SLIDE 71

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Basic approach: T pipelines for T tasks

71

𝑄

𝑨(𝑍 1|𝝁𝟐)

𝑍

1

𝑄

𝑨(𝑍 2|𝝁𝟑)

𝑍

2

𝑄

𝑨(𝑍 3|𝝁𝟒)

𝑍

3

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-72
SLIDE 72

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Snorkel MeTaL: A System for Multi-Task Supervision

72

𝑍

1

𝑄

𝑨(𝑍|𝜇)

𝑍

2

𝑍

3

https://github.com/HazyResearch/metal

Multi-task model for shared representation Joint modeling of multi-task LFs

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-73
SLIDE 73

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Key Idea: Cross-Task Overlaps

73

𝑄

𝑨(𝑍|𝜇)

https://github.com/HazyResearch/metal

𝜇1 = “𝑂𝑝𝑠𝑛𝑏𝑚” 𝜇2 = “𝐷𝑝𝑜𝑡𝑝𝑚𝑗𝑒𝑏𝑢𝑗𝑝𝑜” 𝜇3 = 0

Use cross-task agreements / disagreements in an extended version of the matrix completion approach

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-74
SLIDE 74

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Snorkel MeTaL: A System for Multi-Task Supervision

74

𝑍

1

𝑄

𝑨(𝑍|𝜇)

𝑍

2

𝑍

3

https://github.com/HazyResearch/metal

Empirical results: Strong gains over single-task approach (avg. 4 F1 points) and easier interface

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

slide-75
SLIDE 75

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Snorkel

𝑍

1

𝑍

2

𝑍

3

Labeling Augmentation Multi-Task Supervision

1 2 3

TANDA MeTaL

slide-76
SLIDE 76

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Research Agenda

Make real-world ML applications radically easier and faster to build with data management systems that support critical steps outside of the model

76

Training Data Models Model Ecosystems

slide-77
SLIDE 77

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Programming Stack Supervision Stack Machine Language Assembly Language High-Level Language Declarative Language Application Interfaces Individual Labels LFs Coded Directly LFs Built on Advanced Primitives LFs Compiled from Natural Language LFs Auto-Generated from User Behavior High-level Low-level Automated Manual

Building up the code-as-supervision stack

From supervision as labels to super ervisi vision

  • n as

cod

  • de

77

Training Data

slide-78
SLIDE 78

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Programming Stack Supervision Stack Machine Language Assembly Language High-Level Language Declarative Language Application Interfaces Individual Labels LFs Coded Directly LFs Built on Advanced Primitives LFs Compiled from Natural Language LFs Auto-Generated from User Behavior High-level Low-level Automated Manual

Building up the code-as-supervision stack

SELECT * FROM table WHERE val > 3;

?

78

Goal: Move up the stack- make ML radically easier to program

Training Data

slide-79
SLIDE 79

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Formalize and support other critical “preprocessing” steps of ML

Goal: Build data management systems that accelerate where ML developers actually spend their time

Training Set Management Data Selection Data Preprocessing Candidate Extraction

79

This talk Models

slide-80
SLIDE 80

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

The Massively Multi-Task Ecosystem

Segment Triage Disease A? Disease C? Disease B?

80

As it it becomes

  • mes faster

er to b

  • buil

ild train inin ing g sets ther ere e wil ill b l be tens s to h

  • hundr

dreds s

  • f
  • f in

interactin cting g mod

  • dels

ls New challenges:

  • Incremental maintenance
  • Handling complex data

dependencies

  • New formalisms

Goal: Support new model ecosystems at massive scale Today:

x 10

Tomorrow:

Model Ecosystems

slide-81
SLIDE 81

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Unlabeled data Model

Normal

Labeling

1 𝑍

1

𝑍

2

𝑍

3

Multi-Task Supervision 3

Snorkel TANDA MeTaL

Augmentation

2

Key Idea: Add mathematical & systems structure to training data creation & management

slide-82
SLIDE 82

Accelerating Machine Learning with Training Data Management Data Council 4/17/19 | Alexander Ratner

Our Research: Training Data Management Systems

Data Labeling

1 Data Augmentation 2 Multi-Task Supervision 3

Thank you!

More info: snorkel.stanford.edu github/HazyResearch/metal github/HazyResearch/tanda Thank you to: Chris Ré, Daniel Rubin, Kunle Olukotun, John Duchi, Chris De Sa, Sen Wu, Daniel Selsam, Henry Ehrenberg, Jason Fries, Bryan He, Braden Hancock, Theo Rekatsinas, Paroma Varma, Fred Sala, Jared Dunnmon, the Stanford Bio-X SIG Fellowship, and the many other contributors, users, and sponsors of Snorkel

82