FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan - - PowerPoint PPT Presentation

flyingsquid speeding up weak supervision with triplet
SMART_READER_LITE
LIVE PREVIEW

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan - - PowerPoint PPT Presentation

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods Dan Fu*, Mayee Chen*, Fred Sala, Sarah Hooper, Kayvon Fatahalian, Chris R Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher R. Fast


slide-1
SLIDE 1

FlyingSquid: Speeding Up Weak Supervision with Triplet Methods

Dan Fu*, Mayee Chen*, Fred Sala, Sarah Hooper, Kayvon Fatahalian, Chris Ré

Daniel Y. Fu*, Mayee F. Chen*, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods. ICML 2020. * Denotes Equal Contribution

slide-2
SLIDE 2

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

The Training Data Bottleneck in ML

Collecting training data can be slow and expensive

slide-3
SLIDE 3

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Weak Supervision - A Response

def L_1(comment): return SPAM if “http” in comment def L_2(comment): return NOT SPAM if “love” in comment

User-Defined Functions Crowd Workers External Knowledge Bases

How to best use multiple noisy sources of supervision?

slide-4
SLIDE 4

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

X1 X2 X3

Unlabeled Input

def S_1: label +1 if Bernie on screen def S_2: label +1 if background blue def S_3: return CROWD_WORKER_VOTE

Labeling Functions Latent Variable Model

μ(λ1,Y1) μ(λ2,Y1) μ(λ3,Y1) μ(λ4,λ5,Y2) μ(λ6,Y2)

Label Model

Y1 Y2 Y3 0.95 0.87 0.09

Probabilistic Training Labels End Model

Y1 Y2 Y3 λ1 λ2 λ4 λ6 λ5 λ7 λ9 λ8 λ3

1 2 3

Users write labeling functions Model labeling function behavior to de-noise them Use the probabilistic labels to train a downstream model

Data Programming: Unifying Weak Supervision

[1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018. [1]

slide-5
SLIDE 5

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

X1 X2 X3

Unlabeled Input

def S_1: label +1 if Bernie on screen def S_2: label +1 if background blue def S_3: return CROWD_WORKER_VOTE

Labeling Functions Latent Variable Model

μ(λ1,Y1) μ(λ2,Y1) μ(λ3,Y1) μ(λ4,λ5,Y2) μ(λ6,Y2)

Label Model

Y1 Y2 Y3 0.95 0.87 0.09

Probabilistic Training Labels End Model

Y1 Y2 Y3 λ1 λ2 λ4 λ6 λ5 λ7 λ9 λ8 λ3

1 2 3

Users write labeling functions Model labeling function behavior to de-noise them Use the probabilistic labels to train a downstream model

Data Programming: Unifying Weak Supervision

[1] Ratner et al. Snorkel: Rapid Training Data Creation with Weak Supervision. VLDB 2018. [1]

slide-6
SLIDE 6

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

X1 X2 X3

Unlabeled Input

def S_1: label +1 if Bernie on screen def S_2: label +1 if background blue def S_3: return CROWD_WORKER_VOTE

Labeling Functions Latent Variable Model

μ(λ1,Y1) μ(λ2,Y1) μ(λ3,Y1) μ(λ4,λ5,Y2) μ(λ6,Y2)

Label Model

Y1 Y2 Y3 0.95 0.87 0.09

Probabilistic Training Labels End Model

Y1 Y2 Y3 λ1 λ2 λ4 λ6 λ5 λ7 λ9 λ8 λ3

1 2 3

Users write labeling functions Model labeling function behavior to de-noise them Use the probabilistic labels to train a downstream model

Weak Supervision: A Response

Expensive SGD iterations

Modeling labeling functions critical, but can be slow…

slide-7
SLIDE 7

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

FlyingSquid: Reduce the Turnaround Time

▪ Background: labeling functions and graphical models ▪ Closed-form solution to model parameters, no SGD ▪ Theoretical bounds and guarantees ▪ Run orders of magnitude faster, without losing accuracy;

weakly-supervised online learning

∴ ⇒ ∎

slide-8
SLIDE 8

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Background

slide-9
SLIDE 9

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Problem Setup

{Xi}n

i=1

S1 : 풳 → λ1 ∈ {±1,0}

Sm : 풳 → λm ∈ {±1,0}

{ ̂ Yi}n

i=1

Unlabeled Data m Labeling Functions Probabilistic Labels

We want to learn the joint distribution P(λ, Y), without observing Y!

fw : 풳 → 풴

Downstream End Model

slide-10
SLIDE 10

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Model Labeling Functions with Latent Graphical Models

Y λ 2 λ 1 λ 3 λ 4 λ 5

Hidden Variable (True Label) Observed Variables (Labeling Function Outputs)

Y1 Y2 Y3 λ 3 λ 6 λ 9 λ 1 λ 2 λ 4 λ 5 λ 7 λ 8

Temporal dependencies

Technical problem: learning parameters of these graphical models Main challenge: recover accuracies μ of labeling functions

[2] Varma et al. Learning Dependency Structures for Weak Supervision Models. ICML 2019.

slide-11
SLIDE 11

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Parameter Recovery

slide-12
SLIDE 12

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Existing Iterative Approaches Can Be Slow

Sala et al. 2019 Safranchik et al. 2020 Zhan et al. 2019 Ratner et al. 2019 Bach et al. 2019 (Gibbs)

Disadvantages: SGD can take a long time, many hyperparameters (learning rate, momentum, etc) to tune

Ratner et al. 2016 Ratner et al. 2018

SGD over loss function

slide-13
SLIDE 13

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Solve Triplets of Labeling Function Parameters at a Time

Method of moments: break problem up into pieces, get closed-form solutions

Latent Variable Model

E[λ1Y1]E[λ2Y1] = E[λ1λ2] E[λ2Y1]E[λ3Y1] = E[λ2λ3] E[λ3Y1]E[λ1Y1] = E[λ3λ1]

Solve Triplets

  • f Labeling Functions

μ(λ1,Y1) μ(λ2,Y1) μ(λ3,Y1) μ(λ4,λ5,Y2) μ(λ6,Y2)

Label Model

Y1 Y2 Y3 λ1 λ2 λ4 λ6 λ5 λ7 λ9 λ8 λ3

slide-14
SLIDE 14

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Triplets of Conditionally-Independent Labeling Functions

Form triplets of these equations:

μ1 μ4 = N1,4 μ1 μ5 = N1,5 μ4 μ5 = N4,5 ⇒

Get closed-form solutions:

|μ1| = N1,4N1,5/N4,5 |μ4| = N1,4N4,5/N1,5 |μ5| = N4,5N1,5/N1,4

All we need to do is count how often the labeling functions agree - no SGD!

μi μj = Ni,j

Unobservable accuracy parameters Observable agreements

= E[λiλj]

Moment

slide-15
SLIDE 15

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Theoretical Analysis

slide-16
SLIDE 16

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Bounding Sampling Error (Informal)

E[∥ ̂ μ − μ∥2] ≤ O(n−1/2)

Error in parameter estimate Number of unlabeled data points

Theorem 1: How Sampling Error Scales in n

E[∥ ̂ μ − μ∥2] ≥ Ω(n−1/2)

Theorem 2: Optimal Scaling Rate

Y λ 2 λ 1 λ 3 λ 4 λ 5

Conditionally-Independent Labeling Functions

Best Possible Scaling Rate with Unlabeled Data Bound is Tight

slide-17
SLIDE 17

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

End Model Generalization Error (Informal)

E[L( ̂ w, X, Y) − L(w*, X, Y)] = O(n−1/2)

End model generalization error

This is the same asymptotic rate as with supervised data!

More theory nuggets (check out our paper for details):

We can achieve these rates even with model misspecification (graph is incorrect)

Bounds for distributional drift over time in the online setting

Theorem 3: End Model Generalization Error If you use parameters to generate labels and train an end model ,

̂ μ ̂ Y f ̂

w

slide-18
SLIDE 18

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Evaluation & Implications

slide-19
SLIDE 19

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

We run faster, and get high quality

End model accuracies (F1): Label model training times (s):

Snorkel Temporal Snorkel FlyingSquid Benchmarks Video Tasks 3.0 41.5

  • 292.3

0.06 0.20 Snorkel Temporal Snorkel FlyingSquid Benchmarks Video Tasks 74.6 47.4

  • 75.2

77.0 76.2

slide-20
SLIDE 20

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Re-training in the End Model Training Loop

No SGD -> re-train the label model in the training loop of an end model PyTorch integration: FlyingSquid loss layer

X1 X2 X3

Loss Gradients

slide-21
SLIDE 21

Introduction Background Parameter Recovery Theoretical Analysis Evaluation

Speedups enable online learning

Online learning: re-train on a rolling window Adapt to distributional drift over time

slide-22
SLIDE 22

Contact: Dan Fu (danfu@cs.stanford.edu, @realDanFu) Code: https://github.com/HazyResearch/flyingsquid Blog Post (Towards Interactive Weak Supervision with FlyingSquid): http://hazyresearch.stanford.edu/flyingsquid Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods: https://arxiv.org/abs/2002.11955

Thank you!

∴ ⇒ ∎

Dan Fu Mayee Chen Fred Sala Sarah Hooper