Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - - PowerPoint PPT Presentation

active learning with disagreement graphs
SMART_READER_LITE
LIVE PREVIEW

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - - PowerPoint PPT Presentation

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019 On-line Active Learning Setup At


slide-1
SLIDE 1

Active Learning with Disagreement Graphs

Corinna Cortes1, Giulia DeSalvo1, Claudio Gentile1, Mehryar Mohri1,2, Ningshan Zhang3

1 Google Research 2 Courant Institute, NYU 3 NYU

ICML, June 12, 2019

slide-2
SLIDE 2

On-line Active Learning Setup

◮ At each round t ∈ [T], receives unlabeled xt ∼ DX i.i.d. ◮ Decides whether to request label:

◮ If label requested, receives yt.

◮ After T rounds, returns a hypothesis hT ∈ H.

Objective:

◮ Generalizations error:

◮ Accurate predictor hT: small expected loss R(hT) = Ex,y

  • ℓ(hT(x), y)
  • .

◮ Close to best-in-class h∗ = argminh∈H R(h).

◮ Label complexity: few label requests.

slide-3
SLIDE 3

Disagreement-based Active Learning

Key idea: Request label when there is some disagreement among hypotheses. Examples:

◮ Separable case: CAL (Cohn et al., 1994). ◮ Non-separable case: A2 (Balcan et al., 2006), DHM (Dasgupta et al., 2008). ◮ IWAL (Beygelzimer et al., 2009).

Can we improve upon existing disagreement-based algorithms, such as IWAL?

◮ Better guarantees? ◮ Leverage average disagreements?

slide-4
SLIDE 4

This talk

◮ IWAL-D algorithm: enhanced IWAL with disagreement graph. ◮ IZOOM algorithm: enhanced IWAL-D with zooming-in. ◮ Better generalization and label complexity guarantees. ◮ Experimental results.

slide-5
SLIDE 5

Disagreement Graph (D-Graph)

◮ Vertices: hypotheses in H (a finite hypothesis set) ◮ Edges: fully connected. The edge between h, h′ ∈ H is weighted by their

expected disagreement: L(h, h′) = E

x∼DX

  • max

y∈Y

  • ℓ(h(x), y) − ℓ(h′(x), y)
  • .

L symmetric, ℓ ≤ 1 ⇒ L ≤ 1.

◮ D-Graph can be accurately estimated using unlabeled data.

slide-6
SLIDE 6

Disagreement Graph (D-Graph)

One favorable scenario:

◮ Best-in-class h∗ ( ) is within an isolated cluster; ◮ L(h, h∗) is small within the cluster.

slide-7
SLIDE 7

IWAL-D Algorithm: IWAL with D-Graph

◮ At round t ∈ [T], receive xt.

  • 1. Flip a coin Qt ∼ Ber(pt), with disagreement-based bias:

pt = max

h,h′∈Ht max y∈Y

  • ℓ(h(xt), y) − ℓ(h′(xt), y)
  • .
  • 2. If Qt = 1, request the label yt.
  • 3. Trim the version space:

Ht+1 =

  • h ∈ Ht : Lt(h) ≤ Lt(

ht) +

  • 1 + L(h,

ht)

  • ∆t
  • ,

which uses importance weighted empirical risk Lt(h) = 1 t

t

  • s=1

Qs ps ℓ(h(xs), ys),

  • ht = argmin

h∈Ht

Lt(h), ∆t = O

  • log(T|H|)

t

  • .

◮ After T rounds, return

hT.

slide-8
SLIDE 8

IWAL-D vs. IWAL: Quantify the Improvement

Theorem (IWAL-D) With high probability, R( hT) ≤ R∗ +

  • 1 + L(

hT, h∗)

  • ∆T,

E

x∼DX

  • pt|Ft−1
  • ≤ 2θ
  • 2R∗ + max

h∈Ht

  • 2 + L(h,

ht−1) + L(h, h∗)

  • ∆t−1
  • .

◮ θ: disagreement coefficient (Hanneke, 2007). ◮ More aggressive trimming of the version space. ◮ Slightly better generalization guarantee and label complexity.

slide-9
SLIDE 9

IWAL and IWAL-D

Problem:

◮ Theoretical guarantees only hold for finite hypothesis sets. ◮ Need ǫ-cover to extend to infinite hypothesis sets. ◮ Expensive to construct ǫ-cover in practice.

Can we adaptively enrich the hypothesis set, with theoretical guarantees?

slide-10
SLIDE 10

IZOOM: IWAL-D with Zooming-in

At round t,

◮ Request label based on dis. of (Ht)

Trim Resample

slide-11
SLIDE 11

IZOOM: IWAL-D with Zooming-in

At round t,

◮ Request label based on dis. of (Ht) ◮ H′ t+1 ← Trim(Ht)

Trim Resample

slide-12
SLIDE 12

IZOOM: IWAL-D with Zooming-in

At round t,

◮ Request label based on dis. of (Ht) ◮ H′ t+1 ← Trim(Ht) ◮ H′′ t+1 ← Resample(H′ t+1)

Trim Resample Resample(H′

t+1): sample new h ∈ ConvexHull(H′ t+1). ◮ E.g., random convex combination of

ht and h ∈ H′

t+1.

slide-13
SLIDE 13

IZOOM: IWAL-D with Zooming-in

At round t,

◮ Request label based on dis. of (Ht) ◮ H′ t+1 ← Trim(Ht) ◮ H′′ t+1 ← Resample(H′ t+1) ◮ Ht+1 ← H′ t+1 ∪ H′′ t+1

Trim Resample

slide-14
SLIDE 14

IZOOM vs. IWAL-D

Let Ht = ∪t

s=1Ht, i.e. all the hypotheses ever considered up to time t. Let

h∗

t = argminh∈Ht R(h).

Theorem (IZOOM) With high probability, R( hT) ≤ R∗

T +

  • 1 + L(

hT, h∗

T)

  • ∆T + O( 1

T ),

E

x∼DX

  • pt+1|Ft
  • ≤ 2θt
  • 2R∗

t + max h∈Ht+1

  • 2 + L(h,

ht) + L(h, h∗

t )

  • ∆t
  • + O( 1

T ). ◮ R∗ t = minh∈Ht R(h) is smaller than R∗ = minh∈H0 R(h). ◮ More accurate

hT, with fewer label requests.

slide-15
SLIDE 15

Experiments

Tasks: 8 Binary classification datasets from UCI repository.

◮ ℓ: logistic loss rescaled to [0, 1].

Baselines:

◮ IWAL with 3,000 hypotheses. ◮ IWAL with 12,000 hypotheses. ◮ IZOOM with 3,000 hypotheses.

Performance measure:

◮ 0-1 loss on test data vs. number of label requests.

slide-16
SLIDE 16

Experiments

0.125 0.150 0.175 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

nomao

0.14 0.16 0.18 0.20 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

codrna

0.11 0.12 0.13 0.14 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

skin

0.36 0.38 0.40 0.42 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

covtype

0.22 0.23 0.24 0.25 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

magic04

0.18 0.20 0.22 0.24 0.26 7 8 9 10 11

log2(Number of Labels) Misclassification Loss

IWAL 3000 IWAL 12000 IZOOM 3000

a9a

slide-17
SLIDE 17

Conclusion

◮ Key introduction and role of disagreement graph. ◮ More favorable generalization and label complexity guarantees. ◮ Substantial performance improvements. ◮ Effective solutions for active learning.

Poster: Pacific Ballroom #265 KDD workshop (Alaska, August 2019) on Active Learning: Data Collection, Curation, and Labeling for Mining and Learning.