Active Learning for Classification with Abstention Shubhanshu Shekhar - - PowerPoint PPT Presentation

active learning for classification with abstention
SMART_READER_LITE
LIVE PREVIEW

Active Learning for Classification with Abstention Shubhanshu Shekhar - - PowerPoint PPT Presentation

Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17


slide-1
SLIDE 1

Active Learning for Classification with Abstention

Shubhanshu Shekhar1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020

1shshekha@eng.ucsd.edu 1 / 17

slide-2
SLIDE 2

Introduction: Classification with Abstention

Feature space X, PXY a joint distribution on X × {0, 1}, and augmented label set ¯ Y = {0, 1, ∆}, where ∆ is the ‘abstain’ option. Risk of an abstaining classifier g : X → ¯ Y Rλ(g) := PXY (g(X) = Y , g(X) = ∆)

  • risk from mis-classification

+ λPX (g(X) = ∆)

  • risk from abstention

. Cost of mis-classification is 1, while abstention cost is λ ∈ (0, 1/2). Goal: Given access only to a training sample Dn = {(Xi, Yi) : 1 ≤ i ≤ n}, construct a classifier gn with small risk.

2 / 17

slide-3
SLIDE 3

Introduction: Active and Passive Learning

Passive Learning: Dn := {(Xi, Yi) | 1 ≤ i ≤ n} is generated in an i.i.d. manner. Active Learning: the learner constructs Dn by sequentially querying a labelling oracle. Three commonly used active learning models:

membership query: request label at any x ∈ X Stream based: query/discard from a stream of inputs Xt ∼ PX. Pool based: query points from a given pool of inputs.

This talk: Membership query. Paper: All three models.

3 / 17

slide-4
SLIDE 4

Prior Work

Prior work has focused mainly on the passive case:

Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks.

Active Learning for classification with abstention is largely unexplored.

4 / 17

slide-5
SLIDE 5

Prior Work

Prior work has focused mainly on the passive case:

Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks.

Active Learning for classification with abstention is largely unexplored. Questions

1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 / 17

slide-6
SLIDE 6

Prior Work

Prior work has focused mainly on the passive case:

Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks.

Active Learning for classification with abstention is largely unexplored. Questions

1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 Computationally efficient active algorithms via convex surrogates 5 Active learning for Neural Networks with abstention 4 / 17

slide-7
SLIDE 7

Introduction: Bayes Optimal Classifier

Define the regression function η(x) := PXY (Y = 1 | X = x). For a fixed cost of abstention λ ∈ (0, 1/2), the Bayes optimal classifier g∗

λ is defined as:

g∗

λ =

     1 if η(x) ≥ 1 − λ if η(x) ≤ λ ∆ if η(x) ∈ (λ, 1 − λ) The optimal classifier can equivalently be represented by the triplet (G ∗

0 , G ∗ 1 , G ∗ ∆) where G ∗ v := {x ∈ X | g(x) = v} for v ∈ {0, 1, ∆}.

5 / 17

slide-8
SLIDE 8

Introduction: Bayes Optimal Classifier

6 / 17

slide-9
SLIDE 9

Introduction: Bayes Optimal Classifier

6 / 17

slide-10
SLIDE 10

Assumptions

We make the following two assumptions on the joint distribution PXY . (H¨ O) The regression function η is H¨

  • lder continuous with parameters

L > 0 and 0 < β ≤ 1, i.e., |η(x1) − η(x2)| ≤ Ld(x1, x2)β, ∀x1, x2 ∈ (X, d) . (MA) The joint distribution PXY of the input-label pair satisfies the margin assumption with parameters C0 > 0 and α0 ≥ 0, for γ ∈ {λ, 1 − λ}, which means that for any PX (|η(X) − γ| ≤ t) ≤ C0tα0, ∀0 < t ≤ 1. We will use P(α0, β) to denote the class of PXY satisfying these conditions.

7 / 17

slide-11
SLIDE 11

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X.

E_1 E_2 8 / 17

slide-12
SLIDE 12

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X. For each set E ∈ Q(u)

t

, compute

  • ˆ

ηt(E) = (1/nt(E))

Xi ∈E Yi

  • ut(E) = ˆ

ηt(E) +term1 + term2

  • ℓt(E) = ˆ

ηt(E) - term1 - term2

E_1 E_2 8 / 17

slide-13
SLIDE 13

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X. For each set E ∈ Q(u)

t

, compute

  • ˆ

ηt(E) = (1/nt(E))

Xi ∈E Yi

  • ut(E) = ˆ

ηt(E) +term1 + term2

  • ℓt(E) = ˆ

ηt(E) - term1 - term2 If ut(E) < λ or ℓt(E) > 1 − λ ⇒ add E to Q(c)

t

. Otherwise, keep in Q(u)

t

.

E_1 E_2 8 / 17

slide-14
SLIDE 14

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X. For each set E ∈ Q(u)

t

, compute

  • ˆ

ηt(E) = (1/nt(E))

Xi ∈E Yi

  • ut(E) = ˆ

ηt(E) +term1 + term2

  • ℓt(E) = ˆ

ηt(E) - term1 - term2 If ut(E) < λ or ℓt(E) > 1 − λ ⇒ add E to Q(c)

t

. Otherwise, keep in Q(u)

t

. Et = arg maxE∈Q(u)

t

ut(E) − ℓt(E).

E_1 E_2 8 / 17

slide-15
SLIDE 15

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X. For each set E ∈ Q(u)

t

, compute

  • ˆ

ηt(E) = (1/nt(E))

Xi ∈E Yi

  • ut(E) = ˆ

ηt(E) +term1 + term2

  • ℓt(E) = ˆ

ηt(E) - term1 - term2 If ut(E) < λ or ℓt(E) > 1 − λ ⇒ add E to Q(c)

t

. Otherwise, keep in Q(u)

t

. Et = arg maxE∈Q(u)

t

ut(E) − ℓt(E). Refine if term1< term2, else Query at xt ∼ Unif(Et).

E_1 E_2 8 / 17

slide-16
SLIDE 16

Proposed Algorithm: Outline

Repeat for t = 1, 2, . . . ,

classified

  • Q(c)

t

unclassified

  • Q(u)

t

: partition of X. For each set E ∈ Q(u)

t

, compute

  • ˆ

ηt(E) = (1/nt(E))

Xi ∈E Yi

  • ut(E) = ˆ

ηt(E) +term1 + term2

  • ℓt(E) = ˆ

ηt(E) - term1 - term2 If ut(E) < λ or ℓt(E) > 1 − λ ⇒ add E to Q(c)

t

. Otherwise, keep in Q(u)

t

. Et = arg maxE∈Q(u)

t

ut(E) − ℓt(E). Refine if term1< term2, else Query at xt ∼ Unif(Et). Classifier Definition. ˆ gn(x) =      1 if ℓtn

  • x
  • > 1 − λ,

if utn

  • x
  • < λ,

  • therwise.

(1)

E_1 E_2 8 / 17

slide-17
SLIDE 17

9 / 17

slide-18
SLIDE 18

Bound on Excess Risk

Theorem

Under margin and smoothness assumptions and n large enough, for the classifier ˆ gn defined by (1), we have E [Rλ(ˆ gn) − Rλ(g∗

λ)] = ˜

O

  • n−β(α0+1)/(2β+ ˜

D)

. Conversely, we have the following lower bound for any algorithm sup

P∈P(α0,β)

E [Rλ(g) − Rλ(g∗

λ)] = Ω

  • n−β(1+α0)/(2β+D)

. ˜ D ≤ D is a measure of dimensionality of X near the classifier boundaries {x : η(x) ∈ {λ, 1 − λ} }. ˜ D depends on the parameters of both (MA) and (H¨ O) assumptions. ˜ D ≤ D and there exists cases with ˜ D < D and ˜ D = D.

10 / 17

slide-19
SLIDE 19

Lower Bound: Key Inequality

Lemma

Suppose g = (G0, G1, G∆) is a classifier constructed by an algorithm A, and g∗

λ = (G ∗ 0 , G ∗ 1 , G ∗ ∆).

Then, for any λ ∈ (0, 1/2) we have Rλ (g) − Rλ (g∗

λ)

  • :=E(A,PXY ,n) (Excess Risk)

≥ c PX

  • (G ∗

∆ \ G∆) ∪ (G∆ \ G ∗ ∆)

(1+α0)/α0

  • :=

E(A,PXY ,n)

. Rest of the proof relies on a reduction to multiple hypothesis testing. This Lemma suggests an appropriate pseudo-metric, and motivates the construction of hypothesis set.

11 / 17

slide-20
SLIDE 20

Improvement Over Passive

Corollary

For passive algorithms Ap, we can obtain lower bound E as: inf

Ap

sup

P∈P(α0,β)

E [E (Ap, P, n)] = Ω

  • n−β(1+α0)/(D+2β+α0β)

. This rate is slower than the rate achieved by our active algorithm, ≈ ˜ O

  • n−(β(1+α0)/(D+2β)

, for all D, α0, β. In particular, the gain due to active learning increases with (i) decreasing D, (ii) increasing α0 and (iii) increasing β. Under an additional strong density assumption, further improvements are possible.

12 / 17

slide-21
SLIDE 21

Improvement over passive: α0 = 1, β = 0.5, D ∈ {1, 2, . . . , 50}

Dimension limn→∞

log E log n

13 / 17

slide-22
SLIDE 22

Improvement over passive: α0 ∈ [1, 10], β = 0.5, D = 5

Margin Parameter α0 limn→∞

log E log n

14 / 17

slide-23
SLIDE 23

Improvement over passive: α = 1, β ∈ [0.1, 1], D = 5

Smoothness Parameter β limn→∞

log E log n

15 / 17

slide-24
SLIDE 24

Extensions

We also study the following extensions: Adaptivity to (H¨ O) parameter: Under an additional quality assumption, we present a modification of our algorithm which automatically adapts to the smoothness β.

16 / 17

slide-25
SLIDE 25

Extensions

We also study the following extensions: Adaptivity to (H¨ O) parameter: Under an additional quality assumption, we present a modification of our algorithm which automatically adapts to the smoothness β. Bounded-Rate of Abstention: Constrained formulation min

g

PXY (g(X) = Y , g(X) = ∆) subject to PX (g(X) = ∆) ≤ δ. We propose and analyze an algorithm for this problem which

works in a semi-supervised setting uses the fixed-cost algorithm as a building block uses unlabeled samples to ensure that the constraint is satisfied w.h.p.

16 / 17

slide-26
SLIDE 26

Summary

Questions

1 Performance limits of learning algorithms in both settings? ✓

Active: Ω

  • n−β(1+α0)/(D+2β)

Passive: Ω

  • n−β(1+α0)/(D+2β+α0β)

2 Design active learning algorithm matching the performance limits. ✓ 3 Characterize performance gain achievable in active over passive

setting? ✓ ˜ O

  • n−β(1+α0)/( ˜

D+2β)

  • vs. Ω
  • n−β(1+α0)/(D+2β+α0β)

17 / 17

slide-27
SLIDE 27

Summary

Questions

1 Performance limits of learning algorithms in both settings? ✓

Active: Ω

  • n−β(1+α0)/(D+2β)

Passive: Ω

  • n−β(1+α0)/(D+2β+α0β)

2 Design active learning algorithm matching the performance limits. ✓ 3 Characterize performance gain achievable in active over passive

setting? ✓ ˜ O

  • n−β(1+α0)/( ˜

D+2β)

  • vs. Ω
  • n−β(1+α0)/(D+2β+α0β)

4 Computationally efficient active algorithms via convex surrogates. ? 5 Active learning for Neural Networks with abstention. ? 17 / 17