Learning from Corrupted Binary Labels via Class-Probability - - PowerPoint PPT Presentation

learning from corrupted binary labels via class
SMART_READER_LITE
LIVE PREVIEW

Learning from Corrupted Binary Labels via Class-Probability - - PowerPoint PPT Presentation

Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57 Learning from binary


slide-1
SLIDE 1

Learning from Corrupted Binary Labels via Class-Probability Estimation

Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx

National ICT Australia and The Australian National University

1 / 57

slide-2
SLIDE 2

Learning from binary labels

+" +" +" +" #" #" #" #"

2 / 57

slide-3
SLIDE 3

Learning from binary labels

+" +" +" +" #" #" #" #" ?"

3 / 57

slide-4
SLIDE 4

Learning from binary labels

+" +" +" +" #" #" #" #"

4 / 57

slide-5
SLIDE 5

Learning from noisy labels

+" +" +" +" #" #" #" #"

5 / 57

slide-6
SLIDE 6

Learning from positive and unlabelled data

+" +" ?" ?" ?" ?" ?" ?"

6 / 57

slide-7
SLIDE 7

Learning from binary labels

+" +" +" +" #" #" #" #"

nature learner

S ⇠ Dn

Goal: good classification wrt distribution D

7 / 57

slide-8
SLIDE 8

Learning from corrupted labels

+" +" +" +" #" #" #" #"

nature corruptor learner

S ⇠ Dn S ⇠ Dn

Goal: good classification wrt (unobserved) distribution D

8 / 57

slide-9
SLIDE 9

Paper summary

Can we learn a good classifier from corrupted samples?

9 / 57

slide-10
SLIDE 10

Paper summary

Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes!

10 / 57

slide-11
SLIDE 11

Paper summary

Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes!

can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ...

11 / 57

slide-12
SLIDE 12

Paper summary

Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes!

can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duPlessis and Sugiyama, 2014) ...

This work: unified treatment via class-probability estimation

analysis for general class of corruptions

12 / 57

slide-13
SLIDE 13

Assumed corruption model

13 / 57

slide-14
SLIDE 14

Learning from binary labels: distributions

Fix instance space X (e.g. RN) Underlying distribution D over X⇥{±1} Constituent components of D:

(P(x),Q(x),π) = (P[X = x|Y = 1],P[X = x|Y = 1],P[Y = 1])

14 / 57

slide-15
SLIDE 15

Learning from binary labels: distributions

Fix instance space X (e.g. RN) Underlying distribution D over X⇥{±1} Constituent components of D:

(P(x),Q(x),π) = (P[X = x|Y = 1],P[X = x|Y = 1],P[Y = 1]) (M(x),η(x)) = (P[X = x],P[Y = 1|X = x])

15 / 57

slide-16
SLIDE 16

Learning from corrupted binary labels

nature corruptor learner

S ⇠ Dn S ⇠ Dn

Samples from corrupted distribution D = (P,Q,π) Goal: good classification wrt (unobserved) distribution D

16 / 57

slide-17
SLIDE 17

Learning from corrupted binary labels

nature corruptor learner

S ⇠ Dn S ⇠ Dn

Samples from corrupted distribution D = (P,Q,π), where

P = (1α)·P+α ·Q Q = β ·P+(1β)·Q

and π is arbitrary α,β are noise rates

mutually contaminated distributions (Scott et al., 2013)

Goal: good classification wrt (unobserved) distribution D

17 / 57

slide-18
SLIDE 18

Special cases

Label noise PU learning Labels flipped w.p. ρ Observe M instead of Q

π = (12ρ)·π +ρ π = arbitrary α = π1 ·(1π)·ρ P = 1·P+0·Q β = (1π)1 ·π ·ρ Q = M = π ·P+(1π)·Q +" +" +" +" #" #" #" #" +" +" ?" ?" ?" ?" ?" ?"

18 / 57

slide-19
SLIDE 19

Corrupted class-probabilities

Structure of corrupted class-probabilities underpins analysis

19 / 57

slide-20
SLIDE 20

Corrupted class-probabilities

Structure of corrupted class-probabilities underpins analysis

Proposition

For any D,D,

η(x) = φα,β,π(η(x))

where φα,β,π is strictly monotone for fixed α,β,π.

20 / 57

slide-21
SLIDE 21

Corrupted class-probabilities

Structure of corrupted class-probabilities underpins analysis

Proposition

For any D,D,

η(x) = φα,β,π(η(x))

where φα,β,π is strictly monotone for fixed α,β,π. Follows from Bayes’ rule:

η(x) 1η(x) = π 1π · P(x) Q(x)

21 / 57

slide-22
SLIDE 22

Corrupted class-probabilities

Structure of corrupted class-probabilities underpins analysis

Proposition

For any D,D,

η(x) = φα,β,π(η(x))

where φα,β,π is strictly monotone for fixed α,β,π. Follows from Bayes’ rule:

η(x) 1η(x) = π 1π · P(x) Q(x) = π 1π · (1α)· P(x)

Q(x) +α

β · P(x)

Q(x) +(1β)

.

22 / 57

slide-23
SLIDE 23

Corrupted class-probabilities: special cases

Label noise PU learning

η(x) = (12ρ)·η(x)+ρ η(x) =

π·η(x) π·η(x)+(1π)·π

ρ unknown π unknown

(Natarajan et al., 2013) (Ward et al., 2009)

23 / 57

slide-24
SLIDE 24

Roadmap

nature corruptor class-prob estimator classifier

D D ˆ η

Kernel logistic regression

24 / 57

slide-25
SLIDE 25

Roadmap

Exploit monotone relationship between η and η

nature corruptor class-prob estimator classifier

D D ˆ η

?

Kernel logistic regression

25 / 57

slide-26
SLIDE 26

Classification with noise rates

26 / 57

slide-27
SLIDE 27

Class-probabilities and classification

Many classification measures optimised by sign(η(x)t)

0-1 error ! t = 1

2

Balanced error ! t = π F-score ! optimal t depends on D

I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57

slide-28
SLIDE 28

Class-probabilities and classification

Many classification measures optimised by sign(η(x)t)

0-1 error ! t = 1

2

Balanced error ! t = π F-score ! optimal t depends on D

I (Lipton et al., 2014, Koyejo et al., 2014)

We can relate this to thresholding of η!

28 / 57

slide-29
SLIDE 29

Corrupted class-probabilities and classification

By monotone relationship,

η(x) > t ( ) η(x) > φα,β,π(t).

Threshold η at φα,β,π(t) ! optimal classification on D Can translate into regret bound e.g. for 0-1 loss

29 / 57

slide-30
SLIDE 30

Story so far

Classification scheme requires: η t α,β,π

noise

  • racle

nature corruptor class-prob estimator classifier

D D ˆ η ˆ α, ˆ β, ˆ π

sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t))

30 / 57

slide-31
SLIDE 31

Story so far

Classification scheme requires: η ! class-probability estimation t α,β,π

noise

  • racle

nature corruptor class-prob estimator classifier

D D ˆ η ˆ α, ˆ β, ˆ π

Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t))

31 / 57

slide-32
SLIDE 32

Story so far

Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α,β,π

noise

  • racle

nature corruptor class-prob estimator classifier

D D ˆ η ˆ α, ˆ β, ˆ π

Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t))

32 / 57

slide-33
SLIDE 33

Story so far

Classification scheme requires: η ! class-probability estimation t ! if unknown, alternate approach (see poster) α,β,π ! can we estimate these?

noise estimator nature corruptor class-prob estimator classifier

D D ˆ η ˆ α, ˆ β, ˆ π

?

Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t))

33 / 57

slide-34
SLIDE 34

Estimating noise rates: some bad news

π strongly non-identifiable!

π allowed to be arbitrary (e.g. PU learning)

α,β non-identifiable without assumptions (Scott et al., 2013)

Can we estimate α,β under assumptions?

34 / 57

slide-35
SLIDE 35

Weak separability assumption

Assume that D is “weakly separable”:

min

x2X η(x) = 0

max

x2X η(x) = 1

i.e. 9 deterministically +’ve and -’ve instances weaker than full separability

35 / 57

slide-36
SLIDE 36

Weak separability assumption

Assume that D is “weakly separable”:

min

x2X η(x) = 0

max

x2X η(x) = 1

i.e. 9 deterministically +’ve and -’ve instances weaker than full separability

Assumed range of η constrains observed range of η!

36 / 57

slide-37
SLIDE 37

Estimating noise rates

Proposition

Pick any weakly separable D. Then, for any D,

α = ηmin ·(ηmax π) π ·(ηmax ηmin) and β = (1ηmax)·(π ηmin) (1π)·(ηmax ηmin)

where

ηmin = min

x2X η(x)

ηmax = max

x2X η(x)

α,β can be estimated from corrupted data alone

37 / 57

slide-38
SLIDE 38

Estimating noise rates: special cases

Label noise PU learning

ρ = 1ηmax = ηmin π = π ηmin ηmax ηmin α = 0 β = π = 1ηmax ηmax · π 1π

(Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, π can be estimated as well

38 / 57

slide-39
SLIDE 39

Story so far

Optimal classification in general requires α,β,π

noise estimator nature corruptor class-prob estimator classifier

D D ˆ η ˆ η ˆ α, ˆ β, ˆ π

Range of ˆ η Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t)) 39 / 57

slide-40
SLIDE 40

Story so far

Optimal classification in general requires α,β,π

when does φα,β,π(t) not depend on α,β,π?

noise estimator nature corruptor class-prob estimator classifier

D D ˆ η ˆ η ˆ α, ˆ β, ˆ π

Range of ˆ η Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t)) 40 / 57

slide-41
SLIDE 41

Classification without noise rates

41 / 57

slide-42
SLIDE 42

Balanced error (BER) of classifier

Balanced error (BER) of a classifier f : X ! {±1} is:

BERD(f) = FPRD(f)+FNRD(f) 2

for false positive and negative rates FPRD(f),FNRD(f)

average classification performance on each class

  • ptimal classifier is sign(η(x)π)

42 / 57

slide-43
SLIDE 43

BER “immunity” under corruption

Proposition (c.f. (Zhang and Lee, 2008))

For any D,D, and classifier f : X ! {±1},

BERD(f) = (1α β)·BERD(f)+ α +β 2

43 / 57

slide-44
SLIDE 44

BER “immunity” under corruption

Proposition (c.f. (Zhang and Lee, 2008))

For any D,D, and classifier f : X ! {±1},

BERD(f) = (1α β)·BERD(f)+ α +β 2

BER-optimal classifiers on clean and corrupted coincide sign(η(x)π) = sign(η(x)π)

44 / 57

slide-45
SLIDE 45

BER “immunity” under corruption

Proposition (c.f. (Zhang and Lee, 2008))

For any D,D, and classifier f : X ! {±1},

BERD(f) = (1α β)·BERD(f)+ α +β 2

BER-optimal classifiers on clean and corrupted coincide sign(η(x)π) = sign(η(x)π) Minimise clean BER ! don’t need to know corruption rates!

threshold on η does not need α,β,π

45 / 57

slide-46
SLIDE 46

BER “immunity” & class-probability estimation

Trivially, we also have

regretD

BER(f) = (1α β)1 ·regretD BER(f).

i.e. good corrupted BER =

) good clean BER

can make regretD

BER(f) ! 0 by class-probability estimation

Similar result for AUC (see poster)

46 / 57

slide-47
SLIDE 47

BER “immunity” under corruption: proof

From (Scott et al., 2013),

h FPRD(f) FNRD(f) iT = ⇥ FPRD(f) FNRD(f) ⇤T · 1β α β 1α

  • +

⇥ β α ⇤T ,

47 / 57

slide-48
SLIDE 48

BER “immunity” under corruption: proof

From (Scott et al., 2013),

h FPRD(f) FNRD(f) iT = ⇥ FPRD(f) FNRD(f) ⇤T · 1β α β 1α

  • +

⇥ β α ⇤T ,

and

1 1

  • is an eigenvector of

1β α β 1α

  • 48 / 57
slide-49
SLIDE 49

Are other measures “immune”?

BER is only (non-trivial) performance measure for which:

corrupted risk = affine transform of clean risk

I because of eigenvector interpretation

corrupted threshold is independent of α,β,π

I because of nature of φα,β,π

(see poster) Other performance measures ! need (one of) α,β,π

49 / 57

slide-50
SLIDE 50

Experiments

50 / 57

slide-51
SLIDE 51

Experimental setup

Injected label noise on UCI datasets Estimate corrupted class-probabilities via neural network

well-specified if D linearly separable:

η(x) = σ(hw,xi) = ) η(x) = a·σ(hw,xi)+b Evaluate:

reliability of noise estimates BER performance on clean test set

I corrupted data used for training and validation

0-1 performance on clean test set (see poster)

51 / 57

slide-52
SLIDE 52

Experimental results: noise rates

Estimated noise rates are generally reliable

0.1 0.2 0.3 0.4 0.49 −0.3 −0.2 −0.1 0.1 Ground−truth noise Bias of Estimate segment

Mean Median

0.1 0.2 0.3 0.4 0.49 −0.2 −0.125 −0.05 0.025 0.1 Ground−truth noise Bias of Estimate spambase

Mean Median

0.1 0.2 0.3 0.4 0.49 −0.04 −0.025 −0.01 0.005 0.02 Ground−truth noise Bias of Estimate mnist

Mean Median

52 / 57

slide-53
SLIDE 53

Experimental results: BER immunity

Generally, low observed degradation in BER

Dataset Noise

1 - AUC (%)

BER (%) segment None 0.00 ± 0.00 0.00 ± 0.00 (ρ+,ρ) = (0.1,0.0) 0.00 ± 0.00 0.01 ± 0.00 (ρ+,ρ) = (0.1,0.2) 0.02 ± 0.01 0.90 ± 0.08 (ρ+,ρ) = (0.2,0.4) 0.03 ± 0.01 3.24 ± 0.20 spambase None 2.49 ± 0.00 6.93 ± 0.00 (ρ+,ρ) = (0.1,0.0) 2.67 ± 0.02 7.10 ± 0.03 (ρ+,ρ) = (0.1,0.2) 3.01 ± 0.03 7.66 ± 0.05 (ρ+,ρ) = (0.2,0.4) 4.91 ± 0.09 10.52 ± 0.13 mnist None 0.92 ± 0.00 3.63 ± 0.00 (ρ+,ρ) = (0.1,0.0) 0.95 ± 0.01 3.56 ± 0.01 (ρ+,ρ) = (0.1,0.2) 0.97 ± 0.01 3.63 ± 0.02 (ρ+,ρ) = (0.2,0.4) 1.17 ± 0.02 4.06 ± 0.03

53 / 57

slide-54
SLIDE 54

Conclusion

54 / 57

slide-55
SLIDE 55

Learning from corrupted binary labels

Monotone relationship η(x) = φα,β,π(η(x)) facilitates:

noise estimator nature corruptor class-prob estimator classifier

D D ˆ η ˆ η ˆ α, ˆ β, ˆ π

Range of ˆ η Omit for BER Kernel logistic regression sign( ˆ η(x)φ ˆ

α, ˆ β, ˆ π(t))

55 / 57

slide-56
SLIDE 56

Future work

Better noise estimators in special cases?

c.f. (Elkan and Noto, 2008) when D separable

Fusion with “loss transfer” (Natarajan et al., 2013) approach

assumes noise rates known better for misspecified models?

I c.f. non-robustness of convex surrogate minimisation 56 / 57

slide-57
SLIDE 57

Thanks!1

1Drop by the poster for more (Paper ID 69)

57 / 57