Loss factorization, weakly supervised learning and label noise - - PowerPoint PPT Presentation

loss factorization weakly supervised learning and label
SMART_READER_LITE
LIVE PREVIEW

Loss factorization, weakly supervised learning and label noise - - PowerPoint PPT Presentation

Loss factorization, weakly supervised learning and label noise robustness Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max


slide-1
SLIDE 1

Loss factorization, weakly supervised learning and label noise robustness

  • Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni

Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max Planck Institute of Mathematics in the Sciences

slide-2
SLIDE 2

In 1 slide

Loss functions factor

  • and so their risks, isolating a sufficient statistic

for the labels, .

  • −2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) µ

slide-3
SLIDE 3

In 1 slide

Loss functions factor

  • and so their risks, isolating a sufficient statistic

for the labels, .

  • −2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:

slide-4
SLIDE 4

In 1 slide

Loss functions factor

  • and so their risks, isolating a sufficient statistic

for the labels, .

  • −2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ ˆ µ

.

= E(x,y) y − (p− − p+) 1 − p− − p+ x

  • .

p+, p− µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:

  • For asymmetric label noise with rates , an unbiased estimator is
slide-5
SLIDE 5
  • Binary classification

sampled from

  • ver
  • Learn a linear (or kernel) model
  • Minimize the empirical risk associated

with a surrogate loss

  • Preliminary

S = {(xi, yi), i ∈ [m]} Rd × {−1, 1} D h ∈ H argmin

h∈H

ES [`(yh(x))] = argmin

h∈H

RS,`(h) `(x)

{1, . . . , m}

slide-6
SLIDE 6

Mean operator & linear-odd losses

  • Mean operator
  • µS

.

= ES[yx] = 1 m

m

X

i=1

yixi

slide-7
SLIDE 7

Mean operator & linear-odd losses

  • Mean operator
  • -linear-odd loss,

a-lol a µS

.

= ES[yx] = 1 m

m

X

i=1

yixi 1/2 · (`(x) − `(−x)) = `o(x) = ax, for any a ∈ R

generic x argument

slide-8
SLIDE 8

Loss factorization

  • Linear model
  • Linear-odd loss
  • Define: “double sample”

h S2x

.

= {(xi, σ), i ∈ [m], σ ∈ {±1}}

1 2(`(x) − `(−x)) = `o(x) = ax

smoothness nor convexity required

slide-9
SLIDE 9

Loss factorization

  • Linear model
  • Linear-odd loss
  • Define: “double sample”

h S2x

.

= {(xi, σ), i ∈ [m], σ ∈ {±1}}

1 2(`(x) − `(−x)) = `o(x) = ax

smoothness nor convexity required

RS,`(h) = = 1 2RS2x,`(h) + a · h(µS)

slide-10
SLIDE 10

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

Loss factorization: proof

slide-11
SLIDE 11

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

Loss factorization: proof

e v e n +

  • d

d

slide-12
SLIDE 12

Loss factorization: proof

e v e n +

  • d

d

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

slide-13
SLIDE 13

Loss factorization: proof

e v e n +

  • d

d linear `o and h s u ffi c i e n c y

  • f

µ f

  • r

y

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

slide-14
SLIDE 14

Linear-odd losses: examples

  • Logistic loss & exponential family

m

X

i=1

log X

y2Y

eyhθ,xii hθ, µi =

m

X

i=1

log ⇣ 1 + e2yihθ,xii⌘

slide-15
SLIDE 15

Linear-odd losses: examples

  • Logistic loss & exponential family

loss `

  • dd term `o

lol `(x) −ax ⇢-loss ⇢|x| − ⇢x + 1 −⇢x (⇢ ≥ 0) unhinged 1 − x −x perceptron max(0, −x) −x double-hinge max(−x, 1/2 max(0, 1 − x)) −x spl a` + `?(−x)/b` −x/(2b`) logistic log(1 + e−x) −x/2 square (1 − x)2 −2x Matsushita √ 1 + x2 − x −x

m

X

i=1

log X

y2Y

eyhθ,xii hθ, µi =

m

X

i=1

log ⇣ 1 + e2yihθ,xii⌘

slide-16
SLIDE 16

Generalization bound

  • Loss
  • Bounds
  • Bounded loss
  • Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ)

slide-17
SLIDE 17

Generalization bound

  • Loss
  • Bounds
  • Bounded loss
  • Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ)

Then for any δ > 0, with probability at least 1 δ: RD,`(ˆ θ) inf

θ∈H RD,`(θ) 

p 2 + 1 4 ! · XBL pm + c(X, B)L 2 · s 1 m log ✓1 δ ◆ + 2|a|B · kµD µSk2

slide-18
SLIDE 18

Generalization bound

  • Loss
  • Bounds
  • Bounded loss
  • Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ) non-linearity 2|a|XB s d m log ✓2d δ ◆

Then for any δ > 0, with probability at least 1 δ: RD,`(ˆ θ) inf

θ∈H RD,`(θ) 

p 2 + 1 4 ! · XBL pm + c(X, B)L 2 · s 1 m log ✓1 δ ◆ + 2|a|B · kµD µSk2

complexity

  • f space H
slide-19
SLIDE 19

Weakly supervised learning

  • Weak labels may be

wrong (noisy), missing, multi-instance, etc.

  • D

corrupt

− − − − − → ˜ D

sample

− − − − → ˜ S

slide-20
SLIDE 20

Weakly supervised learning

  • Weak labels may be

wrong (noisy), missing, multi-instance, etc.

  • 2-step approach:

(1) Estimate from weak labels (2) Plug it into and call any algorithm for risk minimization on ` S2x D

corrupt

− − − − − → ˜ D

sample

− − − − → ˜ S µ˜

S

slide-21
SLIDE 21

Example: SGD (step 2)

Algorithm 1 µsgd Input: S2x, µ, ` is a-lol; θ0 0 For any t = 1, 2, . . . until convergence Pick i 2 {1, . . . , |S2x|} at random ⌘ 1/t Pick any v 2 @`(yihθt, xii) θt+1 θt ⌘(v +aµ/2 ) Output: θt+1

slide-22
SLIDE 22

Example: SGD (step 2)

  • In the paper: proximal algorithms

Algorithm 1 µsgd Input: S2x, µ, ` is a-lol; θ0 0 For any t = 1, 2, . . . until convergence Pick i 2 {1, . . . , |S2x|} at random ⌘ 1/t Pick any v 2 @`(yihθt, xii) θt+1 θt ⌘(v +aµ/2 ) Output: θt+1

  • nly changes

wrt SGD

slide-23
SLIDE 23

A unifying approach

Learning from label proportions with

  • logistic loss [N.Quadrianto et al. ’09]
  • symmetric proper loss [G. Patrini et al. ’14]
  • Learning with noisy labels with
  • logistic loss [Gao et al. ’16]
slide-24
SLIDE 24

Asymmetric label noise

Sample corrupted by asymmetric noise rates .

  • ˜

S = {(xi, ˜ yi)}m

i=1

p+, p−

slide-25
SLIDE 25

Asymmetric label noise

Sample corrupted by asymmetric noise rates

  • By the method of [Natarajan et al. ’13] an

unbiased estimator of is

  • This is step (1), then run -SGD for (2).

˜ S = {(xi, ˜ yi)}m

i=1

µS p+, p− ˆ µS

.

= E˜

S

y − (p− − p+) 1 − p− − p+ x

  • µ
slide-26
SLIDE 26

Generalization bound under noise

  • Same as before, except that now

Then for any δ > 0, with probability at least 1 − δ: RD,`(ˆ θ) − inf

θ∈H RD,`(θ) ≤

√ 2 + 1 4 ! · XBL √m + c(X, B)L 2 s 1 m log ✓2 δ ◆ + 2|a|XB 1 − p− − p+ s d m log ✓2d δ ◆ ˆ θ = argminθ∈H ˆ R˜

S,`(θ)

slide-27
SLIDE 27

Generalization bound under noise

  • Same as before, except that now

noise affects the linear term only

Then for any δ > 0, with probability at least 1 − δ: RD,`(ˆ θ) − inf

θ∈H RD,`(θ) ≤

√ 2 + 1 4 ! · XBL √m + c(X, B)L 2 s 1 m log ✓2 δ ◆ + 2|a|XB 1 − p− − p+ s d m log ✓2d δ ◆ ˆ θ = argminθ∈H ˆ R˜

S,`(θ)

c

  • m

p l e x i t y u n t

  • u

c h e d

slide-28
SLIDE 28

Empirics

  • Artificially corrupted data. Noise rates up to ~50%
  • SGD vs -SGD with the same parameters
  • Test error average difference over 25 runs

µ

(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05

slide-29
SLIDE 29

Empirics

  • Artificially corrupted data. Noise rates up to ~50%
  • SGD vs -SGD with the same parameters
  • Test error average difference over 25 runs

=> Still able to learn with one label almost at random µ

(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05

slide-30
SLIDE 30

Bonus: data-dependent robustness

  • The mean operator bounds the effect of noise

a data-dependent statistic θ? and ˜ θ? minimizers under D and ˜ D

Let ✏ = 4|a|B max{p+, p−}kµDk2. Any a-lol ` is such that R ˜

D,`(θ?) R ˜ D,`(˜

θ?)  ✏ Moreover, if ` is differentiable and -strongly convex, then kθ? ˜ θ?k2

2  2/ · ✏

slide-31
SLIDE 31

More in the paper

  • Mean and covariance operators
  • Non linear models & kernel mean map
  • Learning reductions
  • Proximal algorithms
  • Ongoing work: unsupervised component of

losses?