[PPT] - Loss factorization, weakly supervised learning and label noise PowerPoint Presentation

SLIDE 1

Loss factorization, weakly supervised learning and label noise robustness

Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni

Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max Planck Institute of Mathematics in the Sciences

SLIDE 2

In 1 slide

Loss functions factor

and so their risks, isolating a sufficient statistic

for the labels, .

−2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) µ

SLIDE 3

In 1 slide

Loss functions factor

and so their risks, isolating a sufficient statistic

for the labels, .

−2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:

SLIDE 4

In 1 slide

Loss functions factor

and so their risks, isolating a sufficient statistic

for the labels, .

−2

−1 1 2

x

−1 1 2 3

log(1 + e−x) le(x) lo(x)

`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ ˆ µ

.

= E(x,y) y − (p− − p+) 1 − p− − p+ x

.

p+, p− µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:

For asymmetric label noise with rates , an unbiased estimator is

SLIDE 5

Binary classification

sampled from

ver
Learn a linear (or kernel) model
Minimize the empirical risk associated

with a surrogate loss

Preliminary

S = {(xi, yi), i ∈ [m]} Rd × {−1, 1} D h ∈ H argmin

h∈H

ES [`(yh(x))] = argmin

h∈H

RS,`(h) `(x)

{1, . . . , m}

SLIDE 6

Mean operator & linear-odd losses

Mean operator
µS

.

= ES[yx] = 1 m

m

X

i=1

yixi

SLIDE 7

Mean operator & linear-odd losses

Mean operator
-linear-odd loss,

a-lol a µS

.

= ES[yx] = 1 m

m

X

i=1

yixi 1/2 · (`(x) − `(−x)) = `o(x) = ax, for any a ∈ R

generic x argument

SLIDE 8

Loss factorization

Linear model
Linear-odd loss
Define: “double sample”

h S2x

.

= {(xi, σ), i ∈ [m], σ ∈ {±1}}

1 2(`(x) − `(−x)) = `o(x) = ax

smoothness nor convexity required

SLIDE 9

Loss factorization

Linear model
Linear-odd loss
Define: “double sample”

h S2x

.

= {(xi, σ), i ∈ [m], σ ∈ {±1}}

1 2(`(x) − `(−x)) = `o(x) = ax

smoothness nor convexity required

RS,`(h) = = 1 2RS2x,`(h) + a · h(µS)

SLIDE 10

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

Loss factorization: proof

SLIDE 11

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

Loss factorization: proof

e v e n +

d

d

SLIDE 12

Loss factorization: proof

e v e n +

d

d

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

SLIDE 13

Loss factorization: proof

e v e n +

d

d linear `o and h s u ffi c i e n c y

f

µ f

r

y

RS,`(h) = = ES h `(h(x)) i = 1 2ES h `(yh(x)) + `(−yh(x)) + `(yh(x)) − `(−yh(x)) i = 1 2ES2x h `(h(x)) i + ES h `o(h(yx)) i = 1 2RS2x,`(h) + a · h(µS)

SLIDE 14

Linear-odd losses: examples

Logistic loss & exponential family

m

X

i=1

log X

y2Y

eyhθ,xii hθ, µi =

m

X

i=1

log ⇣ 1 + e2yihθ,xii⌘

SLIDE 15

Linear-odd losses: examples

Logistic loss & exponential family

loss `

dd term `o

lol `(x) −ax ⇢-loss ⇢|x| − ⇢x + 1 −⇢x (⇢ ≥ 0) unhinged 1 − x −x perceptron max(0, −x) −x double-hinge max(−x, 1/2 max(0, 1 − x)) −x spl a` + `?(−x)/b` −x/(2b`) logistic log(1 + e−x) −x/2 square (1 − x)2 −2x Matsushita √ 1 + x2 − x −x

m

X

i=1

log X

y2Y

eyhθ,xii hθ, µi =

m

X

i=1

log ⇣ 1 + e2yihθ,xii⌘

SLIDE 16

Generalization bound

Loss
Bounds
Bounded loss
Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ)

SLIDE 17

Generalization bound

Loss
Bounds
Bounded loss
Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ)

Then for any δ > 0, with probability at least 1 δ: RD,`(ˆ θ) inf

θ∈H RD,`(θ) 

p 2 + 1 4 ! · XBL pm + c(X, B)L 2 · s 1 m log ✓1 δ ◆ + 2|a|B · kµD µSk2

SLIDE 18

Generalization bound

Loss
Bounds
Bounded loss
Let

` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2  X < 1} and H = {θ : kθk2  B < 1} c(X, B)

.

= maxy∈{±1} `(yXB) ˆ θ

.

= argminθ∈H RS,`(θ) non-linearity 2|a|XB s d m log ✓2d δ ◆

Then for any δ > 0, with probability at least 1 δ: RD,`(ˆ θ) inf

θ∈H RD,`(θ) 

p 2 + 1 4 ! · XBL pm + c(X, B)L 2 · s 1 m log ✓1 δ ◆ + 2|a|B · kµD µSk2

complexity

f space H

SLIDE 19

Weakly supervised learning

Weak labels may be

wrong (noisy), missing, multi-instance, etc.

D

corrupt

− − − − − → ˜ D

sample

− − − − → ˜ S

SLIDE 20

Weakly supervised learning

Weak labels may be

wrong (noisy), missing, multi-instance, etc.

2-step approach:

(1) Estimate from weak labels (2) Plug it into and call any algorithm for risk minimization on ` S2x D

corrupt

− − − − − → ˜ D

sample

− − − − → ˜ S µ˜

S

SLIDE 21

Example: SGD (step 2)

Algorithm 1 µsgd Input: S2x, µ, ` is a-lol; θ0 0 For any t = 1, 2, . . . until convergence Pick i 2 {1, . . . , |S2x|} at random ⌘ 1/t Pick any v 2 @`(yihθt, xii) θt+1 θt ⌘(v +aµ/2 ) Output: θt+1

SLIDE 22

Example: SGD (step 2)

In the paper: proximal algorithms

Algorithm 1 µsgd Input: S2x, µ, ` is a-lol; θ0 0 For any t = 1, 2, . . . until convergence Pick i 2 {1, . . . , |S2x|} at random ⌘ 1/t Pick any v 2 @`(yihθt, xii) θt+1 θt ⌘(v +aµ/2 ) Output: θt+1

nly changes

wrt SGD

SLIDE 23

A unifying approach

Learning from label proportions with

logistic loss [N.Quadrianto et al. ’09]
symmetric proper loss [G. Patrini et al. ’14]
Learning with noisy labels with
logistic loss [Gao et al. ’16]

SLIDE 24

Asymmetric label noise

Sample corrupted by asymmetric noise rates .

˜

S = {(xi, ˜ yi)}m

i=1

p+, p−

SLIDE 25

Asymmetric label noise

Sample corrupted by asymmetric noise rates

By the method of [Natarajan et al. ’13] an

unbiased estimator of is

This is step (1), then run -SGD for (2).

˜ S = {(xi, ˜ yi)}m

i=1

µS p+, p− ˆ µS

.

= E˜

S

y − (p− − p+) 1 − p− − p+ x

µ

SLIDE 26

Generalization bound under noise

Same as before, except that now

Then for any δ > 0, with probability at least 1 − δ: RD,`(ˆ θ) − inf

θ∈H RD,`(θ) ≤

√ 2 + 1 4 ! · XBL √m + c(X, B)L 2 s 1 m log ✓2 δ ◆ + 2|a|XB 1 − p− − p+ s d m log ✓2d δ ◆ ˆ θ = argminθ∈H ˆ R˜

S,`(θ)

SLIDE 27

Generalization bound under noise

Same as before, except that now

noise affects the linear term only

Then for any δ > 0, with probability at least 1 − δ: RD,`(ˆ θ) − inf

θ∈H RD,`(θ) ≤

√ 2 + 1 4 ! · XBL √m + c(X, B)L 2 s 1 m log ✓2 δ ◆ + 2|a|XB 1 − p− − p+ s d m log ✓2d δ ◆ ˆ θ = argminθ∈H ˆ R˜

S,`(θ)

c

m

p l e x i t y u n t

u

c h e d

SLIDE 28

Empirics

Artificially corrupted data. Noise rates up to ~50%
SGD vs -SGD with the same parameters
Test error average difference over 25 runs

µ

(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05

SLIDE 29

Empirics

Artificially corrupted data. Noise rates up to ~50%
SGD vs -SGD with the same parameters
Test error average difference over 25 runs

=> Still able to learn with one label almost at random µ

(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05

SLIDE 30

Bonus: data-dependent robustness

The mean operator bounds the effect of noise

a data-dependent statistic θ? and ˜ θ? minimizers under D and ˜ D

Let ✏ = 4|a|B max{p+, p−}kµDk2. Any a-lol ` is such that R ˜

D,`(θ?) R ˜ D,`(˜

θ?)  ✏ Moreover, if ` is differentiable and -strongly convex, then kθ? ˜ θ?k2

2  2/ · ✏

SLIDE 31

More in the paper

Mean and covariance operators
Non linear models & kernel mean map
Learning reductions
Proximal algorithms
…
Ongoing work: unsupervised component of