Loss factorization, weakly supervised learning and label noise robustness
- Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni
Loss factorization, weakly supervised learning and label noise - - PowerPoint PPT Presentation
Loss factorization, weakly supervised learning and label noise robustness Giorgio Patrini, Frank Nielsen, Richard Nock, Marcello Carioni Australian National University, Data61 (ex NICTA), Ecole Polytechnique, Sony CS Labs, Max
Loss functions factor
for the labels, .
−1 1 2
x
−1 1 2 3
log(1 + e−x) le(x) lo(x)
`(x) = `e(x) + `o(x) µ
Loss functions factor
for the labels, .
−1 1 2
x
−1 1 2 3
log(1 + e−x) le(x) lo(x)
`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:
Loss functions factor
for the labels, .
−1 1 2
x
−1 1 2 3
log(1 + e−x) le(x) lo(x)
`(x) = `e(x) + `o(x) `(x) θt+1 θt ⌘r`(±hθt, xii) 1 2⌘aµ . µ ˆ µ
.
= E(x,y) y − (p− − p+) 1 − p− − p+ x
p+, p− µ Weakly supervised learning: (1) estimate and (2) plug it into and call standard algorithms. E.g., SGD:
{1, . . . , m}
generic x argument
h S2x
.
= {(xi, σ), i ∈ [m], σ ∈ {±1}}
1 2(`(x) − `(−x)) = `o(x) = ax
smoothness nor convexity required
h S2x
.
= {(xi, σ), i ∈ [m], σ ∈ {±1}}
1 2(`(x) − `(−x)) = `o(x) = ax
smoothness nor convexity required
e v e n +
d
e v e n +
d
e v e n +
d linear `o and h s u ffi c i e n c y
µ f
y
m
i=1
y2Y
m
i=1
loss `
lol `(x) −ax ⇢-loss ⇢|x| − ⇢x + 1 −⇢x (⇢ ≥ 0) unhinged 1 − x −x perceptron max(0, −x) −x double-hinge max(−x, 1/2 max(0, 1 − x)) −x spl a` + `?(−x)/b` −x/(2b`) logistic log(1 + e−x) −x/2 square (1 − x)2 −2x Matsushita √ 1 + x2 − x −x
m
i=1
y2Y
m
i=1
` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2 X < 1} and H = {θ : kθk2 B < 1} c(X, B)
.
= maxy∈{±1} `(yXB) ˆ θ
.
= argminθ∈H RS,`(θ)
` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2 X < 1} and H = {θ : kθk2 B < 1} c(X, B)
.
= maxy∈{±1} `(yXB) ˆ θ
.
= argminθ∈H RS,`(θ)
θ∈H RD,`(θ)
` is a-lol and L-Lipschitz Rd ◆ X = {x : kxk2 X < 1} and H = {θ : kθk2 B < 1} c(X, B)
.
= maxy∈{±1} `(yXB) ˆ θ
.
= argminθ∈H RS,`(θ) non-linearity 2|a|XB s d m log ✓2d δ ◆
θ∈H RD,`(θ)
complexity
wrt SGD
θ∈H RD,`(θ) ≤
S,`(θ)
noise affects the linear term only
θ∈H RD,`(θ) ≤
S,`(θ)
c
p l e x i t y u n t
c h e d
(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05
(p−, p+) → (.00, .00) (.20, .00) (.20, .10) (.20, .20) (.20, .30) (.20, .40) (.20, .49) dataset sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd sgd µsgd australian 0.13 +.01 0.15 −.01 0.14 ±.00 0.14 +.01 0.16 −.01 0.26 −.09 0.45 −.25 breast-can.0.02 +.01 0.03 ±.00 0.03 ±.00 0.03 ±.00 0.05 −.01 0.11 −.06 0.17 −.08 diabetes 0.28 −.03 0.29 −.03 0.29 −.03 0.27 −.02 0.28 −.02 0.39 −.13 0.59 −.22 german 0.27 −.02 0.26 ±.00 0.27 −.02 0.29 −.02 0.31 −.01 0.31 ±.00 0.31 ±.00 heart 0.15 +.01 0.17 −.01 0.16 ±.00 0.17 ±.00 0.18 −.01 0.26 −.08 0.35 −.15 housing 0.17 −.03 0.23 −.05 0.22 −.04 0.20 −.02 0.22 −.03 0.34 −.12 0.41 −.13 ionosphere 0.14 +05 0.19 −.05 0.20 −.05 0.20 −.03 0.21 −.03 0.35 −.13 0.54 −.29 sonar 0.27 ±.00 0.29 +.02 0.29 +.01 0.34 −.04 0.36 −.03 0.43 −.10 0.45 −.05
a data-dependent statistic θ? and ˜ θ? minimizers under D and ˜ D
D,`(θ?) R ˜ D,`(˜
2 2/ · ✏