[PPT] - A convex relaxation for weakly supervised classifiers Armand Joulin PowerPoint Presentation

SLIDE 1

Introduction Problem formulation Convex relaxation Optimization Results

A convex relaxation for weakly supervised classifiers

Armand Joulin and Francis Bach

SIERRA group INRIA -Ecole Normale Sup´ erieure

ICML 2012

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 2

Introduction Problem formulation Convex relaxation Optimization Results

Weakly supervised classification

We adress the problem of weakly supervision: Instances are grouped into bags that are associated with

bservable partial labelling

We suppose that each instance possesses its own true latent label

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 3

Introduction Problem formulation Convex relaxation Optimization Results

Example

Bags = images Instances = pixels

y ={horse, background} y = {background} y = {human, background} y = {horse, background}

set of partial Labelling = 2 Set of true labels = {horse, human, background} Partially labeled data Fully labeled data

{horse, human, background}

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 4

Introduction Problem formulation Convex relaxation Optimization Results

Weakly supervised classification: Examples

set of partial Labelling: { } { , } Semi-supervised learning set of partial Labelling: { , , } Multiple instance learning set of partial Labelling: { , } Unsupervised learning set of Latent true labels:

Examples of partial labelling depending for different weakly supervised problems

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 5

Introduction Problem formulation Convex relaxation Optimization Results

Inferring the labels and learning the model

Latent true labelling set = { , } classifier =

The goal is to jointly estimate these true latent labels and learn a classifier based on them This usually leads to non-convex formulations They are typically optimized with an EM procedure which converges to local minima.

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 6

Introduction Problem formulation Convex relaxation Optimization Results

Our approach

We propose: A general weakly supervised framework based on the likelihood of a probabilistic model, A convex relaxation of the related cost function, A dedicated optimization scheme.

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 7

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Notations

Partial Labelling set ={ , } Bags =

I bags of instances, Each instance n is associated with:

a feature xn ∈ X a weight πn, a partial label yn, common to a bag, a latent label zn depending on yn

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 8

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Discriminative classifier

We consider a reguralized discriminative classifier: L(z, wφ(x) + b) + λ 2 w2

F

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 9

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Discriminative classifier

We consider a reguralized discriminative classifier: L(z, wφ(x) + b) + λ 2 w2

F

where the loss function L(z, w, b) is the reweighted soft-max loss function: −

N

n=1

πn

l∈L

ynl

p∈Pl

znp log

exp(wT

p φ(xn) + bp)

k∈Pl exp(wT

k φ(xn) + bk)

,

Our cost function is equivalent to the log-likelihood of a multinomial model (w, b) = parameters x = feature φ = feature map yn = label zn = latent label πn = weight

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 10

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Cluster size balancing term

In unsupervised learning or MIL, the same latent label to all the instances

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 11

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Cluster size balancing term

In unsupervised learning or MIL, the same latent label to all the instances => Perfect separation

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 12

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Cluster size balancing term

we penalizing by the entropy of the proportion of instances per class and per bag (Joulin et al., 2010): H(z) =

i∈I
k∈P

n∈Ni

πnznk

log

n∈Ni

πnznk

This penalization is related to a graphical model

(x → z → y): No additional parameter i = bag n = instance xn = feature I = set of bags zn = latent label πn = weight

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 13

Introduction Problem formulation Convex relaxation Optimization Results Notations Discriminative classifier Cluster size balancing term Overall problem

Overall problem

Our overall problem is formulated as: min

z∈P min w, b f (z, w, b) =

L(z, w, b) − H(z) + λ

2 w2

F

Not jointly convex in z and (w, b).

H = cluster size balancing term L = cost function (w, b) = classifier parameters λ = regularization parameter P = set of latent labels z = latent label

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 14

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Convex relaxation - Overview

min

z∈P min w, b f (z, w, b) =

L(z, w, b) − H(z) + λ

2 w2

F

We use a dual formulation based on Fenchel duality

We reparametrize the problem following Guo and Schuurmans, 2008 Finally we relax it to a semi-definite program (SDP)

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 15

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Duality with Fenchel conjugate

The Fenchel conjugate of the log-partition is: log(

k

tk) = max

q

k

qktk −

k

qk log(qk) The minimization in (w, b) leading to the dual formulation is in closed form: min

z∈P

max

q∈SN

P

(q−z)T π=0

−H(z) +

i∈I
n∈Ni

πnh(qn) − 1 λtr

(q − z)(q − z)TK
where Knm = πnφ(xn), πmφ(xm).

i = bag n = instance xn = feature φ = feature map πn = weight zn = latent label SP = simplex h = entropy (w,b) = classifier parameters

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 16

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Sources of non-convexity

min

z∈P

max

q∈SN

P

(q−z)T π=0

−H(z) +

i∈I
n∈Ni

πnh(qn) − 1 λtr

(q − z)(q − z)TK
2 sources of non-convexity:

A constraint joining a variable of a convex and a concave problem A function which is not jointly convex/concave in z/q.

Proposed solution (Guo and Schuurmans, 2008):

Reparametrization in q SDP relaxation in z

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 17

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Reparametrization in q

We reparametrize the problem by introducing an N × N matrix Ω such that: q = Ωz The constraints on q become convex constraints over Ω The problem becomes: min

z∈P

max

Ω∈I RN×N

+

ΩT π=π Ω1N=1N

H(z) +

i∈I
n∈Ni

πnh(Ωnz) − 1 λtr

(I − Ω)zzT(I − Ω)TK
q = dual variables

z = latent label π = weights

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 18

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Tight upper-bound on the entropy

The entropy in Ω and z is not convex We use the following upper bound:

i∈I
n∈Ni

πnh(Ωnz) ≤ −

n

πnh(Ωn) + H(z) + C0. This upper-bound is tight for discrete values of z This leads to: min

z∈PN

max

Ω∈I RN×N

+

ΩT π=π Ω1N=1N

n

πnh(Ωn) − 1 λtr

(I − Ω)zzT(I − Ω)TK
h = entropy

z = latent label π = weights

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 19

Introduction Problem formulation Convex relaxation Optimization Results Convex relaxation - Overview Fenchel duality Reparametrization

Reparametrization in z

The cost function depends only on zzT => Z = zzT min

Z∈Z

max

Ω∈I RN×N

+

ΩT π=π Ω1N=1N

n

πnh(Ωn) − 1 λtr

(I − Ω)Z(I − Ω)TK
The cost function is the maximum of linear functions of Z,

which is convex We relax the set Z of Z to its convex hull, the Elliptope: EN = {Z ∈ I RN×N | diag(Z) = 1N, Z 0} h = entropy z = latent label π = weights

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 20

Introduction Problem formulation Convex relaxation Optimization Results Overview Rounding

Optimization - Overview

min

Z∈Z

max

Ω∈I RN×N

+

ΩT π=π Ω1N=1N

n

πnh(Ωn) − 1 λtr

(I − Ω)Z(I − Ω)TK
It is a saddle-point problem where there is no explicit form for

either the primal or dual Optimization in a nutshell:

We use proximal method with Kullback-Leibler divergence for both maximization and the minimation The overall complexity is O(N3)

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 21

Introduction Problem formulation Convex relaxation Optimization Results Overview Rounding

Rounding

we use k-means clustering on the eigenvectors associated with the k highest eigenvalues We then use EM procedure on the orginal problem to obtain a better solution

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 22

Introduction Problem formulation Convex relaxation Optimization Results

Results - Overview

Toy example on an unsupervised problem MIL and SSL results on classical datasets

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 23

Introduction Problem formulation Convex relaxation Optimization Results

Unsupervised classification

(a) (b) (c) (d) (e)

Figure: (b) K = xxT, (c) the matrix Z obtained with (Bach and Harchaoui, 2007), (d) with no intercept and (e) with our method.

Bach and Harchaoui (2007) use square loss instead of logistic loss Our method with no intercept is similar to Guo and Schuurman (2008)

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 24

Introduction Problem formulation Convex relaxation Optimization Results

Multiple-instance learning

Algorithm Musk1 Tiger Elephant Fox Trec1

C. k-NN (Wang and Zucker, 2000)

91.3 78.0 80.5 60.0 87.0 EM-DD (Zhang and Goldman, 2001) 84.8 72.1 78.3 56.1 85.8 mi-SVM (Andrews et al., 2003) 87.4 78.9 82.0 58.2 93.6 MI-SVM (Andrews et al., 2003) 77.9 84.0 81.4 59.4 93.9 PPMM Kernel (Wang et al., 2008) 95.6 80.2 82.4 60.3 93.3

Rand. init / Uniform

71.1 69.0 74.5 61.0 81.3

Rand. init / Weight

76.6 71.0 74.5 59.0 84.4 No inter. / Uniform 75.0 ± 19.5 67.8 ± 10.4 77.3 ± 9.2 51.3 ± 6.4 87.5 ± 5.2 No inter. / Weight 77.8 ± 15.7 71.0 ± 10.8 78.9 ± 9.8 52.1 ± 5.0 87.3 ± 5.6 Ours / Uniform 84.4 ± 14.0 73.0 ± 8.2 86.7 ± 3.5 57.5 ± 5.9 93.0 ± 4.7 Ours / Weight 87.7 ± 13.3 78.0 ± 5.4 83.9 ± 4.2 62.5 ± 6.4 89.0 ± 6.2

Figure: Accuracy of our approach and of standard methods for MIL. We evaluate our method with and without the intercept and with two types

f weights. In bold, the significantly best performances.

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 25

Introduction Problem formulation Convex relaxation Optimization Results

Semi-supervised learning

Dataset Lin. Nonlin. Ent-Reg. Ours (Linear) Ours (Nonlinear) Digit1 79.4 82.2 75.6 84.6 ± 0.7 75.5 ± 2.9 BCI 50.0 50.9 52.3 52.2 ± 1.1 50.2 ± 1.1 l=10 g241c 79.1 75.3 52.6 87.2 ± 0.2 87.3 ± 0.4 g241d 53.7 49.9 54.2 54.4 ± 9.1 53.2 ± 10.1 USPS 69.3 74.8 79.8 57.1 ± 13.3 79.5 ± 0.5 Digit1 82.0 93.9 92.7 91.2 ± 1.7 93.31 ± 1.0 BCI 57.3 66.8 71.1 78.1 ± 2.3 64.0 ± 0.8 l=100 g241c 81.8 81.5 79.0 86.0 ± 0.7 85.1 ± 0.7 g241d 76.2 77.6 74.6 77.1 ± 1.7 73.0 ± 3.0 USPS 78.9 90.2 87.8 71.6 ± 2.6 73.0 ± 0.2

Figure: Comparison in accuracy on SSL databases with methods proposed in Chapelle et al. (2006). In bold, the significantly best performances.

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 26

Introduction Problem formulation Convex relaxation Optimization Results

Conclusion

A general weakly supervised framework A convex formulation Competitive performances on different problems Limitation:

Cannot scale to more than 10,000 instances,

How it should be used:

On a subset of data as a robust initialization of an EM, After a dimension reduction algorithm (such as k-means)

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers

SLIDE 27

Introduction Problem formulation Convex relaxation Optimization Results

Thank you.

Armand Joulin and Francis Bach A convex relaxation for weakly supervised classifiers