Active Learning for Sparse Bayesian Multilabel Classification - - PowerPoint PPT Presentation

active learning for sparse bayesian multilabel
SMART_READER_LITE
LIVE PREVIEW

Active Learning for Sparse Bayesian Multilabel Classification - - PowerPoint PPT Presentation

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification Given a set of


slide-1
SLIDE 1

Active Learning for Sparse Bayesian Multilabel Classification

Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

slide-2
SLIDE 2

Multilabel Classification

Given a set of datapoints, the goal is to annotate them with a set of labels.

slide-3
SLIDE 3

Multilabel Classification

Given a set of datapoints, the goal is to annotate them with a set of labels.

slide-4
SLIDE 4

Multilabel Classification

Given a set of datapoints, the goal is to annotate them with a set of labels.

xi ∈ Rd

Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

slide-5
SLIDE 5

Multilabel Classification

Given a set of datapoints, the goal is to annotate them with a set of labels.

Iraq Flowers Human Brick Sea Sun Sky

xi ∈ Rd

Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

slide-6
SLIDE 6

Multilabel Classification

Given a set of datapoints, the goal is to annotate them with a set of labels.

Iraq Flowers Human Brick Sea Sun Sky

xi ∈ Rd

Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

slide-7
SLIDE 7

Training

slide-8
SLIDE 8

Training

slide-9
SLIDE 9

Training

WikiLSHTC has 325k labels. Good luck with that!!

slide-10
SLIDE 10

Training Is Expensive

  • Training data can also

be very expensive, like genomic data, chemical data

  • Getting each label

incurs additional cost

slide-11
SLIDE 11

Training Is Expensive

  • Training data can also

be very expensive, like genomic data, chemical data

  • Getting each label

incurs additional cost

Need to reduce the required training data.

slide-12
SLIDE 12

Active Learning

Datapoints Labels 1 2 3 L 1 2 3 N

Iraq Flowers Sun Sky

slide-13
SLIDE 13

Active Learning

Datapoints Labels 1 2 3 L 1 2 3 N

Iraq Flowers Sun Sky

1

slide-14
SLIDE 14

Active Learning

Datapoints Labels 1 2 3 L 1 2 3 N

Iraq Flowers Sun Sky

1

slide-15
SLIDE 15

Active Learning

Datapoints 1 2 3 L 1 2 3 N

Which data points should I label?

Iraq Flowers Sun Sky

Labels

slide-16
SLIDE 16

Active Learning

Datapoints 1 2 3 L 1 2 3 N

Iraq Flowers Sun Sky

Labels

For a particular datapoint, which labels should I reveal?

slide-17
SLIDE 17

Active Learning

Datapoints 1 2 3 L 1 2 3 N

Can I choose datapoint-label pairs to annotate?

Iraq Flowers Sun Sky

Labels

slide-18
SLIDE 18

In this talk

  • An active learner for Multi-label classification that:
  • Answers all your questions
  • Is Computationally Cheap
  • Is Non myopic and near-optimal
  • Incorporates label sparsity
  • Achieves higher accuracy than state-of-the-art
slide-19
SLIDE 19

Classification

slide-20
SLIDE 20

Classification Model*

xi

y3

i

y2

i

y1

i

yL

i

Labels *Kapoor et al, NIPS 2012

slide-21
SLIDE 21

Classification Model*

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

Φ Labels *Kapoor et al, NIPS 2012 Compressed Space

slide-22
SLIDE 22

Classification Model*

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

W Φ Labels *Kapoor et al, NIPS 2012 Compressed Space

slide-23
SLIDE 23

Classification Model*

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ Labels *Kapoor et al, NIPS 2012 Sparsity Compressed Space

slide-24
SLIDE 24

Classification Model: Potentials

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W fxi(W, zi) = e− ||W T xi−zi||2

2σ2

Φ

slide-25
SLIDE 25

Classification Model: Potentials

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Φ

slide-26
SLIDE 26

Classification Model: Priors

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W yj

i ∼ N(0, 1

αj

i

) Φ

slide-27
SLIDE 27

Classification Model: Priors

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

Φ

slide-28
SLIDE 28

Sparsity Priors

a0 = 10−6, b0 = 10−6

slide-29
SLIDE 29

Classification Model

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Φ

slide-30
SLIDE 30

Classification Model

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Problem: Exact inference is intractable.

Φ

slide-31
SLIDE 31

Inference: Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ

slide-32
SLIDE 32

Inference: Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ Approximate Gaussian

slide-33
SLIDE 33

Inference: Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ Approximate Gaussian Approximate Gaussian

slide-34
SLIDE 34

Inference: Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ Approximate Gaussian Approximate Gaussian Approximate Gamma

slide-35
SLIDE 35

Inference: Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W Φ Approximate Gaussian

slide-36
SLIDE 36

Active Learning Criteria

  • Entropy: Is a measure of uncertainty. For a

random variable X, the entropy H is given as:

  • Picks points far apart from each other
  • For a Gaussian process, H = 1

2 log(|Σ|) + const H(X) = − X

i

P(xi) log(P(xi))

slide-37
SLIDE 37

Active Learning Criteria

  • Mutual Information: Measures reduction in

uncertainty over unlabeled space

  • Used in past work successfully for regression

MI(A, B) = H(A) − H(A|B)

slide-38
SLIDE 38

Active Learning: Mutual Information

  • We have already modeled the distribution over

labels, Y as a Gaussian process

  • The goal is to select a subset of labels that offers

the maximum reduction in entropy over the remaining space

A∗ = argA⊆U max H(YU\A)−H(YU\A|A)

slide-39
SLIDE 39

Active Learning: Mutual Information

  • We have already modeled the distribution over

labels, Y as a Gaussian process

  • The goal is to select a subset of labels that offers

the maximum reduction in entropy over the remaining space

A∗ = argA⊆U max H(YU\A)−H(YU\A|A)

Problem: Variance is not preserved across layers

slide-40
SLIDE 40

Idea: Collapsed Variational Bayes

slide-41
SLIDE 41

Idea: Collapsed Variational Bayes xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Φ

slide-42
SLIDE 42

Idea: Collapsed Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W

yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Φ

slide-43
SLIDE 43

Idea: Collapsed Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W

yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Integrate to get a Gaussian distribution over Y

Φ

slide-44
SLIDE 44

Idea: Collapsed Variational Bayes

xi

z1

i

z2

i

zk

i

y3

i

y2

i

y1

i

yL

i

α1

i

α2

i

α3

i

αL

i

W

yj

i ∼ N(0, 1

αj

i

) αj

i ∼ Γ(αj i; a0, b0)

fxi(W, zi) = e− ||W T xi−zi||2

2σ2

gφ(yi, zi) = e− ||Φyi−zi||2

2χ2

Integrate to get a Gaussian distribution over Y Use Variational Bayes for sparsity

Φ

slide-45
SLIDE 45

Active Learning: Mutual Information

  • We have already modeled the distribution over

labels, Y as a Gaussian process

  • The goal is to select a subset of labels that offers

the maximum reduction in entropy over the remaining space

A∗ = argA⊆U max H(YU\A)−H(YU\A|A)

slide-46
SLIDE 46

Active Learning: Mutual Information

  • We have already modeled the distribution over

labels, Y as a Gaussian process

  • The goal is to select a subset of labels that offers

the maximum reduction in entropy over the remaining space

A∗ = argA⊆U max H(YU\A)−H(YU\A|A)

Problem: Computing Mutual Information still needs exponential time

slide-47
SLIDE 47

Solution: Approximate Mutual Information

  • Approximate the final distribution over Y by a

Gaussian

  • Use the Gaussian to estimate the mutual

information

  • Theorem 1:

lim

a0→0,b0→0

ˆ MI → MI

slide-48
SLIDE 48

Active Learning: Mutual Information

  • We have already modeled the distribution over

labels, Y as a Gaussian process

  • The goal is to select a subset of labels that offers

the maximum reduction in entropy over the remaining space

A∗ = argA⊆U max H(YU\A)−H(YU\A|A)

Problem: Subset selection problem is NP complete

slide-49
SLIDE 49

Solution: Use Submodularity

  • Under some weak conditions, the objective is

sub-modular

  • Sub-modularity ensures that the greedy solution

is a constant times the optimal solution

slide-50
SLIDE 50

Algorithm

  • Input: Feature vectors for a set of unlabeled

instance, U and a budget n

  • Iteratively, add a datapoint x to labeled set A,

such that x leads to maximum increase in MI

x ← arg max

x∈U\A

ˆ MI(A ∪ x) − ˆ MI(A)

slide-51
SLIDE 51

Performance Evaluation

slide-52
SLIDE 52

Datasets

Dataset Type Instances Features Labels Yeast Biology 2417 103 14 MSRC Image 591 1024 23 Medical Text 978 1449 45 Enron Text 1702 1001 53 Mediamill Video 43907 120 101 RCV1 Text 6000 47236 101 Bookmarks Text 87856 2150 208 Delicious Text 16105 500 983

slide-53
SLIDE 53

Setup

  • Unlabeled pool size: 4000 points, test size: 2000

points

  • For smaller datasets, the entire data was in

unlabeled pool. Testing on all unlabeled data

  • Initial seed size: 500 points
slide-54
SLIDE 54

Compared Algorithms

  • MIML: Mutual Information for Multilabel

Classification (proposed method).

  • Uncert: Uncertainty sampling (Entropy based)
  • Rand: Random sampling
  • Li-Adaptive*: SVM based adaptive active

learning

*Li et al, IJCAI 2013

slide-55
SLIDE 55

Traditional Active Learning

Datapoints 1 2 3 L 1 2 3 N

Which data points should I label?

Iraq Flowers Sun Sky

Labels

slide-56
SLIDE 56

Traditional Active Learning

slide-57
SLIDE 57

Traditional Active Learning

Delicious (983)

Mean Precision

0.518 0.524 0.529 0.535 0.54

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

Yeast (14)

0.673 0.681 0.689 0.697 0.705

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

slide-58
SLIDE 58

Traditional Active Learning

Delicious (983)

Mean Precision

0.518 0.524 0.529 0.535 0.54

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

Yeast (14)

0.673 0.681 0.689 0.697 0.705

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

slide-59
SLIDE 59

Traditional Active Learning

Delicious (983)

Mean Precision

0.518 0.524 0.529 0.535 0.54

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

Yeast (14)

0.673 0.681 0.689 0.697 0.705

#points

50 100 150 200 250

Rand Li-Adaptive Uncert MIML

slide-60
SLIDE 60

Active Learning

Datapoints 1 2 3 L 1 2 3 N

Iraq Flowers Sun Sky

Labels

For a particular datapoint, which labels should I reveal?

slide-61
SLIDE 61

Active Diagnosis

slide-62
SLIDE 62

Active Diagnosis

RCV

F Score

0.15 0.288 0.425 0.563 0.7

# labels

5 10 15 20 25 30

Rand Uncert MIML

slide-63
SLIDE 63

Active Diagnosis

RCV

F Score

0.15 0.288 0.425 0.563 0.7

# labels

5 10 15 20 25 30

Rand Uncert MIML

slide-64
SLIDE 64

Generalized Active Learning

Datapoints 1 2 3 L 1 2 3 N

Can I choose datapoint-label pairs to annotate?

Iraq Flowers Sun Sky

Labels

slide-65
SLIDE 65

Generalized Active Learning

slide-66
SLIDE 66

Generalized Active Learning

RCV

F Score

0.17 0.193 0.215 0.238 0.26

#points

5 10 15 20 25 30

Rand Uncert MIML

slide-67
SLIDE 67

Generalized Active Learning

RCV

F Score

0.17 0.193 0.215 0.238 0.26

#points

5 10 15 20 25 30

Rand Uncert MIML

slide-68
SLIDE 68

Time Complexity

Dataset Labels MIML Li-Adaptive Yeast 14 3m 25s 1m 54s Mediamill 101 41m 29s 54m 35s RCV1 101 30m 45s 37m 35s Bookmarks 208 48m 58s 3h 57m Delicious 983 1h 11m 20h 15m

slide-69
SLIDE 69

Related Work

  • SVM based Active Learning: Li et al [IJCAI,

2013], Yang et al [KDD 2009], Esuli et al [ECIR 2009], Li et al [ICIP 2004], …

  • Mutual Information: Krause et al [UAI 2005],

Krause et al [JMLR 2008], Singh et al [JAIR 2009], …

slide-70
SLIDE 70

Conclusion

  • Proposed mutual information based active

learning for multi-label classification

  • Collapsed Variational Bayes to infer variances
  • Theoretical analysis of mutual information

approximation showing that it is near-optimal

  • Showed significant empirical improvements over

the state-of-the-art