Multimodal semi-supervised learning for image classification - - PowerPoint PPT Presentation

multimodal semi supervised learning for image
SMART_READER_LITE
LIVE PREVIEW

Multimodal semi-supervised learning for image classification - - PowerPoint PPT Presentation

Multimodal semi-supervised learning for image classification Matthieu Guillaumin, Jakob Verbeek, Cordelia Schmid LEAR team, INRIA Grenoble, France Motivation and goal Images often come with additional textual info. Videos with scripts and


slide-1
SLIDE 1

Multimodal semi-supervised learning for image classification

Matthieu Guillaumin, Jakob Verbeek, Cordelia Schmid

LEAR team, INRIA Grenoble, France

slide-2
SLIDE 2

Motivation and goal

Images often come with additional textual info. Videos with scripts and subtitles, ...

Matthieu Guillaumin, INRIA Grenoble 2/21

slide-3
SLIDE 3

Goal of this work

Visual object category recognition, Leveraging user tags available on :

Tags wow San Fransisco Golden Gate Bridge SBP2005 top-f50 fog SF Chronicle 96 hours

Matthieu Guillaumin, INRIA Grenoble 3/21

slide-4
SLIDE 4

Overview of the talk

(A) Data sets and features (B) Learning scenarios using images with tags (1) Supervised multimodal classification (2) Multimodal semi-supervised scenario (3) Weakly supervised learning

Matthieu Guillaumin, INRIA Grenoble 4/21

slide-5
SLIDE 5

Data sets of images with tags

PASCAL VOC 07, ≈10000 images, 804 Flickr tags, 20 classes.

Flickr tags: india aviation, airplane, airport Class labels: cow aeroplane

MIR Flickr, 25000 images, 457 Flickr tags, 38 classes.

Flickr tags: desert, nature, landscape, sky rose, pink Class labels: clouds, plant life, sky, tree flower, plant life Matthieu Guillaumin, INRIA Grenoble 5/21

slide-6
SLIDE 6

Flickr tags as textual features

Restrict to the most frequent tags. 100 101 102 103 104 100 101 102

Sorted tag index Tag frequency PASCAL VOC’07 tags

Binary vector of tag presence/absence. Linear kernel counts the number of shared tags.

Matthieu Guillaumin, INRIA Grenoble 6/21

slide-7
SLIDE 7

Combination of several visual features

RBF kernel on average distance between 15 image representations:

Bag-of-features histograms:

Harris interest points and dense grid, SIFT [Lowe, 2004] and Hue [van de Weijer & Schmid, 2006], K-means quantization.

Color histograms:

RGB, HSV and Lab colorspaces, 16 bins per channel.

GIST [Oliva & Torralba, 2001], 2 spatial layouts

Global, 3 horizontal regions [Lazebnik et al., 2006], Only global for GIST.

Matthieu Guillaumin, INRIA Grenoble 7/21

slide-8
SLIDE 8

Learning scenarios using images with tags

1

Supervised multimodal classification

2

Multimodal semi-supervised scenario

3

Weakly supervised learning

Matthieu Guillaumin, INRIA Grenoble 8/21

slide-9
SLIDE 9

Supervised multimodal classification

Flickr tags = additional features for classification. Tags also available at test time, MKL to combine visual and textual kernels.

DOG (+1) not DOG (−1) DOG?

greyhound running athlete sport horse vermont cars racing dog rottweiler pets computer dual monitor

yacht canine pet locomotive black puppy cute dog Matthieu Guillaumin, INRIA Grenoble 9/21

slide-10
SLIDE 10

Results of multimodal classification on PASCAL VOC 2007

0.2 0.4 0.6 0.8 1

a e r

  • p

l a n e b i c y c l e b i r d b

  • a

t b

  • t

t l e b u s c a r c a t c h a i r c

  • w

d i n i n g t a b l e d

  • g

h

  • r

s e m

  • t
  • r

b i k e p e r s

  • n

p

  • t

t e d p l a n t s h e e p s

  • f

a t r a i n t v m

  • n

i t

  • r

M e a n PASCAL VOC’07 Average Precision tags image image+tags

Tags (0.43) < Image (0.53) < Image+tags (0.67) Winner of PASCAL VOC’07: 0.59. Similar observation for MIR Flickr.

Matthieu Guillaumin, INRIA Grenoble 10/21

slide-11
SLIDE 11

Learning scenarios using images with tags

1

Supervised multimodal classification

2

Multimodal semi-supervised scenario

3

Weakly supervised learning

Matthieu Guillaumin, INRIA Grenoble 11/21

slide-12
SLIDE 12

Multimodal semi-supervised scenario

Large pool of additional unlabeled images with tags. Tags NOT available at test time: visual categorization.

DOG (+1) Unlabeled

DOG?

greyhound running athlete sport vermont horse dog rottweiler pets canine pet

not DOG (−1)

puppy dog computer dual monitor railroads train locomotive car auto Matthieu Guillaumin, INRIA Grenoble 12/21

slide-13
SLIDE 13

Three-step learning process

In a nutshell, predict labels for the unlabeled images:

1 Train an MKL classifier on labeled images and tags. 2 Score unlabeled data. 3 Train an image-only classifier. 2 options: 1

SVM:

Use unlabeled data with label from sign of MKL score, Using only the sign, we dismiss the confidence of classification.

2

LSR:

Least-squares regression of MKL scores using the visual kernel, Regularized using KPCA projection.

Matthieu Guillaumin, INRIA Grenoble 13/21

slide-14
SLIDE 14

Experimental comparison

Baselines:

1 Supervised, image-only: SVM, 2 Semi-supervised, image-only: SVM+SVM, 3 Semi-supervised, multimodal: Co-training, with SVM on

images and SVM on tags. [Blum & Mitchell, 98] Our three-step learning approach (semi-supervised, multimodal):

1 MKL learned on labeled images with tags,

followed by visual-only SVM trained on labeled and unlabeled images: MKL+SVM,

2 MKL, followed by LSR: MKL+LSR. Matthieu Guillaumin, INRIA Grenoble 14/21

slide-15
SLIDE 15

Results of semi-supervised learning

40 40 100 100 200 200 20% 25% 30% 35% 40% 45% PASCAL VOC’07 MIR Flickr Mean AP Number of labeled training examples SVM SVM+SVM Co-training MKL+SVM MKL+LSR

SVM+SVM worse than baseline. With little supervision, MKL+LSR is significantly better. With more supervision, differences shrink.

Matthieu Guillaumin, INRIA Grenoble 15/21

slide-16
SLIDE 16

Learning scenarios using images with tags

1

Supervised multimodal classification

2

Multimodal semi-supervised scenario

3

Weakly supervised learning

Matthieu Guillaumin, INRIA Grenoble 16/21

slide-17
SLIDE 17

Weakly supervised scenario

For learning: no manual annotation, but Flickr tags, Other tags used as additional features. For evaluation: ground-truth labels. DOG?

greyhound running athlete sport vermont horse

dog

rottweiler pets canine pet

locomotive puppy

dog

computer dual monitor railroads train Matthieu Guillaumin, INRIA Grenoble 17/21

slide-18
SLIDE 18

Weakly supervised setting

Tags are noisy annotations:

Tag presence is relatively clean (82.0% precision) Tag absence is relatively uninformative (17.8% recall)

Our approach, modified:

1

Learn a multimodal MKL with tag annotations,

2

Rank training images and remove the images that yield highest MKL scores but do not have the tag,

3

Fit LSR.

Baseline: visual-only SVM learned on images with tag annotations.

Matthieu Guillaumin, INRIA Grenoble 18/21

slide-19
SLIDE 19

Results on 18 classes of MIR Flickr

Mean AP

37% 38% 39% 40% 41%

2000 4000 6000 8000 10000

Baseline MKL+LSR Number of removed training negatives

mAP on 18 MIR Flickr classes. On average, MKL+LSR outperforms SVM baseline:

SVM baseline better for 4 classes (up to +5.6%), MKL+LSR better for 14 classes (up to +9.8%).

Matthieu Guillaumin, INRIA Grenoble 19/21

slide-20
SLIDE 20

Conclusion

We considered using Flickr tags for 3 scenarios:

1

Supervised classification,

2

Semi-supervised learning of visual classifiers,

3

Weakly supervised learning of visual classifiers.

We proposed a three-step learning process:

1

Training of a multimodal classifier on labeled data,

2

Classification of the unlabeled data,

3

Regression of the multimodal classifier.

Our multimodal approach using Flickr tags improves over:

Visual-only SVM on all three scenarios, Co-training for semi-supervised learning.

Matthieu Guillaumin, INRIA Grenoble 20/21

slide-21
SLIDE 21

Multimodal semi-supervised learning for image classification

Matthieu Guillaumin, Jakob Verbeek, Cordelia Schmid

LEAR team, INRIA Grenoble, France