[PPT] - On the use of semantic features for the semantic indexing task PowerPoint Presentation

SLIDE 1

1

On the use of semantic features for the semantic indexing task

Bahjat Safadi, Nadia Derbas, Abdelkader Hamadi, Mateusz Budnik, Philippe Mulhem and Georges Quénot UJF-LIG and many other people from the IRIM group of GDR 720 ISIS 10 November 2014

SLIDE 2

2

Outline

System overview
Semantic features
Contrast experiments
Engineered versus learned features
Conclusion

SLIDE 3

3

Main runs scores 2014 (from NIST)

Median = 0.206

Mean InfAP.

* non-LIG submitted runs in 2013 against 2014 testing data (progress runs) * LIG (Quaero) submitted runs in 2013 against 2014 testing data (progress runs) * LIG submitted runs in 2014 against 2014 testing data (main runs)

0,05 0,1 0,15 0,2 0,25 0,3 0,35

D_MediaMill.14_1 D_MediaMill.14_2 D_MediaMill.14_3 A_MediaMill.14_4 D_PicSOM.14_1 D_PicSOM.14_3 D_TokyoTech-Waseda.14_1 D_TokyoTech-Waseda.14_2 D_PicSOM.14_2 D_LIG.14_3 D_LIG.14_4 A_TokyoTech-Waseda.14_3 A_TokyoTech-Waseda.14_4 D_LIG.14_2 D_IRIM.14_2 D_IRIM.14_1 D_LIG.14_1 D_IRIM.14_4 A_CMU.14_1 D_IRIM.14_3 A_LIG.13_3 A_LIG.13_1 A_CMU.14_3 A_IRIM.13_1 A_VideoSense.13_4 D_OrangeBJ.14_4 A_CMU.14_2 A_CMU.14_4 D_VIREO.14_2 D_EURECOM.14_1 A_axes.inria.lear.13_8 A_axes.inria.lear.13_5 A_axes.inria.lear.13_2 A_ITI_CERTH.14_1 A_IRIM.13_2 D_OrangeBJ.14_2 A_ITI_CERTH.14_2 D_EURECOM.14_2 D_VIREO.14_1 A_PicSOM.14_4 A_ITI_CERTH.14_3 A_OrangeBJ.14_1 D_UEC.14_1 D_CRCV_UCF.14_3 A_NII.13_1 A_EURECOM.14_3 A_NII.13_2 D_CRCV_UCF.14_2 D_UEC.14_2 D_CRCV_UCF.14_1 D_OrangeBJ.14_3 A_EURECOM.14_4 A_ITI_CERTH.13_6 A_ITI_CERTH.13_5 D_CRCV_UCF.14_4 A_UEC.14_3 A_NHKSTRL.13_3 A_insightdcu.13_1 A_ITI_CERTH.14_4 E_insightdcu.14_1 A_UEC.13_2 E_insightdcu.14_2 A_insightdcu.14_1 A_HFUT.13_2 A_EURECOM.13_1 A_EURECOM.13_2 E_CMU.14_1 E_CMU.14_2 A_PKUSZ_ELMT.14_2 A_PKUSZ_ELMT.14_1 A_FIU_UM.14_4 A_FIU_UM.14_3

SLIDE 4

4

Basic classification pipeline

Descriptor extraction Classification Late fusion Text Audio Image Classification score

SLIDE 5

5

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Text Audio Image Classification score

+ hierarchical fusion [Strat et al., ECCV/IFCVCR workshop 2012, Springer 2014]

SLIDE 6

6

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Classification score

+ Temporal re-ranking [Safadi et al., CIKM 2011; Wang et al, TV 2009]: update shot scores considering other shots’ scores for a same concept

SLIDE 7

7

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Classification score

+ Descriptor optimization [Safadi et al., MTAP 2014]: combination

f PCA-based dimensionality reduction and pre- and post- power

transformations

SLIDE 8

8

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score

+ conceptual feedback [Hamadi et al., MTAP, 2014]

SLIDE 9

9

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score

+ conceptual re-ranking [Hamadi et al., MTAP, 2014] update concept scores considering other concepts’ scores for a same shot

SLIDE 10

10

LIG/Quaero/IRIM classification pipeline

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score

+ semantic descriptors [TRECVid 2013 and 2014]

SLIDE 11

11

Conceptual feedback: unfolded graph

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Score iteration 1 (feedback) Score iteration 0 (original)

SLIDE 12

12

Conceptual feedback: semantic descriptor

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Classification score iteration 1 Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Image Classification score iteration 0 semantic descriptor extraction standard descriptor processing shared components (computed

nly once)

Audio Text Image Audio Text

SLIDE 13

13

Semantic descriptor: general case

Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) semantic descriptor extraction standard descriptor processing

Any classification system using any approach trained on any annotated data for any target concept set

Image Audio Text Image Audio Text Classification score Semantic descriptor

Model vectors [Smith et al. ICME 2003]

SLIDE 14

14

Semantic descriptors trained on ImageNet

Fisher Vector based descriptor [Perronnin, IJCV 2013]:
XEROX/ilsvrc2010: vectors of 1000 scores trained on

ILSVRC10 and applied to key frames, kindly produced by Florent Perronnin from Xerox (XRCE)

XEROX/imagenet10174: same with10274 concepts scores

trained ImageNet

Deep learning based descriptors, computed by

Eurecom and LIG using Berkeley caffe tool [Jia et al,

2013]:

EUR/caffe1000: vectors of 1000 scores trained on

ILSVRC12 and applied to key frames, fusing outputs for 10 variants of each input image

LIG/caffe1000b: same with a different version of the tool

and using only one variant of each input image

SLIDE 15

15 [Krizhevsky et al., 2012]

7 hidden layers, 650K units, 630 M connections, 60M parameters
GPU implementation (50× speed-up over CPU)
Trained on two GPUs for a week

b1000

fc7 fc6 fc5

“Quasi-semantic” descriptors from deep learning and ImageNet

SLIDE 16

16

“Quasi-semantic” descriptors from deep learning and ImageNet

Deep learning based descriptors, computed by

LIG using Berkeley caffe tool [Jia et al, 2013]:

LIG/caffe_fc7b_4096: 4096 values of the last hidden layer

(non convolutional)

LIG/caffe_fc6b_4096: 4096 values of the last but one

hidden layers (non convolutional)

LIG/caffe_fc5b_43264: 43264 values of the last but two

hidden layers (convolutional, 13×13×256)

Not strictly semantic as not classification scores,

close to the semantic level however

Expected to perform better than the last layer:
No (or les) information loss due to the targeting of different

and/or unrelated target concepts

SLIDE 17

17

Local semantic descriptors trained on TRECVid 2003

Scores for 15 TRECVid 2003 concepts (sky,

building, water, greenery ...) on image patches trained using local annotations [Ayache et al., IVP 2007]

LIG/percepts*: computed at various resolutions in a

pyramidal way, aggregated by concatenation

Computed using local color and texture descriptors
No longer state of the art

SLIDE 18

18

Experiments

Use of SIN 2013 development data only (no

tuning on SIN 2013 test data) and various components using ImageNet annotated data → D type submissions

Evaluation on SIN 2013 and 2014 test data
Use of a combination of kNN and MSVM for

classification [Safadi, RIAO 2010]

Use of uploader information: multiplicative factor

at the video level, weighted at 10%, provided by Eurecom [Niaz, TV 2012]

SLIDE 19

19

Performance of “low-level” descriptors

0,1 0,2 0,3

13 Low-level "engineered" descriptor ETIS VLAT (vector of locally aggregated tensors) LISTIC SIFT with retina masking ETIS color (lab BoW) and texture (wavelets) CEALIST pyramidal bag of SIFT LIG opponent SIFT LIRIS OC-LBP

MAP 2013 MAP 2014

SLIDE 20

20

Performance of semantic descriptors

0,1 0,2 0,3

LIG/concepts second iteration (includes Xerox) LIG/concepts first iteration (includes Xerox) Caffe semantic features last two hidden layers… Caffe quasi-semantic hidden layer 7 (4096) Caffe quasi-semantic hidden layer 6 (4096) Caffe quasi-semantic hidden layer 5 (43264) Caffe semantic features output layer (fused) Caffe semantic feature Eurecom Caffe semantic features LIG Xerox semantic features (fused) Xerox semantic features ImageNet 10174 Xerox semantic features ILSVRC 1000 13 "low-level" "engineered" descriptors

MAP 2013 MAP 2014

SLIDE 21

21

Temporal re-scoring on semantic descriptors

0,1 0,2 0,3

LIG/concepts second iteration (includes Xerox) LIG/concepts first iteration (includes Xerox) Caffe semantic features last two hidden layers… Caffe quasi-semantic hidden layer 7 (4096) Caffe quasi-semantic hidden layer 6 (4096) Caffe quasi-semantic hidden layer 5 (43264) Caffe semantic features output layer (fused) Caffe semantic feature Eurecom Caffe semantic features LIG Xerox semantic features (fused) Xerox semantic features ImageNet 10174 Xerox semantic features ILSVRC 1000 13 "low-level" "engineered" descriptors

Re-scored MAP 2014

SLIDE 22

22

Combination of improvement methods

The relative gain brought by each improvement method depends upon the order in which they are applied. Final conceptual re-scoring did not further improve on 2014 data

0,2 0,4 Plus use of uploader model (D_LIG.14_1) Plus conceptual re-scoring (D_LIG.14_2) Plus use of uploader model (D_LIG.14_3) Plus temporal re-scoring (D_LIG.14_4) Plus conceptual feedback second iteration Plus conceptual feedback first iteration Plus Caffe last two hidden layers Plus Caffe output layer Plus Xerox semantic features 13 "low-level" "engineered" descriptor MAP 2013 MAP 2014

SLIDE 23

23

Use of semantic features for the semantic indexing task

Fisher vectors based descriptors on par with deep

learning based descriptors

Both on par with a combination of 13 low-level engineered

descriptors types, some of which being state of the art

Any single engineered descriptor performs significantly

lower than any sematic descriptor → why? maybe a question of training data (more and cleaner in ImageNet)

Conceptual feedback based semantic descriptors better

than all (even when not including other semantic ones)

Fusion and combination with other methods (e.g. temporal

re-scoring) further improves

Direct application of FV and deep learning on SIN training

data on-going but unlikely to compete

Very small gain from the uploader field

SLIDE 24