On the use of semantic features for the semantic indexing task - - PowerPoint PPT Presentation
On the use of semantic features for the semantic indexing task - - PowerPoint PPT Presentation
On the use of semantic features for the semantic indexing task Bahjat Safadi, Nadia Derbas, Abdelkader Hamadi, Mateusz Budnik, Philippe Mulhem and Georges Qunot UJF-LIG and many other people from the IRIM group of GDR 720 ISIS 10 November
2
Outline
- System overview
- Semantic features
- Contrast experiments
- Engineered versus learned features
- Conclusion
3
Main runs scores 2014 (from NIST)
Median = 0.206
Mean InfAP.
* non-LIG submitted runs in 2013 against 2014 testing data (progress runs) * LIG (Quaero) submitted runs in 2013 against 2014 testing data (progress runs) * LIG submitted runs in 2014 against 2014 testing data (main runs)
0,05 0,1 0,15 0,2 0,25 0,3 0,35
D_MediaMill.14_1 D_MediaMill.14_2 D_MediaMill.14_3 A_MediaMill.14_4 D_PicSOM.14_1 D_PicSOM.14_3 D_TokyoTech-Waseda.14_1 D_TokyoTech-Waseda.14_2 D_PicSOM.14_2 D_LIG.14_3 D_LIG.14_4 A_TokyoTech-Waseda.14_3 A_TokyoTech-Waseda.14_4 D_LIG.14_2 D_IRIM.14_2 D_IRIM.14_1 D_LIG.14_1 D_IRIM.14_4 A_CMU.14_1 D_IRIM.14_3 A_LIG.13_3 A_LIG.13_1 A_CMU.14_3 A_IRIM.13_1 A_VideoSense.13_4 D_OrangeBJ.14_4 A_CMU.14_2 A_CMU.14_4 D_VIREO.14_2 D_EURECOM.14_1 A_axes.inria.lear.13_8 A_axes.inria.lear.13_5 A_axes.inria.lear.13_2 A_ITI_CERTH.14_1 A_IRIM.13_2 D_OrangeBJ.14_2 A_ITI_CERTH.14_2 D_EURECOM.14_2 D_VIREO.14_1 A_PicSOM.14_4 A_ITI_CERTH.14_3 A_OrangeBJ.14_1 D_UEC.14_1 D_CRCV_UCF.14_3 A_NII.13_1 A_EURECOM.14_3 A_NII.13_2 D_CRCV_UCF.14_2 D_UEC.14_2 D_CRCV_UCF.14_1 D_OrangeBJ.14_3 A_EURECOM.14_4 A_ITI_CERTH.13_6 A_ITI_CERTH.13_5 D_CRCV_UCF.14_4 A_UEC.14_3 A_NHKSTRL.13_3 A_insightdcu.13_1 A_ITI_CERTH.14_4 E_insightdcu.14_1 A_UEC.13_2 E_insightdcu.14_2 A_insightdcu.14_1 A_HFUT.13_2 A_EURECOM.13_1 A_EURECOM.13_2 E_CMU.14_1 E_CMU.14_2 A_PKUSZ_ELMT.14_2 A_PKUSZ_ELMT.14_1 A_FIU_UM.14_4 A_FIU_UM.14_3
4
Basic classification pipeline
Descriptor extraction Classification Late fusion Text Audio Image Classification score
5
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Text Audio Image Classification score
+ hierarchical fusion [Strat et al., ECCV/IFCVCR workshop 2012, Springer 2014]
6
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Classification score
+ Temporal re-ranking [Safadi et al., CIKM 2011; Wang et al, TV 2009]: update shot scores considering other shots’ scores for a same concept
7
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Classification score
+ Descriptor optimization [Safadi et al., MTAP 2014]: combination
- f PCA-based dimensionality reduction and pre- and post- power
transformations
8
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score
+ conceptual feedback [Hamadi et al., MTAP, 2014]
9
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score
+ conceptual re-ranking [Hamadi et al., MTAP, 2014] update concept scores considering other concepts’ scores for a same shot
10
LIG/Quaero/IRIM classification pipeline
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Conceptual feedback Classification score
+ semantic descriptors [TRECVid 2013 and 2014]
11
Conceptual feedback: unfolded graph
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Text Audio Image Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Score iteration 1 (feedback) Score iteration 0 (original)
12
Conceptual feedback: semantic descriptor
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Classification score iteration 1 Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) Image Classification score iteration 0 semantic descriptor extraction standard descriptor processing shared components (computed
- nly once)
Audio Text Image Audio Text
13
Semantic descriptor: general case
Descriptor extraction Descriptor transformation Classification Descriptors and classifier variants fusion Higher level hierarchical fusion Re-ranking (re-scoring) semantic descriptor extraction standard descriptor processing
Any classification system using any approach trained on any annotated data for any target concept set
Image Audio Text Image Audio Text Classification score Semantic descriptor
Model vectors [Smith et al. ICME 2003]
14
Semantic descriptors trained on ImageNet
- Fisher Vector based descriptor [Perronnin, IJCV 2013]:
- XEROX/ilsvrc2010: vectors of 1000 scores trained on
ILSVRC10 and applied to key frames, kindly produced by Florent Perronnin from Xerox (XRCE)
- XEROX/imagenet10174: same with10274 concepts scores
trained ImageNet
- Deep learning based descriptors, computed by
Eurecom and LIG using Berkeley caffe tool [Jia et al,
2013]:
- EUR/caffe1000: vectors of 1000 scores trained on
ILSVRC12 and applied to key frames, fusing outputs for 10 variants of each input image
- LIG/caffe1000b: same with a different version of the tool
and using only one variant of each input image
15
[Krizhevsky et al., 2012]
- 7 hidden layers, 650K units, 630 M connections, 60M parameters
- GPU implementation (50× speed-up over CPU)
- Trained on two GPUs for a week
b1000
fc7 fc6 fc5
“Quasi-semantic” descriptors from deep learning and ImageNet
16
“Quasi-semantic” descriptors from deep learning and ImageNet
- Deep learning based descriptors, computed by
LIG using Berkeley caffe tool [Jia et al, 2013]:
- LIG/caffe_fc7b_4096: 4096 values of the last hidden layer
(non convolutional)
- LIG/caffe_fc6b_4096: 4096 values of the last but one
hidden layers (non convolutional)
- LIG/caffe_fc5b_43264: 43264 values of the last but two
hidden layers (convolutional, 13×13×256)
- Not strictly semantic as not classification scores,
close to the semantic level however
- Expected to perform better than the last layer:
- No (or les) information loss due to the targeting of different
and/or unrelated target concepts
17
Local semantic descriptors trained on TRECVid 2003
- Scores for 15 TRECVid 2003 concepts (sky,
building, water, greenery ...) on image patches trained using local annotations [Ayache et al., IVP 2007]
- LIG/percepts*: computed at various resolutions in a
pyramidal way, aggregated by concatenation
- Computed using local color and texture descriptors
- No longer state of the art
18
Experiments
- Use of SIN 2013 development data only (no
tuning on SIN 2013 test data) and various components using ImageNet annotated data → D type submissions
- Evaluation on SIN 2013 and 2014 test data
- Use of a combination of kNN and MSVM for
classification [Safadi, RIAO 2010]
- Use of uploader information: multiplicative factor
at the video level, weighted at 10%, provided by Eurecom [Niaz, TV 2012]
19
Performance of “low-level” descriptors
0,1 0,2 0,3
13 Low-level "engineered" descriptor ETIS VLAT (vector of locally aggregated tensors) LISTIC SIFT with retina masking ETIS color (lab BoW) and texture (wavelets) CEALIST pyramidal bag of SIFT LIG opponent SIFT LIRIS OC-LBP
MAP 2013 MAP 2014
20
Performance of semantic descriptors
0,1 0,2 0,3
LIG/concepts second iteration (includes Xerox) LIG/concepts first iteration (includes Xerox) Caffe semantic features last two hidden layers… Caffe quasi-semantic hidden layer 7 (4096) Caffe quasi-semantic hidden layer 6 (4096) Caffe quasi-semantic hidden layer 5 (43264) Caffe semantic features output layer (fused) Caffe semantic feature Eurecom Caffe semantic features LIG Xerox semantic features (fused) Xerox semantic features ImageNet 10174 Xerox semantic features ILSVRC 1000 13 "low-level" "engineered" descriptors
MAP 2013 MAP 2014
21
Temporal re-scoring on semantic descriptors
0,1 0,2 0,3
LIG/concepts second iteration (includes Xerox) LIG/concepts first iteration (includes Xerox) Caffe semantic features last two hidden layers… Caffe quasi-semantic hidden layer 7 (4096) Caffe quasi-semantic hidden layer 6 (4096) Caffe quasi-semantic hidden layer 5 (43264) Caffe semantic features output layer (fused) Caffe semantic feature Eurecom Caffe semantic features LIG Xerox semantic features (fused) Xerox semantic features ImageNet 10174 Xerox semantic features ILSVRC 1000 13 "low-level" "engineered" descriptors
Re-scored MAP 2014
22
Combination of improvement methods
The relative gain brought by each improvement method depends upon the order in which they are applied. Final conceptual re-scoring did not further improve on 2014 data
0,2 0,4 Plus use of uploader model (D_LIG.14_1) Plus conceptual re-scoring (D_LIG.14_2) Plus use of uploader model (D_LIG.14_3) Plus temporal re-scoring (D_LIG.14_4) Plus conceptual feedback second iteration Plus conceptual feedback first iteration Plus Caffe last two hidden layers Plus Caffe output layer Plus Xerox semantic features 13 "low-level" "engineered" descriptor MAP 2013 MAP 2014
23
Use of semantic features for the semantic indexing task
- Fisher vectors based descriptors on par with deep
learning based descriptors
- Both on par with a combination of 13 low-level engineered
descriptors types, some of which being state of the art
- Any single engineered descriptor performs significantly
lower than any sematic descriptor → why? maybe a question of training data (more and cleaner in ImageNet)
- Conceptual feedback based semantic descriptors better
than all (even when not including other semantic ones)
- Fusion and combination with other methods (e.g. temporal
re-scoring) further improves
- Direct application of FV and deep learning on SIN training
data on-going but unlikely to compete
- Very small gain from the uploader field