Kobe University, NICT, and University of Siegen at TRECVID 2016 - - PowerPoint PPT Presentation

kobe university nict and university of siegen at trecvid
SMART_READER_LITE
LIVE PREVIEW

Kobe University, NICT, and University of Siegen at TRECVID 2016 - - PowerPoint PPT Presentation

Kobe University, NICT, and University of Siegen at TRECVID 2016 AVS Task Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University) Takashi Shinozaki (NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen) Our Contribution A method of using


slide-1
SLIDE 1

Kobe University, NICT, and University of Siegen at TRECVID 2016 AVS Task

Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University) Takashi Shinozaki (NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen)

slide-2
SLIDE 2

2

Our Contribution

A method of using small-scale neural network to greatly accelerate concept classifier training. Transfer learning can be used to acquire temporal characteristics effiently by combining both small networks and LSTM. Evaluate the effectiveness of using balanced examples at the time

  • f training.
slide-3
SLIDE 3

3

The Problem

Using pre-trained neural networks to extract features is a very popular approach. However, training of classifiers takes long time. This training gets even worse if classifiers required are many. pre-trained network

~

?

extract feature

slide-4
SLIDE 4

4

Micro Neural Networks

Binary classifier that outputs two values to predict the presence

  • r absence of the concept.

A micro Neural Network is a fully-connected neural network with a single hidden layer. Dropout is used to avoid overfitting. Calculation time could be reduced (hours->minutes).

slide-5
SLIDE 5

Our Approach - Overview

5 + Manual selection + Feature extraction + MicroNN training + LSTM + Shot retrieval

Query Concept Model Precision Overview of our method for TRECVID 2016 AVS task

slide-6
SLIDE 6

6

Query Concept Model How we extracted concepts from the queries

+ Shot retrieval

Precision

+ Feature extraction + MicroNN training + LSTM + Manual selection

Our Approach - Overview

slide-7
SLIDE 7

Our Approach - Manual Selection

7

Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’ “look’’ Base form “man’’ Pick only noun and verb Simple rule is used to make it easier to automate the concept selection in the future. “bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet) Begin with manually selecting relevant concepts for each query

slide-8
SLIDE 8

8

Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’ “look’’ Base form “man’’ Pick only noun and verb “bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet) Begin with manually selecting relevant concepts for each query Indoor Speaking_to_camera Bookshelf Funiture Concept Simple rule is used to make it easier to automate the concept selection in the future.

Our Approach - Manual Selection

slide-9
SLIDE 9

9

Query Concept Model Overview of our method for TRECVID 2016 AVS task

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM + Manual selection

slide-10
SLIDE 10

10

Query Concept Model Combine the concepts from each query.

+ Shot retrieval

Precision

+ Feature extraction + MicroNN training + LSTM + Manual selection

Our Approach - Overview

slide-11
SLIDE 11

Our Approach - Feature Extraction

11

Use pre-trained VGGNet

  • ILSVRC 2014
  • CNN with very deep architecture
  • The 16 layer version is used
  • FC7 : Use output at the second 


fully connected layer Pre-trained network is usually transferred into classifiers suitable for the target problem

Conv1 Conv2 Conv3 Conv4 Conv5 FC6 FC7 FC8 Softmax

  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”
slide-12
SLIDE 12

~

Image VGG Net

Our Approach - MicroNN Training

① Start with training microNN using images Perform gradual transfer learning for each concept in the following step

12

slide-13
SLIDE 13

~

Image VGG Net SVM Until now . . .

13

Previous Approach - SVM Training

Previous studies have trained classifiers such as SVM by extracted features. This requires a lot of time.

slide-14
SLIDE 14

~

Image VGG Net microNN ① Start with training microNN using images Perform gradual transfer learning for each concept in the following step

14

Our Approach - MicroNN Training

slide-15
SLIDE 15

~

① Start with training microNN using images Perform gradual transfer learning for each concept in the following step

15

Our Approach - MicroNN Training

slide-16
SLIDE 16

Perform gradual transfer learning for each concept in the following step

16

Our Approach - MicroNN Training ~

② Refine the microNN using shots in video dataset.

slide-17
SLIDE 17

Perform gradual transfer learning for each concept in the following step

17

② Refine the microNN using shots in video dataset. The microNN has weight parameters learned at first step as its initial value. W, b

Our Approach - MicroNN Training ~ ~

Video

slide-18
SLIDE 18

Perform gradual transfer learning for each concept in the following step

18

W, b Video

~

LSTM

Our Approach - MicroNN Training ~ ~

V ③ Futher, hidden layer of microNN is replaced with LSTM for acquiring temporal

  • characteristics. Refine the microNN starting with weight parameters

learned at the second step as initial values.

slide-19
SLIDE 19

19

Query Concept Model Overview of our method for TRECVID 2016 AVS task

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM + Manual selection

slide-20
SLIDE 20

20

Query Concept Model How we go from a shot’s concept relevance to its search score

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM + Manual selection

slide-21
SLIDE 21

Our Approach - Shot Retrieval

21

For each shot, calculate the avarage of output values

  • f microNNs for the selected concepts in a query

Indoor Speaking_to_camera Bookshelf Funiture Concept Output values 0.7 0.1 0.4 0.6 MicroNN outputs are normalized to [-1, 1], to balance between different concepts.

slide-22
SLIDE 22

Our Approach - Shot Retrieval

22

Indoor Speaking_to_camera Bookshelf Funiture Concept Output values Average of output values (Search Score) 0.7 0.1 0.4 0.6 / 4 0.45 Calculate the average of output values and use it as overall search score. How do we compare that with other shots

slide-23
SLIDE 23

23

Purpose of Experiment

  • 1. Evaluate the learning speed.
  • 2. Evaluate the effectiveness of using LSTM to acquire

temporal characteristics.

  • 3. Evaluate wheather using same number of positive and negative


examples (“Balanced”) for training improves classification.

slide-24
SLIDE 24

Experiment - Three Runs

24 kobe_nict_siegen_D_M_1

Imbalanced

Fine-tuning is carried out using imbalanced numbers

  • f positive and negative examples.

(30,000 total) kobe_nict_siegen_D_M_2

Balanced

Fine-tuning is carried out using balanced numbers

  • f positive and negative examples.

(30,000 total) kobe_nict_siegen_D_M_3

(Imbalanced) LSTM

Unlike max-pooling, LSTM obtains temporal characteristics. LSTM-based microNNs are trained

  • nly for 14 concepts for which

temporal relations among video frames are important

positive negative negative positive Dataset Ratio Dataset Ratio Submitted the following for TRECVID 2016 AVS task Only 14 concepts

slide-25
SLIDE 25

Experiment - Dataset

25

TRECVID IACC Video data

61 concepts

ImageNet Image data

39 concepts

UCF 101 Video data

5 concepts

Used in this study

slide-26
SLIDE 26

Experiment - Dataset

26

TRECVID IACC Video data

61 concepts

ImageNet Image data

39 concepts

UCF 101 Video data

5 concepts

Training time sec / concept (30000 shots) min / concept (30000 shots)

2 3

slide-27
SLIDE 27

27

List of some concepts selected for each query query_id ImageNet TRECVID UCF 101 501 Outdoor playingGuitar 502 bookshelf Indoor Speaking_to_camera Furniture 503 drum Indoor drumming

Experiment - Dataset

Used in this study

slide-28
SLIDE 28

Experiment - Result

28 LSTM AP

Performance comparison between Imbalanced, Balanced and LSTM

  • n each of the 30 queries

0.05 0.1 0.15 0.2 0.25 0.3 0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

slide-29
SLIDE 29

Experiment - Result

29 LSTM AP

Performance comparison between Imbalanced, Balanced and LSTM

  • n each of the 30 queries

0.05 0.1 0.15 0.2 0.25 0.3 0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

Using imbalanced training examples leads to higher average precisions than using balanced ones.

slide-30
SLIDE 30

Experiment - Result

30 LSTM AP

Performance comparison between Imbalanced, Balanced and LSTM

  • n each of the 30 queries

0.05 0.1 0.15 0.2 0.25 0.3 0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

Using LSTM is more than three times higher than the ones not-using LSTM.

slide-31
SLIDE 31

Experiment - Result

31 Ours Others MAP

Performance comparison between our method and the other methods developed for the manually-assisted category in AVS task

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

LSTM Imbalanced Balanced

slide-32
SLIDE 32

Experiment - Result

32

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Waseda.16 2 Waseda.16 1 Waseda.16 4 Waseda.16 3 NII_Hitachi_UIT.16 4 ITI_CERTH.16 4 ITI_CERTH.16 3 ITI_CERTH.16 1 kobe_nict_siegen.16 3 IMOTION.16 1 kobe_nict_siegen.16 1 IMOTION.16 2 NII_Hitachi_UIT.16 3 vitrivr.16 1 VIREO.16 5 vitrivr.16 2 VIREO.16 1 ITI_CERTH.16 4 ITI_CERTH.16 1 NII_Hitachi_UIT.16 2 NII_Hitachi_UIT.16 1 ITI_CERTH.16 2 kobe_nict_siegen.16 2 INF.16 1 VIREO.16 6 VIREO.16 2 ITI_CERTH.16 3 ITI_CERTH.16 2 MediaMill.16 4 MediaMill.16 2 MediaMill.16 1 INF.16 2 MediaMill.16 3 EURECOM.16 2 INF.16 3 FIU_UM.16 2 FIU_UM.16 1 VIREO.16 3 IMOTION.16 3 IMOTION.16 4 EURECOM.16 1 VIREO.16 4 EURECOM.16 4 INF.16 4 UEC.16 2 UEC.16 1 vitrivr.16 4 vitrivr.16 3 ITEC_UNIKLU.16 1 EURECOM.16 3 ITEC_UNIKLU.16 2 ITEC_UNIKLU.16 3

Ours Others MAP

Performance comparison between our method and the other methods developed for the AVS task

slide-33
SLIDE 33

33

Conclusion

Video search through efficient transfer learning using microNN

  • fast
  • flexibile

Imbalanced examples are more useful than balanced examples Validity of acquired temporal characteristics by LSTM

slide-34
SLIDE 34

34

Future work

Further experiments by using LSTM on reduced frame interval.

  • ne video frame every 30 frames in a shot

more densly sampled video frames

slide-35
SLIDE 35

35

Future work

Acquiring temporal characteristics using

  • ptical flow.

Before detecting objects in a scene, we can first classify its environment to improve the performance.

~ ~ ~

  • prical flow

scene