Match tching ing and d Rankin nking Jianfeng Dong 1 , Xirong Li 2 - - PowerPoint PPT Presentation

match tching ing and d rankin nking
SMART_READER_LITE
LIVE PREVIEW

Match tching ing and d Rankin nking Jianfeng Dong 1 , Xirong Li 2 - - PowerPoint PPT Presentation

Wo Word2Visua d2VisualVec lVec fo for Video deo-To To-Text Text Match tching ing and d Rankin nking Jianfeng Dong 1 , Xirong Li 2 , Xiaoxu Wang 2 , Qijie Wei 2 , Weiyu Lan 2 , Cees G. M. Snoek 3 Zhejiang University 1 Renmin University of


slide-1
SLIDE 1

Wo Word2Visua d2VisualVec lVec fo for Video deo-To To-Text Text Match tching ing and d Rankin nking

Jianfeng Dong1, Xirong Li2, Xiaoxu Wang2, Qijie Wei2, Weiyu Lan2, Cees G. M. Snoek3

Zhejiang University1 Renmin University of China2 University of Amsterdam3

slide-2
SLIDE 2

Our idea

Project sentences into a video feature space Match sentences and videos in this space

slide-3
SLIDE 3

Solution: Word2VisualVec

Transform text into a video feature vector

pooling

σ(W1*s(q)+b1)

s(q) h1(q)

σ(W2*h1(q)+b2)

word matrix

Text video CNN

Φ(x)

  • J. Dong, X. Li, C. Snoek, Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction,

Arxiv:1604.06838, 2016

slide-4
SLIDE 4

Word2VisualVec

Transform text into a video feature vector

pooling

σ(W1*s(q)+b1)

s(q) h1(q)

σ(W2*h1(q)+b2)

word matrix

Text video CNN

Φ(x) word2vec

slide-5
SLIDE 5

Word2VisualVec

Transform text into a video feature vector

pooling

σ(W1*s(q)+b1)

s(q) h1(q)

σ(W2*h1(q)+b2)

word matrix

Text video CNN

Φ(x) word2vec Multi-layer perceptron

+

Minimize Mean Squared Error between text vector and video vector

slide-6
SLIDE 6

Implementation

Two video features

  • Visual: Mean pooling over frame-level CNN feature extracted by

GoogleNet-shuffle[Mettes et al ICMR16]

  • Visual + Audio: GoogleNet-shuffle + Bag of quantized MFCC

Word2Vec

  • 500-dim, trained on user tags of 30m Flickr images

Word2VisualVec architecture

  • For predicting the visual feature: 500-1000-1024
  • For predicting the visual + audio feature: 500-1000-2048

Training set

  • MSR-VTT training set of 6,513 videos[Xu et al. CVPR16]

Validation set

  • TRECVID 200 training videos
slide-7
SLIDE 7

Video-to-text results

Adding the audio feature provides some improvement

set A set B

Word2VisualVec is effective

slide-8
SLIDE 8

Video-to-text results

Text → Visual

a man with a beard is wearing glasses

Text → Visual + Audio

man talks into the camera

Text → Visual

soccer players are blocking the ball on a soccer field

Text → Visual + Audio

a soccer player scores a goal on a soccer field

More results at http://lixirong.net/demo/vtt/tv16.html

slide-9
SLIDE 9

Video Description Generation

  • J. Dong, X. Li, W. Lan, Y. Huo, C. Snoek,

Early embedding and late reranking for video captioning, ACM Multimedia 2016

slide-10
SLIDE 10

Idea: Re-use Video Tags for Captioning

track race field woman a group of people are running in a race track dance people woman dancing people are dancing on a stage soccer player game playing a soccer player is playing a goal on a soccer field Predicted tags Generated caption

slide-11
SLIDE 11

Our solution

Google’s model

[Vinyals et al. CVPR 2015] models are walking down the runway models are walking on the runway a woman is walking down the runway a woman is dancing … models are walking in a fashion show models are walking on the ramp

Google’s model for sentence generation

GoogleNet-shuffle

slide-12
SLIDE 12

Our solution

Google’s model

[Vinyals et al. CVPR 2015]

fashion walking model

models are walking down the runway models are walking on the runway a woman is walking down the runway a woman is dancing … models are walking in a fashion show models are walking on the ramp

Re-encoding by Word2VisualVec

Better initialization by tag embedding

slide-13
SLIDE 13

Our solution

Google’s model

[Vinyals et al. CVPR 2015] models are walking down the runway models are walking on the runway a woman is walking down the runway a woman is dancing … models are walking in a fashion show models are walking on the ramp

Maximize tag matches

models are walking in a fashion show

fashion walking model

Rerank sentences by matching with video tags

Re-encoding by Word2VisualVec

slide-14
SLIDE 14

Heuristics to add ‘where’

Two simple rules to append ‘where’ description to the end

  • f the generated sentences:

1. Add “on a $sport_name field” if $sport appear in the sentence, such as basketball, baseball, and football. 2. Add “on a stage” if “sing” or “dance” appear in the sentence.

slide-15
SLIDE 15

Description generation results

Adding “where” improve the performance

slide-16
SLIDE 16

Live demo

http://lixirong.net/demo/vtt

accept video file less than 10 MB

slide-17
SLIDE 17

Conclusion

Word2VisualVec for video-to-text matching in video space Early embedding and late reranking improves LSTM based video captioning Winning results in the VTT task Xirong Li