Deep Learning Based Semantic Video Indexing and Retrieval Anna - - PowerPoint PPT Presentation

deep learning based semantic video indexing and retrieval
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Based Semantic Video Indexing and Retrieval Anna - - PowerPoint PPT Presentation

Deep Learning Based Semantic Video Indexing and Retrieval Anna Podlesnaya, Sergey Podlesnyy Cinema and Photo Research Institute (NIKFI) This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15 Fast Track


slide-1
SLIDE 1

Deep Learning Based Semantic Video Indexing and Retrieval

Anna Podlesnaya, Sergey Podlesnyy Cinema and Photo Research Institute (NIKFI)

This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15

slide-2
SLIDE 2

Fast Track

Contribution 1

Video Segmentation Feature vector extracted by GoogLeNet contains enough semantic information for segmenting raw video into shots with 0.94 precision compared to MPEG-4 i-frames.

Contribution 2

Video Indexing Graph-based database for temporal, spatial and semantic properties indexing is proposed. Cost efficient pipeline.

Contribution 3

Search by Examples Video retrieval by sample video clip @0.86 precision Online learning of new concepts: video retrieval by sample photos @0.64 precision.

slide-3
SLIDE 3

Relevance

Archives are Huge

Russian Documentary Archive: 250K items (dated from 1910) Russian TV Archive: 100K items Youtube: users uploading 100hrs of video every minute (as of 2013)

Production Needs

Everyday need for footage in TV production Non-fiction movies production relies on historical and cultural heritage content Education, research, art...

MPEG-7 Query Format

  • QueryByFreeText
  • QueryByMedia
  • SpatialQuery
  • TemporalQuery

ISO/IEC 15938-5:2003 Information technology -- Multimedia content description interface -- Part 5: Multimedia description schemes

slide-4
SLIDE 4

Video Segmentation

With semantic features

Semantic features extraction by deep neural network Shots cut by vector distance spikes between frames Temporal pooling for shot semantics summarizing

slide-5
SLIDE 5

Deep Neural Network

slide-6
SLIDE 6

Semantics Feature Vector

slide-7
SLIDE 7

Distance Between Frames’ Feature Vectors

slide-8
SLIDE 8

Segmenting Algorithm Details

{x0, x1, … xn} — feature vectors of successive frames

slide-9
SLIDE 9

Robustness to Camera Movement

Zoom Pan/Rotate Pan Pan Zoom/Pan

slide-10
SLIDE 10

Video Indexing

With graph database

Apache Cassandra storage for feature vectors and thumbnails Neo4j graph database for movies archive Structured queries for keywords-based retrieval

slide-11
SLIDE 11

Digitizing Starting with film or tape FV Extraction Store per-frame timecodes and feature vectors in Cassandra Segmentation Store per-scene data structure in Neo4j Indexing May use additional classifiers for faces, places, salient objects etc. BK-Tree Building Add edges to Neo4j graph to speed up nearest neighbors search

slide-12
SLIDE 12

Graph-Based Index

slide-13
SLIDE 13

Neo4j Graph

slide-14
SLIDE 14

Neo4j Graph

slide-15
SLIDE 15

Neo4j Query

Find Scenes with Zebra

MATCH (s:Shot) - [c:Category] -> (w:Wordnet {synset: “zebra”}) WHERE c.weight > 0.1 RETURN s ORDER BY s.duration DESC ASCII art: (s)-[c]->(w)

slide-16
SLIDE 16

Neo4j Query

Find Scenes with Lion at Left to Zebra

MATCH (s:Shot) --> (zebra_obj:Salient_obj) --> (w:Wordnet {synset: “zebra”}) MATCH (s) --> (lion_obj:Salient_obj) --> (w:Wordnet {synset: “lion”}) MATCH (zebra_obj) - [:Left] -> (lion_obj) RETURN s ORDER BY s.duration DESC (s)-->(zebra_obj)-->(w) (s)-->(lion_obj)-->(w) (zebra_obj)-[:Left]->(lion_obj)

slide-17
SLIDE 17

Search by Example

One picture is better than 100 words

Find similar clip Find near-duplicates Online learning of new concepts

slide-18
SLIDE 18

Keyword Search Search for ELEPHANT Select Sample Clip Need elephants herd, forest, sky Find Similar Clips Found clips with required characteristics

Use Case 1

slide-19
SLIDE 19

Find Similar Clip

Quick Search

31-bit random projection hash (RPH) BK-Tree on PRH hamming distance from sample clip Quick incomplete search

Exhaustive Search

Feature vectors pooled by scene (R1024) Cosine distance between sample clip and every

  • ther scene in the

archive, sort descending Well, slow

An average precision of search by video sample was 0.86. The precision was evaluated by searching by a keyword and then searching by one of resulted shots with cosine distance threshold 0.3. A human expert performed true/false positives counting.

slide-20
SLIDE 20

Extract Features Show sample clip Exhaustive Search Sort by Cos-distance Found near-duplicates, robust to resampling, vignetting, hue/sat augmentation etc.

Use Case 2

0.0057 0.0124 0.0152 0.0583

slide-21
SLIDE 21

Google Image Search Show sample images

  • f unknown concept

Train Linear Classifier Exhaustive Search Found video clips matching the classifier trained on images feature vectors AP 0.64

Use Case 3

Vowpal Wabbit

slide-22
SLIDE 22

Thank you!

Questions welcome

Future Work

  • Faces
  • Places
  • Video to text annotations