Deep Learning Based Semantic Video Indexing and Retrieval
Anna Podlesnaya, Sergey Podlesnyy Cinema and Photo Research Institute (NIKFI)
This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15
Deep Learning Based Semantic Video Indexing and Retrieval Anna - - PowerPoint PPT Presentation
Deep Learning Based Semantic Video Indexing and Retrieval Anna Podlesnaya, Sergey Podlesnyy Cinema and Photo Research Institute (NIKFI) This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15 Fast Track
This work was funded by Russian Federation Ministry of Culture Contract No. 2214-01-41/06-15
Contribution 1
Video Segmentation Feature vector extracted by GoogLeNet contains enough semantic information for segmenting raw video into shots with 0.94 precision compared to MPEG-4 i-frames.
Contribution 2
Video Indexing Graph-based database for temporal, spatial and semantic properties indexing is proposed. Cost efficient pipeline.
Contribution 3
Search by Examples Video retrieval by sample video clip @0.86 precision Online learning of new concepts: video retrieval by sample photos @0.64 precision.
Archives are Huge
Russian Documentary Archive: 250K items (dated from 1910) Russian TV Archive: 100K items Youtube: users uploading 100hrs of video every minute (as of 2013)
Production Needs
Everyday need for footage in TV production Non-fiction movies production relies on historical and cultural heritage content Education, research, art...
MPEG-7 Query Format
ISO/IEC 15938-5:2003 Information technology -- Multimedia content description interface -- Part 5: Multimedia description schemes
Semantic features extraction by deep neural network Shots cut by vector distance spikes between frames Temporal pooling for shot semantics summarizing
{x0, x1, … xn} — feature vectors of successive frames
Zoom Pan/Rotate Pan Pan Zoom/Pan
Apache Cassandra storage for feature vectors and thumbnails Neo4j graph database for movies archive Structured queries for keywords-based retrieval
Digitizing Starting with film or tape FV Extraction Store per-frame timecodes and feature vectors in Cassandra Segmentation Store per-scene data structure in Neo4j Indexing May use additional classifiers for faces, places, salient objects etc. BK-Tree Building Add edges to Neo4j graph to speed up nearest neighbors search
Find Scenes with Zebra
MATCH (s:Shot) - [c:Category] -> (w:Wordnet {synset: “zebra”}) WHERE c.weight > 0.1 RETURN s ORDER BY s.duration DESC ASCII art: (s)-[c]->(w)
Find Scenes with Lion at Left to Zebra
MATCH (s:Shot) --> (zebra_obj:Salient_obj) --> (w:Wordnet {synset: “zebra”}) MATCH (s) --> (lion_obj:Salient_obj) --> (w:Wordnet {synset: “lion”}) MATCH (zebra_obj) - [:Left] -> (lion_obj) RETURN s ORDER BY s.duration DESC (s)-->(zebra_obj)-->(w) (s)-->(lion_obj)-->(w) (zebra_obj)-[:Left]->(lion_obj)
Find similar clip Find near-duplicates Online learning of new concepts
Keyword Search Search for ELEPHANT Select Sample Clip Need elephants herd, forest, sky Find Similar Clips Found clips with required characteristics
Quick Search
31-bit random projection hash (RPH) BK-Tree on PRH hamming distance from sample clip Quick incomplete search
Exhaustive Search
Feature vectors pooled by scene (R1024) Cosine distance between sample clip and every
archive, sort descending Well, slow
An average precision of search by video sample was 0.86. The precision was evaluated by searching by a keyword and then searching by one of resulted shots with cosine distance threshold 0.3. A human expert performed true/false positives counting.
Extract Features Show sample clip Exhaustive Search Sort by Cos-distance Found near-duplicates, robust to resampling, vignetting, hue/sat augmentation etc.
0.0057 0.0124 0.0152 0.0583
Google Image Search Show sample images
Train Linear Classifier Exhaustive Search Found video clips matching the classifier trained on images feature vectors AP 0.64
Vowpal Wabbit
Future Work