Category-specific video summarization
Speaker: Danila Potapov Joint work with: Matthijs Douze Zaid Harchaoui Cordelia Schmid LEAR team, Inria Grenoble Rhône-Alpes Christmas Colloquium on Computer Vision Moscow, 28.12.2015
1 / 22
Category-specific video summarization Speaker: Danila Potapov - - PowerPoint PPT Presentation
Category-specific video summarization Speaker: Danila Potapov Joint work with: Matthijs Douze Zaid Harchaoui Cordelia Schmid LEAR team, Inria Grenoble Rhne-Alpes Christmas Colloquium on Computer Vision Moscow, 28.12.2015 1 / 22
1 / 22
◮ size of video data is growing
◮ 300 hours of video uploaded on YouTube every minute
◮ types of video data: user-generated, sports, news, movies
◮ common need for structuring video data
2 / 22
3 / 22
◮ Recognize events accurately and efficiently ◮ Identify the most important moments in videos ◮ Quantitative evaluation of video analysis algorithms
4 / 22
◮ Recognize events accurately and efficiently ◮ Identify the most important moments in videos ◮ Quantitative evaluation of video analysis algorithms
4 / 22
◮ Recognize events accurately and efficiently ◮ Identify the most important moments in videos ◮ Quantitative evaluation of video analysis algorithms
4 / 22
◮ supervised approach to video summarization ◮ temporal localization at test time ◮ MED-Summaries dataset for evaluation of video
◮ D. Potapov, M. Douze, Z. Harchaoui, C. Schmid
◮ MED-Summaries dataset online http://lear.inrialpes.fr/people/potapov/med_summaries
5 / 22
◮ evaluation benchmark for video summarization ◮ subset of TRECVID Multimedia Event Detection 2011 dataset ◮ 10 categories
5 10 15 20 25 30 UTE SumMe YouT ubeHl MED-Summaries
T
5 10 15 20 UTE SumMe YouT ubeHl MED-Summaries
Number of annotators per video
2000 4000 6000 8000 10000 UTE SumMe YouT ubeHl MED-Summaries
Number of segments
6 / 22
◮ built from subset of temporal segments of original video ◮ conveys the most important details of the video
7 / 22
◮ produce visually coherent temporal segments
◮ no shot boundaries, camera shake, etc. inside segments
◮ identify important parts
◮ category-specific importance: a measure of relevance to the
Per-segment classification scores KTS segments Input video (category: Working on a sewing project) Output summary Maxima 8 / 22
◮ specialized domains
◮ Lu and Grauman [2013], Lee et al. [2012]: summarization of
◮ Khosla et al. [2013]: keyframe summaries, canonical views for
◮ Sun et al. [2014] “Ranking Domain-specific Highlights by
◮ automatic approach for harvesting data ◮ highlight detection vs. temporally coherent summarization
◮ Gygli et al. [2014] “Creating Summaries from User Videos”
◮ cinematic rules for segmentation ◮ small set of informative descriptors 9 / 22
◮ goals: group similar frames such that semantic changes occur
◮ kernelized Multiple Change-Point Detection algorithm
◮ change-points divide the video into temporal segments
◮ input: robust frame descriptor (SIFT + Fisher Vector)
− 0.25 0.00 0.25 0.50 0.75 1.00
10 / 22
◮ goals: group similar frames such that semantic changes occur
◮ kernelized Multiple Change-Point Detection algorithm
◮ change-points divide the video into temporal segments
◮ input: robust frame descriptor (SIFT + Fisher Vector)
− 0.25 0.00 0.25 0.50 0.75 1.00
10 / 22
◮ goals: group similar frames such that semantic changes occur
◮ kernelized Multiple Change-Point Detection algorithm
◮ change-points divide the video into temporal segments
◮ input: robust frame descriptor (SIFT + Fisher Vector)
− 0.25 0.00 0.25 0.50 0.75 1.00
10 / 22
i=t
d
i,j=t
11 / 22
◮ Training: train a linear SVM from a set of videos with just
◮ Testing: score segment descriptors with the classifiers
Per-segment classification scores KTS segments Input video (category: Working on a sewing project) Output summary Maxima 12 / 22
◮ 100 test videos (= 4 hours) from TRECVID MED 2011 ◮ multiple annotators ◮ 2 annotation tasks:
◮ segment boundaries (median duration: 3.5 sec.) ◮ segment importance (grades from 0 to 3) ◮ 0 = not relevant to the category ◮ 3 = highest relevance
13 / 22
14 / 22
15 / 22
◮ often based on user studies
◮ time-consuming, costly and hard to reproduce
◮ Our approach: rely on the annotation of test videos ◮ ground truth segments {Si}m i=1 ◮ computed summary {
m j=1 ◮ coverage criterion:
ground truth summary
period
covers the ground-truth covered by the summary no match
period
◮ importance ratio for summary
16 / 22
◮ a meaningful summary covers a ground-truth segment of
ground truth summary
1 3 2
importance
0.7 0.5 0.9
classification score
3 3 segments are required to see an importance-3 segment
◮ segmentation f-score: match when overlap/union > β
17 / 22
◮ Users: keep 1 user in turn as a ground truth for evaluation of
◮ SD + SVM: shot detector Massoudi et al. [2006] for
◮ KTS + Cluster: Kernel Temporal Segmentation + k-means
◮ sort segments by increasing distance to centroid
18 / 22
higher better lower better
10 15 20 25 Duration, sec. 38 40 42 44 46 48 50 52 Importance ratio Users SD + SVM KTS + Cluster KVS-SIFT KVS-MBH
19 / 22
20 / 22
◮ KVS delivers short and highly-informative summaries, with the
◮ temporal segmentation algorithm produces visually coherent
◮ KVS is trained in a weakly-supervised way
◮ does not require segment annotations in the training set
◮ MED-Summaries — dataset for evaluation of video
◮ annotations and evaluation code available online
◮ D. Potapov, M. Douze, Z. Harchaoui, C. Schmid
◮ MED-Summaries dataset online http://lear.inrialpes.fr/people/potapov/med_summaries
21 / 22
22 / 22