Vi Video Ob eo Object ject Segm Segmen enta tati tion
- n
CV3DST | Prof. Leal-Taixé 1
Vi Video Ob eo Object ject Segm Segmen enta tati tion on - - PowerPoint PPT Presentation
Vi Video Ob eo Object ject Segm Segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Vi Video deo Objec ject Seg egmen entat ation on Object Detection Object Tracking This lecture Object Segmentation Video Object
CV3DST | Prof. Leal-Taixé 1
Object Detection Object Tracking Object Segmentation Video Object Segmentation This lecture
CV3DST | Prof. Leal-Taixé 2
pixel masks for objects in a video sequence.
CV3DST | Prof. Leal-Taixé 3
CV3DST | Prof. Leal-Taixé 4
CV3DST | Prof. Leal-Taixé 5
CV3DST | Prof. Leal-Taixé 6
Hard to make assumptions about
Hard to make assumptions about
CV3DST | Prof. Leal-Taixé 7
Semi-supervised (one-shot) video
Unsupervised (zero- shot) video object segmentation
We get the first frame ground truth mask, we know what object to segment We have to find the
masks
CV3DST | Prof. Leal-Taixé 8
Semi-supervised (one-shot) video
Unsupervised (zero- shot) video object segmentation
We get the first frame ground truth mask, we know what object to segment We have to find the
masks
CV3DST | Prof. Leal-Taixé 9
Motion segmentation, salient object detection..
Semi-supervised (one-shot) video
Unsupervised (zero- shot) video object segmentation
We get the first frame ground truth mask, we know what object to segment We have to find the
masks
CV3DST | Prof. Leal-Taixé 10
This lecture
– Given: segmentation mask of target object(s) in the first frame – Goal: pixel-accurate segmentation of the entire video – Currently a major testing ground for segmentation-based tracking
Given: First-frame ground truth Goal: Complete video segmentation
CV3DST | Prof. Leal-Taixé 11
learning-based methods
DAVIS 2016 (30/20, single objects, first frames) DAVIS 2017 (60/90, multiple
YouTube-VOS 2018 (3471/982, multiple
where object appears)
https://davischallenge.org https://youtube-vos.org
CV3DST | Prof. Leal-Taixé 12
is a concept in Computer Vision that we need to know first
CV3DST | Prof. Leal-Taixé 13
14 CV3DST | Prof. Leal-Taixé
image B
motion of the object
15 CV3DST | Prof. Leal-Taixé
16 CV3DST | Prof. Leal-Taixé
17 CV3DST | Prof. Leal-Taixé
18 CV3DST | Prof. Leal-Taixé
19 CV3DST | Prof. Leal-Taixé
channels
20 CV3DST | Prof. Leal-Taixé
21 CV3DST | Prof. Leal-Taixé
22 CV3DST | Prof. Leal-Taixé
How to combine the information from both images?
23 CV3DST | Prof. Leal-Taixé
Fixed operation. No learnable weights!
two feature vectors are
24 CV3DST | Prof. Leal-Taixé
Useful for finding image correspondences
25 CV3DST | Prof. Leal-Taixé
Find a transformation from image A to image B A B
26 CV3DST | Prof. Leal-Taixé
How to combine the information from both images? How to obtain high- quality results?
flow of the object
segmentation and OF iteratively (no DL yet)
27 CV3DST | Prof. Leal-Taixé
Y.H. Tsai et al. “Video Segmentation via Object Flow“. CVPR 2016
CV3DST | Prof. Leal-Taixé 28
– Pre-training for ‘objectness’. – First-frame adaptation to specific object-of-interest using fine-tuning.
CV3DST | Prof. Leal-Taixé 29
Base Network
Pre-trained on ImageNet1
Edges and basic image features
Test Network
Fine-tuned on frame 1 of test sequence3
Learns which
segment Finetuning Pre-trained
Parent Network
Trained on DAVIS training set2
Learns how to do video segmentation Training
CV3DST | Prof. Leal-Taixé 30
the test sequence first frame. Overfitting is therefor used to learn the appearance of the foreground
no temporal information
CV3DST | Prof. Leal-Taixé 31
propagation or optical flow-based methods)
CV3DST | Prof. Leal-Taixé 32
CV3DST | Prof. Leal-Taixé 33
Two camels! Another annotation where the 2nd camel is background Another annotation Mask is refined
CV3DST | Prof. Leal-Taixé 34
102ms – One forward pass (parent network)
DAVIS dataset
11.8 pp.
CV3DST | Prof. Leal-Taixé 35
Object flow
foreground (or the background) appearance changes too much, the method fails
CV3DST | Prof. Leal-Taixé 36
In Intro roduc ucing Semantics First frame
He was occluded in the first frame, therefore the network never learned he was background.
CV3DST | Prof. Leal-Taixé 38
39 CV3DST | Prof. Leal-Taixé
Semantic Instance Segmentation
Result
Top Matching Instances Instance Proposals
Input Image
First-Round Foreground Estimation
Conditional
Selection & Propagation
Semantic Prior
Foreground Estimation CNN
Appearance Model
K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018
CV3DST | Prof. Leal-Taixé 40
Semantic prior branch that gives us proposals to select from Prior: semantics stay coherent throughout the sequence
Semantic Selection
Selected Instances: Person and Motorbike Ground Truth Instance Segmentation Proposals
Semantic Propagation
Instance Segmentation Proposals First-Round Foreground Estimation Top Person and Motorbike
Frame 0 Frame 18 Frame 24 Frame 30 Frame 36
CV3DST | Prof. Leal-Taixé 41
K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018
though pose or camera changes), then the model is not powerful anymore
CV3DST | Prof. Leal-Taixé 42
though pose or camera changes), then the model is not powerful anymore
CV3DST | Prof. Leal-Taixé 43
Why not gradually update the model?
changes every frame – not just the first frame.
every frame.
CV3DST | Prof. Leal-Taixé 44
CV3DST | Prof. Leal-Taixé 45
Blue = background samples Red = foreground samples
lot from frame to frame.
from previous frame or from coarse estimate).
nement nt network to accurately refine the mask estimate.
segmentation at a higher resolution.
CV3DST | Prof. Leal-Taixé 46
CV3DST | Prof. Leal-Taixé 47
Why the name?
48 CV3DST | Prof. Leal-Taixé
– Like displacements to train the regressor of Faster-RCNN – Very similar in spirit to Tracktor
and appearance for fully automatic segmentation of generic objects in videos." CVPR 2017. à Optical flow propagation
Object Segmentation„ IJCV 2019 à clever data augmentation.
identification“ CVPRW 2017. à use reidentification techniques to recover from occlusions
CV3DST | Prof. Leal-Taixé 49
50 CV3DST | Prof. Leal-Taixé
give object instance segmentation proposals.
taking these proposals in each frame and then linking them over time using a merging algorithm.
CV3DST | Prof. Leal-Taixé 51
Until now:
to refine Now: Input are proposals Goal is to “link” them (much like we did in tracking-by-detection)
principles and gives state-of-the-art results.
– First-frame fine-tuning – Mask Refinement – Optical Flow Mask Propagation – Data Augmentation – Object Appearance Re-Identification – Proposal Generation
CV3DST | Prof. Leal-Taixé 52
– Category-agnostic Mask R-CNN proposals
– Fully-convolutional segmentation network trained to refine the segmentation given a proposal bounding box
Proposal generation Refinement Merging
CV3DST | Prof. Leal-Taixé 53
– Greedy decision process, chooses proposal(s) with best score – Optional proposal expansion through Optical Flow propagation – Proposal score as combination of
Proposal generation Refinement Merging
CV3DST | Prof. Leal-Taixé 54
CV3DST | Prof. Leal-Taixé 55
CV3DST | Prof. Leal-Taixé 56
– Deep-learning based region proposal generators are fit for the task – Experimented with SharpMask and Mask R-CNN
– Region overlap works as a consistency measure – Optical flow based propagation really helps – ReID score also helpful
– PReMVOS has no notion of 3D objects moving through 3D space. – Track initialization / termination logic needed for real tracking. – How to obtain the initial segmentation?
CV3DST | Prof. Leal-Taixé 57
Slide from: Jonathon Luiten
58 CV3DST | Prof. Leal-Taixé
region proposals work really well.
embedding for every pixel.
CV3DST | Prof. Leal-Taixé 59
mask, scribble…
60 CV3DST | Prof. Leal-Taixé
together and separate them from background pixels
61 CV3DST | Prof. Leal-Taixé
and perform a nearest neighbor search for the test pixels.
62 CV3DST | Prof. Leal-Taixé
We do not need to retrain the model for each sequence, nor finetune
63 CV3DST | Prof. Leal-Taixé
64 CV3DST | Prof. Leal-Taixé
independently
CV3DST | Prof. Leal-Taixé 65
CV3DST | Prof. Leal-Taixé 66
for video object segmentation“. CVPR 2019
Instance Segmentation“ ECCV 2016
67 CV3DST | Prof. Leal-Taixé
and temporal coherence are both trained end-to-end
processed once (unlike ConvLSTM example before)
68 CV3DST | Prof. Leal-Taixé
69 CV3DST | Prof. Leal-Taixé
similarly used for one-shot VOS.
– OSVOS: First-frame fine-tuning (appearance model) – OSVOS-S: + semantic guidance through proposals (shape) – OnAVOS: Online Adaptation (stronger appearance model) – MaskTrack: Mask Refinement – Lucid: clever data augmentation – ReID-VOS: Object Appearance Re-Identification – PReMVOS: putting it all together – Seq2seq and RVOS: recurrent architectures
CV3DST | Prof. Leal-Taixé 70
Appearance Motion Matching Shape
71 CV3DST | Prof. Leal-Taixé
Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.
ntour Ac Accurac acy: measures the precision and recall
measure.
72 CV3DST | Prof. Leal-Taixé
Precision = TP TP + FP Recall = TP TP + FN F = 2 ∗ Prec ∗ Rec Prec + Rec
al stab ability: measures the evolution of object shapes, i.e., how stable the boundaries are in time.
– Estimate the deformation of the mask from t to t+1 – If the transformation is smooth and precise, the result is considered stable. – A bad results is a jittery mask evolution – Note: this measure has been dropped due to its instability during occlusions.
73 CV3DST | Prof. Leal-Taixé
Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.
– Mean: average for the dataset – Decay: quantifies the performance loss (or gain) over time. à This is currently used to judge temporal stability – Recall: fraction of sequences scoring higher than a threshold
74 CV3DST | Prof. Leal-Taixé
CV3DST | Prof. Leal-Taixé 75
– First frame mask given (in the supervised case) – Short video clips with objects present in almost all frames – Objects in a video are (mostly) of different categories – Few objects to track (max around 7 per video)
– Scenarios with a large number of objects (20-40), mostly of the same category (e.g., pedestrians) – Long sequences – No first frame annotation provided, one has to deal with appearing and disappearing objects.
CV3DST | Prof. Leal-Taixé 76
CV3DST | Prof. Leal-Taixé 77
tracking dataset
CV3DST | Prof. Leal-Taixé 78
– Prof. Xavier Giró, Technical University of Catalonia (UPC) – Jonathon Luiten, RWTH Aachen
CV3DST | Prof. Leal-Taixé 79