[PPT] - Vi Video Ob eo Object ject Segm Segmen enta tati tion on PowerPoint Presentation

SLIDE 1

Vi Video Ob eo Object ject Segm Segmen enta tati tion

n

CV3DST | Prof. Leal-Taixé 1

SLIDE 2

Vi Video deo Objec ject Seg egmen entat ation

n

Object Detection Object Tracking Object Segmentation Video Object Segmentation This lecture

CV3DST | Prof. Leal-Taixé 2

SLIDE 3

Vi Video deo Objec ject Seg egmen entat ation

n
Goal: Generate accurate and temporally consistent

pixel masks for objects in a video sequence.

CV3DST | Prof. Leal-Taixé 3

SLIDE 4

VO VOS: som

me

e chal allen enges es

Strong viewpoint/appearance changes

CV3DST | Prof. Leal-Taixé 4

SLIDE 5

VO VOS: som

me

e chal allen enges es

Strong viewpoint/appearance changes
Occlusions

CV3DST | Prof. Leal-Taixé 5

SLIDE 6

VO VOS: som

me

e chal allen enges es

Strong viewpoint/appearance changes
Occlusions
Scale changes

CV3DST | Prof. Leal-Taixé 6

SLIDE 7

VO VOS: som

me

e chal allen enges es

Strong viewpoint/appearance changes
Occlusions
Scale changes
Illumination
Shape
…

Hard to make assumptions about

bject’s appearance

Hard to make assumptions about

bject’s motion

CV3DST | Prof. Leal-Taixé 7

SLIDE 8

VO VOS: tas asks

Semi-supervised (one-shot) video

bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 8

SLIDE 9

VO VOS: tas asks

Semi-supervised (one-shot) video

bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 9

Motion segmentation, salient object detection..

SLIDE 10

VO VOS: tas asks

Semi-supervised (one-shot) video

bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 10

This lecture

SLIDE 11

Supe Superv rvised Video Obj bject Se Segm gment ntation

Task formulation

– Given: segmentation mask of target object(s) in the first frame – Goal: pixel-accurate segmentation of the entire video – Currently a major testing ground for segmentation-based tracking

Given: First-frame ground truth Goal: Complete video segmentation

CV3DST | Prof. Leal-Taixé 11

SLIDE 12

VO VOS Dat atas aset ets

Remember that large-scale datasets are needed for

learning-based methods

DAVIS 2016 (30/20, single objects, first frames) DAVIS 2017 (60/90, multiple

bjects, first frames)

YouTube-VOS 2018 (3471/982, multiple

bjects, first frame

where object appears)

https://davischallenge.org https://youtube-vos.org

CV3DST | Prof. Leal-Taixé 12

SLIDE 13

Bef Befor

re

e we e get et star arted… ed…

Pixel-wise output
If we talk about pixel-wise outputs and motion, there

is a concept in Computer Vision that we need to know first

CV3DST | Prof. Leal-Taixé 13

SLIDE 14

Optical l flo low

14 CV3DST | Prof. Leal-Taixé

SLIDE 15

Opt Optica cal l flo flow

Input: 2 consecutive images (e.g. from a video)
Output: displacement of every pixel from image A to

image B

Results in the “perceived” 2D motion, not the real

motion of the object

15 CV3DST | Prof. Leal-Taixé

SLIDE 16

Opt Optica cal l flo flow

16 CV3DST | Prof. Leal-Taixé

SLIDE 17

Opt Optica cal l flo flow

17 CV3DST | Prof. Leal-Taixé

SLIDE 18

Opt Optica cal l flo flow with CNNs NNs

End-to-end supervised learning of optical flow

18 CV3DST | Prof. Leal-Taixé

P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015

SLIDE 19

Opt Optica cal l flo flow with CNNs NNs

19 CV3DST | Prof. Leal-Taixé

P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015

SLIDE 20

Fl FlowNet: a : arc rchit itecture ure 1 1

Stack both images à input is now 2 x RGB = 6

channels

20 CV3DST | Prof. Leal-Taixé

SLIDE 21

Fl FlowNet: a : arc rchit itecture ure 2 2

Siamese architecture

21 CV3DST | Prof. Leal-Taixé

SLIDE 22

Fl FlowNet: a : arc rchit itecture ure 2 2

Two key design choices

22 CV3DST | Prof. Leal-Taixé

How to combine the information from both images?

SLIDE 23

Cor Correl elation ion layer er

Multiplies a feature vector with another feature vector

23 CV3DST | Prof. Leal-Taixé

Fixed operation. No learnable weights!

SLIDE 24

Cor Correl elation ion layer er

The matching score represents how correlated these

two feature vectors are

24 CV3DST | Prof. Leal-Taixé

SLIDE 25

Cor Correl elation ion layer er

Hint for anyone interested in 3D reconstruction:

Useful for finding image correspondences

25 CV3DST | Prof. Leal-Taixé

I. Rocco et al. “Convolutional neural network architecture for geometric matching. CVPR 2017.

Find a transformation from image A to image B A B

SLIDE 26

Fl FlowNet : a : arc rchit itecture ure 2 2

Two key design choices

26 CV3DST | Prof. Leal-Taixé

How to combine the information from both images? How to obtain high- quality results?

SLIDE 27

Ca Can we e do

VOS wit

ith OF?

Indeed!
Better if we focus on the

flow of the object

We can improve

segmentation and OF iteratively (no DL yet)

27 CV3DST | Prof. Leal-Taixé

Y.H. Tsai et al. “Video Segmentation via Object Flow“. CVPR 2016

SLIDE 28

OS OSVOS VOS

CV3DST | Prof. Leal-Taixé 28

SLIDE 29

Goal: Learn the appearance of the object to track
Main contribution: separate training steps

– Pre-training for ‘objectness’. – First-frame adaptation to specific object-of-interest using fine-tuning.

Fir First st-fra frame fi fine-tu tuni ning ng

CV3DST | Prof. Leal-Taixé 29

SLIDE 30 Results on frame N

f test sequence

Base Network

Pre-trained on ImageNet

1

Edges and basic image features

S. Caelles et al. “One-shot video object segmentation”.CVPR 2017

Test Network

Fine-tuned on frame 1 of test sequence

3

Learns which

bject to

segment Finetuning Pre-trained

Parent Network

Trained on DAVIS training set

2

Learns how to do video segmentation Training

On One-sh shot V VOS

CV3DST | Prof. Leal-Taixé 30

SLIDE 31

S. Caelles et al. “One-shot video object segmentation”.CVPR 2017

On One-sh shot V VOS

One-shot: we see the first frame ground truth
Finetuning step: this is used to technically overfit to

the test sequence first frame. Overfitting is therefor used to learn the appearance of the foreground

bject (and the background!)
Test time: each frame is processed independently à

no temporal information

CV3DST | Prof. Leal-Taixé 31

SLIDE 32

Fr Frame me-ba based segm gmentation

PRO: it recovers well from occlusions (unlike mask

propagation or optical flow-based methods)

CON: it is temporally inconsistent

CV3DST | Prof. Leal-Taixé 32

SLIDE 33

Ex Exper erimen iments: hig ighly dynamic mic scen enes es

CV3DST | Prof. Leal-Taixé 33

SLIDE 34

Ex Experiments: accuracy y vs annotations

Two camels! Another annotation where the 2nd camel is background Another annotation Mask is refined

CV3DST | Prof. Leal-Taixé 34

SLIDE 35

102ms – One forward pass (parent network)

Fin Finetunin ing ti time

DAVIS dataset

11.8 pp.

CV3DST | Prof. Leal-Taixé 35

Object flow

SLIDE 36

Obs Observ rvations

OSVOS does not have an object of object shape.
It is a pure appearance-based method, if the

foreground (or the background) appearance changes too much, the method fails

CV3DST | Prof. Leal-Taixé 36

SLIDE 37

In Intro roduc ucing Semantics First frame

SLIDE 38

In Introducing Semantics

He was occluded in the first frame, therefore the network never learned he was background.

CV3DST | Prof. Leal-Taixé 38

SLIDE 39

Bu But wai ait….

We have already seen models that have an idea of
bject shape..
Instance segmentation methods!

39 CV3DST | Prof. Leal-Taixé

SLIDE 40

OS OSVOS OS-S: S: Se Semanti ntic c propagati tion

Semantic Instance Segmentation

Result

Top Matching Instances Instance Proposals

Input Image

First-Round Foreground Estimation

Conditional

Semantic

Selection & Propagation

Semantic Prior

Foreground Estimation CNN

Appearance Model

K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018

CV3DST | Prof. Leal-Taixé 40

Semantic prior branch that gives us proposals to select from Prior: semantics stay coherent throughout the sequence

SLIDE 41

OS OSVOS OS-S: S: Se Semanti ntic c propagati tion

Semantic Selection

Selected Instances: Person and Motorbike Ground Truth Instance Segmentation Proposals

Semantic Propagation

Instance Segmentation Proposals First-Round Foreground Estimation Top Person and Motorbike

Frame 0 Frame 18 Frame 24 Frame 30 Frame 36

CV3DST | Prof. Leal-Taixé 41

K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018

SLIDE 42

Dr Drifti ting g pr proble blem

If the object greatly changes its appearance (e.g.,

though pose or camera changes), then the model is not powerful anymore

But this change was gradual….

CV3DST | Prof. Leal-Taixé 42

SLIDE 43

Dr Drifti ting g pr proble blem

If the object greatly changes its appearance (e.g.,

though pose or camera changes), then the model is not powerful anymore

CV3DST | Prof. Leal-Taixé 43

Why not gradually update the model?

SLIDE 44

On OnAVOS OS: O : Onlin ine A Adaptatio ion

Online adaptation: adapt model to appearance

changes every frame – not just the first frame.

Iteratively fine-tune the model on previous prediction

every frame.

CON: Extremely slow.

CV3DST | Prof. Leal-Taixé 44

P. Voigtlander and B. Leibe. “Online adaptation of convolutional neural networks for video object segmentation”. BMVC 2017

SLIDE 45

On OnAVOS OS: O : Onlin ine A Adaptatio ion

CV3DST | Prof. Leal-Taixé 45

Blue = background samples Red = foreground samples

P. Voigtlander and B. Leibe. “Online adaptation of convolutional neural networks for video object segmentation”. BMVC 2017

SLIDE 46

Mas Mask Ref Refinem emen ent

Assumption: an object, i.e., a mask, does not move a

lot from frame to frame.

We can often start with an approximate mask (either

from previous frame or from coarse estimate).

We can then use a refine

nement nt network to accurately refine the mask estimate.

This can take advantage of crop-and-zoom to do

segmentation at a higher resolution.

CV3DST | Prof. Leal-Taixé 46

SLIDE 47

Mas MaskTrac ack

CV3DST | Prof. Leal-Taixé 47

A. Khoreva et al. „Learning Video Object Segmentation from Static Images“ CVPR 2017

Why the name?

SLIDE 48

Mas MaskTrac ack

48 CV3DST | Prof. Leal-Taixé

A. Khoreva et al. „Learning Video Object Segmentation from Static Images“ CVPR 2017
Training inputs can be simulated!

– Like displacements to train the regressor of Faster-RCNN – Very similar in spirit to Tracktor

SLIDE 49

Wor Worth readi eading

S. Jain et al. "Fusionseg: Learning to combine motion

and appearance for fully automatic segmentation of generic objects in videos." CVPR 2017. à Optical flow propagation

A. Khoreva et al. “Lucid Data Dreaming for Video

Object Segmentation„ IJCV 2019 à clever data augmentation.

X. Li et al. „Video object segmentation with re-

identification“ CVPRW 2017. à use reidentification techniques to recover from occlusions

CV3DST | Prof. Leal-Taixé 49

SLIDE 50

Pr Proposal-ba based sed ap approac ache hes

50 CV3DST | Prof. Leal-Taixé

SLIDE 51

Pr Propo posal l Ge Generation

Instance Segmentation Networks (E.g. Mask-RCNN)

give object instance segmentation proposals.

One can approach video object segmentation as

taking these proposals in each frame and then linking them over time using a merging algorithm.

CV3DST | Prof. Leal-Taixé 51

Until now:

Input is the whole image
Proposals are put on top just

to refine Now: Input are proposals Goal is to “link” them (much like we did in tracking-by-detection)

SLIDE 52

PR PReMVOS OS

An approach that combines all of the previous VOS

principles and gives state-of-the-art results.

Combines the following principles:

– First-frame fine-tuning – Mask Refinement – Optical Flow Mask Propagation – Data Augmentation – Object Appearance Re-Identification – Proposal Generation

CV3DST | Prof. Leal-Taixé 52

J. Luiten et al. „PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation“. ACCV2018

SLIDE 53

PR PReMVOS OS:Ov Overview

Proposal generation

– Category-agnostic Mask R-CNN proposals

Refinement

– Fully-convolutional segmentation network trained to refine the segmentation given a proposal bounding box

Proposal generation Refinement Merging

CV3DST | Prof. Leal-Taixé 53

SLIDE 54

PR PReMVOS OS: O : Overv rvie iew

Merging

– Greedy decision process, chooses proposal(s) with best score – Optional proposal expansion through Optical Flow propagation – Proposal score as combination of

Objectness score
Mask propagation IoU score (Optical Flow warping)
ReID score
Object-Object interaction scores

Proposal generation Refinement Merging

CV3DST | Prof. Leal-Taixé 54

SLIDE 55

PR PReMVOS OS: re : resul sults

Very complex but a winner
DAVIS Challenge 2018 Winner
Youtube-VOS Challenge 2018 Winner

CV3DST | Prof. Leal-Taixé 55

SLIDE 56

PR PReMVOS OS: V : Visua isual R Resul sults

CV3DST | Prof. Leal-Taixé 56

SLIDE 57

Le Lessons ns Le Learne ned

Challenge 1: How to generate proposals?

– Deep-learning based region proposal generators are fit for the task – Experimented with SharpMask and Mask R-CNN

Challenge 2: How to track region proposals?

– Region overlap works as a consistency measure – Optical flow based propagation really helps – ReID score also helpful

Open issues

– PReMVOS has no notion of 3D objects moving through 3D space. – Track initialization / termination logic needed for real tracking. – How to obtain the initial segmentation?

CV3DST | Prof. Leal-Taixé 57

Slide from: Jonathon Luiten

SLIDE 58

Re Retrie ieval val ap approac ache hes

58 CV3DST | Prof. Leal-Taixé

SLIDE 59

Pi Pixe xel-wi wise retr trieva val

Re-Identification networks based on bounding-box

region proposals work really well.

This idea can be extended to a Re-Identification

embedding for every pixel.

CV3DST | Prof. Leal-Taixé 59

SLIDE 60

Pi Pixe xel-wi wise retr trieva val

The user input can be in any form, first-frame groundtruth

mask, scribble…

60 CV3DST | Prof. Leal-Taixé

Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018

SLIDE 61

Pi Pixe xel-wi wise retr trieva val

Training: use the triplet loss to bring foreground pixels

together and separate them from background pixels

61 CV3DST | Prof. Leal-Taixé

Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018

SLIDE 62

Pi Pixe xel-wi wise retr trieva val

Test: embed pixels from both annotated and test frame,

and perform a nearest neighbor search for the test pixels.

62 CV3DST | Prof. Leal-Taixé

Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018

We do not need to retrain the model for each sequence, nor finetune

SLIDE 63

We We ar are e deal dealing with video deo

Which is a sequence of images….
And we have not talked about….
Recurrent Neural Networks!

63 CV3DST | Prof. Leal-Taixé

SLIDE 64

Spa Spati tio-temporal l ap approac ache hes

64 CV3DST | Prof. Leal-Taixé

SLIDE 65

Te Temporal l LSTM LSTM

One-shot video object segmentation
If we have multiple objects, each of them is predicted

independently

CV3DST | Prof. Leal-Taixé 65

N. Xu et al. „Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.

SLIDE 66

R-VO VOS: tem empor

ral

al an and d spat atial al LSTM

CV3DST | Prof. Leal-Taixé 66

C. Ventura et al. „RVOS: end-to-end recurrent network

for video object segmentation“. CVPR 2019

B. Roma-Paredes and P.H.S. Torr. „Recurrent

Instance Segmentation“ ECCV 2016

SLIDE 67

R-VO VOS: tem empor

ral

al an and d spat atial al LSTM

67 CV3DST | Prof. Leal-Taixé

C. Ventura et al. „RVOS: end-to-end recurrent network for video object segmentation“. CVPR 2019
Instance generation

and temporal coherence are both trained end-to-end

Image just needs to be

processed once (unlike ConvLSTM example before)

SLIDE 68

Tr Trans nsformers for VOS

68 CV3DST | Prof. Leal-Taixé

S. Oh “Video Object Segmentation using Space-Time Memory Networks”. ICCV 2019

SLIDE 69

Gra Graph ph A Attention Ne Network rks fo s for V r VOS OS

69 CV3DST | Prof. Leal-Taixé

W. Wang et al. „ Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks “. ICCV 2019
They use it for zero-shot segmentation, but could be

similarly used for one-shot VOS.

SLIDE 70

Ov Overv rview of f the methods

Video Object Segmentation (VOS)

– OSVOS: First-frame fine-tuning (appearance model) – OSVOS-S: + semantic guidance through proposals (shape) – OnAVOS: Online Adaptation (stronger appearance model) – MaskTrack: Mask Refinement – Lucid: clever data augmentation – ReID-VOS: Object Appearance Re-Identification – PReMVOS: putting it all together – Seq2seq and RVOS: recurrent architectures

CV3DST | Prof. Leal-Taixé 70

Appearance Motion Matching Shape

SLIDE 71

Evalu luation and me metric ics

71 CV3DST | Prof. Leal-Taixé

SLIDE 72

Met Metrics for

r VO

VOS

Re

Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.

Cont

ntour Ac Accurac acy: measures the precision and recall

f the boundary pixels. This is put together in the F-

measure.

72 CV3DST | Prof. Leal-Taixé

Precision = TP TP + FP Recall = TP TP + FN F = 2 ∗ Prec ∗ Rec Prec + Rec

SLIDE 73

Met Metrics for

r VO

VOS

Temporal

al stab ability: measures the evolution of object shapes, i.e., how stable the boundaries are in time.

– Estimate the deformation of the mask from t to t+1 – If the transformation is smooth and precise, the result is considered stable. – A bad results is a jittery mask evolution – Note: this measure has been dropped due to its instability during occlusions.

73 CV3DST | Prof. Leal-Taixé

SLIDE 74

Met Metrics for

r VO

VOS

You can use error measure statistics
Re

Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.

– Mean: average for the dataset – Decay: quantifies the performance loss (or gain) over time. à This is currently used to judge temporal stability – Recall: fraction of sequences scoring higher than a threshold

74 CV3DST | Prof. Leal-Taixé

SLIDE 75

Tra Track cking g and d Segm Segmen enta tati tion

n

CV3DST | Prof. Leal-Taixé 75

SLIDE 76

VO VOS -> > MOTS TS

Video Object Segmentation (VOS) is limited by:

– First frame mask given (in the supervised case) – Short video clips with objects present in almost all frames – Objects in a video are (mostly) of different categories – Few objects to track (max around 7 per video)

Multi-Object Tracking and Segmentation (MOTS)

– Scenarios with a large number of objects (20-40), mostly of the same category (e.g., pedestrians) – Long sequences – No first frame annotation provided, one has to deal with appearing and disappearing objects.

CV3DST | Prof. Leal-Taixé 76

SLIDE 77

MO MOTS dat datas aset et

CV3DST | Prof. Leal-Taixé 77

Segmentations coming to MOTChallenge pedestrian

tracking dataset

P. Voigtlaender et al. „MOTS: Multi-Object Tracking and Segmentation“. CVPR 2019

SLIDE 78

Vi Video Ob eo Object ject Segm Segmen enta tati tion

n

CV3DST | Prof. Leal-Taixé 78

SLIDE 79

Di Discla claimer

This lecture was done borrowing material from:

– Prof. Xavier Giró, Technical University of Catalonia (UPC) – Jonathon Luiten, RWTH Aachen

CV3DST | Prof. Leal-Taixé 79