Vi Video Ob eo Object ject Segm Segmen enta tati tion on - - PowerPoint PPT Presentation

vi video ob eo object ject segm segmen enta tati tion on
SMART_READER_LITE
LIVE PREVIEW

Vi Video Ob eo Object ject Segm Segmen enta tati tion on - - PowerPoint PPT Presentation

Vi Video Ob eo Object ject Segm Segmen enta tati tion on CV3DST | Prof. Leal-Taix 1 Vi Video deo Objec ject Seg egmen entat ation on Object Detection Object Tracking This lecture Object Segmentation Video Object


slide-1
SLIDE 1

Vi Video Ob eo Object ject Segm Segmen enta tati tion

  • n

CV3DST | Prof. Leal-Taixé 1

slide-2
SLIDE 2

Vi Video deo Objec ject Seg egmen entat ation

  • n

Object Detection Object Tracking Object Segmentation Video Object Segmentation This lecture

CV3DST | Prof. Leal-Taixé 2

slide-3
SLIDE 3

Vi Video deo Objec ject Seg egmen entat ation

  • n
  • Goal: Generate accurate and temporally consistent

pixel masks for objects in a video sequence.

CV3DST | Prof. Leal-Taixé 3

slide-4
SLIDE 4

VO VOS: som

  • me

e chal allen enges es

  • Strong viewpoint/appearance changes

CV3DST | Prof. Leal-Taixé 4

slide-5
SLIDE 5

VO VOS: som

  • me

e chal allen enges es

  • Strong viewpoint/appearance changes
  • Occlusions

CV3DST | Prof. Leal-Taixé 5

slide-6
SLIDE 6

VO VOS: som

  • me

e chal allen enges es

  • Strong viewpoint/appearance changes
  • Occlusions
  • Scale changes

CV3DST | Prof. Leal-Taixé 6

slide-7
SLIDE 7

VO VOS: som

  • me

e chal allen enges es

  • Strong viewpoint/appearance changes
  • Occlusions
  • Scale changes
  • Illumination
  • Shape

Hard to make assumptions about

  • bject’s appearance

Hard to make assumptions about

  • bject’s motion

CV3DST | Prof. Leal-Taixé 7

slide-8
SLIDE 8

VO VOS: tas asks

Semi-supervised (one-shot) video

  • bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

  • bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 8

slide-9
SLIDE 9

VO VOS: tas asks

Semi-supervised (one-shot) video

  • bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

  • bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 9

Motion segmentation, salient object detection..

slide-10
SLIDE 10

VO VOS: tas asks

Semi-supervised (one-shot) video

  • bject segmentation

Unsupervised (zero- shot) video object segmentation

We get the first frame ground truth mask, we know what object to segment We have to find the

  • bjects as well as their

masks

CV3DST | Prof. Leal-Taixé 10

This lecture

slide-11
SLIDE 11

Supe Superv rvised Video Obj bject Se Segm gment ntation

  • Task formulation

– Given: segmentation mask of target object(s) in the first frame – Goal: pixel-accurate segmentation of the entire video – Currently a major testing ground for segmentation-based tracking

Given: First-frame ground truth Goal: Complete video segmentation

CV3DST | Prof. Leal-Taixé 11

slide-12
SLIDE 12

VO VOS Dat atas aset ets

  • Remember that large-scale datasets are needed for

learning-based methods

DAVIS 2016 (30/20, single objects, first frames) DAVIS 2017 (60/90, multiple

  • bjects, first frames)

YouTube-VOS 2018 (3471/982, multiple

  • bjects, first frame

where object appears)

https://davischallenge.org https://youtube-vos.org

CV3DST | Prof. Leal-Taixé 12

slide-13
SLIDE 13

Bef Befor

  • re

e we e get et star arted… ed…

  • Pixel-wise output
  • If we talk about pixel-wise outputs and motion, there

is a concept in Computer Vision that we need to know first

CV3DST | Prof. Leal-Taixé 13

slide-14
SLIDE 14

Optical l flo low

14 CV3DST | Prof. Leal-Taixé

slide-15
SLIDE 15

Opt Optica cal l flo flow

  • Input: 2 consecutive images (e.g. from a video)
  • Output: displacement of every pixel from image A to

image B

  • Results in the “perceived” 2D motion, not the real

motion of the object

15 CV3DST | Prof. Leal-Taixé

slide-16
SLIDE 16

Opt Optica cal l flo flow

16 CV3DST | Prof. Leal-Taixé

slide-17
SLIDE 17

Opt Optica cal l flo flow

17 CV3DST | Prof. Leal-Taixé

slide-18
SLIDE 18

Opt Optica cal l flo flow with CNNs NNs

  • End-to-end supervised learning of optical flow

18 CV3DST | Prof. Leal-Taixé

  • P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015
slide-19
SLIDE 19

Opt Optica cal l flo flow with CNNs NNs

19 CV3DST | Prof. Leal-Taixé

  • P. Fischer et al. „FlowNet: Learning Optical Flow With Convolutional Networks“. ICCV 2015
slide-20
SLIDE 20

Fl FlowNet: a : arc rchit itecture ure 1 1

  • Stack both images à input is now 2 x RGB = 6

channels

20 CV3DST | Prof. Leal-Taixé

slide-21
SLIDE 21

Fl FlowNet: a : arc rchit itecture ure 2 2

  • Siamese architecture

21 CV3DST | Prof. Leal-Taixé

slide-22
SLIDE 22

Fl FlowNet: a : arc rchit itecture ure 2 2

  • Two key design choices

22 CV3DST | Prof. Leal-Taixé

How to combine the information from both images?

slide-23
SLIDE 23

Cor Correl elation ion layer er

  • Multiplies a feature vector with another feature vector

23 CV3DST | Prof. Leal-Taixé

Fixed operation. No learnable weights!

slide-24
SLIDE 24

Cor Correl elation ion layer er

  • The matching score represents how correlated these

two feature vectors are

24 CV3DST | Prof. Leal-Taixé

slide-25
SLIDE 25

Cor Correl elation ion layer er

  • Hint for anyone interested in 3D reconstruction:

Useful for finding image correspondences

25 CV3DST | Prof. Leal-Taixé

  • I. Rocco et al. “Convolutional neural network architecture for geometric matching. CVPR 2017.

Find a transformation from image A to image B A B

slide-26
SLIDE 26

Fl FlowNet : a : arc rchit itecture ure 2 2

  • Two key design choices

26 CV3DST | Prof. Leal-Taixé

How to combine the information from both images? How to obtain high- quality results?

slide-27
SLIDE 27

Ca Can we e do

  • VOS wit

ith OF?

  • Indeed!
  • Better if we focus on the

flow of the object

  • We can improve

segmentation and OF iteratively (no DL yet)

27 CV3DST | Prof. Leal-Taixé

Y.H. Tsai et al. “Video Segmentation via Object Flow“. CVPR 2016

slide-28
SLIDE 28

OS OSVOS VOS

CV3DST | Prof. Leal-Taixé 28

slide-29
SLIDE 29
  • Goal: Learn the appearance of the object to track
  • Main contribution: separate training steps

– Pre-training for ‘objectness’. – First-frame adaptation to specific object-of-interest using fine-tuning.

Fir First st-fra frame fi fine-tu tuni ning ng

CV3DST | Prof. Leal-Taixé 29

slide-30
SLIDE 30 Results on frame N
  • f test sequence

Base Network

Pre-trained on ImageNet

1

Edges and basic image features

  • S. Caelles et al. “One-shot video object segmentation”.CVPR 2017

Test Network

Fine-tuned on frame 1 of test sequence

3

Learns which

  • bject to

segment Finetuning Pre-trained

Parent Network

Trained on DAVIS training set

2

Learns how to do video segmentation Training

On One-sh shot V VOS

CV3DST | Prof. Leal-Taixé 30

slide-31
SLIDE 31
  • S. Caelles et al. “One-shot video object segmentation”.CVPR 2017

On One-sh shot V VOS

  • One-shot: we see the first frame ground truth
  • Finetuning step: this is used to technically overfit to

the test sequence first frame. Overfitting is therefor used to learn the appearance of the foreground

  • bject (and the background!)
  • Test time: each frame is processed independently à

no temporal information

CV3DST | Prof. Leal-Taixé 31

slide-32
SLIDE 32

Fr Frame me-ba based segm gmentation

  • PRO: it recovers well from occlusions (unlike mask

propagation or optical flow-based methods)

  • CON: it is temporally inconsistent

CV3DST | Prof. Leal-Taixé 32

slide-33
SLIDE 33

Ex Exper erimen iments: hig ighly dynamic mic scen enes es

CV3DST | Prof. Leal-Taixé 33

slide-34
SLIDE 34

Ex Experiments: accuracy y vs annotations

Two camels! Another annotation where the 2nd camel is background Another annotation Mask is refined

CV3DST | Prof. Leal-Taixé 34

slide-35
SLIDE 35

102ms – One forward pass (parent network)

Fin Finetunin ing ti time

DAVIS dataset

11.8 pp.

CV3DST | Prof. Leal-Taixé 35

Object flow

slide-36
SLIDE 36

Obs Observ rvations

  • OSVOS does not have an object of object shape.
  • It is a pure appearance-based method, if the

foreground (or the background) appearance changes too much, the method fails

CV3DST | Prof. Leal-Taixé 36

slide-37
SLIDE 37

In Intro roduc ucing Semantics First frame

slide-38
SLIDE 38

In Introducing Semantics

He was occluded in the first frame, therefore the network never learned he was background.

CV3DST | Prof. Leal-Taixé 38

slide-39
SLIDE 39

Bu But wai ait….

  • We have already seen models that have an idea of
  • bject shape..
  • Instance segmentation methods!

39 CV3DST | Prof. Leal-Taixé

slide-40
SLIDE 40

OS OSVOS OS-S: S: Se Semanti ntic c propagati tion

Semantic Instance Segmentation

Result

Top Matching Instances Instance Proposals

Input Image

First-Round Foreground Estimation

Conditional

  • Semantic

Selection & Propagation

Semantic Prior

Foreground Estimation CNN

Appearance Model

K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018

CV3DST | Prof. Leal-Taixé 40

Semantic prior branch that gives us proposals to select from Prior: semantics stay coherent throughout the sequence

slide-41
SLIDE 41

OS OSVOS OS-S: S: Se Semanti ntic c propagati tion

Semantic Selection

Selected Instances: Person and Motorbike Ground Truth Instance Segmentation Proposals

Semantic Propagation

Instance Segmentation Proposals First-Round Foreground Estimation Top Person and Motorbike

Frame 0 Frame 18 Frame 24 Frame 30 Frame 36

CV3DST | Prof. Leal-Taixé 41

K.-K. Maninis et al. “Video object segmentation without temporal information”. TPAMI 2018

slide-42
SLIDE 42

Dr Drifti ting g pr proble blem

  • If the object greatly changes its appearance (e.g.,

though pose or camera changes), then the model is not powerful anymore

  • But this change was gradual….

CV3DST | Prof. Leal-Taixé 42

slide-43
SLIDE 43

Dr Drifti ting g pr proble blem

  • If the object greatly changes its appearance (e.g.,

though pose or camera changes), then the model is not powerful anymore

CV3DST | Prof. Leal-Taixé 43

Why not gradually update the model?

slide-44
SLIDE 44

On OnAVOS OS: O : Onlin ine A Adaptatio ion

  • Online adaptation: adapt model to appearance

changes every frame – not just the first frame.

  • Iteratively fine-tune the model on previous prediction

every frame.

  • CON: Extremely slow.

CV3DST | Prof. Leal-Taixé 44

  • P. Voigtlander and B. Leibe. “Online adaptation of convolutional neural networks for video object segmentation”. BMVC 2017
slide-45
SLIDE 45

On OnAVOS OS: O : Onlin ine A Adaptatio ion

CV3DST | Prof. Leal-Taixé 45

Blue = background samples Red = foreground samples

  • P. Voigtlander and B. Leibe. “Online adaptation of convolutional neural networks for video object segmentation”. BMVC 2017
slide-46
SLIDE 46

Mas Mask Ref Refinem emen ent

  • Assumption: an object, i.e., a mask, does not move a

lot from frame to frame.

  • We can often start with an approximate mask (either

from previous frame or from coarse estimate).

  • We can then use a refine

nement nt network to accurately refine the mask estimate.

  • This can take advantage of crop-and-zoom to do

segmentation at a higher resolution.

CV3DST | Prof. Leal-Taixé 46

slide-47
SLIDE 47

Mas MaskTrac ack

CV3DST | Prof. Leal-Taixé 47

  • A. Khoreva et al. „Learning Video Object Segmentation from Static Images“ CVPR 2017

Why the name?

slide-48
SLIDE 48

Mas MaskTrac ack

48 CV3DST | Prof. Leal-Taixé

  • A. Khoreva et al. „Learning Video Object Segmentation from Static Images“ CVPR 2017
  • Training inputs can be simulated!

– Like displacements to train the regressor of Faster-RCNN – Very similar in spirit to Tracktor

slide-49
SLIDE 49

Wor Worth readi eading

  • S. Jain et al. "Fusionseg: Learning to combine motion

and appearance for fully automatic segmentation of generic objects in videos." CVPR 2017. à Optical flow propagation

  • A. Khoreva et al. “Lucid Data Dreaming for Video

Object Segmentation„ IJCV 2019 à clever data augmentation.

  • X. Li et al. „Video object segmentation with re-

identification“ CVPRW 2017. à use reidentification techniques to recover from occlusions

CV3DST | Prof. Leal-Taixé 49

slide-50
SLIDE 50

Pr Proposal-ba based sed ap approac ache hes

50 CV3DST | Prof. Leal-Taixé

slide-51
SLIDE 51

Pr Propo posal l Ge Generation

  • Instance Segmentation Networks (E.g. Mask-RCNN)

give object instance segmentation proposals.

  • One can approach video object segmentation as

taking these proposals in each frame and then linking them over time using a merging algorithm.

CV3DST | Prof. Leal-Taixé 51

Until now:

  • Input is the whole image
  • Proposals are put on top just

to refine Now: Input are proposals Goal is to “link” them (much like we did in tracking-by-detection)

slide-52
SLIDE 52

PR PReMVOS OS

  • An approach that combines all of the previous VOS

principles and gives state-of-the-art results.

  • Combines the following principles:

– First-frame fine-tuning – Mask Refinement – Optical Flow Mask Propagation – Data Augmentation – Object Appearance Re-Identification – Proposal Generation

CV3DST | Prof. Leal-Taixé 52

  • J. Luiten et al. „PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation“. ACCV2018
slide-53
SLIDE 53

PR PReMVOS OS:Ov Overview

  • Proposal generation

– Category-agnostic Mask R-CNN proposals

  • Refinement

– Fully-convolutional segmentation network trained to refine the segmentation given a proposal bounding box

Proposal generation Refinement Merging

CV3DST | Prof. Leal-Taixé 53

slide-54
SLIDE 54

PR PReMVOS OS: O : Overv rvie iew

  • Merging

– Greedy decision process, chooses proposal(s) with best score – Optional proposal expansion through Optical Flow propagation – Proposal score as combination of

  • Objectness score
  • Mask propagation IoU score (Optical Flow warping)
  • ReID score
  • Object-Object interaction scores

Proposal generation Refinement Merging

CV3DST | Prof. Leal-Taixé 54

slide-55
SLIDE 55

PR PReMVOS OS: re : resul sults

  • Very complex but a winner
  • DAVIS Challenge 2018 Winner
  • Youtube-VOS Challenge 2018 Winner

CV3DST | Prof. Leal-Taixé 55

slide-56
SLIDE 56

PR PReMVOS OS: V : Visua isual R Resul sults

CV3DST | Prof. Leal-Taixé 56

slide-57
SLIDE 57

Le Lessons ns Le Learne ned

  • Challenge 1: How to generate proposals?

– Deep-learning based region proposal generators are fit for the task – Experimented with SharpMask and Mask R-CNN

  • Challenge 2: How to track region proposals?

– Region overlap works as a consistency measure – Optical flow based propagation really helps – ReID score also helpful

  • Open issues

– PReMVOS has no notion of 3D objects moving through 3D space. – Track initialization / termination logic needed for real tracking. – How to obtain the initial segmentation?

CV3DST | Prof. Leal-Taixé 57

Slide from: Jonathon Luiten

slide-58
SLIDE 58

Re Retrie ieval val ap approac ache hes

58 CV3DST | Prof. Leal-Taixé

slide-59
SLIDE 59

Pi Pixe xel-wi wise retr trieva val

  • Re-Identification networks based on bounding-box

region proposals work really well.

  • This idea can be extended to a Re-Identification

embedding for every pixel.

CV3DST | Prof. Leal-Taixé 59

slide-60
SLIDE 60

Pi Pixe xel-wi wise retr trieva val

  • The user input can be in any form, first-frame groundtruth

mask, scribble…

60 CV3DST | Prof. Leal-Taixé

  • Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018
slide-61
SLIDE 61

Pi Pixe xel-wi wise retr trieva val

  • Training: use the triplet loss to bring foreground pixels

together and separate them from background pixels

61 CV3DST | Prof. Leal-Taixé

  • Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018
slide-62
SLIDE 62

Pi Pixe xel-wi wise retr trieva val

  • Test: embed pixels from both annotated and test frame,

and perform a nearest neighbor search for the test pixels.

62 CV3DST | Prof. Leal-Taixé

  • Y. Chen et al. „Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning“. CVPR 2018

We do not need to retrain the model for each sequence, nor finetune

slide-63
SLIDE 63

We We ar are e deal dealing with video deo

  • Which is a sequence of images….
  • And we have not talked about….
  • Recurrent Neural Networks!

63 CV3DST | Prof. Leal-Taixé

slide-64
SLIDE 64

Spa Spati tio-temporal l ap approac ache hes

64 CV3DST | Prof. Leal-Taixé

slide-65
SLIDE 65

Te Temporal l LSTM LSTM

  • One-shot video object segmentation
  • If we have multiple objects, each of them is predicted

independently

CV3DST | Prof. Leal-Taixé 65

  • N. Xu et al. „Youtube-vos: Sequence-to-sequence video object segmentation." ECCV 2018.
slide-66
SLIDE 66

R-VO VOS: tem empor

  • ral

al an and d spat atial al LSTM

CV3DST | Prof. Leal-Taixé 66

  • C. Ventura et al. „RVOS: end-to-end recurrent network

for video object segmentation“. CVPR 2019

  • B. Roma-Paredes and P.H.S. Torr. „Recurrent

Instance Segmentation“ ECCV 2016

slide-67
SLIDE 67

R-VO VOS: tem empor

  • ral

al an and d spat atial al LSTM

67 CV3DST | Prof. Leal-Taixé

  • C. Ventura et al. „RVOS: end-to-end recurrent network for video object segmentation“. CVPR 2019
  • Instance generation

and temporal coherence are both trained end-to-end

  • Image just needs to be

processed once (unlike ConvLSTM example before)

slide-68
SLIDE 68

Tr Trans nsformers for VOS

68 CV3DST | Prof. Leal-Taixé

  • S. Oh “Video Object Segmentation using Space-Time Memory Networks”. ICCV 2019
slide-69
SLIDE 69

Gra Graph ph A Attention Ne Network rks fo s for V r VOS OS

69 CV3DST | Prof. Leal-Taixé

  • W. Wang et al. „ Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks “. ICCV 2019
  • They use it for zero-shot segmentation, but could be

similarly used for one-shot VOS.

slide-70
SLIDE 70

Ov Overv rview of f the methods

  • Video Object Segmentation (VOS)

– OSVOS: First-frame fine-tuning (appearance model) – OSVOS-S: + semantic guidance through proposals (shape) – OnAVOS: Online Adaptation (stronger appearance model) – MaskTrack: Mask Refinement – Lucid: clever data augmentation – ReID-VOS: Object Appearance Re-Identification – PReMVOS: putting it all together – Seq2seq and RVOS: recurrent architectures

CV3DST | Prof. Leal-Taixé 70

Appearance Motion Matching Shape

slide-71
SLIDE 71

Evalu luation and me metric ics

71 CV3DST | Prof. Leal-Taixé

slide-72
SLIDE 72

Met Metrics for

  • r VO

VOS

  • Re

Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.

  • Cont

ntour Ac Accurac acy: measures the precision and recall

  • f the boundary pixels. This is put together in the F-

measure.

72 CV3DST | Prof. Leal-Taixé

Precision = TP TP + FP Recall = TP TP + FN F = 2 ∗ Prec ∗ Rec Prec + Rec

slide-73
SLIDE 73

Met Metrics for

  • r VO

VOS

  • Temporal

al stab ability: measures the evolution of object shapes, i.e., how stable the boundaries are in time.

– Estimate the deformation of the mask from t to t+1 – If the transformation is smooth and precise, the result is considered stable. – A bad results is a jittery mask evolution – Note: this measure has been dropped due to its instability during occlusions.

73 CV3DST | Prof. Leal-Taixé

slide-74
SLIDE 74

Met Metrics for

  • r VO

VOS

  • You can use error measure statistics
  • Re

Region n similar arity: Jaccard index (IoU) of ground truth mask and predicted mask.

– Mean: average for the dataset – Decay: quantifies the performance loss (or gain) over time. à This is currently used to judge temporal stability – Recall: fraction of sequences scoring higher than a threshold

74 CV3DST | Prof. Leal-Taixé

slide-75
SLIDE 75

Tra Track cking g and d Segm Segmen enta tati tion

  • n

CV3DST | Prof. Leal-Taixé 75

slide-76
SLIDE 76

VO VOS -> > MOTS TS

  • Video Object Segmentation (VOS) is limited by:

– First frame mask given (in the supervised case) – Short video clips with objects present in almost all frames – Objects in a video are (mostly) of different categories – Few objects to track (max around 7 per video)

  • Multi-Object Tracking and Segmentation (MOTS)

– Scenarios with a large number of objects (20-40), mostly of the same category (e.g., pedestrians) – Long sequences – No first frame annotation provided, one has to deal with appearing and disappearing objects.

CV3DST | Prof. Leal-Taixé 76

slide-77
SLIDE 77

MO MOTS dat datas aset et

CV3DST | Prof. Leal-Taixé 77

  • Segmentations coming to MOTChallenge pedestrian

tracking dataset

  • P. Voigtlaender et al. „MOTS: Multi-Object Tracking and Segmentation“. CVPR 2019
slide-78
SLIDE 78

Vi Video Ob eo Object ject Segm Segmen enta tati tion

  • n

CV3DST | Prof. Leal-Taixé 78

slide-79
SLIDE 79

Di Discla claimer

  • This lecture was done borrowing material from:

– Prof. Xavier Giró, Technical University of Catalonia (UPC) – Jonathon Luiten, RWTH Aachen

CV3DST | Prof. Leal-Taixé 79