Objec bject t tra tracking CV3DST | Prof. Leal-Taix 1 Pr - - PowerPoint PPT Presentation

objec bject t tra tracking
SMART_READER_LITE
LIVE PREVIEW

Objec bject t tra tracking CV3DST | Prof. Leal-Taix 1 Pr - - PowerPoint PPT Presentation

Objec bject t tra tracking CV3DST | Prof. Leal-Taix 1 Pr Proble blem statement Given a video, find out which parts of the image depict the same object in different frames Often we use detectors as starting points t t+1 CV3DST |


slide-1
SLIDE 1

Objec bject t tra tracking

CV3DST | Prof. Leal-Taixé 1

slide-2
SLIDE 2

Pr Proble blem statement

  • Given a video, find out which parts of the image

depict the same object in different frames

  • Often we use detectors as starting points

2 CV3DST | Prof. Leal-Taixé

t t+1

slide-3
SLIDE 3

Wh Why do do we e need eed trac acking?

  • To model objects when detection fails:

– Occlusions – Viewpoint/pose/blur/illumination variations (in a few frames of a sequence) – Background clutter

  • To reason about the dynamic world, e.g., trajectory

prediction (is the person going to cross the street?)

CV3DST | Prof. Leal-Taixé 3

slide-4
SLIDE 4

Tr Tracki cking ng is….

  • Similarity measurement
  • Correlation
  • Correspondence

– Story time: „A young graduate student asked Takeo Kanade what are the three most important problems in computer vision. Kanade replied: “Correspondence, correspondence, correspondence!”

  • Matching/retrieval
  • Data association

CV3DST | Prof. Leal-Taixé 4

slide-5
SLIDE 5

Tr Tracki cking ng is is also….

  • Learning to model our target’s
  • Ap

Appearance: we need to know how the target looks like

– Single object tracking – Re-identification

  • Mo

Motio ion: to make predictions of where the targets goes

– Trajectory prediction (lecture 6)

CV3DST | Prof. Leal-Taixé 5

slide-6
SLIDE 6

Si Sing ngle le Ta Target t Tr Tracki cking ng

  • STT (1) as a matching/correspondence problem:

– GOTURN: no online appearance modeling

  • STT (2) as an appearance learning problem:

– MDNet: quick online finetuning of the network

  • STT (3) as a (temporal) prediction problem:

– ROLO = CNN + LSTM

CV3DST | Prof. Leal-Taixé 6

slide-7
SLIDE 7

Si Sing ngle le Ta Target t Tr Tracki cking ng 1

  • Input: what to track?
  • D. Held, S. Thrun, S. Savarese. “Learning to Track at 100 FPS with Deep Regression Networks”. ECCV 2016.

Crop the object to be tracked – Initialization of our tracker

CV3DST | Prof. Leal-Taixé 7

slide-8
SLIDE 8

Si Sing ngle le Ta Target t Tr Tracki cking ng 1

  • Where do I search for the object?

Crop the object to be tracked – Initialization of our tracker

Assume smooth “slow”

  • motion. The object cannot

be far away from where it was in the previous frame.

Use the position of t-1 to crop frame t

  • D. Held, S. Thrun, S. Savarese. “Learning to Track at 100 FPS with Deep Regression Networks”. ECCV 2016.

CV3DST | Prof. Leal-Taixé 8

slide-9
SLIDE 9

Si Sing ngle le Ta Target t Tr Tracki cking ng 1

  • Architecture: conv + concatenate + FC

Check the original paper for the exact parameterization of the output

  • D. Held, S. Thrun, S. Savarese. “Learning to Track at 100 FPS with Deep Regression Networks”. ECCV 2016.

CV3DST | Prof. Leal-Taixé 9

slide-10
SLIDE 10

Si Sing ngle le Ta Target t Tr Tracki cking ng 1

  • PROS of GOTURN:

– No online training required. – Tracking is done by comparison, so we do not need to retrain

  • r finetune our model for every new object.

– Close to the template matching approach that we saw in the first lectures for object detection – This makes it very fast!

  • CONS:

– We have a motion assumption. If the object moves fast and goes out of our search window, we cannot recover.

  • D. Held, S. Thrun, S. Savarese. “Learning to Track at 100 FPS with Deep Regression Networks”. ECCV 2016.

CV3DST | Prof. Leal-Taixé 10

slide-11
SLIDE 11

SO SOT T 1.2 - Un Unsu supervise ised

  • Forward cycle and backward cycle should be

consistent!

  • X. Wang, A. Jabri, A. Efros. “Learning correspondence from the cycle-consistency of time”. CVPR 2019

CV3DST | Prof. Leal-Taixé 11

slide-12
SLIDE 12

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • Online appearance model learning entails training

your CNN at test time.

– Slow: not suitable for real-time applications – Solution: train as few layers as possible

  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

CV3DST | Prof. Leal-Taixé 12

slide-13
SLIDE 13

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • Shared layers + scene-specific layers
  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

CV3DST | Prof. Leal-Taixé 13

slide-14
SLIDE 14

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • Backpropagation is independent per sequence
  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

Sequence 1

CV3DST | Prof. Leal-Taixé 14

slide-15
SLIDE 15

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • Backpropagation is independent per sequence
  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

Sequence k

CV3DST | Prof. Leal-Taixé 15

slide-16
SLIDE 16

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • At test time, we need to train fc6 (up to fc4 if wanted).
  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

New test sequence

CV3DST | Prof. Leal-Taixé 16

slide-17
SLIDE 17

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • Online tracking
  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

R-CNN type of regression

CV3DST | Prof. Leal-Taixé 17

slide-18
SLIDE 18

Si Sing ngle le Ta Target t Tr Tracki cking ng 2

  • PROS of MDNet:

– No previous location assumption, the object can move anywhere in the image – Fine-tuning step is comparatively cheap – Winner of the VOT Challenge 2015 (http://www.votchallenge.net)

  • CONS:

– Not as fast as GOTURN

  • H. Nam and B. Han. „Learning Multi-Domain Convolutional Neural Networks for Visual Tracking“. CVPR 2016

CV3DST | Prof. Leal-Taixé 18

slide-19
SLIDE 19

Si Sing ngle le Ta Target t Tr Tracki cking ng 3

  • CNN for appearance + LSTM for motion
  • G. Ning, Z. Zhang, C. Huang, Z. He. „Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking“. arXiv:1607.05781. 2016

Recurrent YOLO ROLO

CV3DST | Prof. Leal-Taixé 19

slide-20
SLIDE 20

Si Sing ngle le Ta Target t Tr Tracki cking ng 3

  • LSTM receives the heatmap for the object’s position

and the 4096 descriptor of the image

  • G. Ning, Z. Zhang, C. Huang, Z. He. „Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking“. arXiv:1607.05781. 2016

CV3DST | Prof. Leal-Taixé 20

slide-21
SLIDE 21

Mult ltiple le object tra track cking

CV3DST | Prof. Leal-Taixé 21

slide-22
SLIDE 22

Di Different t ch challe llenge ges

  • Multiple objects of the same type
  • Heavy occlusions
  • Appearance is often very similar

CV3DST | Prof. Leal-Taixé 22

slide-23
SLIDE 23

Tr Tracki cking ng-by by-det detec ection

  • n
  • We will focus on algorithms where a set of detections is

provided

– Remember detections are not prefect!

Find detections that match and form a trajectory

CV3DST | Prof. Leal-Taixé 23

slide-24
SLIDE 24

On Onli line vs offli ffline tra rack cking

  • Online tracking

– Processes two frames at a time – For real-time applications – Prone to drifting à hard to recover from errors or occlusions

  • Offline tracking

– Processes a batch of frames – Good to recover from occlusions (short ones as we will see) – Not suitable for real-time applications – Suitable for video analysis

CV3DST | Prof. Leal-Taixé 24

slide-25
SLIDE 25

On Onli line tra rack cking

  • 1. Track in

init itia ializ izatio ion (e.g. using a detector)

t t+1 t+2

CV3DST | Prof. Leal-Taixé 25

slide-26
SLIDE 26

On Onli line tra rack cking

  • 1. Track initialization (e.g. using a detector)
  • 2. Pr

Prediction

  • n of the next position (motion model)

t t+1 t+2

CV3DST | Prof. Leal-Taixé 26

slide-27
SLIDE 27

On Onli line tra rack cking

  • 1. Track initialization (e.g. using a detector)
  • 2. Prediction of the next position (motion model)
  • 3. Ma

Matc tching predictions with detections (appearance model)

t t+1 t+2

CV3DST | Prof. Leal-Taixé 27

slide-28
SLIDE 28

On Onli line tra rack cking

  • 2. Prediction of the next position (motion model)

– Classic: Kalman filter – Nowadays: Recurrent architecture – For now: we will assume a constant velocity model (spoiler alter: it works really well at high framerates and without occlusions!)

CV3DST | Prof. Leal-Taixé 28

slide-29
SLIDE 29

On Onli line tra rack cking

  • 3. Matching predictions

with detections (appearance model)

Detections Predictions

CV3DST | Prof. Leal-Taixé 29

slide-30
SLIDE 30

On Onli line tra rack cking

  • Bipartite matching

– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)

0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9

Detections Predictions

CV3DST | Prof. Leal-Taixé 30

slide-31
SLIDE 31

On Onli line tra rack cking

  • Bipartite matching

– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)

  • Solve the unique

matching with e.g., the Hungarian algorithm*

0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 31

slide-32
SLIDE 32

On Onli line tra rack cking

  • Bipartite matching

– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)

  • Solve the unique

matching with e.g., the Hungarian algorithm*

  • Solutions are the unique

assignments that minimize the total cost

0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 32

slide-33
SLIDE 33

On Onli line tra rack cking

  • Bipartite matching

– What happens if we are missing a prediction?

0.9 0.8 0.8 0.5 0.4 0.3 0.2 0.1 0.4 0.1 0.2 0.5

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 33

slide-34
SLIDE 34

On Onli line tra rack cking

  • Bipartite matching

– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match?

0.9 0.8 0.8 0.5 0.4 0. 0.7 0.2 0.1 0.4 0.1 0.2 0.5

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 34

slide-35
SLIDE 35

On Onli line tra rack cking

  • Bipartite matching

– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match? – Introducing extra nodes with a threshold cost

0.9 0.8 0.8 0.3 0.3 0.5 0.4 0.7 0.3 0.3 0.2 0.1 0.4 0.3 0.3 0.1 0.2 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.3

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 35

slide-36
SLIDE 36

On Onli line tra rack cking

  • Bipartite matching

– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match? – Introducing extra nodes with a threshold cost – Apply Hungarian – Result: two detections have no matched prediction

0.9 0.8 0.8 0.3 0.3 0.5 0.4 0.7 0.3 0.3 0.2 0.1 0.4 0.3 0.3 0.1 0.2 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.3

Detections Predictions

*Demo: http://www.hungarianalgorithm.com/solve.php

CV3DST | Prof. Leal-Taixé 36

slide-37
SLIDE 37

On Onli line tra rack cking: g: what is DL DL’s ro role le?

  • 1. Track initialization (e.g. using a detector)

– Deep Learning has provided us with better detectors

  • 2. Prediction of the next position (motion model)

– Trajectory prediction will be covered in Lecture 6

  • 3. Matching predictions with detections (appearance model)

– Improving appearance models à Re-Identification (in a few slides) – Matching still happens separately from learning, we will see how to couple both steps in the next lecture à MOT as a graph problem

CV3DST | Prof. Leal-Taixé 37

Adding computational complexity Adding feature complexity Adding temporal complexity

slide-38
SLIDE 38

Tra Track cktor tor: : striving iving fo for simpli licity

CV3DST | Prof. Leal-Taixé 38

  • P. Bergmann, T. Meinhardt and L. Leal-Taixé. ICCV, 2019
slide-39
SLIDE 39

Rec Recal all two

  • step

ep-det detec ector

  • rs

Classification head Regression head Input image Convolutions Feature representation Region proposal

CV3DST | Prof. Leal-Taixé 39

slide-40
SLIDE 40

Rec Recal all two

  • step

ep-det detec ector

  • rs

Classification head Regression head Convolutions Feature representation Region proposal Input image

CV3DST | Prof. Leal-Taixé 40

slide-41
SLIDE 41

Rec Recal all two

  • step

ep-det detec ector

  • rs

Classification head Regression head Convolutions Feature representation Regressed bounding box Input image

CV3DST | Prof. Leal-Taixé 41

slide-42
SLIDE 42

Mak Making a a det detec ector

  • r into
  • a

a tr trackto cktor

  • Tracktor: a method trained as a detector but with

tracking capabilities

  • P. Bergmann, T. Meinhardt and L. Leal-Taixé. ICCV, 2019

CV3DST | Prof. Leal-Taixé 42

slide-43
SLIDE 43

Mak Making a a det detec ector

  • r into
  • a

a tr trackto cktor

Frame t+1 Use detections of frame t as proposals

CV3DST | Prof. Leal-Taixé 43

  • P. Bergmann, T. Meinhardt and L. Leal-Taixé. ICCV, 2019
slide-44
SLIDE 44

Mak Making a a det detec ector

  • r into
  • a

a tr trackto cktor

Bounding box regression Where did the detection with ID 1 go in the next frame? Tracking!

CV3DST | Prof. Leal-Taixé 44

  • P. Bergmann, T. Meinhardt and L. Leal-Taixé. ICCV, 2019
slide-45
SLIDE 45

Pr Pros and co cons

  • PRO

PRO We can reuse an extremely well-trained regressor

– We get well-positioned bounding boxes

  • PRO We can train our model on still images à easier

annotation!

  • PRO Tracktor is online

CV3DST | Prof. Leal-Taixé 45

slide-46
SLIDE 46

Pr Pros and co cons

  • CON There is no notion of “identity” in the model

– Confusion in crowded spaces

  • CON As any online tracker, the track is killed if

the target becomes occluded

– Need to close small gaps and occlusions

  • CON The regressor only shifts the box by a

small quantity

– Large camera motions – Large displacements due to low framerate

Motion model Re-ID

CV3DST | Prof. Leal-Taixé 46

slide-47
SLIDE 47

Bac Back to…

  • ….

Modeling appearance Modeling motion

Motion model Re-ID

CV3DST | Prof. Leal-Taixé 47

slide-48
SLIDE 48

Re Re-ID ID

CV3DST | Prof. Leal-Taixé 48

slide-49
SLIDE 49

Pr Proble blem statement

  • Viewing tracking as a retrieval problem

CV3DST | Prof. Leal-Taixé 49

Images of different identities Detected person Retrieve the top matches How to measure (learn) the distance between two images?

slide-50
SLIDE 50

Ca Can we e use e classif ific ication ion?

  • We can do image classification to get attributes

A B Classification: person, face, female, brunette Classification: person, face, female, blonde

CV3DST | Prof. Leal-Taixé 50

slide-51
SLIDE 51

Pr Proble blems with cla classifica cati tion

1) What if the classes used during the inference

are different from those used during the training?

2) What if the dataset is not that big? 3) What if the classes used during the inference

change?

4) What if we are interested only in knowing if

two samples are from the same class (similar) rather than on what class they belong?

51

slide-52
SLIDE 52

(D (Deep) similar arity lear arning

  • Third type of problems

A B

Is it the same person?

CV3DST | Prof. Leal-Taixé 52

slide-53
SLIDE 53

(D (Deep) similar arity lear arning

  • Third type of problems

A B

  • Similarity Learning: learn a

function that measure how similar two objects are.

  • Deep Metric Learning:

Learning a distance function

  • ver objects.

CV3DST | Prof. Leal-Taixé 53

slide-54
SLIDE 54

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Application: unlocking your iPhone with your face

Training

CV3DST | Prof. Leal-Taixé 54

slide-55
SLIDE 55

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Application: unlocking your iPhone with your face

A B YES NO

Testing Can be solved as a classification problem

CV3DST | Prof. Leal-Taixé 55

slide-56
SLIDE 56

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Application: face recognition system so students can

enter the exam room without the need for ID check Person 1 Person 2 Training Person 3

CV3DST | Prof. Leal-Taixé 56

slide-57
SLIDE 57

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Application: face recognition system so students can

enter the exam room without the need for ID check What is the problem with this approach? Scalability – we need to retrain our model every time a new student is registered to the course

CV3DST | Prof. Leal-Taixé 57

slide-58
SLIDE 58

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Application: face recognition system so students can

enter the exam room without the need for ID check Can we train one model and use it every year?

CV3DST | Prof. Leal-Taixé 58

slide-59
SLIDE 59

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Learn a similarity function

A B Low similarity score (large distance) A B High similarity score (small distance)

CV3DST | Prof. Leal-Taixé 59

slide-60
SLIDE 60

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Learn a similarity function: testing

A B Not the same person

d(A, B) > τ

CV3DST | Prof. Leal-Taixé 60

slide-61
SLIDE 61

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • Learn a similarity function

A B

d(A, B) < τ

Same person

CV3DST | Prof. Leal-Taixé 61

slide-62
SLIDE 62

Si Simila larity ty Le Learni ning ng: whe when n and nd why why?

  • How about retrieval and re-ID?

CV3DST | Prof. Leal-Taixé 62

  • 1. Compute distance

between query image and database images

  • 2. Use k-nearest neighbors,

i.e., retrieve top images based on distance

slide-63
SLIDE 63

Si Simila larity ty le learni ning ng

  • How do we train a network to learn similarity?

CV3DST | Prof. Leal-Taixé 63

slide-64
SLIDE 64

Si Simila larity ty le learni ning ng

  • How do we train a network to learn similarity?

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

CNN FC Representation

  • f my face in

128 values A

CV3DST | Prof. Leal-Taixé 64

slide-65
SLIDE 65

Si Simila larity ty le learni ning ng

  • How do we train a network to learn similarity?

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

A B

f(A) f(B)

CV3DST | Prof. Leal-Taixé 65

slide-66
SLIDE 66

Si Simila larity ty le learni ning ng

  • Siamese network = shared weights

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

A B

f(A) f(B)

CV3DST | Prof. Leal-Taixé 66

slide-67
SLIDE 67

Si Simila larity ty le learni ning ng

  • Siamese network = shared weights
  • We use the same network to obtain an encoding of

the images, and respectively.

  • To be done: compare the encodings

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

f(A)

CV3DST | Prof. Leal-Taixé 67

f(B)

slide-68
SLIDE 68

Si Simila larity ty lear earnin ing: los

  • sses

es

  • Distance function
  • Training: learn the parameter such that

– If and depict the same person, is small – If and depict a different person, is large

Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014

d(A, B) = ||f(A) − f(B)||2 d(A, B) = d(A, A, B) = d(A, B) = d(A, A, B) =

CV3DST | Prof. Leal-Taixé 68

slide-69
SLIDE 69

Si Simila larity ty le learni ning ng: lo losses

  • Loss function for a positive pair:

– If and depict the same person, is small

d(A, B) = d(A, A, B) = L(A, B) = ||f(A) − f(B)||2

CV3DST | Prof. Leal-Taixé 69

slide-70
SLIDE 70

Si Simila larity ty le learni ning ng: lo losses

  • Loss function for a negative pair:

– If and depict a different person, is large – Better use a Hinge loss:

d(A, B) = d(A, A, B) = L(A, B) = max(0, m2 − ||f(A) − f(B)||2)

If two elements are already far away, do not spend energy in pulling them even further away

CV3DST | Prof. Leal-Taixé 70

slide-71
SLIDE 71

Si Simila larity ty le learni ning ng: lo losses

  • Contrastive loss:

L(A, B) = y∗||f(A) − f(B)||2 + (1 − y∗)max(0, m2 − ||f(A) − f(B)||2)

Positive pair, reduce the distance between the elements Negative pair, brings the elements further apart up to a margin

CV3DST | Prof. Leal-Taixé 71

Chopra, Hadsell and LeCun. “Learning a similarity metric discriminatively, with application to Face verification”, CVPR 2005.

slide-72
SLIDE 72

Tr Triple let t lo loss

  • Triplet loss allows us to learn a ranking

We want:

Anchor (A) Positive (P) Negative (N)

||f(A) − f(P)||2 < ||f(A) − f(N)||2

CV3DST | Prof. Leal-Taixé 72

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

slide-73
SLIDE 73

Tr Triple let t lo loss

  • Triplet loss allows us to learn a ranking

||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0

) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0

margin

CV3DST | Prof. Leal-Taixé 73

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

slide-74
SLIDE 74

Tr Triple let t lo loss

  • Triplet loss allows us to learn a ranking

Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015

||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0

) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0

L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m)

CV3DST | Prof. Leal-Taixé 74

slide-75
SLIDE 75

Tr Triple let t lo loss

  • Training with hard cases
  • Train for a few epochs
  • Choose the hard cases where
  • Train with those to refine the distance learned

L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m) d(A, P) ≈ d(A, N)

CV3DST | Prof. Leal-Taixé 75

slide-76
SLIDE 76

Tr Triple let t lo loss

Anchor Negative Positive Anchor Negative Positive Training

CV3DST | Prof. Leal-Taixé 76

slide-77
SLIDE 77

Pr Proble blems with these lo losses

  • Number of pairwise relations in a mini-batch is

O(n^2), but contrastive loss considers only O(n/2) relations, and triplet loss considers only O(2n/3) relations.

  • Too much information is thrown away.
  • In order to train these networks many tricks

are required: hard-negative mining, intelligent sampling and multi-task learning.

77

slide-78
SLIDE 78

Ex Extra: Lo Loss Functi nctions ns

  • Lifted Structure (Song et al., CVPR 2016)
  • NPairs (Sohn et al., NIPS 2016)
  • Facility Location (Song et al., CVPR 2017)
  • Angular Loss (Wang et al., ICCV 2017)
  • Proxy-NCA (Morshovitz-Attias et al., ICCV

2017)

  • Deep Spectral Clustering (Law et al., ICML

2017)

  • Bias Triplet (Yu et al., ECCV 2018)

78

slide-79
SLIDE 79

Ex Extra: : Lo Loss ss Functions s + + Sampling

  • Sampling Matters (Manmatha et al., ICCV 2017)
  • Hierarchical Triplet Loss (Ge et al., ECCV 2018)
  • DAMLRRM (Xu et al., CVPR 2019)
  • DE-DSP (Duan et al., CVPR 2019)
  • GPW (Wang et al., CVPR 2019)

79

slide-80
SLIDE 80

Ex Extra: : Lo Loss ss Func nctions ns + Ense Ensembles s + (p (potentially) Samp mpling

  • HDC (Yuan et al., ICCV 2017)
  • BIER (Opitz et al., ICCV 2017)
  • DRE (Xuan et al., ECCV 2018)
  • ABE (Kim et al., ECCV 2018)
  • D and C (Sanakoyue et al., CVPR 2019)
  • RLL (Wang et al., CVPR 2019)

80

slide-81
SLIDE 81

Gr Group Lo Loss

81 CV3DST | Prof. Leal-Taixé

Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019

slide-82
SLIDE 82

Ov Overv rview

82

Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019

We want to take into account the similarity of all samples wrt all other samples!

slide-83
SLIDE 83

Su Summary of the the model

1) Initialization: Initialize X, the image-label assignment

using the softmax outputs of the neural network. Compute the n x n pairwise similarity matrix W using the neural network embedding.

2) Refinement: Iteratively, refine X considering the

similarities between all the mini-batch images, as encoded in W, as well as their labeling preferences.

3) Loss computation: Compute the cross-entropy loss of

the refined probabilities and update the weights of the neural network using backpropagation.

83

slide-84
SLIDE 84

Ste Step 2: Refine nement nt

84

slide-85
SLIDE 85

Ste Step 2: Refine nement nt

85

B and C are much more similar than A and B, or A and C Maybe the net is untrained and initialized the probabilities to an uniform distribution.

slide-86
SLIDE 86

Ste Step 2: Refine nement nt - ma math

  • Use replicator dynamics*

86

Propagate information Normalize in

  • rder to stay in

standard simplex. Iterate t + 1 times

*J.W. Weibull. Evolutionary Game Theory. MIT Press, 1997

From the similarity matrix Lambda is the class This measures the support that the current mini-batch gives to image I belonging to class lambda

slide-87
SLIDE 87

Step 3: 3: Los Loss an and bac ackp kprop

  • p

87

slide-88
SLIDE 88

Ste Step 3: Lo Loss

  • Compute cross-entropy over the refined

probabilities.

  • Backpropagate over the entire net
  • Group Loss has no parameters to learn, but it

propagates the gradients over the network.

  • Result: state-of-the-art results in retrieval

88

Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019

slide-89
SLIDE 89

Ov Overv rview

  • Mostly focused on online tracking
  • Better appearance models with similarity learning (re-ID)
  • How about offline tracking? Taking into account more than two

frames to compute the matching (to recover from errors, missing detections, wrong predictions)

  • Better motion models?

Lecture 5: MOT with graphs

CV3DST | Prof. Leal-Taixé 89

Lecture 6: Trajectory prediction

slide-90
SLIDE 90

Ne Next l lecture ures

  • Multiple object tracking with graph networks
  • Next lecture on Friday.
  • If you are interested in working on a project with us,

send an email to dst@dvl.in.tum.de with CV and transcript and what are your research interests!

CV3DST | Prof. Leal-Taixé 90