Objec bject t tra tracking
CV3DST | Prof. Leal-Taixé 1
Objec bject t tra tracking CV3DST | Prof. Leal-Taix 1 Pr - - PowerPoint PPT Presentation
Objec bject t tra tracking CV3DST | Prof. Leal-Taix 1 Pr Proble blem statement Given a video, find out which parts of the image depict the same object in different frames Often we use detectors as starting points t t+1 CV3DST |
CV3DST | Prof. Leal-Taixé 1
depict the same object in different frames
2 CV3DST | Prof. Leal-Taixé
t t+1
– Occlusions – Viewpoint/pose/blur/illumination variations (in a few frames of a sequence) – Background clutter
prediction (is the person going to cross the street?)
CV3DST | Prof. Leal-Taixé 3
– Story time: „A young graduate student asked Takeo Kanade what are the three most important problems in computer vision. Kanade replied: “Correspondence, correspondence, correspondence!”
CV3DST | Prof. Leal-Taixé 4
Appearance: we need to know how the target looks like
– Single object tracking – Re-identification
Motio ion: to make predictions of where the targets goes
– Trajectory prediction (lecture 6)
CV3DST | Prof. Leal-Taixé 5
– GOTURN: no online appearance modeling
– MDNet: quick online finetuning of the network
– ROLO = CNN + LSTM
CV3DST | Prof. Leal-Taixé 6
Crop the object to be tracked – Initialization of our tracker
CV3DST | Prof. Leal-Taixé 7
Crop the object to be tracked – Initialization of our tracker
Assume smooth “slow”
be far away from where it was in the previous frame.
Use the position of t-1 to crop frame t
CV3DST | Prof. Leal-Taixé 8
Check the original paper for the exact parameterization of the output
CV3DST | Prof. Leal-Taixé 9
– No online training required. – Tracking is done by comparison, so we do not need to retrain
– Close to the template matching approach that we saw in the first lectures for object detection – This makes it very fast!
– We have a motion assumption. If the object moves fast and goes out of our search window, we cannot recover.
CV3DST | Prof. Leal-Taixé 10
consistent!
CV3DST | Prof. Leal-Taixé 11
your CNN at test time.
– Slow: not suitable for real-time applications – Solution: train as few layers as possible
CV3DST | Prof. Leal-Taixé 12
CV3DST | Prof. Leal-Taixé 13
Sequence 1
CV3DST | Prof. Leal-Taixé 14
Sequence k
CV3DST | Prof. Leal-Taixé 15
New test sequence
CV3DST | Prof. Leal-Taixé 16
R-CNN type of regression
CV3DST | Prof. Leal-Taixé 17
– No previous location assumption, the object can move anywhere in the image – Fine-tuning step is comparatively cheap – Winner of the VOT Challenge 2015 (http://www.votchallenge.net)
– Not as fast as GOTURN
CV3DST | Prof. Leal-Taixé 18
Recurrent YOLO ROLO
CV3DST | Prof. Leal-Taixé 19
and the 4096 descriptor of the image
CV3DST | Prof. Leal-Taixé 20
CV3DST | Prof. Leal-Taixé 21
CV3DST | Prof. Leal-Taixé 22
provided
– Remember detections are not prefect!
Find detections that match and form a trajectory
CV3DST | Prof. Leal-Taixé 23
– Processes two frames at a time – For real-time applications – Prone to drifting à hard to recover from errors or occlusions
– Processes a batch of frames – Good to recover from occlusions (short ones as we will see) – Not suitable for real-time applications – Suitable for video analysis
CV3DST | Prof. Leal-Taixé 24
init itia ializ izatio ion (e.g. using a detector)
t t+1 t+2
CV3DST | Prof. Leal-Taixé 25
Prediction
t t+1 t+2
CV3DST | Prof. Leal-Taixé 26
Matc tching predictions with detections (appearance model)
t t+1 t+2
CV3DST | Prof. Leal-Taixé 27
– Classic: Kalman filter – Nowadays: Recurrent architecture – For now: we will assume a constant velocity model (spoiler alter: it works really well at high framerates and without occlusions!)
CV3DST | Prof. Leal-Taixé 28
with detections (appearance model)
Detections Predictions
CV3DST | Prof. Leal-Taixé 29
– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)
0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9
Detections Predictions
CV3DST | Prof. Leal-Taixé 30
– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)
matching with e.g., the Hungarian algorithm*
0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 31
– Define distances between boxes (e.g., IoU, pixel distance, 3D distance)
matching with e.g., the Hungarian algorithm*
assignments that minimize the total cost
0.9 0.8 0.8 0.1 0.5 0.4 0.3 0.8 0.2 0.1 0.4 0.8 0.1 0.2 0.5 0.9
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 32
– What happens if we are missing a prediction?
0.9 0.8 0.8 0.5 0.4 0.3 0.2 0.1 0.4 0.1 0.2 0.5
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 33
– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match?
0.9 0.8 0.8 0.5 0.4 0. 0.7 0.2 0.1 0.4 0.1 0.2 0.5
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 34
– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match? – Introducing extra nodes with a threshold cost
0.9 0.8 0.8 0.3 0.3 0.5 0.4 0.7 0.3 0.3 0.2 0.1 0.4 0.3 0.3 0.1 0.2 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.3
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 35
– What happens if we are missing a prediction? – What happens if no prediction is suitable for the match? – Introducing extra nodes with a threshold cost – Apply Hungarian – Result: two detections have no matched prediction
0.9 0.8 0.8 0.3 0.3 0.5 0.4 0.7 0.3 0.3 0.2 0.1 0.4 0.3 0.3 0.1 0.2 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.3
Detections Predictions
*Demo: http://www.hungarianalgorithm.com/solve.php
CV3DST | Prof. Leal-Taixé 36
– Deep Learning has provided us with better detectors
– Trajectory prediction will be covered in Lecture 6
– Improving appearance models à Re-Identification (in a few slides) – Matching still happens separately from learning, we will see how to couple both steps in the next lecture à MOT as a graph problem
CV3DST | Prof. Leal-Taixé 37
Adding computational complexity Adding feature complexity Adding temporal complexity
CV3DST | Prof. Leal-Taixé 38
Classification head Regression head Input image Convolutions Feature representation Region proposal
CV3DST | Prof. Leal-Taixé 39
Classification head Regression head Convolutions Feature representation Region proposal Input image
CV3DST | Prof. Leal-Taixé 40
Classification head Regression head Convolutions Feature representation Regressed bounding box Input image
CV3DST | Prof. Leal-Taixé 41
tracking capabilities
CV3DST | Prof. Leal-Taixé 42
Frame t+1 Use detections of frame t as proposals
CV3DST | Prof. Leal-Taixé 43
Bounding box regression Where did the detection with ID 1 go in the next frame? Tracking!
CV3DST | Prof. Leal-Taixé 44
PRO We can reuse an extremely well-trained regressor
– We get well-positioned bounding boxes
annotation!
CV3DST | Prof. Leal-Taixé 45
– Confusion in crowded spaces
the target becomes occluded
– Need to close small gaps and occlusions
small quantity
– Large camera motions – Large displacements due to low framerate
Motion model Re-ID
CV3DST | Prof. Leal-Taixé 46
Modeling appearance Modeling motion
Motion model Re-ID
CV3DST | Prof. Leal-Taixé 47
CV3DST | Prof. Leal-Taixé 48
CV3DST | Prof. Leal-Taixé 49
Images of different identities Detected person Retrieve the top matches How to measure (learn) the distance between two images?
A B Classification: person, face, female, brunette Classification: person, face, female, blonde
CV3DST | Prof. Leal-Taixé 50
1) What if the classes used during the inference
are different from those used during the training?
2) What if the dataset is not that big? 3) What if the classes used during the inference
change?
4) What if we are interested only in knowing if
two samples are from the same class (similar) rather than on what class they belong?
51
A B
Is it the same person?
CV3DST | Prof. Leal-Taixé 52
A B
function that measure how similar two objects are.
Learning a distance function
CV3DST | Prof. Leal-Taixé 53
Training
CV3DST | Prof. Leal-Taixé 54
A B YES NO
Testing Can be solved as a classification problem
CV3DST | Prof. Leal-Taixé 55
enter the exam room without the need for ID check Person 1 Person 2 Training Person 3
CV3DST | Prof. Leal-Taixé 56
enter the exam room without the need for ID check What is the problem with this approach? Scalability – we need to retrain our model every time a new student is registered to the course
CV3DST | Prof. Leal-Taixé 57
enter the exam room without the need for ID check Can we train one model and use it every year?
CV3DST | Prof. Leal-Taixé 58
A B Low similarity score (large distance) A B High similarity score (small distance)
CV3DST | Prof. Leal-Taixé 59
A B Not the same person
d(A, B) > τ
CV3DST | Prof. Leal-Taixé 60
A B
d(A, B) < τ
Same person
CV3DST | Prof. Leal-Taixé 61
CV3DST | Prof. Leal-Taixé 62
between query image and database images
i.e., retrieve top images based on distance
CV3DST | Prof. Leal-Taixé 63
Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014
CNN FC Representation
128 values A
CV3DST | Prof. Leal-Taixé 64
Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014
A B
f(A) f(B)
CV3DST | Prof. Leal-Taixé 65
Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014
A B
f(A) f(B)
CV3DST | Prof. Leal-Taixé 66
the images, and respectively.
Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014
f(A)
CV3DST | Prof. Leal-Taixé 67
f(B)
– If and depict the same person, is small – If and depict a different person, is large
Taigman et al. „DeepFace: closing the gap to human level performance“. CVPR 2014
d(A, B) = ||f(A) − f(B)||2 d(A, B) = d(A, A, B) = d(A, B) = d(A, A, B) =
CV3DST | Prof. Leal-Taixé 68
– If and depict the same person, is small
d(A, B) = d(A, A, B) = L(A, B) = ||f(A) − f(B)||2
CV3DST | Prof. Leal-Taixé 69
– If and depict a different person, is large – Better use a Hinge loss:
d(A, B) = d(A, A, B) = L(A, B) = max(0, m2 − ||f(A) − f(B)||2)
If two elements are already far away, do not spend energy in pulling them even further away
CV3DST | Prof. Leal-Taixé 70
L(A, B) = y∗||f(A) − f(B)||2 + (1 − y∗)max(0, m2 − ||f(A) − f(B)||2)
Positive pair, reduce the distance between the elements Negative pair, brings the elements further apart up to a margin
CV3DST | Prof. Leal-Taixé 71
Chopra, Hadsell and LeCun. “Learning a similarity metric discriminatively, with application to Face verification”, CVPR 2005.
We want:
Anchor (A) Positive (P) Negative (N)
||f(A) − f(P)||2 < ||f(A) − f(N)||2
CV3DST | Prof. Leal-Taixé 72
Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015
||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0
) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0
margin
CV3DST | Prof. Leal-Taixé 73
Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015
Schroff et al „FaceNet: a unified embedding for face recognition and clustering“. CVPR 2015
||f(A) − f(P)||2 < ||f(A) − f(N)||2 ||f(A) − f(P)||2 − ||f(A) − f(N)||2 < 0
) = ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m < 0
L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m)
CV3DST | Prof. Leal-Taixé 74
L(A, P, N) = max(0, ||f(A) − f(P)||2 − ||f(A) − f(N)||2 + m) d(A, P) ≈ d(A, N)
CV3DST | Prof. Leal-Taixé 75
Anchor Negative Positive Anchor Negative Positive Training
CV3DST | Prof. Leal-Taixé 76
O(n^2), but contrastive loss considers only O(n/2) relations, and triplet loss considers only O(2n/3) relations.
are required: hard-negative mining, intelligent sampling and multi-task learning.
77
2017)
2017)
78
79
80
81 CV3DST | Prof. Leal-Taixé
Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019
82
Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019
We want to take into account the similarity of all samples wrt all other samples!
1) Initialization: Initialize X, the image-label assignment
using the softmax outputs of the neural network. Compute the n x n pairwise similarity matrix W using the neural network embedding.
2) Refinement: Iteratively, refine X considering the
similarities between all the mini-batch images, as encoded in W, as well as their labeling preferences.
3) Loss computation: Compute the cross-entropy loss of
the refined probabilities and update the weights of the neural network using backpropagation.
83
84
85
B and C are much more similar than A and B, or A and C Maybe the net is untrained and initialized the probabilities to an uniform distribution.
86
Propagate information Normalize in
standard simplex. Iterate t + 1 times
*J.W. Weibull. Evolutionary Game Theory. MIT Press, 1997
From the similarity matrix Lambda is the class This measures the support that the current mini-batch gives to image I belonging to class lambda
87
probabilities.
propagates the gradients over the network.
88
Elezi et al „The Group Loss for deep metric learning“. Arxiv: 1912.00385. 2019
frames to compute the matching (to recover from errors, missing detections, wrong predictions)
Lecture 5: MOT with graphs
CV3DST | Prof. Leal-Taixé 89
Lecture 6: Trajectory prediction
send an email to dst@dvl.in.tum.de with CV and transcript and what are your research interests!
CV3DST | Prof. Leal-Taixé 90