MOTS: Multi-Object Tracking and Segmentation Paul Voigtlaender - - PowerPoint PPT Presentation

mots multi object tracking and segmentation
SMART_READER_LITE
LIVE PREVIEW

MOTS: Multi-Object Tracking and Segmentation Paul Voigtlaender - - PowerPoint PPT Presentation

Visual Computing Institute Computer Vision MOTS: Multi-Object Tracking and Segmentation MOTS: Multi-Object Tracking and Segmentation Paul Voigtlaender RWTH Aachen University Joint work with M. Krause, A. O sep, J. Luiten, B. B. G. Sekar,


slide-1
SLIDE 1

MOTS: Multi-Object Tracking and Segmentation

Visual Computing Institute Computer Vision

MOTS: Multi-Object Tracking and Segmentation

Paul Voigtlaender

RWTH Aachen University

Joint work with

  • M. Krause, A. O˘

sep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe

CVPR 2019 main conference poster #122, Wednesday 15:20

slide-2
SLIDE 2

MOTS: Multi-Object Tracking and Segmentation

Motivation

video sequence with bounding boxes

t

<latexit sha1_base64="QI1IQUhOJTV8GznP03tbQWvZVOo=">AB6HicbVBNS8NAEN3Ur1q/qh69LBbBU0lUsMeCF48t2A9oQ9lsJ+3azSbsToQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW/4PCofHzSNnGqObR4LGPdDZgBKRS0UKCEbqKBRYGETjC5m/udJ9BGxOoBpwn4ERspEQrO0EpNHJQrbtVdgK4TLycVkqMxKH/1hzFPI1DIJTOm57kJ+hnTKLiEWamfGkgYn7AR9CxVLALjZ4tDZ/TCKkMaxtqWQrpQf09kLDJmGgW2M2I4NqveXPzP6UY1vxMqCRFUHy5KEwlxZjOv6ZDoYGjnFrCuBb2VsrHTDONpuSDcFbfXmdtK+q3nXVa95U6rU8jiI5I+fknjkltTJPWmQFuEyDN5JW/Oo/PivDsfy9aCk8+ckj9wPn8A3iOM8Q=</latexit>

video sequence with pixel-level segmentation masks

t

<latexit sha1_base64="QI1IQUhOJTV8GznP03tbQWvZVOo=">AB6HicbVBNS8NAEN3Ur1q/qh69LBbBU0lUsMeCF48t2A9oQ9lsJ+3azSbsToQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW/4PCofHzSNnGqObR4LGPdDZgBKRS0UKCEbqKBRYGETjC5m/udJ9BGxOoBpwn4ERspEQrO0EpNHJQrbtVdgK4TLycVkqMxKH/1hzFPI1DIJTOm57kJ+hnTKLiEWamfGkgYn7AR9CxVLALjZ4tDZ/TCKkMaxtqWQrpQf09kLDJmGgW2M2I4NqveXPzP6UY1vxMqCRFUHy5KEwlxZjOv6ZDoYGjnFrCuBb2VsrHTDONpuSDcFbfXmdtK+q3nXVa95U6rU8jiI5I+fknjkltTJPWmQFuEyDN5JW/Oo/PivDsfy9aCk8+ckj9wPn8A3iOM8Q=</latexit>

◮ Now many datasets for multi-object tracking available

◮ MOTChallenges

◮ MOT15 [Leal-Taix´ e et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019]

◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015]

◮ But annotations are only on the bounding box level

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 1

slide-3
SLIDE 3

MOTS: Multi-Object Tracking and Segmentation

Motivation

◮ In difficult cases, bounding boxes are a very coarse approximation ◮ Most of the pixels of the bounding box belong to other

  • bjects

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 2

slide-4
SLIDE 4

MOTS: Multi-Object Tracking and Segmentation

So let there be Annotations

◮ Dense pixel-wise annotations are super expensive... ◮ But we did it!

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 3

slide-5
SLIDE 5

MOTS: Multi-Object Tracking and Segmentation

So let there be Annotations

◮ Dense pixel-wise annotations are super expensive... ◮ But we did it! ◮ How?

◮ Semi-automatic annotation procedure

KITTI MOTS MOTSChallenge train val # Sequences 12 9 4 # Frames 5,027 2,981 2,862 # Tracks Pedestrian 99 68 228 # Masks Pedestrian Total 8,073 3,347 26,894 Manually annotated 1,312 647 3,930 # Tracks Car 431 151

  • # Masks Car

Total 18,831 8,068

  • Manually annotated

1,509 593

  • Paul Voigtlaender

voigtlaender@vision.rwth-aachen.de 4

slide-6
SLIDE 6

MOTS: Multi-Object Tracking and Segmentation

Outline

◮ Semi-automatic Annotation Procedure ◮ Evaluation Measures ◮ TrackR-CNN Baseline Method ◮ Results

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 5

slide-7
SLIDE 7

MOTS: Multi-Object Tracking and Segmentation

Semi-automatic Annotation Procedure

◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network (Box2Seg) converts bounding boxes to segmentation masks

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 6

slide-8
SLIDE 8

MOTS: Multi-Object Tracking and Segmentation

Semi-automatic Annotation Procedure

◮ Starting point: dataset with existing box level tracking annotations

Global Box2Seg training Annotators manually annotate additional polygons Quality Assurance Quality standards reached End Start Pick erroneous masks

t

<latexit sha1_base64="QI1IQUhOJTV8GznP03tbQWvZVOo=">AB6HicbVBNS8NAEN3Ur1q/qh69LBbBU0lUsMeCF48t2A9oQ9lsJ+3azSbsToQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW/4PCofHzSNnGqObR4LGPdDZgBKRS0UKCEbqKBRYGETjC5m/udJ9BGxOoBpwn4ERspEQrO0EpNHJQrbtVdgK4TLycVkqMxKH/1hzFPI1DIJTOm57kJ+hnTKLiEWamfGkgYn7AR9CxVLALjZ4tDZ/TCKkMaxtqWQrpQf09kLDJmGgW2M2I4NqveXPzP6UY1vxMqCRFUHy5KEwlxZjOv6ZDoYGjnFrCuBb2VsrHTDONpuSDcFbfXmdtK+q3nXVa95U6rU8jiI5I+fknjkltTJPWmQFuEyDN5JW/Oo/PivDsfy9aCk8+ckj9wPn8A3iOM8Q=</latexit>

t

<latexit sha1_base64="QI1IQUhOJTV8GznP03tbQWvZVOo=">AB6HicbVBNS8NAEN3Ur1q/qh69LBbBU0lUsMeCF48t2A9oQ9lsJ+3azSbsToQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW/4PCofHzSNnGqObR4LGPdDZgBKRS0UKCEbqKBRYGETjC5m/udJ9BGxOoBpwn4ERspEQrO0EpNHJQrbtVdgK4TLycVkqMxKH/1hzFPI1DIJTOm57kJ+hnTKLiEWamfGkgYn7AR9CxVLALjZ4tDZ/TCKkMaxtqWQrpQf09kLDJmGgW2M2I4NqveXPzP6UY1vxMqCRFUHy5KEwlxZjOv6ZDoYGjnFrCuBb2VsrHTDONpuSDcFbfXmdtK+q3nXVa95U6rU8jiI5I+fknjkltTJPWmQFuEyDN5JW/Oo/PivDsfy9aCk8+ckj9wPn8A3iOM8Q=</latexit>

Track with bounding boxes and some polygon annotations Pixel-Level Object Masks Box2Seg (train) Box2Seg (eval) For Each Frame Fine-tuning

  • n polygons

Segment bounding boxes Track

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 7

slide-9
SLIDE 9

MOTS: Multi-Object Tracking and Segmentation

Semi-automatic Annotation Procedure

◮ Manual corrections ensure consistent and high quality ◮ Large savings in time

◮ KITTI MOTS: only 13% of car boxes / 17% of pedestrian boxes manually annotated ◮ MOTSChallenge: 15% of pedestrian boxes manually annotated

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 8

slide-10
SLIDE 10

MOTS: Multi-Object Tracking and Segmentation

Evaluation Measures

◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] ◮ Need to establish correspondences between hypothesized and ground truth objects

◮ Box-based tracking: non-trivial due to allowed overlap

◮ Hungarian matching needed

◮ Mask-based: we require disjoint masks!

◮ Correspondences are unique and straightforward ◮ Hypothesized and ground truth masks are matched iff mask IoU > 0.5

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 9

slide-11
SLIDE 11

MOTS: Multi-Object Tracking and Segmentation

Evaluation Measures

◮ MOTSA: Multi-Object Tracking and Segmentation Accuracy MOTSA = 1 − |FN| + |FP| + |IDS| |M| = |TP| − |FP| − |IDS| |M| ◮ Like MOTA, but with mask-based IoU instead of box IoU ◮ TP: true positives ◮ FN: false negatives ◮ FP: false positives ◮ IDS: ID switches ◮ M: set of ground truth segmentation masks

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 10

slide-12
SLIDE 12

MOTS: Multi-Object Tracking and Segmentation

Evaluation Measures

◮ TP: soft number of true positives

  • TP =
  • h∈TP

IoU(h, c(h)) ◮ c: unique mapping from hypotheses to ground truth

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 11

slide-13
SLIDE 13

MOTS: Multi-Object Tracking and Segmentation

Evaluation Measures

◮ TP: soft number of true positives

  • TP =
  • h∈TP

IoU(h, c(h)) ◮ MOTSP: Multi-Object Tracking and Segmentation Precision MOTSP =

  • TP

|TP| ◮ c: unique mapping from hypotheses to ground truth

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 11

slide-14
SLIDE 14

MOTS: Multi-Object Tracking and Segmentation

Evaluation Measures

◮ sMOTSA: Soft Multi-Object Tracking and Segmentation Accuracy sMOTSA =

  • TP − |FP| − |IDS|

|M| ◮ Combines tracking and segmentation quality into a single measure

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 12

slide-15
SLIDE 15

MOTS: Multi-Object Tracking and Segmentation

Baseline Method: TrackR-CNN

◮ Idea: detection, segmentation, and data association with single convolutional network ◮ Extend Mask R-CNN by 3D convolutions and association head ◮ ResNet-101 backbone, Mask R-CNN pre-trained on Mapillary ◮ Speed: ∼2 fps

CAR: 0.99 CAR: 0.99 CAR: 0.99 CAR: 0.99

Feature Extraction Feature Extraction Feature Extraction

... . . .

Image Features Temporally Enhanced Image Features

t-1 t t+1

Region Proposal Network Bounding Box Regression Classification + Scoring

CAR: 0.99

Mask Generation 128-D Association Vectors Image Instance Segmentation Ground Truth Video Tracking Ground Truth

Online Track

Association During Training During Evaluation Shared weights Shared weights

. . .

Association Embedding

...

Previously

Tracked Objects Loss Loss 2x 3D Conv

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 13

slide-16
SLIDE 16

MOTS: Multi-Object Tracking and Segmentation

TrackR-CNN

Feature Extraction Feature Extraction Feature Extraction

... ...

Image Features Temporally Enhanced Image Features

t-1 t t+1

Region Proposal Network Shared weights Shared weights

... ...

2x 3D Conv Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 14

slide-17
SLIDE 17

MOTS: Multi-Object Tracking and Segmentation

TrackR-CNN

CAR: 0.99 CAR: 0.99 CAR: 0.99 CAR: 0.99

Region Proposal Network Bounding Box Regression Classification + Scoring

CAR: 0.99

Mask Generation 128-D Association Vectors Image Instance Segmentation Ground Truth Video Tracking Ground Truth

Online Track

Association During Training During Evaluation Association Embedding

Previously

Tracked Objects Loss Loss Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 15

slide-18
SLIDE 18

MOTS: Multi-Object Tracking and Segmentation

Association Head

◮ Predict Re-ID association vector for each detection ◮ Detections of the same instance should be close in embedding space ◮ Detections of distinct instances should be far away in embedding space ◮ Learned using batch-hard triplet loss [Hermans et al., 2017] ◮ Associate detections over time using Euclidean distance + Hungarian matching (very simple)

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 16

slide-19
SLIDE 19

MOTS: Multi-Object Tracking and Segmentation

Association Head: Embedding Visualization

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 17

slide-20
SLIDE 20

MOTS: Multi-Object Tracking and Segmentation

Qualitative Results: TrackR-CNN

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 18

slide-21
SLIDE 21

MOTS: Multi-Object Tracking and Segmentation

Qualitative Comparison

◮ (a), (c): TrackR-CNN trained with boxes only + mask generation by Mask R-CNN ◮ (b), (d): TrackR-CNN trained with segmentation masks on KITTI MOTS (a) (b) (c) (d)

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 19

slide-22
SLIDE 22

MOTS: Multi-Object Tracking and Segmentation

Quantitative Results (KITTI MOTS)

sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped

TrackR-CNN (ours) 76.2 46.8 87.8 65.1 87.2 75.7

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 20

slide-23
SLIDE 23

MOTS: Multi-Object Tracking and Segmentation

Quantitative Results (KITTI MOTS)

sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped

TrackR-CNN (ours) 76.2 46.8 87.8 65.1 87.2 75.7 TrackR-CNN (box) + MG 75.0 41.2 87.0 57.9 86.8 76.3

◮ +MG: mask generation from bounding boxes using Mask R-CNN ◮ Training using box based tracking data with post-hoc mask generation

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 20

slide-24
SLIDE 24

MOTS: Multi-Object Tracking and Segmentation

Quantitative Results (KITTI MOTS)

sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped

TrackR-CNN (ours) 76.2 46.8 87.8 65.1 87.2 75.7 TrackR-CNN (box) + MG 75.0 41.2 87.0 57.9 86.8 76.3 Mask R-CNN + maskprop 75.1 45.0 86.6 63.5 87.1 75.6

◮ +MG: mask generation from bounding boxes using Mask R-CNN ◮ Training using box based tracking data with post-hoc mask generation ◮ Training using instance segmentation data only

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 20

slide-25
SLIDE 25

MOTS: Multi-Object Tracking and Segmentation

Quantitative Results (KITTI MOTS)

sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped

TrackR-CNN (ours) 76.2 46.8 87.8 65.1 87.2 75.7 TrackR-CNN (box) + MG 75.0 41.2 87.0 57.9 86.8 76.3 Mask R-CNN + maskprop 75.1 45.0 86.6 63.5 87.1 75.6 GT Boxes + MG 82.5 50.0 95.3 71.1 86.9 75.4

◮ +MG: mask generation from bounding boxes using Mask R-CNN ◮ Training using box based tracking data with post-hoc mask generation ◮ Training using instance segmentation data only ◮ Segment ground truth bounding boxes

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 20

slide-26
SLIDE 26

MOTS: Multi-Object Tracking and Segmentation

Quantitative Results (MOTSChallenge)

sMOTSA MOTSA MOTSP TrackR-CNN (ours)

52.7 66.9 80.2

MHT-DAM [Kim et al., 2015] + MG

48.0 62.7 79.8

FWT [Henschel et al., 2018] + MG

49.3 64.0 79.7

MOTDT [Long et al., 2018] + MG

47.8 61.1 80.0

jCC [Keuper et al., 2018] + MG

48.3 63.0 79.9

GT Boxes + MG

55.8 74.5 78.6 ◮ +MG: mask generation from bounding boxes using Mask R-CNN ◮ MOTS is hard, even when given perfect ground truth bounding boxes! ◮ Segmenting pedestrians in a crowd is difficult!

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 21

slide-27
SLIDE 27

MOTS: Multi-Object Tracking and Segmentation

TrackR-CNN Ablations: Temporal Component

Temporal component sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped 1xConv3D 76.1 46.3 87.8 64.5 87.1 75.7 2xConv3D 76.2 46.8 87.8 65.1 87.2 75.7 1xConvLSTM 75.7 45.0 87.3 63.4 87.2 75.6 2xConvLSTM 76.1 44.8 87.9 63.3 87.0 75.2 None 76.4 44.8 87.9 63.2 87.3 75.5

◮ Conv3D show improvements for pedestrians ◮ But overall effect is limited

◮ Need better ways to incorporate temporal context

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 22

slide-28
SLIDE 28

MOTS: Multi-Object Tracking and Segmentation

TrackR-CNN Ablations: Association Mechanism

Association Mechanism sMOTSA MOTSA MOTSP Car Ped Car Ped Car Ped Association head 76.2 46.8 87.8 65.1 87.2 75.7 Mask IoU 75.5 46.1 87.1 64.4 87.2 75.7 Mask IoU (train w/o assoc.) 74.9 44.9 86.5 63.3 87.1 75.6 Bbox IoU 75.4 45.9 87.0 64.3 87.2 75.7 Bbox Center 74.3 43.3 86.0 61.7 87.2 75.7

◮ Mask IoU: warp mask using optical flow into next frame and associate based on mask IoU ◮ Bbox IoU: warp bounding box using median optical flow and associate based on box IoU ◮ Bbox Center: associate based on (unwarped) bounding box center distance

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 23

slide-29
SLIDE 29

MOTS: Multi-Object Tracking and Segmentation

Conclusion

◮ MOTS: new task, new annotations, metrics, and baseline ◮ Training benefits from time-consistent instance segmentations compared to

◮ Single image instance segmentations ◮ Box-based tracking data

◮ CVPR 2019 main conference poster #122, Wednesday 15:20 ◮ Get the new annotations and our code now at https://www.vision.rwth-aachen.de/page/mots ◮ KITTI MOTS test-set evaluation server coming soon! ◮ MOTSChallenge test-set evaluation server planned!

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 24

slide-30
SLIDE 30

MOTS: Multi-Object Tracking and Segmentation

References I

Bernardin, K. and Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image and Video Processing. Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., and Leal-Taix´ e, L. (2019). CVPR19 tracking and detection challenge: How crowded can it get? arXiv:1906.04567. Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR.

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 25

slide-31
SLIDE 31

MOTS: Multi-Object Tracking and Segmentation

References II

Henschel, R., Leal-Taix´ e, L., Cremers, D., and Rosenhahn,

  • B. (2018).

Fusion of head and full-body detectors for multi-object tracking. In CVPRW. Hermans, A., Beyer, L., and Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Keuper, M., Tang, S., Andres, B., Brox, T., and Schiele, B. (2018). Motion segmentation & multiple object tracking by correlation co-clustering. PAMI.

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 26

slide-32
SLIDE 32

MOTS: Multi-Object Tracking and Segmentation

References III

Kim, C., Li, F., Ciptadi, A., and Rehg, J. M. (2015). Multiple hypothesis tracking revisited. In ICCV. Leal-Taix´ e, L., Milan, A., Reid, I., Roth, S., and Schindler,

  • K. (2015).

MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942. Long, C., Haizhou, A., Zijie, Z., and Chong, S. (2018). Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME.

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 27

slide-33
SLIDE 33

MOTS: Multi-Object Tracking and Segmentation

References IV

Milan, A., Leal-Taix´ e, L., Reid, I., Roth, S., and Schindler,

  • K. (2016).

MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Ristani, E., Solera, F., Zou, R., Cucchiara, R., and Tomasi,

  • C. (2016).

Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking. Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M., Qi, H., Lim, J., Yang, M., and Lyu, S. (2015). UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. arXiv preprint arXiv:1511.04136.

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 28

slide-34
SLIDE 34

MOTS: Multi-Object Tracking and Segmentation

References V

Zhu, P., Wen, L., Bian, X., Haibin, L., and Hu, Q. (2018). Vision meets drones: A challenge. arXiv preprint arXiv:1804.07437.

Paul Voigtlaender voigtlaender@vision.rwth-aachen.de 29