beyond detection towards multi object tracking and
play

Beyond Detection: Towards Multi-Object Tracking and Segmentation - PowerPoint PPT Presentation

Beyond Detection: Towards Multi-Object Tracking and Segmentation Andreas Geiger Autonomous Vision Group University of T ubingen / MPI for Intelligent Systems June 17, 2018 University of Tbingen MPI for Intelligent Systems Autonomous


  1. Beyond Detection: Towards Multi-Object Tracking and Segmentation Andreas Geiger Autonomous Vision Group University of T¨ ubingen / MPI for Intelligent Systems June 17, 2018 University of Tübingen MPI for Intelligent Systems Autonomous Vision Group

  2. MOTS: Multi-Object Tracking and Segmentation [Voigtlaender, Krause, Osep, Luiten, Sekar, Geiger & Leibe, CVPR 2019]

  3. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... 3

  4. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... ◮ Led to great progress in the community 3

  5. Motivation ◮ Datasets for multi-object tracking ◮ MOTChallenges ◮ MOT15 [Leal-Taixe et al., 2015] ◮ MOT16, MOT17 [Milan et al., 2016] ◮ CVPR19 [Dendorfer et al., 2019] ◮ KITTI Tracking [Geiger et al., 2012] ◮ VisDrone2018 [Zhu et al., 2018] ◮ DukeMTMC [Ristani et al., 2016] ◮ UA-DETRAC [Wen et al., 2015] ◮ ... ◮ Led to great progress in the community ◮ But annotations are only on the bounding box level 3

  6. Are bounding boxes enough?

  7. Object Tracking vs. Segmentation ◮ In difficult cases, bounding boxes are a very coarse approximation ◮ Most pixels of the bounding box belong to other objects 5

  8. Two Communities Object Tracking Semantic Segmentation / Instance Segmentation 6

  9. Can we unite the two?

  10. MOTS: Multi-Object Tracking and Segmentation ◮ Dense pixel-wise annotations are tedious, hard work .. but we did it! KITTI MOTS 8

  11. MOTS: Multi-Object Tracking and Segmentation ◮ Dense pixel-wise annotations are tedious, hard work .. but we did it! MOTSChallenge 8

  12. MOTS: Multi-Object Tracking and Segmentation ◮ How? 4 student assistants & semi-automatic annotation procedure KITTI MOTS MOTSChallenge train val train # Sequences 12 9 4 # Frames 5,027 2,981 2,862 # Tracks Pedestrian 99 68 228 # Masks Pedestrian (total) 8,073 3,347 26,894 # Masks Pedestrian (annot.) 1,312 647 3,930 # Tracks Car 431 151 - # Masks Car (total) 18,831 8,068 - # Masks Car (annot.) 1,509 593 - 9

  13. Data Annotation

  14. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks 11

  15. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated 11

  16. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated ◮ However, the trained segmentation model will not be perfect 11

  17. Data Annotation ◮ Starting point: existing box level tracking annotations ◮ Fully convolutional network converts bounding boxes to segmentation masks ◮ First, 2 instances per track are manually annotated ◮ However, the trained segmentation model will not be perfect ◮ Repeat until annotations are good: 1. Annotators fix worst errors with polygon annotations 2. Add new annotations to training set of FCN 3. Re-train FCN (pre-train on all, fine-tune per object) ⇒ Allows for adaptation to appearance and context of each object 4. Re-generate masks using FCN 11

  18. Data Annotation ◮ Manual corrections ensure consistency and high quality 12

  19. Data Annotation ◮ Manual corrections ensure consistency and high quality ◮ Large savings in annotation time ◮ KITTI MOTS: only 13% of car boxes / 17% of pedestrian boxes manually annotated ◮ MOTSChallenge: 15% of pedestrian boxes manually annotated 12

  20. Evaluation Metrics

  21. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] 14

  22. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] ◮ Need to associate predictions to ground truth instances ◮ Box-based tracking: boxes might overlap ◮ Requires bi-partite matching 14

  23. Evaluation Metrics ◮ We consider mask-based variants of the CLEAR MOT metrics [Bernardin and Stiefelhagen, 2008] ◮ Need to associate predictions to ground truth instances ◮ Box-based tracking: boxes might overlap ◮ Requires bi-partite matching ◮ Mask-based tracking: masks are disjoint ◮ Establishing correspondences is greatly simplified ◮ Hypothesized and ground truth masks are matched iff mask IoU > 0 . 5 14

  24. Evaluation Metrics (Soft) Multi-Object Tracking and Segmentation Accuracy / Precision: MOTSA = 1 − | FN | + | FP | + | IDS | = | TP | − | FP | − | IDS | | M | | M | � � � TP TP − | FP | − | IDS | � MOTSP = sMOTSA = TP = IoU ( h, c ( h )) | TP | | M | h ∈ TP ◮ c : mapping from hypotheses to ground truth ◮ TP: true positives, � TP: soft number of true positives ◮ FN: false negatives, FP: false positives, IDS: ID switches ◮ M: set of ground truth segmentation masks 15

  25. TrackR-CNN Baseline

  26. TrackR-CNN ... During Image Training Features Image Instance Loss Segmentation Bounding Box ... Feature t-1 Regression Ground Truth Extraction Temporally Enhanced Shared Image weights Classification Features CAR: 0.99 Loss Video Tracking CAR: 0.99 CAR: 0.99 + Scoring CAR: 0.99 CAR: 0.99 Ground Truth Region Feature t Proposal Extraction 2x Network During 3D Conv Evaluation Online Track Association Shared Mask weights Generation Previously ... Feature t+1 Tracked Extraction Objects Association Embedding ... 128-D Association Vectors Key Idea: ◮ Detection, segmentation, and data association with a single ConvNet ◮ Extend Mask R-CNN by 3D convolutions and association head 17

  27. TrackR-CNN Association Head: ◮ Predict association vector for each detection 18

  28. TrackR-CNN Association Head: ◮ Predict association vector for each detection ◮ Detections of same instance should be close in embedding space 18

  29. TrackR-CNN Association Head: ◮ Predict association vector for each detection ◮ Detections of same instance should be close in embedding space ◮ Detections of distinct instances should be distant from each other 18

  30. TrackR-CNN Training: ◮ Learned using batch-hard triplet loss [Hermans et al., 2017]: � � � 1 max max � a e − a d � 2 − min � a e − a d � 2 + α, 0 | D | e ∈D : e ∈D : d ∈D id e = id d id e � = id d ◮ Mini-batch: 8 consecutive frames ◮ Mine furthest detection of same instance and closest detection of other instance ◮ Require separation by not more than margin α 19

  31. TrackR-CNN Training: ◮ Learned using batch-hard triplet loss [Hermans et al., 2017]: � � � 1 max max � a e − a d � 2 − min � a e − a d � 2 + α, 0 | D | e ∈D : e ∈D : d ∈D id e = id d id e � = id d ◮ Mini-batch: 8 consecutive frames ◮ Mine furthest detection of same instance and closest detection of other instance ◮ Require separation by not more than margin α Inference: ◮ Associate detections over time based on Euclidean distance in embedding space and bi-partite graph matching 19

  32. Experimental Evaluation

  33. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  34. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  35. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  36. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  37. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  38. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  39. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  40. Results of TrackR-CNN on MOTSChallenge ◮ Crowded scenes can lead to missing detections and id switches 21

  41. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  42. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  43. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  44. Results of TrackR-CNN on KITTI MOTS ◮ Most objects distinguished well but some erroneous detections remain (red) 22

  45. Results of TrackR-CNN on KITTI MOTS ◮ Continuation of track with same ID after missing detection (red) 23

  46. Results of TrackR-CNN on KITTI MOTS ◮ Continuation of track with same ID after missing detection (red) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend