 
              0 ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st Place Solution for MOTS 2020 Challenge 1) Fan Yang 1,2 , Xin Chang 1 , Chenyu Dang 1 , Ziqiang Zheng 3 , Sakriani Sakti 1,2 , Satoshi Nakamura 1,2 , Yang Wu 4 1 Nara Institute of Science and Technology, Japan 2 RIKEN Center for Advanced Intelligence Project, Japan 3 UISEE Technology (Beijing) Co. Ltd., China 4 Kyoto University, Japan
Background of Multi-Object Tracking and Segmentation (MOTS) • Problem: detect, segment, and track multiple objects in videos. • Input: a video sequence contain that multiple RGB images. • Output: 2D mask and corresponding track ID at each frame. • Application: action recognition, automatic driving, and others. Input Video Data MOTS Instance Segmentation frame k frame k frame k track_id 1 track_id 2 frame k+1 frame k+1 frame k+1 track_id 1 track_id 2 detect+segment detect+segment+track 1
Our solution for MOTS Use off-the-shelf models Step 1 Input Video Data Instance Segmentation frame k frame k frame k+1 frame k+1 detect+segment 2
Instance Segmentation We take off-the-shelf models: X-101-64x4d-FPN of MMDetection + Mask R-CNN X152 of Detectron 2, which refers to the public detection and segmentation methods. But, how to fuse instance masks from different models? Fusing boxes – using NMS Fusing masks – may also using NMS – but change IoU to IoM (Intersection over Minimum). 2 2 0.1 0.1 Instance masks: 1 1 Pixel_IoU = 1/3 = 0.33 Pixel_IoM = 1/2 = 0.5 Acceptable for bounding box, But not for mask. Pixel_IoU = 0.01/2 = 0.005 Pixel_IoM = 0.01/0.01 = 1 3
Our solution for MOTS We proposed an offline method, as ReMOTS (Refining Multi-Object Tracking and Segmentation ). Our main contributions: 1. Refine appearance features 2. Automatically decide threshold Step 1 Step 2 Input Video Data MOTS Instance Segmentation frame k frame k frame k track_id 1 track_id 2 frame k+1 frame k+1 frame k+1 track_id 1 track_id 2 detect+segment detect+segment+track 4
Intra-frame Training and Short-term Tracking t 1 t 1 Intra-frame Training t 2 t 2 Appearance t 3 Ground Truth: t 3 Encoder t 4 t 4 ID 2 ID 3 Short-term t 5 t 5 Tracker Hypotheses: t 6 t 6 9 id 3 9 id 5 9 id 6 9 id 2 t 7 Raw t 7 t 8 t 8 Object-instance Short-term Segmentation Tracklets Estimated Bounding Boxes Form a Mini Batch Ground-truth Tracklets in Test Set Input by the Ratio 1:1 in Training Set frame k Ground Truth: Intra-frame t 1 Observations ID 7 ID 8 Augmentation P Temporal P t 2 t n Hypotheses: overlapped & t 3 '(( ! "#$%& non-overlapped # tracklets # id 2 N Inter-tracklet t 4 id 3 N Intra-frame sampling sampling t 5 Intra-frame Training frame k+1 Cosine Appearance Encoder Distance Linear 0.2 0.4 inf Assignment inf inf 0.3 For masks of frame k, Embedding BoT-Reid (TMM 2020) Appearance Distance Matrix Appearance consider all of IoU > 0 masks Features of frame k+1 for matching 5
Inter-short-term-tracklet Training t 1 t 1 Intra-frame Training Inter-short-term-tracklet Training t 1 t 2 t 2 g Inter-short-term-tracklet Training ( )* ( Appearance )+ t 3 Appearance ( Ground Truth: t 3 ), Similarity Encoder ( Cosine Encoder t 2 )- t 4 ( t 4 )* inf inf 0.1 0.4 ( ( ). ID 2 ID 3 ( )+ Short-term )/ t 3 t 5 Appearance ( ( t 5 )0 Similarity ), ( Tracker Hypotheses: inf inf 0.5 0.2 ( )1 Cosine Encoder )- t 6 t 6 t 4 if same split tracklet ID: 0.1 0.5 inf inf 9 id 3 9 id 5 9 id 6 9 id 2 inf inf 0.1 0.4 t 7 Raw t 7 ( set inf ). ( elif temporal overlapping: 0.4 0.2 inf inf t 8 )/ t 5 t 8 ( set inf )0 Distance Matrix W long ( else: inf inf 0.5 0.2 )1 Object-instance Short-term set cosine distance Segmentation t 6 Tracklets if same split tracklet ID: 0.1 0.5 inf inf t 7 set inf elif temporal overlapping: 0.4 0.2 inf inf t 8 set inf Distance Matrix W long else: Short-term set cosine distance Tracklets Estimated Short-term- Form a Mini Batch Ground-truth Tracklets tracklets in Test Set Input by the Ratio 1:1 in Training Set t 1 t 1 Ground Truth: P P t 2 t 2 ID 7 ID 8 t 3 t 3 Inter-short-term- Hypotheses: N Inter-tracklet t 4 t 4 N tracklet sampling sampling t 5 # # # # t 5 id 2 id 3 id 5 id 6 Temporal overlapped & Temporal overlapped non-overlapped tracklets short-term tracklets 6
What Happened in Each Step of Appearance Training J (H 1 , H 2 ) represents Jaccard Index of two normalized histograms H 1 and H 2 . (3) After Inter-short-tracklet training (2) After Intra-frame training on (1) After Trained on the train on test set with pseudo labels test set without labels set only Intra-frame instance masks Intra-short-tracklet instance mask t 1 t 1 g g t 2 t 2 t 3 t 3 t 4 t 4 7
Merging Short-term Tracklets Hierarchical Clustering t 1 t 1 t 1 Intra-frame Training Inter-short-term-tracklet Training Hierarchical Clustering t 1 t 2 Inter-short-term-tracklet Training t 2 t 2 Distance ( )* ( Appearance )+ t 3 Appearance ( t 3 Ground Truth: t 3 ), Similarity Encoder ( Cosine Cutting threshold Encoder )- t 2 Distance &'' = 2 − ! 456%5 t 4 t 4 t 4 ( inf inf 0.1 0.4 )* ( ). ( ID 2 ID 3 ( )+ Short-term )/ t 5 Appearance t 5 ( t 5 ( t 3 )0 Similarity ( ), Tracker Hypotheses: inf inf 0.5 0.2 )1 ( Cosine Encoder Cutting threshold t 6 clusters )- t 6 t 6 &'' = 2 − ! 456%5 if same split tracklet ID: 0.1 0.5 inf inf t 4 9 id 3 9 id 5 9 id 6 9 Intra-frame &'' id 2 ! "#$% t 7 t 7 Raw t 7 inf inf 0.1 set inf 0.4 Cosine Affinity ( ). elif temporal overlapping: Intra-short-tracklet 0.4 0.2 inf inf t 8 ( t 8 t 8 Cosine Affinity )/ set inf t 5 ( W long )0 else: Distance Matrix Object-instance Short-term Long-term ( )1 inf inf 0.5 0.2 set cosine distance Segmentation Tracklets Tracklets clusters t 6 if same split tracklet ID: 0.1 0.5 inf inf Intra-frame &'' ! "#$% t 7 set inf Cosine Affinity elif temporal overlapping: Intra-short-tracklet 0.4 0.2 inf inf t 8 Cosine Affinity set inf W long else: Distance Matrix Long-term set cosine distance Tracklets t 1 t 2 Temporal- overlapped t 3 Short-term Tracklets t 4 &'' ! "#$% Intra-short- Intra-frame tracklet Cosine Affinity Cosine Affinity 8
Comparison with others on MOTSChallenge 1 May benefit from refinement May benefit from mask fusion Since our strategy can be easily adapted to others, will other methods get better performance by applying our appearance encoder and merging? 9
Limitations of ReMOTS 1. An offline approach. - It worth to explore how to bring it to online approach. 2. It is challenging for ReMOTs to handle objects with similar appearance. e.g., good for persons (wear different clothes) but not very useful for vehicles (similar textures) 3. Trajectory is not considered in our short-term tracker. Failed to associate fast moving objects. Fast moving car with similar appearance ? Slowly moving person with diverse clothes 10
Conclusion • Unlabeled target videos can be used for learning better appearance features, but should take care of the potential of introducing noises. • The suitable hyper parameters for data association may varies from case to case, and the statistical information of tracklets can be used to adjust them. • It is preferred to accommodate some insights of ReMOTS to online MOTS. 11
Thanks for your listening
Recommend
More recommend