0 ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st - - PowerPoint PPT Presentation

0 remots refining multi object tracking and segmentation
SMART_READER_LITE
LIVE PREVIEW

0 ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st - - PowerPoint PPT Presentation

0 ReMOTS: Refining Multi-Object Tracking and Segmentation (1 st Place Solution for MOTS 2020 Challenge 1) Fan Yang 1,2 , Xin Chang 1 , Chenyu Dang 1 , Ziqiang Zheng 3 , Sakriani Sakti 1,2 , Satoshi Nakamura 1,2 , Yang Wu 4 1 Nara Institute of


slide-1
SLIDE 1

ReMOTS: Refining Multi-Object Tracking and Segmentation (1st Place Solution for MOTS 2020 Challenge 1)

Fan Yang1,2, Xin Chang1, Chenyu Dang1, Ziqiang Zheng3, Sakriani Sakti1,2, Satoshi Nakamura1,2, Yang Wu4

1Nara Institute of Science and Technology, Japan 2RIKEN Center for Advanced Intelligence Project, Japan 3UISEE Technology (Beijing) Co. Ltd., China 4Kyoto University, Japan

slide-2
SLIDE 2

1

  • Problem: detect, segment, and track multiple objects in videos.
  • Input: a video sequence contain that multiple RGB images.
  • Output: 2D mask and corresponding track ID at each frame.
  • Application: action recognition, automatic driving, and others.

frame k frame k+1 Input Video Data frame k frame k+1 Instance Segmentation detect+segment

Background of Multi-Object Tracking and Segmentation (MOTS)

frame k frame k+1

track_id 1 track_id 2 track_id 2 track_id 1

MOTS detect+segment+track

slide-3
SLIDE 3

2

frame k frame k+1 frame k frame k+1 Instance Segmentation Input Video Data detect+segment Step 1

Our solution for MOTS

Use off-the-shelf models

slide-4
SLIDE 4

3

We take off-the-shelf models: X-101-64x4d-FPN of MMDetection + Mask R-CNN X152 of Detectron 2, which refers to the public detection and segmentation methods.

Instance Segmentation

But, how to fuse instance masks from different models? Fusing boxes – using NMS Fusing masks – may also using NMS – but change IoU to IoM (Intersection over Minimum). Pixel_IoM = 0.01/0.01 = 1 Pixel_IoM = 1/2 = 0.5 Instance masks: 2 1 2 1 0.1 0.1 Pixel_IoU = 1/3 = 0.33 Pixel_IoU = 0.01/2 = 0.005 Acceptable for bounding box, But not for mask.

slide-5
SLIDE 5

4

We proposed an offline method, as ReMOTS (Refining Multi-Object Tracking and Segmentation ). frame k frame k+1 frame k frame k+1 frame k frame k+1

track_id 1 track_id 2 track_id 2 track_id 1

MOTS Instance Segmentation Input Video Data detect+segment+track detect+segment

Our solution for MOTS

Step 1 Step 2 Our main contributions:

  • 1. Refine appearance features
  • 2. Automatically decide threshold
slide-6
SLIDE 6

5

Intra-frame Training and Short-term Tracking

t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets t1 t2 t3 t4 t5 t6 t7 t8

Object-instance Segmentation Ground Truth: Hypotheses:

ID2 ID3 id2 9 id3 9 id5 9 id6 9

Raw

Short-term Tracker Appearance Encoder Intra-frame Training

For masks of frame k, consider all of IoU > 0 masks

  • f frame k+1 for matching

tn Intra-frame Observations

Estimated Bounding Boxes in Test Set

Intra-frame sampling Augmentation

P N P N Inter-tracklet

sampling

t1 t2 t3 t4 t5

Ground-truth Tracklets in Training Set Form a Mini Batch Input by the Ratio 1:1

Temporal

  • verlapped &

non-overlapped tracklets

Ground Truth: Hypotheses:

ID7 ID8 id2 # id3 #

Intra-frame Training

frame k frame k+1

0.2 0.4 inf inf inf 0.3

Linear Assignment

BoT-Reid (TMM 2020) Appearance Encoder Appearance Distance Matrix Embedding Appearance Features Cosine Distance

!"#$%&

'((

slide-7
SLIDE 7

t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets

Appearance Encoder Cosine Similarity

(

)*

(

)+

(

),

(

)-

(

).

(

)/

(

)0

(

)1

inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Distance Matrix Wlong if same split tracklet ID: set inf elif temporal overlapping: set inf else: set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8

Object-instance Segmentation Ground Truth: Hypotheses:

ID2 ID3 id2 9 id3 9 id5 9 id6 9

Raw

Short-term Tracker Appearance Encoder Intra-frame Training Inter-short-term-tracklet Training

6

Inter-short-term- tracklet sampling

P N P N Inter-tracklet

sampling

t1 t2 t3 t4 t5 Temporal overlapped & non-overlapped tracklets

Ground-truth Tracklets in Training Set Form a Mini Batch Input by the Ratio 1:1

t1 t2 t3 t4 t5 Temporal overlapped short-term tracklets

Ground Truth: Hypotheses:

ID7 ID8 id2 # id3 # id5 # id6 #

Estimated Short-term- tracklets in Test Set

Inter-short-term-tracklet Training

t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets

Appearance Encoder Cosine Similarity

(

)*

(

)+

(

),

(

)-

(

).

(

)/

(

)0

(

)1

inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf Distance Matrix Wlong if same split tracklet ID: set inf elif temporal overlapping: set inf else: set cosine distance g Inter-short-term-tracklet Training

slide-8
SLIDE 8

7

What Happened in Each Step of Appearance Training

J (H1, H2) represents Jaccard Index of two normalized histograms H1 and H2. (1) After Trained on the train set only (2) After Intra-frame training on test set without labels (3) After Inter-short-tracklet training

  • n test set with pseudo labels

t1 t2 t3 t4

g

t1 t2 t3 t4

g

Intra-frame instance masks Intra-short-tracklet instance mask

slide-9
SLIDE 9

8

Merging Short-term Tracklets

Long-term Tracklets

!"#$%

&''

t1 t2 t3 t4 t5 t6 t7 t8 Short-term Tracklets

Appearance Encoder Cosine Similarity

(

)*

(

)+

(

),

(

)-

(

).

(

)/

(

)0

(

)1

inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf

Cutting threshold =2 − !456%5

&''

clusters Distance

t1 t2 t3 t4 t5 t6 t7 t8

Distance Matrix Hierarchical Clustering

Wlong

if same split tracklet ID: set inf elif temporal overlapping: set inf else: set cosine distance

t1 t2 t3 t4 t5 t6 t7 t8

Object-instance Segmentation Ground Truth: Hypotheses:

ID2 ID3 id2 9 id3 9 id5 9 id6 9

Raw

Short-term Tracker Appearance Encoder Intra-frame Training Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity

Long-term Tracklets

!"#$%

&''

Appearance Encoder Cosine Similarity

(

)*

(

)+

(

),

(

)-

(

).

(

)/

(

)0

(

)1

inf inf 0.1 0.4 inf inf 0.5 0.2 0.1 0.5 inf inf 0.4 0.2 inf inf

Cutting threshold =2 − !456%5

&''

clusters Distance

t1 t2 t3 t4 t5 t6 t7 t8

Distance Matrix Hierarchical Clustering

Wlong

if same split tracklet ID: set inf elif temporal overlapping: set inf else: set cosine distance Inter-short-term-tracklet Training

Intra-frame Cosine Affinity Intra-short-tracklet Cosine Affinity

!"#$%

&''

Intra-frame Cosine Affinity Intra-short- tracklet Cosine Affinity

t1 t2 t3 t4

Temporal-

  • verlapped

Short-term Tracklets

slide-10
SLIDE 10

9

Comparison with others on MOTSChallenge 1 Since our strategy can be easily adapted to others, will other methods get better performance by applying our appearance encoder and merging?

May benefit from mask fusion May benefit from refinement

slide-11
SLIDE 11

10

Limitations of ReMOTS 1. An offline approach.

  • It worth to explore how to bring it to online approach.

2. It is challenging for ReMOTs to handle objects with similar appearance. e.g., good for persons (wear different clothes) but not very useful for vehicles (similar textures) 3. Trajectory is not considered in our short-term tracker. Failed to associate fast moving objects.

Slowly moving person with diverse clothes Fast moving car with similar appearance?

slide-12
SLIDE 12

11

Conclusion

  • Unlabeled target videos can be used for learning better appearance

features, but should take care of the potential of introducing noises.

  • The suitable hyper parameters for data association may varies from case

to case, and the statistical information of tracklets can be used to adjust them.

  • It is preferred to accommodate some insights of ReMOTS to online

MOTS.

slide-13
SLIDE 13

Thanks for your listening