Action Segmentation with Jo Join int Self-Superv rvised Temporal - - PowerPoint PPT Presentation

action segmentation with jo join int self superv rvised
SMART_READER_LITE
LIVE PREVIEW

Action Segmentation with Jo Join int Self-Superv rvised Temporal - - PowerPoint PPT Presentation

ghassanalregib.info/ Action Segmentation with Jo Join int Self-Superv rvised Temporal Domain Adaptation Min-Hung Chen 1 Baopu Li 2 Yingze Bao 2 Ghassan AlRegib 1 Zsolt Kira 1 1 Georgia Institute of Technology 2 Baidu USA June 17, 2020


slide-1
SLIDE 1

Action Segmentation with Jo Join int Self-Superv rvised Temporal Domain Adaptation

Min-Hung Chen1∗ Baopu Li2 Yingze Bao2 Ghassan AlRegib1 Zsolt Kira1

1Georgia Institute of Technology 2Baidu USA

June 17, 2020

∗Work done during an internship at Baidu USA

ghassanalregib.info/

[Paper] https://arxiv.org/abs/2003.02824 [Project] https://minhungchen.netlify.app/project/cdas

slide-2
SLIDE 2

Action segmentation = Action Recognition + Temporal Segmentation

Input Video Output Predictions

Segmentation Model

Time Time background take cup spoon powder pour milk Begin End Begin End

Action Segmentation

Make milk

2

slide-3
SLIDE 3

Action Segmentation

Source Model

Source Videos

Focus on architecture design with fully-supervised learning Standard Action Segmentation Fully Supervised Learning

Labels Unlabeled Videos

3

slide-4
SLIDE 4

Challenge

Source Model Source Model

Source Videos Target Videos

Different people perform the same action in different styles Source vs. Target

Focus on architecture design with fully-supervised learning Standard Action Segmentation Fully Supervised Learning

Labels Unlabeled Videos Unlabeled Videos

Source model may not work on target videos

4

slide-5
SLIDE 5

Adapting Source Model

Source Model

Fully Supervised Learning

Source Videos Target Videos

Target Model

Adapt the source model

Goal

Labels Unlabeled Videos Unlabeled Videos Labels

Model Adaptation

without additional labels

Annotating data is time consuming!

5

slide-6
SLIDE 6

Source Model

Fully Supervised Learning

Source Videos Target Videos

Target Model

Labels Labels Unlabeled Videos Unlabeled Videos

Adapt the source model

Goal Previous Works

Adversarial-based Domain Adaptation Self-Supervised Learning

with unlabeled data

Model Adaptation

6

Not consider cross-domain discrepancy Not consider dependencies between adapted features

slide-7
SLIDE 7

Temporal Domain Permutation

7

Source Target

ℒ𝑕𝑒

Segment & Shuffle

[0,0,1,1] [0,1,1,0] [1,1,0,0] [0,1,0,1] [1,0,1,0] [1,0,0,1] Domain Permutation Prediction

Our Method

Predict temporal permutations of domains

Previous Works

  • Predict temporal orders
  • Predict binary domains

Intuition

DA for classification ➔ domain classification DA for segmentation ➔ domain segmentation

slide-8
SLIDE 8

8

Source Target Original videos

Self-Superv rvised Temporal Domain Adaptation (S (SSTDA)

  • r

Frame-level feature

ADC Binary Domain Prediction

ℒ𝑚𝑒

[1] JMLR 16

ADC: Adversarial Domain Confusion [1] ℒ𝑚𝑒: local domain loss ℒ𝑕𝑒: global domain loss

Domain-shuffled video-level feature Segment & Shuffle

ADC Sequential Domain Prediction

ℒ𝑕𝑒 Local SSTDA Global SSTDA ℒ𝑚𝑒 ℒ𝑕𝑒

Training

Adversarial training with ℒ𝑧, ℒ𝑚𝑒, ℒ𝑕𝑒

slide-9
SLIDE 9

Action Predictions

Our Approach: SSTDA

Source Model

Fully Supervised Learning

Source Videos Target Videos

Target Model

Labels Labels

Video Variations

Unlabeled Videos Unlabeled Videos

Self-Supervised Temporal Domain Adaptation (SSTDA)

Video-based Domain Adaptation with self-supervised learning to reduce variations in videos

9

slide-10
SLIDE 10

Experimental Results

10

Effectively exploit unlabeled target videos for action segmentation

[1] UbiComp 13

50Salads [1]

Prediction Ground Truth 50Salads F1@10 F1@25 F1@50 Edit score 50Salads F1@10 F1@25 F1@50 Edit score

Source-only: results from directly running the official released code of MS-TCN [2]

SSTDA (65%) 77.7 75.0 66.2 69.3 Source-only [2] 75.4 73.4 65.2 68.9 Local SSTDA 79.2 77.8 70.3 72.0 SSTDA

83.0 81.5 73.8 75.8

, [2] CVPR 19

slide-11
SLIDE 11

Comparison: Unlabeled Target Vid ideos

11

50Salads F1@10 F1@25 F1@50 Edit score

Source-only 75.4 73.4 65.2 68.9 VCOP [1] 75.8 73.8 65.9 68.4 DANN [2] 79.2 77.8 70.3 72.0 JAN [3] 80.9 79.4 72.4 73.5 MADA [4] 79.6 77.4 70.0 72.4 MSTN [5] 79.3 77.6 71.5 72.1 MCD [6] 78.2 75.5 67.1 70.8 SWD [7] 78.2 76.2 67.4 71.6 SSTDA

83.0 81.5 73.8 75.8

[1] CVPR 19, [2] JMLR 16, [3] ICML 17, [4] AAAI 18, [5] ICML 18, [6] CVPR 2018, [7] CVPR 19

Jointly adapt domains with multiple temporal scales can better address discrepancy problems for videos

slide-12
SLIDE 12

action start add oil add vinegar add pepper mix dressing peel cucumber place cucumber into bowl cut cucumber cut tomato place tomato into bowl cut cheese place cheese into bowl cut lettuce place lettuce into bowl mix ingredients serve salad onto plate add dressing action end

Vis isualization: 50Salads

12

[1] CVPR 19

Ground Truth MS-TCN [1] Local SSTDA SSTDA Expectation

Only highlight the difference from the ground truth

slide-13
SLIDE 13

Summary

  • Goal: adapt action segmentation models using unlabeled videos
  • Approach: Self-Supervised Temporal Domain Adaptation (SSTDA)
  • Perform domain adaptation for multiple temporal scales
  • Learn feature representations with domain-invariant temporal dynamics
  • Outperform other self-supervised methods and image-based DA methods
  • Improve action segmentation by large margins using unlabeled target videos

Paper Code

[Paper] https://arxiv.org/abs/2003.02824 [Project] https://minhungchen.netlify.app/project/cdas [Code] https://github.com/cmhungsteve/SSTDA Poster: #93 @ Session 2.4 Date: June 17 (Wed.) Q&A Time: 16 - 18 & 04 - 06

13

Project