UCF Yogesh S Rawat, Aayush Rana, Praveen Tirupattur, and Mubarak - - PowerPoint PPT Presentation

ucf
SMART_READER_LITE
LIVE PREVIEW

UCF Yogesh S Rawat, Aayush Rana, Praveen Tirupattur, and Mubarak - - PowerPoint PPT Presentation

UCF Yogesh S Rawat, Aayush Rana, Praveen Tirupattur, and Mubarak Shah Center for Research in Computer Vision University of Central Florida Contents Activity Detection in Untrimmed Videos AD Task Activity Object Detection in Untrimmed


slide-1
SLIDE 1

UCF

Yogesh S Rawat, Aayush Rana, Praveen Tirupattur, and Mubarak Shah

Center for Research in Computer Vision University of Central Florida

slide-2
SLIDE 2

Contents

  • Activity Detection in Untrimmed Videos
  • AD Task
  • Activity Object Detection in Untrimmed Videos
  • AOD Task
slide-3
SLIDE 3

Contents

  • Activity Detection in Untrimmed Videos
  • AD Task
  • Activity Object Detection in Untrimmed Videos
  • AOD Task
slide-4
SLIDE 4

Activity Detection (AD) in Untrimmed Videos

4

slide-5
SLIDE 5

Action Analysis in Video

Given Untrimmed videos

  • Containing multiple
  • actors
  • actions
  • action labels per actor
  • Varying length of action
  • Unbalanced dataset (low samples)

5

  • We want to
  • Localize all actions
  • Classify each action
slide-6
SLIDE 6

Key Points

  • Bottom up foreground background segmentation
  • Detect actions tubes from long untrimmed videos
  • Classify each instance individually
  • Activity tube generation

6

slide-7
SLIDE 7

Overview of Architecture

Input video (8, 448, 800, 3) Tube Extraction Tubes (8, 112, 112, 3) Action classifier

  • Divide video into smaller clips
  • Send one clip at a time as input
  • Perform foreground segmentation
  • Find connected components
  • Classify each tube (resized to 112 x 112)
  • Stitch classified tubes

Foreground/ Background Segmentation Network Untrimmed Video

Clip 1 Clip 2 Clip N

. . .

. . . Stitch output

7

Individual actions Long tube

Tube 1 Tube 2 Tube T

slide-8
SLIDE 8

Foreground/ Background Segmentation Network

Encoder block Decoder block

Skip Connections

Localization output

8

slide-9
SLIDE 9

Tube Extraction

Multiply and segment foreground

Output Tubes ( 112 x 112 x 3) Input Video (448 x 800 x 3)

9

Localization Mask (448 x 800 x 1)

Connected components

slide-10
SLIDE 10

Action Classification

Output Tubes ( 8 x 112 x 112 x 3)

10

Transport Heavycarry: 0.69 Walking: 0.81 Vehicle moving: 0.86 Standing: 0.73 Talking: 0.65 Interacts: 0.77 Network (ResNet 3D) Extracted Features Output

Classification Block Classification output

slide-11
SLIDE 11

Tube Stitching

11

. . .

Vehicle Stopping Vehicle Turning Left Vehicle Turning Right

Long Tube

Tube 1 Tube 2 Tube T

slide-12
SLIDE 12

Final Output (Example-1)

12

slide-13
SLIDE 13

Final Output (Example-2)

13

slide-14
SLIDE 14

Final Output (Example-3)

14

slide-15
SLIDE 15

NIST Evaluation on Validation Set

Metric name Metric Value Mean-p_miss @ 0.01 rfa 0.9066 Mean-p_miss @ 0.03 rfa 0.8478 Mean-p_miss @ 0.1 rfa 0.6973 Mean-p_miss @ 0.15 rfa 0.6608 Mean-p_miss @ 0.2 rfa 0.6279 Mean-p_miss @ 1 rfa 0.4633 N-mide 0.2045

15

slide-16
SLIDE 16

Issues

  • Imbalanced Dataset
  • Extremely low samples for some classes
  • Similar activities being confused by classifier
  • Activities far from camera
  • Very small activities, hard to locate

16

slide-17
SLIDE 17

Contents

  • Activity Detection in Untrimmed Videos
  • AD Task
  • Activity Object Detection in Untrimmed Videos
  • AOD Task
slide-18
SLIDE 18

Activity Detection based on Actor-Object Interaction

18

slide-19
SLIDE 19

Actor Object Interaction in Videos

  • Given an untrimmed video, localize
  • all actors present
  • all objects interacted with
  • Classify Activities based on the actor-object interaction
slide-20
SLIDE 20

Challenges

  • Multiple actor-object instances in

single clip

  • Multiple actors and objects
  • Same actor-object combination in

multiple classes

  • Opening door, closing door
  • Same actor-object instance with

multiple labels

  • Exiting, closing door
slide-21
SLIDE 21

Approaches

  • Region Proposals
  • Based on bounding box proposals T-CNN [1], Mask-CNN [2]
  • Bottom-up approach
  • Regression over full space
  • Encoder-Decoder
  • Unified semantic segmentation ST-CNN [3], SegNet [4]
  • Issue with multiple activity instances
  • Need of connected components and post processing

[1] Hui et al. "Tube convolutional neural network (T-CNN) for action detection in videos." In IEEE international conference on computer vision. 2017. [2] He et al. "Mask r-cnn." In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980-2988. IEEE, 2017. [3] Rui et al. "An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos." arXiv preprint arXiv:1712.01111 (2017). [4] Badrinarayananet al. "Segnet: A deep convolutional encoder-decoder architecture for image segmentation." arXiv preprint arXiv:1511.00561 (2015).

slide-22
SLIDE 22

Motivation

  • End-to-end training framework
  • Completely remove region proposal and ToI/RoI pooling
  • Use actor-object attention instead
  • Multiple tasks
  • Foreground/background
  • Objects
  • Actions
  • Model convergence using multiple losses
  • Joint actor-object action classification

22

slide-23
SLIDE 23

Action Classification in Videos

23

Object:

  • Vehicle

Action: Vehicle turning left

Objects:

  • Person

Action: Activity Talking (red), Activity Carrying (green)

slide-24
SLIDE 24

Overview of Proposed Architecture

Video Encoder Decoder Input video Foreground/Background Segmentation

C3D or I3D backbone Skip Connections

Object Classification

  • Get 8 frame video clip
  • Generate foreground / background segmentation mask
  • Generate object segmentation mask for each object type
  • Use fg/bg segmentation for feature attention
  • Classify action using actor - object information
slide-25
SLIDE 25

End-to-end network for Video Action Segmentation

Encoder Block

Skip connections

Decoder Block

O 25

Object Classification Branch

  • Encode video features (Conv 3D)
  • Decode features (Deconv 3D) with skip connection
  • Segment foreground/background
  • Segment each object class
slide-26
SLIDE 26

Quantitative Results

  • DIVA data subset
  • Smaller clips focusing on activity used (128 x 192 resolution)
  • 64 training videos, 55 validation videos
  • 19 action classes (DIVA 1B set)
  • 2 object classes (person and vehicle)
  • Action object localization IoU: 0.64
  • Classification F1 Score (19 classes): 0.46

26

slide-27
SLIDE 27

Qualitative Results

Input Foreground/Background segmentation (Only moving objects) Object segmentation (3 people) Action classification Talking (Red) Carrying (Green)

27

slide-28
SLIDE 28

Qualitative Results

Input Foreground/Background segmentation (Only moving objects) Object segmentation Vehicle (Green) Person (Red) Action classification Vehicle turning left

28

slide-29
SLIDE 29

Metric Value mean-p_miss@0.01rfa 0.954337382386 mean-p_miss@0.03rfa 0.925133046316 mean-p_miss@0.15rfa 0.757087143515 mean-p_miss@0.1rfa 0.784522064048 mean-p_miss@0.2rfa 0.739966420528 mean-p_miss@1rfa 0.605960537865

Activity Detection

NIST Evaluation on Validation Set

slide-30
SLIDE 30

Metric Value mean-mean-object-p_miss@0.033rfa 0.7397920634 mean-mean-object-p_miss@0.1rfa 0.673425676293 mean-mean-object-p_miss@0.2rfa 0.624957826044 mean-mean-object-p_miss@0.5rfa 0.538296977439

Object detection

NIST Evaluation on Validation Set

slide-31
SLIDE 31

Thank you!