Anticipating Visual Representations from Unlabeled Data Carl - - PowerPoint PPT Presentation

anticipating visual representations from unlabeled data
SMART_READER_LITE
LIVE PREVIEW

Anticipating Visual Representations from Unlabeled Data Carl - - PowerPoint PPT Presentation

Anticipating Visual Representations from Unlabeled Data Carl Vondrick, Hamed Pirsiavash, Antonio Torralba Overview Problem Key Insight Methods Experiments Problem: Predict future actions and objects Image from Vondrick


slide-1
SLIDE 1

Anticipating Visual Representations from Unlabeled Data

Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

slide-2
SLIDE 2

Overview

  • Problem
  • Key Insight
  • Methods
  • Experiments
slide-3
SLIDE 3

Problem: Predict future actions and objects

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-4
SLIDE 4

Related Work

  • Unlabeled video prediction

○ Motion and trajectory prediction ○ Pixel level prediction

  • Action prediction

○ Intention inference ○ Semantic context for action prediction

  • Path and motion prediction

○ Optical flow

slide-5
SLIDE 5

Applications

  • Robotics

○ Path planning ○ Human robot interaction ○ Obstacle avoidance

  • Surveillance

○ Warning systems

slide-6
SLIDE 6

Overview

  • Problem
  • Key Insight
  • Methods
  • Experiments
slide-7
SLIDE 7

Key Insight: Don’t predict images

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-8
SLIDE 8

Key Insight: Predict Intermediate Representation

Predict AlexNet Representation

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-9
SLIDE 9

Key Insight: Predict Intermediate Representation

Predict AlexNet Representation Classifier

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-10
SLIDE 10

Overview

  • Problem
  • Key Insight
  • Methods
  • Experiments
slide-11
SLIDE 11

Use Unlabeled Video as Training Data

  • The internet is full of unlabeled videos

○ Used 600 hours of popular TV shows on YouTube

  • Get supervision for free!

(Because they all go forward in time)

  • Can then use predicted representation for action or object detection
slide-12
SLIDE 12

Multiple Futures

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-13
SLIDE 13

Multiple Futures

  • Train network to predict K representations for the future
  • Classify all K representations
  • Predict class with highest marginal probability

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-14
SLIDE 14

Network Architecture

AlexNet fc7

  • Alexnet with additional fully connected layers
  • Loss function is simply argmin of squared error.

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-15
SLIDE 15

Overview

  • Problem
  • Key Insight
  • Methods
  • Experiments
slide-16
SLIDE 16

Action Forecasting Experiment

  • Dataset: TV Human Interactions

○ 300 separate videos ○ People do one of: {high fiving, hugging, shaking hands, kissing}

  • Goal: Predict activity 1 second in the future
  • Baselines:

○ SVM, Nearest Neighbor, Max Margin Event Detector, Linear Regression

slide-17
SLIDE 17

Normal vs. Adapted Training: Normal Training

Classifier Ground Truth Future Representation

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-18
SLIDE 18

Normal vs. Adapted Training: Adapted Training

Classifier Ground Truth Future Representation Predicted Future Representation CNN Future Predictor

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-19
SLIDE 19

Action Forecasting Results

  • Deep, adapted networks
  • utperform all baselines
  • Much effort needed to approach

human-level performance

slide-20
SLIDE 20

Action Forecasting Results

slide-21
SLIDE 21

Action Forecasting Results?

slide-22
SLIDE 22

Object Forecasting Experiment

  • Dataset: Daily Living Activities Dataset

○ Egocentric video ○ Segments featuring 1 of 14 objects

  • Goal: Predict object on screen 5 seconds in the future
  • Baselines:

○ SVM, Scene features, Linear Classifier

  • Normal & Adapted, as before
slide-23
SLIDE 23

Object Forecasting Results

  • Performance indicates that this is a difficult task

○ Still, outperformed all other methods.

slide-24
SLIDE 24

Object Forecasting Results

Image from Vondrick et. al, http://web.mit.edu/vondrick/prediction/paper.pdf

slide-25
SLIDE 25

Future Work

  • More robust experimentation

○ Comparison to other action prediction systems ○ Improved performance on egocentric dataset ○ Datasets (i.e. THUMOS) where semantic roles are not implicit

  • Extension to real world problems

○ Robotics, surveillance, etc.

  • Video Generation
slide-26
SLIDE 26

Conclusion

  • Problem

○ Predicting future actions or objects in video

  • Key Insight

○ Learn to predict intermediate representations from unlabeled data

  • Methods

○ AlexNet with additional FC layers

  • Experiments

○ Outperformed baselines on action detection, still work to do to reach human performance ○ Object forecasting results proved to be challenging, still outperformed baselines

slide-27
SLIDE 27

The End.