Activity Recognition Computer Vision Fall 2018 Columbia University - - PowerPoint PPT Presentation

activity recognition
SMART_READER_LITE
LIVE PREVIEW

Activity Recognition Computer Vision Fall 2018 Columbia University - - PowerPoint PPT Presentation

Activity Recognition Computer Vision Fall 2018 Columbia University Many slides from Bolei Zhou Project How are they going? About 30 teams have requested GPU credit so far Final presentations on December 5th and 10th We will assign


slide-1
SLIDE 1

Activity Recognition

Computer Vision Fall 2018 Columbia University

Many slides from Bolei Zhou

slide-2
SLIDE 2

Project

  • How are they going? About 30 teams have requested

GPU credit so far

  • Final presentations on December 5th and 10th
  • We will assign you to dates soon
  • Final report due December 10 at midnight
  • Details here: http://w4731.cs.columbia.edu/project
slide-3
SLIDE 3

Challenge for Image Recognition

  • Variation in appearance.
slide-4
SLIDE 4

Challenge for Activity Recognition

  • Describing activity at the proper

level Image recognition? No motion needed? Skeleton recognition? Which activities?

slide-5
SLIDE 5

A chain of events Making chocolate cookies

Challenge for Activity Recognition

  • Describing activity at the proper

level

slide-6
SLIDE 6

What are they doing?

slide-7
SLIDE 7

What are they doing?

Barker and Wright, 1954

slide-8
SLIDE 8

Vision or Cognition?

slide-9
SLIDE 9

Video Recognition Datasets

  • KTH Dataset: recognition of human actions
  • 6 classes, 2391 videos

Recognizing Human Actions: A Local SVM Approach. ICPR 2004 https://www.youtube.com/watch?v=Jm69kbCC17s

slide-10
SLIDE 10
  • UCF101 from University of Central Florida
  • 101 classes, 9,511 videos in training

UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. 2012 https://www.youtube.com/watch?v=hGhuUaxocIE

Video Recognition Datasets

slide-11
SLIDE 11
  • Kinetics from Google DeepMind
  • 400 classes, 239,956 videos in training

https://deepmind.com/research/open-source/open-source-datasets/kinetics/

Video Recognition Datasets

slide-12
SLIDE 12
  • Charades dataset: Hollywood in Homes
  • Crowdsourced video dataset

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV’16 http://allenai.org/plato/charades/

Video Recognition Datasets

slide-13
SLIDE 13
  • Charades dataset: Hollywood in Homes
  • Crowdsourced video dataset

Video Recognition Datasets

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV’16

slide-14
SLIDE 14
slide-15
SLIDE 15

Poking a stack of something so the stack collapses Plugging something into something

https://www.twentybn.com/datasets/something-something

  • Something-Something dataset: human object

interaction

  • 174 categories: 100,000 videos

▪ Holding something ▪ Turning something upside down ▪ Turning the camera left while filming something ▪ Opening something

Video Recognition Datasets

slide-16
SLIDE 16

Width Height T i m e

Activity Labels

?

slide-17
SLIDE 17

Single-frame image model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

slide-18
SLIDE 18

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1%

slide-19
SLIDE 19

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1%

slide-20
SLIDE 20

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1% 40.7%

slide-21
SLIDE 21

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1% 40.7%

slide-22
SLIDE 22

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1% 40.7% 38.9%

slide-23
SLIDE 23

Multi-frame fusion model

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

41.1% 40.7% 38.9% 41.9%

slide-24
SLIDE 24

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

slide-25
SLIDE 25

Sequence of frames?

Long-term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR 2015

slide-26
SLIDE 26

Recurrent Neural Networks (RNNs)

Credit: Christopher Olah

slide-27
SLIDE 27

Recurrent Neural Networks (RNNs)

Credit: Christopher Olah

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor

slide-28
SLIDE 28

Recurrent Neural Networks (RNNs)

Credit: Christopher Olah

When the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information

slide-29
SLIDE 29

Long-term dependencies - hard to model!

Credit: Christopher Olah

But there are also cases where we need more context.

slide-30
SLIDE 30

(LSTM: Long Short Term Memory Networks)

Credit: Christopher Olah

From plain RNNs to LSTMs

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-31
SLIDE 31

From plain RNNs to LSTMs

Credit: Christopher Olah

(LSTM: Long Short Term Memory Networks)

slide-32
SLIDE 32

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates

Credit: Christopher Olah

Cell State / Memory

LSTMs Step by Step: Memory

slide-33
SLIDE 33

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.”

Credit: Christopher Olah

Should we continue to remember this “bit” of information or not?

LSTMs Step by Step: Forget Gate

slide-34
SLIDE 34

LSTMs Step by Step: Input Gate

Credit: Christopher Olah

Should we update this “bit” of information or not? If so, with what?

The next step is to decide what new information we’re going to store in the cell state. This has two

  • parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next,

a tanh layer creates a vector of new candidate values, C̃t, that could be added to the state.

slide-35
SLIDE 35

LSTMs Step by Step: Memory Update

Forget that Memorize this

Decide what will be kept in the cell state/memory

Credit: Christopher Olah

slide-36
SLIDE 36

LSTMs Step by Step: Output Gate

Credit: Christopher Olah

Should we output this “bit” of information?

This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

slide-37
SLIDE 37

Complete LSTM - A pretty sophisticated cell

Credit: Christopher Olah

slide-38
SLIDE 38

Show and Tell: A Neural Image Caption Generator

Show and Tell: A Neural Image Caption Generator, Vinyals et. al., CVPR 2015

slide-39
SLIDE 39

Multi-frame LSTM fusion model

Long-term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR 2015 LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Tumbling

slide-40
SLIDE 40

è Ventral (‘what’) stream performs object recognition è Dorsal stream (‘where/how’) recognizes motion and locates objects

OPTICAL FLOW STIMULI

Motivation: Separate visual pathways in nature

Sources: “Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli." Journal of neurophysiology 65.6 (1991). “A cortical representation of the local visual environment”, Nature. 392 (6676): 598–601, 2009 https://en.wikipedia.org/wiki/Two-streams_hypothesis

è “Interconnection”

e.g. in STS area

slide-41
SLIDE 41

2-Stream Network

Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014

slide-42
SLIDE 42

Temporal segment network

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV 2016

slide-43
SLIDE 43

3D convolutional Networks

Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015

2D convolutions 3D convolutions

slide-44
SLIDE 44
  • 3D filters at the first layer.

Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015

3D convolutional Networks

slide-45
SLIDE 45

Temporal Relational Reasoning

  • Infer the temporal relation between frames.

Poking a stack of something so it collapses

slide-46
SLIDE 46
  • It is the temporal transformation/relation that

defines the activity, rather than the appearance

  • f objects.

Temporal Relational Reasoning

Poking a stack of something so it collapses

slide-47
SLIDE 47

Temporal Relations in Videos

Pretending to put something next to something 2-frame relations 3-frame relations 4-frame relations

slide-48
SLIDE 48

Framework of Temporal Relation Networks

slide-49
SLIDE 49

Something-Something Dataset

  • 100 K videos from 174 human-object interaction

classes.

Pulling two ends of something so that it gets stretched Plugging something into something Moving something away from something

slide-50
SLIDE 50

Jester Dataset

  • 140 K videos from 27 gesture classes.

Drumming fingers Thumb down Zooming in with two fingers

slide-51
SLIDE 51

Experimental Results

  • On Something-Something dataset
slide-52
SLIDE 52

Experimental Results

  • On Jester dataset
slide-53
SLIDE 53

Importance of temporal orders

slide-54
SLIDE 54

How well are they diving?

Olympic judge’s score

Pirsiavash, Vondrick, Torralba. Assessing Quality of Actions, ECCV 2014

slide-55
SLIDE 55

How well are they diving?

  • 1. Track and compute

human pose

slide-56
SLIDE 56

How well are they diving?

  • 1. Track and compute

human pose

  • 2. Extract temporal features
  • take FT and histogram?
  • use deep network?
slide-57
SLIDE 57

How well are they diving?

  • 1. Track and compute

human pose

  • 2. Extract temporal features
  • take FT and histogram?
  • use deep network?
  • 3. Train regression model to

predict expert quality score

slide-58
SLIDE 58

Assessing diving

slide-59
SLIDE 59

Feedback

slide-60
SLIDE 60

Summarizing

slide-61
SLIDE 61

Assessing figure skating