Learning to Detect Activity in Untrimmed Video Prof. Bernard Ghanem - - PowerPoint PPT Presentation

learning to detect activity in
SMART_READER_LITE
LIVE PREVIEW

Learning to Detect Activity in Untrimmed Video Prof. Bernard Ghanem - - PowerPoint PPT Presentation

Learning to Detect Activity in Untrimmed Video Prof. Bernard Ghanem An image is worth a thousand words A video is worth a million words Source: YouTube Image: a tiger attacking a person on a grass field Video: the tiger is being


slide-1
SLIDE 1
  • Prof. Bernard Ghanem

Learning to Detect Activity in Untrimmed Video

slide-2
SLIDE 2

Bernard Ghanem

An image is worth a thousand words A video is worth a million words

Source: YouTube

“a tiger attacking a person on a grass field” “the tiger is being playful” Image: Video:

slide-3
SLIDE 3

Bernard Ghanem

Fun facts about video

By 2017, online video will account for 74% of all online traffic3 45% of people watch more than an hour of Facebook or YouTube videos a week2 Almost 50% of internet users look for videos related to a product or service before visiting a store4 85% of Facebook video is watched without sound5 55% of people watch videos online every day1

Source:Source:1) MWP Statistics, 2015; 2) HubSpot, 2016 3) KPCB, 2016 4) Google, 2016; 5) DIGIDAY, 2016

slide-4
SLIDE 4

Bernard Ghanem

Problem: Detecting Human Activities in Video

… … … …

Input

slide-5
SLIDE 5

Bernard Ghanem

Problem: Detecting Human Activities in Video

… … … …

Input Output

Class: Pole Vault Bounds: (23.1s, 25.2s)

slide-6
SLIDE 6

Bernard Ghanem

Why Activity Detection?

slide-7
SLIDE 7

Bernard Ghanem

slide-8
SLIDE 8

Bernard Ghanem

  • 1. Not enough large-scale training data
  • 2. Large number of activities
  • 3. Real-time processing is not enough

Challenges of Detecting Human Activities

… … … …

Input Output

slide-9
SLIDE 9

Bernard Ghanem

  • 1. Not enough large-scale training data

1st Version (R1.1):

  • ~200 classes
  • ~850 hours
  • class hierarchy

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015]

slide-10
SLIDE 10

Bernard Ghanem

  • 1. Not enough large-scale training data

At CVPR 2017 (July 26 – afternoon) http://activity-net.org/challenges/2017

Sponsored by:

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding [CVPR 2015]

slide-11
SLIDE 11

Bernard Ghanem

Classical Activity Detection Pipeline

… …

Basketball Dunk Classifier Volleyball Spiking Classifier

. . .

slide-12
SLIDE 12

Bernard Ghanem

Classical Activity Detection Pipeline

… …

Basketball Dunk Classifier Volleyball Spiking Classifier

. . .

slide-13
SLIDE 13

Bernard Ghanem

Using proposals is important

… …

Action Proposal Basketball Dunk Classifier Basketball Dunk Classifier Volleyball Spiking Classifier Volleyball Spiking Classifier

slide-14
SLIDE 14

Bernard Ghanem

What have we done?

Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos [CVPR 2016] proposals are represented as sparse combinations of STIPs (10FPS on single CPU core) DAPs: Deep Action Proposals for Action Understanding [ECCV 2016] multi-scale (sparse) proposals are output by an LSTM in one pass (130FPS on single GPU) SST: Single-Stream Temporal Action Proposals [CVPR 2017] multi-scale (dense) proposals are scored by a GRU in one pass + streaming (300FPS on single GPU)

slide-15
SLIDE 15

Bernard Ghanem

Localized Action Detections

… …

Input video Visual Encoder

  • Seq. Encoder

Output Proposals

… k · δ maximum proposal size

(per output)

k - proprosals

classifier Untrimmed Input Video Temporal Action Proposals

  • utput

(time step t)

ct

δ

SST

ϕ ϕ ϕ ϕ ϕ ϕ

Time

SST: Single Stream Temporal Action Proposals

slide-16
SLIDE 16

Bernard Ghanem

SS-TAD: Single Stream Temporal Action Detection

End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos [BMVC 2017] multi-scale (dense) detection are scored in one pass + streaming (700FPS on TitanX GPU)

SS-TAD Proposals Classifiers Frame-level Classifiers Merging/Smoothing

Untrimmed Video Input Action Detections (a) (b) (c)

slide-17
SLIDE 17

Bernard Ghanem

SS-TAD: Single Stream Temporal Action Detection

Key Detection Ground-truth Time (Actions are played at 1x speed, Background video is sped up)

slide-18
SLIDE 18

Bernard Ghanem

  • Applying activity detectors for large number of

activity classes is expensive.

  • Can we do better than linear computational

growth with # of activity classes?

  • 2. Large number of activities
slide-19
SLIDE 19

Bernard Ghanem

Activity-Object and Activity-Scene Relations

SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016]

slide-20
SLIDE 20

Bernard Ghanem

Typical Activity Detection Pipeline

SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017] DAPs: Deep Action Proposals for Action Understanding [ECCV 2016]

Video Sequence

Action Proposals (Stage 1)

Action Proposals

Action Classifiers (Stage 2)

Reject

slide-21
SLIDE 21

Bernard Ghanem

SCC: Semantic Context Cascade

SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

slide-22
SLIDE 22

Bernard Ghanem

SCC: Semantic Context Cascade

SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

slide-23
SLIDE 23

Bernard Ghanem

SCC: Semantic Context Cascade

SCC: Semantic Context Cascade for Efficient Action Detection [CVPR 2017]

slide-24
SLIDE 24

Bernard Ghanem

  • In the past, real-time processing was a “good-to-have”,

i.e. 1min video → 1min processing

  • But, not anymore!
  • We need to stay ahead of the increasing video upload
  • rate. How?

hardware acceleration (GPUs)

more efficient implementation

do we need to visit every frame?

  • 3. Real-time processing is not enough
slide-25
SLIDE 25

Bernard Ghanem

Do we have to visit every frame?

Search History t

  • Log how human annotator

moves the time slider instead

  • f throwing it away
  • Can we learn from how humans

move the slider to localize activities?

Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018]

slide-26
SLIDE 26

Bernard Ghanem 𝑢 𝑢

Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018]

slide-27
SLIDE 27

Bernard Ghanem 𝑢

LSTM 3D ConvNet Target Activity 𝒀: Visual Observation 𝒘: Feature Vector 𝒊: LSTM State 𝑔 𝒀 : Temporal Location

𝒊𝑗−2 𝑔(𝒀𝑗−2) 𝑔(𝒀𝑗−3)

. . .

𝒊𝑗−3 𝑔(𝒀𝑗−3)

. . .

𝒀𝑗−2 𝒘𝑗−2 𝑔(𝒀𝑗−2) 𝑢

Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018]

𝒊𝑗−1 𝑔(𝒀𝑗−1) 𝒘𝑗−1 𝒀𝑗−1 𝑔(𝒀𝑗−1) 𝒘𝑗 𝒊𝑗 𝑔(𝒀𝑗) 𝒀𝑗 𝑔(𝒀𝑗) 𝒀𝑗+1 𝒘𝑗+1 𝒊𝑗+1 𝑔(𝒀𝑗+1)

. . . . . .

𝑔(𝒀𝑗+1)

slide-28
SLIDE 28

Bernard Ghanem

Action Search or Action Spotting

Action Search: Learning to Search for Human Activities in Untrimmed Videos [arXiv 2017][To be submitted to CVPR2018]

Activity: “shot put” Activity: “shot put” Activity: “basketball dunk”

slide-29
SLIDE 29

Bernard Ghanem

SPONSORS

slide-30
SLIDE 30

Bernard Ghanem

  • Prof. Bernard Ghanem

bernard.ghanem@kaust.edu.sa ivul.kaust.edu.sa

baseball throw dunk shoveling washing dishes pole vault dancing