Video: Tracking and Action Recognition
EECS 442 – David Fouhey Fall 2019, University of Michigan
http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/
Action Recognition EECS 442 David Fouhey Fall 2019, University of - - PowerPoint PPT Presentation
Video: Tracking and Action Recognition EECS 442 David Fouhey Fall 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/ Today: Tracking Objects Goal: Locating a moving object/part across video frames
http://web.eecs.umich.edu/~fouhey/teaching/EECS442_F19/
video frames
Slide credit: D. Hoiem
Video credit: B. Babenko
Slide credit: B. Babenko
Slide credit: D. Hoiem
Tracking by detection:
Slide credit: D. Hoiem
Based on motion, predict object location
smoothness
Slide credit: D. Hoiem
Slide credit: D. Hoiem
State X: actual state of object that we want to estimate. Could be: Pose, viewpoint, velocity, acceleration. Observation Y: our “measurement” of state X. Can be noisy. At each time step t, state changes to Xt, get Yt.
Slide credit: D. Hoiem
Prediction: What’s the next state of the object given past measurements
0 = 𝑧0, … , 𝑍 𝑢−1 = 𝑧𝑢−1)
Correction: Compute updated estimate of the state from prediction and measurements
0 = 𝑧0, … , 𝑍 𝑢−1 = 𝑧𝑢−1, 𝑍 𝑢 = 𝑧𝑢)
Slide credit: D. Hoiem
Only immediate past matters (Markovian) Measurement depends only on current state (Independence)
𝑢 𝑌0, 𝑍 0, … , 𝑌𝑢−1, 𝑍 𝑢−1, 𝑌𝑢 = 𝑄(𝑍 𝑢|𝑌𝑢)
X0 Y0 X1 Y1 Xt-1 Yt-1 Xt Yt
Slide credit: D. Hoiem
Have models for: (1) P(next state) given current state / Transition
(2) P(observation) given state / Observation
𝑢|𝑌𝑢)
Want to recover, for each timestep t
Slide credit: D. Hoiem
measurement: Y0=y0
Slide credit: D. Hoiem
measurement: Y0=y0
Slide credit: D. Hoiem
𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1)
Given P(Xt-1|y0,…,yt-1) want P(Xt|y0,…,yt-1)
= න 𝑄 𝑌𝑢, 𝑌𝑢−1 𝑧0, … , 𝑧𝑢−1 𝑒𝑌𝑢−1
Total probability
= න 𝑄 𝑌𝑢, 𝑌𝑢−1, 𝑧0, … , 𝑧𝑢−1 𝑄(𝑌𝑢−1|𝑧0, … , 𝑧𝑢−1) 𝑒𝑌𝑢−1
Condition on Xt-1
= න 𝑄 𝑌𝑢, 𝑌𝑢−1 𝑄(𝑌𝑢−1|𝑧0, … , 𝑧𝑢−1) 𝑒𝑌𝑢−1
Markovian
dynamics model corrected estimate from previous step
Slide credit: D. Hoiem
Given P(Xt|y0,…,yt-1) want P(Xt|y0,…,yt-1,yt)
𝑄 𝑌𝑢 𝑧0, … , 𝑧𝑢 = 𝑄 𝑧𝑢 𝑌𝑢, 𝑧0, … , 𝑧𝑢−1 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑧0, … , 𝑧𝑢−1
Bayes Rule
= 𝑄 𝑧𝑢 𝑌𝑢 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑧0, … , 𝑧𝑢−1
Independence Assumption
= 𝑄 𝑧𝑢 𝑌𝑢 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑌𝑢 𝑄 𝑌𝑢 𝑧0, … , 𝑧𝑢−1 𝑒𝑌𝑢
Condition
Slide credit: D. Hoiem
Given P(Xt|y0,…,yt-1) want P(Xt|y0,…,yt-1,yt)
𝑄 𝑌𝑢 𝑧0, … , 𝑧𝑢 = 𝑄 𝑧𝑢 𝑌𝑢, 𝑧0, … , 𝑧𝑢−1 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑧0, … , 𝑧𝑢−1
Bayes Rule
= 𝑄 𝑧𝑢 𝑌𝑢 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑧0, … , 𝑧𝑢−1
Independence Assumption
= 𝑸 𝒛𝒖 𝒀𝒖 𝑄(𝑌𝑢|𝑧0, … , 𝑧𝑢−1) 𝑄 𝑧𝑢 𝑌𝑢 𝑄 𝑌𝑢 𝑧0, … , 𝑧𝑢−1 𝑒𝑌𝑢
Condition
model
Slide credit: D. Hoiem
Predicted estimate Normalization Factor
𝑸 𝒀𝒖 𝒛𝟏, … , 𝒛𝒖 = 𝑸 𝒛𝒖 𝒀𝒖 𝑸(𝒀𝒖|𝒛𝟏, … , 𝒛𝒖−𝟐) 𝑸 𝒛𝒖 𝒀𝒖 𝑸 𝒀𝒖 𝒛𝟏, … , 𝒛𝒖−𝟐 𝒆𝒀𝒖 𝑸(𝒀𝒖|𝒛𝟏, … , 𝒛𝒖−𝟐) = න 𝑸 𝒀𝒖, 𝒀𝒖−𝟐 𝑸(𝒀𝒖−𝟐|𝒛𝟏, … , 𝒛𝒖−𝟐) 𝑒𝑌𝑢−1
Prediction: Correction: Nasty integrals! Also these are probability distributions Transition Observation P(state given past) P(state given past+present)
multivariate Gaussian?
Kalman filter: assume everything’s Gaussian
“The Apollo computer used 2k of magnetic core RAM and 36k wire rope [...]. The CPU was built from ICs [...]. Clock speed was under 100 kHz”
Photo, Quote credit: Wikipedia
Correction Observation Ground Truth
Slide credit: D. Hoiem
Observation Prediction Correction Ground Truth
Slide credit: D. Hoiem
Expected change Uncertainty Observation and Correction Current state
Decent model if there is just one object, but localization is imprecise
Slide credit: D. Hoiem
Represent the state distribution non-parametrically
previous state
weighted samples and P(yt|Xt)
visual tracking, IJCV 29(1):5-28, 1998
Expected change Uncertainty Observation and Correction Current state
Good if there are multiple, confusable
Slide credit: D. Hoiem
Gaussians, everything is nice
intractable math
K different targets
Nam and Han, CVPR 2016, Learning Multi-Domain Convolutional Neural Networks For Visual Tracking
Nam and Han, CVPR 2016, Learning Multi-Domain Convolutional Neural Networks For Visual Tracking
Slide credit: D. Hoiem
has many advantages
at a time
(e.g. Monet)
Slide credit: A. Efros
volume
corresponds to a single ray in space
t
255
time
Slide credit: A. Efros
Slide credit: A. Efros
Average image Median Image
Slide credit: A. Efros
Slide credit: A. Efros
Slide credit: A. Efros
detector
Slide credit: D. Hoiem
Too strong dynamics model Too strong observation model
Slide credit: D. Hoiem
filters good for this
Slide credit: D. Hoiem
Slide credit: D. Hoiem
Slide credit: D. Hoiem
#Classes: 6, Videos: 2391, Source: Lab, Year: 2004 Recognizing Human Actions: A Local SVM Approach
#Classes: 101, #Videos: 9,511, Source: YouTube, Year: 2012 UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild.
#Classes: 400, #Videos: 240K, Source: YouTube, Year: 2017 The Kinetics Human Action Video Dataset
Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman,
modeler (also used in language tasks, e.g., sentence -> sentiment)
activations as opposed to words
Diagram credit: J. Carreira, A. Zisserman
takes HxWx3 image
takes HxWx2*N image
Diagram credit: J. Carreira, A. Zisserman
Diagram credit: J. Carreira, A. Zisserman
a HxWx3xN tensor
Diagram credit: J. Carreira, A. Zisserman
patterns and motion patterns
HxWx3xN tensor
HxWx2xN tensor
Diagram credit: J. Carreira, A. Zisserman
Take-homes:
Just looking at independent frames does shockingly well.
Using optical flow as input improves things. If flow is so important, can’t it just learn this on its own?