[PPT] - Deep Tracking & Flow Instructor - Simon Lucey 16-623 - PowerPoint Presentation

SLIDE 1

Deep Tracking & Flow

Instructor - Simon Lucey

16-623 - Designing Computer Vision Apps

SLIDE 2

Today

Deep Features
Deep Tracking
Deep Flow

SLIDE 3

Primary Visual Cortex

            

SLIDE 4

Spatial Sensitivity

Which image has the greatest distortion with respect to the

template?

Kingdom, Field, Olmos, 2007

SLIDE 5

Spatial Sensitivity

Kingdom, Field, Olmos, 2007

SLIDE 6

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-

6

“1D Patch” “Distorted 1D Patch”

Source: A. C. Berg

SLIDE 7

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-

6

“1D Patch” “Distorted 1D Patch”

Source: A. C. Berg

“match”

SLIDE 8

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-
1. simultaneously estimate the distortion and position of matching patch

7

“1D Patch” “Distorted 1D Patch”

SLIDE 9

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-
1. simultaneously estimate the distortion and position of matching patch

7

“1D Patch” “Distorted 1D Patch” “align”

SLIDE 10

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-
1. simultaneously estimate the distortion and position of matching patch

7

“1D Patch” “Distorted 1D Patch” “align”

SLIDE 11

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.

Two options to match local image patches:-
1. simultaneously estimate the distortion and position of matching patch

7

“1D Patch” “Distorted 1D Patch” “align” “match”

SLIDE 12

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.

Two options to match patches:-
1. simultaneously estimate the distortion and position of matching patch.
2. to “blur” the template window performing matching coarse-to-fine.

8

“1D Patch” “Distorted 1D Patch”

SLIDE 13

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.

Two options to match patches:-
1. simultaneously estimate the distortion and position of matching patch.
2. to “blur” the template window performing matching coarse-to-fine.

8

“1D Patch” “Distorted 1D Patch” “blur”

SLIDE 14

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.

Two options to match patches:-
1. simultaneously estimate the distortion and position of matching patch.
2. to “blur” the template window performing matching coarse-to-fine.

8

“1D Patch” “Distorted 1D Patch” “blur” “match”

SLIDE 15

Handling Geometric Distortion

As pointed out in seminal work by Berg and Malik

(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.

Two options to match patches:-
1. simultaneously estimate the distortion and position of matching patch.
2. to “blur” the template window performing matching coarse-to-fine.

8

“1D Patch” “Distorted 1D Patch” “blur” “match”

Option 2 is attractive, low computational cost!

SLIDE 16

Sparseness and Positiveness

Blurring only works if the signals being matched are

sparse and positive.

Unfortunately natural images are neither.
Combination of oriented filter banks and rectification can

remedy this problem with little loss in performance.

9

x y

...

x y

... ... ... ... ...

) ⇤

“Rectification” e.g., oriented gradients, Gabor filters e.g., sigmoid, squared, relu, etc.

SLIDE 17

Sparseness and Positiveness

Blurring only works if the signals being matched are

sparse and positive.

Unfortunately natural images are neither.
Combination of oriented filter banks and rectification can

remedy this problem with little loss in performance.

10

SLIDE 18

Reminder: Convolution

8 4 6 2 7

∗

1 2

x

h

“signal” “filter” “convolution

perator”

SLIDE 19

Reminder: Convolution

8 4 6 2 7

∗

1 2

x

h

“signal” “filter” “convolution

perator”

>> conv(x,h,’valid’) ans = 20 14 14 11

SLIDE 20

Multi-Channel Convolution

∗

y

“single-channel response”

x

“multi-channel signal” “multi-channel filter”

h

SLIDE 21

Multi-Channel Convolution

∗

y

“single-channel response”

x

“multi-channel signal” “multi-channel filter”

h

K-channels K-channels

SLIDE 22

Multi-Channel Convolution

y =

K

X

k=1

x(k) ∗ h(k)

SLIDE 23

Multi-Channel Convolution

∗

y

“multi-channel response”

x

“multi-channel signal” “multi-channel filter”

h

SLIDE 24

Multi-Channel Convolution

∗

y

“multi-channel response”

x

“multi-channel signal” “multi-channel filter”

h

L-channels

SLIDE 25

Multi-Channel Convolution

y(l) =

K

X

k=1

x(k) ∗ h(k,l) for l = 1 : L

SLIDE 26

CNNs for Object Detection

SLIDE 27

CNNs for Object Detection

image patch 3@ (224x224)

SLIDE 28

CNNs for Object Detection

image patch 3@ (224x224)

conv L@ (NxM)

L-channels M-pixels N-pixels

SLIDE 29

D · η ( K X

k=1

x(k) ∗ h(k,l) )

CNNs for Object Detection

image patch 3@ (224x224)

conv L@ (NxM)

η{} → non-linear function (relu, max pooling)

SLIDE 30

ReLU - Sparse and Positive

Rectified Linear Unit

relu{x} = max(0, x)

Connection to LASSO and sparsity??

||y − Ax||2

2 + λ

2 ||x||1

SLIDE 31

Max Pooling - Down Sampling

Input image Convolutional layer Sub-sampling layer

LeCun 1980

Max Pool Convolutional Layer Input Image max( )

SLIDE 32

Hierarchical Learning

View-tuned cells Complex Simple

Bob Crimi

SLIDE 33

Hierarchical Learning

View-tuned cells Complex Simple

Bob Crimi

V1

V2/V4

IT

Ventral Visual Stream

SLIDE 34

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

“car” “bird” “cat”

. . .

SLIDE 35

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

“car” “bird” “cat”

. . .

SLIDE 36

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

     1 . . .     

K × 1

SLIDE 37

Current State of the Art - Pose Selection

image patch 3@ (224x224)

fc-8 conv1 64@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (4096)

“car” “bird” “cat”

. . .

K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman. “Return of the Devil in the Details: Delving Deep into Convolutional Networks.”

In BMVC, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

SLIDE 38

Impact on Object Recognition

ImageNet Challenge Year

BC

(before ConvNets)

AD

(after deep learning)

6.8%

SLIDE 39

Visualizing CNNs

SLIDE 40

CNNs as Feature Extraction

image patch 3@ (224x224)

conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)

K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman. “Return of the Devil in the Details: Delving Deep into Convolutional Networks.”

In BMVC, 2014.

parameters to learn pre-learned parameters (VGG)

fc-8 fc-6 (4096) fc-7 (4096)

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

?

SLIDE 41

Today

Deep Features
Deep Tracking
Deep Flow

SLIDE 42

Drawback to Conventional Methods

Most methods for object tracking employ “online” learning.
Online methods are expensive, have to make simplifying

assumptions (e.g. circulant Toeplitz) to make things efficient.

∗

x

y

“known signal” “known response” “unknown filter”

h

SLIDE 43

Deep Tracking Methods

Recently, there have been works that have tried to explore

the employment of tracking using deep learning features.

As efficiency is key, strategy is to learn from a large

ensemble of labeled offline videos.

Of particular interest are two papers,
1. D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS

with Deep Regression Networks”, ECCV 2016.

2. L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr

“Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)

SLIDE 44

Deep Regression Networks

Previous frame Current frame Predicted loca3on

f target

within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

SLIDE 45

Deep Regression Networks

Previous frame Current frame Predicted loca3on

f target

within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

What does this method remind you of?

SLIDE 46

Deep Regression Networks

Previous frame Current frame Predicted loca3on

f target

within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

What does this method remind you of? Why is it fast?

SLIDE 47

Supervised Descent Method (SDM)

SDM assumes a linear relationship between appearance and

geometry:

… …

∆p = R[I(p) − T (0)]

Iteratively updates until convergence
However, is iteration specific.

R(k)

Xiong & De la Torre 2013

SLIDE 48

Supervised Descent Method (SDM)

SDM assumes a linear relationship between appearance and

geometry:

… …

∆p = R[I(p) − T (0)]

Iteratively updates until convergence
However, is iteration specific.

R(k)

Xiong & De la Torre 2013

What is a potential issue here?

SLIDE 49

Deep Regression Networks

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

Previous)) video)frame) centered)on))

bject)

Current)video)frame,)) shi6ed,)with)) ground9truth) bounding)box)

SLIDE 50

Deep Regression Networks

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

Image& centered&on&&

bject&

Shi2ed&image& with&ground5truth& bounding&box&

SLIDE 51

Deep Regression Networks

Accuracy Rank Robustness Rank

GOTURN (Ours)

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

SLIDE 52

How does it work?

Two hypotheses,
1. The network compares the previous frame to the current frame to find

the target object in the current frame.

2. The network acts as a local generic “object detector” and simply

locates the nearest “object.”  

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

SLIDE 53

None Illumination change Camera motion Motion change Size change Occlusion 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

Overall Errors

Current frame only Current + previous frame

None Illumina,on Change Camera Mo,on Mo,on Change Occlusion Size Change

How does it work?

Two hypotheses,
1. The network compares the previous frame to the current frame to find

the target object in the current frame.

2. The network acts as a local generic “object detector” and simply

locates the nearest “object.”  

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

SLIDE 54

Generality vs. Specificity

50 100 150 200 250 300 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Number of training videos Tracking Errors Same object class in training and test sets Different object class in training and test sets

Overall Errors

D. Held, S. Thrun, and S. Savarese “Learning to Track at 100 FPS with Deep Regression Networks”, ECCV 2016.

SLIDE 55

Fully Convolutional Siamese Networks

127x127x3 6x6x128 255x255x3 22x22x128 17x17x1

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 56

Fully Convolutional Siamese Networks

127x127x3 6x6x128 255x255x3 22x22x128 17x17x1

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

“Siamese” as they apply an identical transformation to both inputs “Template” “Source Image”

SLIDE 57

Fully Convolutional

image patch 3@ (224x224)

conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)

?

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 58

Fully Convolutional

image patch 3@ (224x224)

?

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 59

Fully Convolutional

image patch 3@ (224x224)

?

ϕ{ei ∗ x} = ei ∗ ϕ{x}

ei = [0, . . . , 1, . . . , 0]T

i-th index

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 60

Fully Convolutional Siamese Networks

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.
Example training sequences.

SLIDE 61

Fully Convolutional Siamese Networks

Overlap threshold

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Success rate

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Success plots of OPE

SiamFC (ours) [0.612] LCT (2015) [0.612] SiamFC_3s (ours) [0.608] CCT (2015) [0.605] Staple (2016) [0.600] SCT4 (2016) [0.595] KCFDP (2015) [0.581] DSST (2014) [0.554] DLSSVM_NU (2016) [0.550] SCM (2012) [0.499]

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 62

Fully Convolutional Siamese Networks

Robustness (S = 30.00)

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1

Accuracy

0.5 0.55 0.6 0.65

AR plot for experiment baseline (mean)

SiamFC (ours) SiamFC_3s (ours) Staple GOTURN ACAT DGT DSST eASMS HMMTxD KCF MCT PLT_13 PLT_14 SAMF

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

SLIDE 63

Fully Convolutional Siamese Networks

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

Frame 1 (init.) Frame 50 Frame 100 Frame 200

SLIDE 64

Fully Convolutional Siamese Networks

L. Bertinetto J. Valmadre J. F. Henriques, A. Vedaldi, P. H. S. Torr “Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.

Frame 1 (init.) Frame 50 Frame 100 Frame 200

SLIDE 65

Today

Deep Features
Deep Tracking
Deep Flow

SLIDE 66

SLIDE 67

SLIDE 68

Flow = Parts Based Registration

min

x N

X

i=1

Di(xi) + λ R(x)

SLIDE 69

Flow = Parts Based Registration

min

x N

X

i=1

Di(xi) + λ R(x) Does the image at look like the part?

xi

ith

SLIDE 70

Flow = Parts Based Registration

min

x N

X

i=1

Di(xi) + λ R(x) Does the image at look like the part?

xi

ith

Do the joint locations of the parts match the object?

SLIDE 71

Reminder - Exhaustive Search

“We can do much better than this if the graph is sparse.”

p1 p2 p3 p4 p5 O(M N)

SLIDE 72

Reminder - Exhaustive Search

“We can do much better than this if the graph is sparse.”

p1 p2 p3 p4 p5 O(NM 2)

SLIDE 73

Learning

{Di(xi)}N

i=1

for every pixel u

patch at pixel u (positive) all other patches (negatives)

H. Bristow, J. Valmadre, and S. Lucey “Dense Semantic Correspondence where Every Pixel is a Classifier”, ICCV 2015.

SLIDE 74

Learning

{Di(xi)}N

i=1

J. Zˇbontar & Y. LeCunn “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches”, JMLR 2015.

Concatenate Fully-connected, ReLU Fully-connected, ReLU Fully-connected, ReLU Fully-connected, Sigmoid Righ input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Left input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Similarity score

SLIDE 75

Learning

{Di(xi)}N

i=1

J. Zˇbontar & Y. LeCunn “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches”, JMLR 2015.

“Siamese” network

Concatenate Fully-connected, ReLU Fully-connected, ReLU Fully-connected, ReLU Fully-connected, Sigmoid Righ input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Left input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Similarity score

SLIDE 76

Fast Architecture

Dot product Left input patch Convolution Convolution, ReLU Convolution, ReLU Normalize Similarity score Right input patch Convolution Convolution, ReLU Convolution, ReLU Normalize

J. Zˇbontar & Y. LeCunn “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches”, JMLR 2015.

SLIDE 77

Results - KITTI 2015

Left input image Right input image Ground truth

Fast architecture Error: 2.79 % Accurate architecture Error: 2.36 %

J. Zˇbontar & Y. LeCunn “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches”, JMLR 2015.

SLIDE 78

More Data?

N. Mayer, D. Cremers, T. Brox, et al. “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation”, CVPR 2016.

SLIDE 79

More Data?

N. Mayer, D. Cremers, T. Brox, et al. “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation”, CVPR 2016.

SLIDE 80

FlowNet

P. Fisher, D. Cremers, T. Brox, et al. “FlowNet: Learning Optical Flow with Convolutional Networks”, ICCV 2015.

SLIDE 81

FlowNet - Refinement

P. Fisher, D. Cremers, T. Brox, et al. “FlowNet: Learning Optical Flow with Convolutional Networks”, ICCV 2015.

SLIDE 82

FlowNet - Results

Images Ground truth EpicFlow FlowNetS FlowNetC

P. Fisher, D. Cremers, T. Brox, et al. “FlowNet: Learning Optical Flow with Convolutional Networks”, ICCV 2015.