Deep Tracking & Flow
Instructor - Simon Lucey
16-423 - Designing Computer Vision Apps
Deep Tracking & Flow Instructor - Simon Lucey 16-423 - - - PowerPoint PPT Presentation
Deep Tracking & Flow Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Deep Features Deep Tracking Deep Flow Primary Visual Cortex
Deep Tracking & Flow
Instructor - Simon Lucey
16-423 - Designing Computer Vision Apps
Today
Primary Visual Cortex
Spatial Sensitivity
template?
Kingdom, Field, Olmos, 2007
Spatial Sensitivity
Kingdom, Field, Olmos, 2007
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
6
“1D Patch” “Distorted 1D Patch”
Source: A. C. Berg
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
6
“1D Patch” “Distorted 1D Patch”
Source: A. C. Berg
“match”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
7
“1D Patch” “Distorted 1D Patch”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
7
“1D Patch” “Distorted 1D Patch” “align”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
7
“1D Patch” “Distorted 1D Patch” “align”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint change.
7
“1D Patch” “Distorted 1D Patch” “align” “match”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.
8
“1D Patch” “Distorted 1D Patch”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.
8
“1D Patch” “Distorted 1D Patch” “blur”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.
8
“1D Patch” “Distorted 1D Patch” “blur” “match”
Handling Geometric Distortion
(CVPR’01) the effectiveness of SSD will degrade with significant viewpoint and/or illumination change.
8
“1D Patch” “Distorted 1D Patch” “blur” “match”
Option 2 is attractive, low computational cost!
Sparseness and Positiveness
sparse and positive.
remedy this problem with little loss in performance.
9
x y...
x y... ... ... ... ...
) ⇤
“Rectification” e.g., oriented gradients, Gabor filters e.g., sigmoid, squared, relu, etc.
Sparseness and Positiveness
sparse and positive.
remedy this problem with little loss in performance.
10
Reminder: Convolution
8 4 6 2 7
1 2
x
h
“signal” “filter” “convolution
Reminder: Convolution
8 4 6 2 7
1 2
x
h
“signal” “filter” “convolution
>> conv(x,h,’valid’) ans = 20 14 14 11
Multi-Channel Convolution
∗
y
“single-channel response”
“multi-channel signal” “multi-channel filter”
h
Multi-Channel Convolution
∗
y
“single-channel response”
“multi-channel signal” “multi-channel filter”
h
K-channels K-channels
Multi-Channel Convolution
K
k=1
Multi-Channel Convolution
∗
y
“multi-channel response”
“multi-channel signal” “multi-channel filter”
h
Multi-Channel Convolution
∗
y
“multi-channel response”
“multi-channel signal” “multi-channel filter”
h
L-channels
Multi-Channel Convolution
y(l) =
K
X
k=1
x(k) ∗ h(k,l) for l = 1 : L
CNNs for Object Detection
CNNs for Object Detection
image patch 3@ (224x224)
CNNs for Object Detection
image patch 3@ (224x224)
conv L@ (NxM)
L-channels M-pixels N-pixels
D · η ( K X
k=1
x(k) ∗ h(k,l) )
CNNs for Object Detection
image patch 3@ (224x224)
conv L@ (NxM)
η{} → non-linear function (relu, max pooling)
ReLU - Sparse and Positive
relu{x} = max(0, x)
||y − Ax||2
2 + λ
2 ||x||1
Max Pooling - Down Sampling
Input image Convolutional layer Sub-sampling layer
LeCun 1980
Max Pool Convolutional Layer Input Image max( )
Hierarchical Learning
View-tuned cells Complex Simple
Bob Crimi
Hierarchical Learning
View-tuned cells Complex Simple
Bob Crimi
V1
V2/V4
IT
Ventral Visual Stream
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
“car” “bird” “cat”
. . .
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
“car” “bird” “cat”
. . .
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
1 . . .
K × 1
Current State of the Art - Pose Selection
image patch 3@ (224x224)
fc-8 conv1 64@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (4096)
“car” “bird” “cat”
. . .
In BMVC, 2014.
Impact on Object Recognition
ImageNet Challenge Year
BC
(before ConvNets)
AD
(after deep learning)
6.8%
Visualizing CNNs
CNNs as Feature Extraction
image patch 3@ (224x224)
conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)
In BMVC, 2014.
parameters to learn pre-learned parameters (VGG)
fc-8 fc-6 (4096) fc-7 (4096)
Today
Drawback to Conventional Methods
assumptions (e.g. circulant Toeplitz) to make things efficient.
∗
x
y
“known signal” “known response” “unknown filter”
h
Deep Tracking Methods
the employment of tracking using deep learning features.
ensemble of labeled offline videos.
with Deep Regression Networks”, ECCV 2016.
“Fully-Convolutional Siamese Networks for Object Tracking”, ArXiv 2016.
conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)
Deep Regression Networks
Previous frame Current frame Predicted loca3on
within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers
Deep Regression Networks
Previous frame Current frame Predicted loca3on
within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers
What does this method remind you of?
Deep Regression Networks
Previous frame Current frame Predicted loca3on
within search region Crop Crop What to track Search Region Conv Layers Conv Layers Fully-Connected Layers
What does this method remind you of? Why is it fast?
Supervised Descent Method (SDM)
geometry:
∆p = R[I(p) − T (0)]
R(k)
Xiong & De la Torre 2013
Supervised Descent Method (SDM)
geometry:
∆p = R[I(p) − T (0)]
R(k)
Xiong & De la Torre 2013
What is a potential issue here?
Deep Regression Networks
Previous)) video)frame) centered)on))
Current)video)frame,)) shi6ed,)with)) ground9truth) bounding)box)
Deep Regression Networks
Image& centered&on&&
Shi2ed&image& with&ground5truth& bounding&box&
Deep Regression Networks
Accuracy Rank Robustness Rank
GOTURN (Ours)
How does it work?
the target object in the current frame.
locates the nearest “object.”
None Illumination change Camera motion Motion change Size change Occlusion 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Overall Errors
Current frame only Current + previous frameNone Illumina,on Change Camera Mo,on Mo,on Change Occlusion Size Change
How does it work?
the target object in the current frame.
locates the nearest “object.”
Generality vs. Specificity
50 100 150 200 250 300 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Number of training videos Tracking Errors Same object class in training and test sets Different object class in training and test sets
Overall Errors
Fully Convolutional Siamese Networks
127x127x3 6x6x128 255x255x3 22x22x128 17x17x1
Fully Convolutional Siamese Networks
127x127x3 6x6x128 255x255x3 22x22x128 17x17x1
“Siamese” as they apply an identical transformation to both inputs “Template” “Source Image”
Fully Convolutional
image patch 3@ (224x224)
conv1 96@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13)
Fully Convolutional
image patch 3@ (224x224)
Fully Convolutional
image patch 3@ (224x224)
ϕ{ei ∗ x} = ei ∗ ϕ{x}
ei = [0, . . . , 1, . . . , 0]T
i-th index
Fully Convolutional Siamese Networks
Fully Convolutional Siamese Networks
Overlap threshold
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Success rate
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Success plots of OPE
SiamFC (ours) [0.612] LCT (2015) [0.612] SiamFC_3s (ours) [0.608] CCT (2015) [0.605] Staple (2016) [0.600] SCT4 (2016) [0.595] KCFDP (2015) [0.581] DSST (2014) [0.554] DLSSVM_NU (2016) [0.550] SCM (2012) [0.499]
Fully Convolutional Siamese Networks
Robustness (S = 30.00)
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1Accuracy
0.5 0.55 0.6 0.65AR plot for experiment baseline (mean)
SiamFC (ours) SiamFC_3s (ours) Staple GOTURN ACAT DGT DSST eASMS HMMTxD KCF MCT PLT_13 PLT_14 SAMF
Fully Convolutional Siamese Networks
Frame 1 (init.) Frame 50 Frame 100 Frame 200
Fully Convolutional Siamese Networks
Frame 1 (init.) Frame 50 Frame 100 Frame 200
Frame 1 (init.) Frame 50 Frame 100 Frame 200
43 Sony Xperia Phone - 960 FPS - https://www.facebook.com/bbcnews/videos/662989307239621/
43 Sony Xperia Phone - 960 FPS - https://www.facebook.com/bbcnews/videos/662989307239621/
44
44
45
Release - “Need for Speed: Benchmark for High Frame Rate Object Tracking”
45
Release - “Need for Speed: Benchmark for High Frame Rate Object Tracking”
46
Release - “Need for Speed: Benchmark for High Frame Rate Object Tracking”
46
Release - “Need for Speed: Benchmark for High Frame Rate Object Tracking”
Today
Flow = Parts Based Registration
min
x N
X
i=1
Di(xi) + λ R(x)
Flow = Parts Based Registration
min
x N
X
i=1
Di(xi) + λ R(x) Does the image at look like the part?
xi
ith
Flow = Parts Based Registration
min
x N
X
i=1
Di(xi) + λ R(x) Does the image at look like the part?
xi
ith
Do the joint locations of the parts match the object?
Reminder - Exhaustive Search
“We can do much better than this if the graph is sparse.”
p1 p2 p3 p4 p5 O(M N)
Reminder - Exhaustive Search
“We can do much better than this if the graph is sparse.”
p1 p2 p3 p4 p5 O(NM 2)
Learning
{Di(xi)}N
i=1
for every pixel u
patch at pixel u (positive) all other patches (negatives)
Learning
{Di(xi)}N
i=1
Concatenate Fully-connected, ReLU Fully-connected, ReLU Fully-connected, ReLU Fully-connected, Sigmoid Righ input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Left input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Similarity score
Learning
{Di(xi)}N
i=1
“Siamese” network
Concatenate Fully-connected, ReLU Fully-connected, ReLU Fully-connected, ReLU Fully-connected, Sigmoid Righ input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Left input patch Convolution, ReLU Convolution, ReLU Convolution, ReLU Similarity score
Fast Architecture
Dot product Left input patch Convolution Convolution, ReLU Convolution, ReLU Normalize Similarity score Right input patch Convolution Convolution, ReLU Convolution, ReLU Normalize
Results - KITTI 2015
Left input image Right input image Ground truth
Fast architecture Error: 2.79 % Accurate architecture Error: 2.36 %
More Data?
More Data?
FlowNet
FlowNet - Refinement
FlowNet - Results
Images Ground truth EpicFlow FlowNetS FlowNetC