Deep Learning beyond Classification Cees Snoek, UvA Efstratios - - PowerPoint PPT Presentation

deep learning beyond classification
SMART_READER_LITE
LIVE PREVIEW

Deep Learning beyond Classification Cees Snoek, UvA Efstratios - - PowerPoint PPT Presentation

Deep Learning beyond Classification Cees Snoek, UvA Efstratios Gavves, UvA Laurens van de Maaten, Facebook Standard inference N-way classification Dog? Cat? Bike Car? Plane ? ? Standard inference N-way classification How popular will


slide-1
SLIDE 1

Deep Learning beyond Classification

Cees Snoek, UvA Efstratios Gavves, UvA Laurens van de Maaten, Facebook

slide-2
SLIDE 2

Standard inference

N-way classification

Dog? Cat? Bike ? Car? Plane ?

slide-3
SLIDE 3

Standard inference

N-way classification Regression

How popular will this movie be in IMDB?

slide-4
SLIDE 4

Standard inference

N-way classification Regression Ranking …

Who is older?

slide-5
SLIDE 5

Quiz: What is common?

N-way classification Regression Ranking …

slide-6
SLIDE 6

Quiz: What is common?

They all make “single value” predictions Do all our machine learning tasks boil down to “single value” predictions?

slide-7
SLIDE 7

Beyond “single value” predictions?

Do all our machine learning tasks boil to “single value” predictions? Are there tasks where outputs are somehow correlated? Is there some structure in this output correlations? How can we predict such structures?

q Structured prediction

slide-8
SLIDE 8

Quiz: Examples?

slide-9
SLIDE 9

Object detection

Predict a box around an object Images

q Spatial location q b(ounding) box

Videos

q Spatio-temporal location q bbox@t, bbox@t+1, …

slide-10
SLIDE 10

Object segmentation

slide-11
SLIDE 11

Optical flow & motion estimation

slide-12
SLIDE 12

Depth estimation

Godard et al., Unsupervised Monocular Depth Estimation with Left-Right Consistency, 2016

slide-13
SLIDE 13

Normals and reflectance estimation

slide-14
SLIDE 14

Structured prediction

Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?

slide-15
SLIDE 15

Structured prediction

Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?

slide-16
SLIDE 16

Convnets for structured prediction

slide-17
SLIDE 17

Sliding window on feature maps

Selective Search Object Proposals [Uijlings2013] SPPnet [He2014] Fast R-CNN [Girshick2015]

slide-18
SLIDE 18

Fast R-CNN: Steps

Process the whole image up to conv5

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

slide-19
SLIDE 19

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

slide-20
SLIDE 20

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

q some correct, most wrong

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

slide-21
SLIDE 21

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

q some correct, most wrong

Given single location à ROI pooling module extracts fixed length feature

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

Always 4x4 no matter the size

  • f candidate

location

slide-22
SLIDE 22

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

q some correct, most wrong

Given single location à ROI pooling module extracts fixed length feature

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

Always 4x4 no matter the size

  • f candidate

location ROI Pooling Module

slide-23
SLIDE 23

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

q some correct, most wrong

Given single location à ROI pooling module extracts fixed length feature

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

Always 4x4 no matter the size

  • f candidate

location ROI Pooling Module

slide-24
SLIDE 24

Fast R-CNN: Steps

Process the whole image up to conv5 Compute possible locations for objects

q some correct, most wrong

Given single location à ROI pooling module extracts fixed length feature

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

Conv 5 feature map

Always 4x4 no matter the size

  • f candidate

location ROI Pooling Module

Car/dog/bicycle New box coordinates

slide-25
SLIDE 25

Divide feature map in !"! cells

q Cell size changes depending on the size of the candidate location

Always 3x3 no matter the size of candidate location

slide-26
SLIDE 26

Some results

slide-27
SLIDE 27

Fast R-CNN

Reuse convolutions for different candidate boxes

q Compute feature maps only once

Region-of-Interest pooling

q Define stride relatively à box width divided by predefined number of “poolings” T q Fixed length vector

End-to-end training! (Very) Accurate object detection (Very) Faster

q Less than a second per image

External box proposals needed

T=5

slide-28
SLIDE 28

Faster R-CNN [Girshick2016]

Fast R-CNN

q external candidate locations

Faster R-CNN

q deep network proposes candidate locations

Slide the feature map

q ! anchor boxes per slide

Region Proposal Network

slide-29
SLIDE 29

Going Fully Convolutional

[LongCVPR2014] Image larger than network input

q slide the network

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

fc1 fc2

Is this pixel a camel? Yes! No!

slide-30
SLIDE 30

Going Fully Convolutional

[LongCVPR2014] Image larger than network input

q slide the network

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

fc1 fc2

Is this pixel a camel? Yes! No!

slide-31
SLIDE 31

Going Fully Convolutional

[LongCVPR2014] Image larger than network input

q slide the network

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

fc1 fc2

Is this pixel a camel? Yes! No!

slide-32
SLIDE 32

Going Fully Convolutional

[LongCVPR2014] Image larger than network input

q slide the network

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

fc1 fc2

Is this pixel a camel? Yes! No!

slide-33
SLIDE 33

Going Fully Convolutional

[LongCVPR2014] Image larger than network input

q slide the network

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5

fc1 fc2

Is this pixel a camel? Yes! No!

slide-34
SLIDE 34

Fully Convolutional Networks

[LongCVPR2014] Connect intermediate layers to output

slide-35
SLIDE 35

Fully Convolutional Networks

Output is too coarse

q Image Size 500x500, Alexnet Input Size: 227x227 à Output: 10x10

How to obtain dense predictions? Upconvolution

q Other names: deconvolution, transposed convolution, fractionally-strided convolutions

slide-36
SLIDE 36

Deconvolutional modules

Convolution No padding, no strides Image Output

https://github.com/vdumoulin/conv_arithmetic

Upconvolution No padding, no strides Upconvolution Padding, strides

slide-37
SLIDE 37

Coarse à Fine Output

Upconvolution 2x Upconvolution 2x

7x7 14x14 224x224 Pixel label probabilities Ground truth pixel labels 0.8 0.1 0.9 1

Small loss generated Large loss generated (probability much higher than ground truth)

slide-38
SLIDE 38

Structured losses

slide-39
SLIDE 39

Deep ConvNets with CRF loss

[Chen, Papandreou 2016] Segmentation map is good but not pixel-precise

q Details around boundaries are lost

Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions

slide-40
SLIDE 40

Deep ConvNets with CRF loss

[Chen, Papandreou 2016]

slide-41
SLIDE 41

Deep ConvNets with CRF loss

[Chen, Papandreou 2016] Segmentation map is good but not pixel-precise

– Details around boundaries are lost

Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions Include Fully Connected CRF loss to refine segmentation ! " = ∑%& "& + ∑%&(("&, "()

Unary loss Pairwise loss Total loss

%&( "&, "( ~ -. exp −3 4& − 4(

5 − 6 7& − I( 5 + -5 exp(−9 4& − 4( 5)

slide-42
SLIDE 42

Examples

slide-43
SLIDE 43

Mask R-CNN

State-of-the-art in semantic segmentation Heavily relies on Fast R-CNN Can work with different architectures, also ResNet Runs at 195ms per image on an Nvidia Tesla M40 GPU Can also be used for Human Pose Estimation

slide-44
SLIDE 44

Mask R-CNN: R-CNN + 2 layers

slide-45
SLIDE 45

Mask R-CNN: ROI Align

slide-46
SLIDE 46

Mask R-CNN

slide-47
SLIDE 47

Mask R-CNN

slide-48
SLIDE 48

Mask R-CNN

slide-49
SLIDE 49

SINT: Siamese Networks for Tracking

While tracking, the only definitely correct training example is the first frame

q All others are inferred by the algorithm

If the “inferred positives” are correct, then the model is already good enough and no update is needed If the “inferred positives” are incorrect, updating the model using wrong positive examples will eventually destroy the model Siamese Instance Search for Tracking, R. Tao, E. Gavves,

  • A. Smeulders, CVPR 2016
slide-50
SLIDE 50

Basic Idea

No model updates through time to avoid model contamination Instead, learn invariance model !("#)

– invariances shared between objects – reliable, external, rich, category-independent, data

Assumption

– The appearance variances are shared amongst object and categories – Learning can accurate enough to identify common appearance variances

Solution: Use a Siamese Network to compare patches between images

– Then “tracking” equals finding the most similar patch at each frame (no temporal modelling)

slide-51
SLIDE 51

Training

loss CNN f(.) CNN f(.)

!" !# $(!") $(!#)

Marginal Contrastive Loss: ' !", !#, )"# = 1 2 )"#-. + 1 2 1 − )"# max(0, 5 − -.)

)"# ∈ {0,1}- =

$ !" − $(!#) . Matching function (after learning): 9 !", !# = $ !" : $ !#

slide-52
SLIDE 52

Training

loss CNN f(.) CNN f(.)

!" !# $(!") $(!#)

Marginal Contrastive Loss: ' !", !#, )"# = 1 2 )"#-. + 1 2 1 − )"# max(0, 5 − -.)

)"# ∈ {0,1}- =

$ !" − $(!#) . Matching function (after learning): 9 !", !# = $ !" : $ !#

slide-53
SLIDE 53

Training

loss CNN f(.) CNN f(.)

!" !# $(!") $(!#)

Marginal Contrastive Loss: ' !", !#, )"# = 1 2 )"#-. + 1 2 1 − )"# max(0, 5 − -.)

)"# ∈ {0,1}- =

$ !" − $(!#) . Matching function (after learning): 9 !", !# = $ !" : $ !# 0.16

slide-54
SLIDE 54

Training

loss CNN f(.) CNN f(.)

!" !# $(!") $(!#)

Marginal Contrastive Loss: ' !", !#, )"# = 1 2 )"#-. + 1 2 1 − )"# max(0, 5 − -.)

)"# ∈ {0,1}- =

$ !" − $(!#) . Matching function (after learning): 9 !", !# = $ !" : $ !# 0.79

slide-55
SLIDE 55

Spatial Transform Networks

slide-56
SLIDE 56

Problem

ConvNets sometimes are robust enough to input changes

– While pooling gives some invariance, only in deeper layers the pooling receptive field is large enough for this invariance to be noteworthy – One way to improve robustness: Data augmentation

Smarter way: Spatial Transformer Networks

slide-57
SLIDE 57

Basic Idea

Define a geometric transformation matrix Θ = #$$ #$% #$& #%$ #%% #%& Four interesting transformations

– Identity, i.e. Θ = 1 1 – Rotation, e.g., Θ ≈ 0.7 −0.7 0.7 0.7 0 for 45/, as cos(4

5) ≈ 0.7

– Zooming in, e.g. Θ ≈ 0.5 0.5 0 for 2X zooming in – Zooming in, e.g. Θ ≈ 2 2 0 for 2X zooming out

slide-58
SLIDE 58

Basic Idea

Then, define a mesh grid ("#

$, &# $) on the original image and

apply the geometric transformations "#

(

&#

( = Θ ⋅

"#

$

&#

$

1 Produce the new image using the transformation above and an interpolation method Learn the parameters Θ and the meshgrid from the data A localization network learns to predict Θ given a new image

slide-59
SLIDE 59

Sequential data

slide-60
SLIDE 60

Or …

What

slide-61
SLIDE 61

Or …

What about

slide-62
SLIDE 62

Or …

What about inputs that appear in sequences, such as text? Could neural network handle such modalities? a

slide-63
SLIDE 63

Memory needed

What about inputs that appear in sequences, such as text? Could neural network handle such modalities? a

Pr # = %

&

Pr #& #', … , #&*')

slide-64
SLIDE 64

Recurrent Networks

Simplest model

q Input with parameters ! q Memory embedding with parameters " q Output with parameters #

$% &% ! #

'%

Output parameters Input parameters Memory Input Output ! # " Memory parameters

slide-65
SLIDE 65

Recurrent Networks

Simplest model

q Input with parameters ! q Memory embedding with parameters " q Output with parameters #

$% &% &%'( ! # "

)% )%'(

Output parameters Input parameters ! # Memory Input Output ! # " Memory parameters

slide-66
SLIDE 66

Recurrent Networks

Simplest RNN

q Input with parameters ! q Memory embedding with parameters " q Output with parameters #

$% &% $%'( $%') $%'* &%'( &%') &%'* ! # "

+% +%'( +%') +%'*

Output parameters Input parameters ! # " ! # " ! # " Memory Input Output ! # " Memory parameters

slide-67
SLIDE 67

Folding the memory

!" #" !"$% !"$& #"$% #"$& ' ( ) '

*" *"$% *"$&

' ( ) !" #" ) ( '

*"

(*",%)

Unrolled/Unfolded Network Folded Network

)

slide-68
SLIDE 68

RNN vs NN

What is really different?

q Steps instead of layers q Step parameters shared whereas in a Multi-Layer Network they are different

!" !# !$ %&!'( 1 %&!'( 2 %&!'( 3 3-gram Unrolled Recurrent Network 3-layer Neural Network “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 ,

  • .

,

  • .

,

  • .

"

.

#

. .

$

slide-69
SLIDE 69

Training an RNN

Cross-entropy loss ! = #

$,&

'$&

()*

⇒ ℒ = − log ! = 1

$

ℒ$ = − 1 3 1

$

4$ log '$ Backpropagation Through Time (BPTT) Be careful of the recursion. The non-linearity is influencing

  • itself. The gradients at one time step depends on

gradients on previous time steps

q Like in NN à Chain Rule q Only difference: Gradients survive over time steps

slide-70
SLIDE 70

RNN Gradients

ℒ = #(%& %&'( … %( *(, %,; . ; . ; . ; . /ℒ0 /. = 1

23( 4 /ℒ0

/%0 /%0 /%2 /%2 /. /ℒ /%0 /%0 /%2 = /ℒ /c4 ⋅ /%0 /c4'( ⋅ /%0'( /c4'7 ⋅ … ⋅ /%28( /c9 ≤ ;0'2 /ℒ0 /%0 The RNN gradient is a recursive product of <=>

<?@AB

slide-71
SLIDE 71

Vanishing/Exploding gradients

!ℒ !#$ = !ℒ !c' ⋅ !#) !c'*+ ⋅ !#)*+ !c'*, ⋅ … ⋅ !#$.+ !c/0 !ℒ !#$ = !ℒ !c' ⋅ !#) !c'*+ ⋅ !#)*+ !c'*, ⋅ … ⋅ !#+ !c/0

< 1 < 1 < 1

3ℒ 34 ≪ 1 ⟹ Vanishing

gradient

> 1 > 1 > 1

3ℒ 34 ≫ 1 ⟹ Exploding

gradient

slide-72
SLIDE 72

RNN & Chaotic Systems

The latent memory space is composed of multiple dimensions A subspace of the memory state space can store information if multiple basins of attraction in some dimensions exist Gradients must be strong near the basin boundaries

slide-73
SLIDE 73

RNN & Chaotic Systems

In the figures x" ∝ $% and &% ∝ F()&%*+ + -.% + /)

Figures from:

slide-74
SLIDE 74

Advanced RNN: LSTM

! ∈ (0, 1): control gate – something like a switch tanh ∈ −1, 1 : recurrent nonlinearity

  • = ! /01(2) + 40567(2)

8 = ! /01(9) + 40567(9) : = ! /01(;) + 40567(;) < =0 = tanh(/01 > + 40567(>)) =0 = =056 ⊙ 8 + < =0 ⊙ - 40 = tanh =0 ⊙ :

slide-75
SLIDE 75

Discovering structure

slide-76
SLIDE 76

Standard Autoencoder

Encoder ! Decoder "

Error ℒ Input: $ Output: reconstruction% $

ℒ = '

(

ℓ $, + ℎ $ ℓ = $ − % $ .

slide-77
SLIDE 77

Standard Autoencoder

The latent space should have fewer dimensions than input

q Undercomplete representation q Bottleneck architecture

Otherwise (overcomplete) autoencoder might learn the identity function

! ∝ # ⟹ % & = & ⟹ ℒ = 0

q Assuming no regularization q Often in practice still works though

Also, if * = !& + , (linear) autoencoder learns same subspace as PCA

slide-78
SLIDE 78

Denoising Autoencoder

Error ℒ

Corrupted input: " # Output: reconstruction $ # Input: # Noise %: '(" #|#, %)

Encoder , Decoder -

slide-79
SLIDE 79

Denoising Autoencoder

The network does not overlearn the data

q Can even use overcomplete latent spaces

Model forced to learn more intelligent, robust representations

q Learn to ignore noise or trivial solutions(identity) q Focus on “underlying” data generation process

slide-80
SLIDE 80

Variational Autoencoder

We want to model the data distribution ! " = $ !% & !% " & '& Posterior !% & " is intractable for complicated likelihood functions !% " & , e.g. a neural network à ! " is also intractable Introduce an inference machine () & " (e.g. another neural network) that learns to approximate the posterior !% & "

q Since we cannot know !% & " define a variational lower bound to optimize instead

ℒ +, -, " = −/01 () & " ‖!% & + 456 & " (log !%("|&))

Reconstruction term Regularization term

slide-81
SLIDE 81

Examples

slide-82
SLIDE 82

Generative Adversarial Networks

Composed of two successive networks

q Generator network (like upper half of autoencoders) q Discriminator network (like a convent)

Learning

q Sample “noise” vectors ! q Per ! the generator produces a sample " q Make a batch where half samples are real, half are the generated ones q The discriminator needs to predict what is real and what is fake

slide-83
SLIDE 83

Generator Noise ! Discriminator

Generative Adversarial Networks

slide-84
SLIDE 84

“Police vs Thief”

Generator and discriminator networks optimized together

q The generator (thief) tries to fool the discriminator q The discriminator (police) tries to not get fooled by the generator

Mathematically min

$ max '

((*, ,) = /0~23454(0) log ,(9) + /;~2<(;) log(1 − , * ? )

slide-85
SLIDE 85

Examples

Bedrooms

slide-86
SLIDE 86

Image “arithmetics”

slide-87
SLIDE 87

Take away message

Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks Discovering structure in data is also possible Training neural networks is sequences is possible with recurrent nets

slide-88
SLIDE 88

Thank you!