Learning Based Vision II Computer Vision Fall 2018 Columbia - - PowerPoint PPT Presentation

learning based vision ii
SMART_READER_LITE
LIVE PREVIEW

Learning Based Vision II Computer Vision Fall 2018 Columbia - - PowerPoint PPT Presentation

Learning Based Vision II Computer Vision Fall 2018 Columbia University Project Project Proposals due October 31 Pick one of our suggested projects, or pitch your own Must use something in this course Groups of 2 strongly


slide-1
SLIDE 1

Learning Based Vision II

Computer Vision Fall 2018 Columbia University

slide-2
SLIDE 2

Project

  • Project Proposals due October 31
  • Pick one of our suggested projects, or pitch your own
  • Must use something in this course
  • Groups of 2 strongly recommended
  • If you want help finding a team, see post on Piazza
  • We’ll give you Google Cloud credits once you turn in your project

proposal

  • Details here: http://w4731.cs.columbia.edu/project
slide-3
SLIDE 3

Neural Networks

slide-4
SLIDE 4

224x224x3 55x55x96 27x27x256 13x13x384 13x13x256 13x13x384 input conv1 conv2 conv3 conv4 conv5 1x1x4096 1x1x4096 1x1x1000 “fc6” “fc7”

Red layers are followed by max pooling

  • utput

Visualization hids the dimensions of the filters

Convolutional Network (AlexNet)

Slide credit: Deva Ramanan

slide-5
SLIDE 5

wk

i ∈ ℝw×h×D

xi ∈ ℝW×H×D * xi+1 ∈ ℝW×H×K

=

Convolutional Layer

slide-6
SLIDE 6

Learning

min

θ ∑ i

ℒ (f(xi; θ), yi) + λ∥θ∥2

2

xi yi f(xi; θ) θ

Input (image) Target (labels) Parameters Prediction

ℒ Loss Function ℒ(z, y) = − ∑

j

yi log zi

slide-7
SLIDE 7
slide-8
SLIDE 8

Slide from Rob Fergus, NYU

slide-9
SLIDE 9

Let’s break them

slide-10
SLIDE 10

“school bus”

slide-11
SLIDE 11

“school bus” “ostrich”

slide-12
SLIDE 12

“school bus” “ostrich”

+ =

(scaled for visualization)

slide-13
SLIDE 13

Images on left are correctly classified Images on the right are incorrectly classified as ostrich

slide-14
SLIDE 14

How can we find these?

max

Δ ℒ (f(x + Δ), y) − λ∥Δ∥2 2

Solve optimization problem to find minimal change that maximizes the loss

slide-15
SLIDE 15

99% confidence! Nguyen, Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images

slide-16
SLIDE 16

99% confidence! Also 99% confidence! Nguyen, Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images

slide-17
SLIDE 17

Nguyen, Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images

slide-18
SLIDE 18

Universal attacks

Moosave-Dezfooli et al. arXiv 1610.08401

slide-19
SLIDE 19

Universal attacks

Attack is agnostic to the image content Moosave-Dezfooli et al. arXiv 1610.08401

slide-20
SLIDE 20

Change just one pixel

Su et al, “One pixel attack for fooling deep neural networks”

slide-21
SLIDE 21

In the physical world

slide-22
SLIDE 22

In the 3D physical world

slide-23
SLIDE 23

Neural network camouflage

https://cvdazzle.com/

slide-24
SLIDE 24

Which Pixels in the Input Affect the Neuron the Most?

  • Rephrased: which pixels would make the neuron

not turn on if they had been different?

  • In other words, for which inputs is

𝜖𝑜𝑓𝑣𝑠𝑝𝑜 𝜖𝑦𝑗

large?

slide-25
SLIDE 25

Typical Gradient of a Neuron

  • Visualize the gradient of a particular neuron with respect to the

input x

  • Do a forward pass:
  • Compute the gradient of a particular neuron using backprop:
slide-26
SLIDE 26

“Guided Backpropagation”

  • Idea: neurons act like detectors of particular image

features

  • We are only interested in what image features the

neuron detects, not in what kind of stuff it doesn’t detect

  • So when propagating the gradient, we set all the

negative gradients to 0

  • We don’t care if a pixel “suppresses” a neuron

somewhere along the part to our neuron

slide-27
SLIDE 27

Guided Backpropagation

Compute gradient, zero out negatives, backpropagate Compute gradient, zero out negatives, backpropagate Compute gradient, zero out negatives, backpropagate

slide-28
SLIDE 28

Guided Backpropagation

Backprop Guided Backprop

slide-29
SLIDE 29

Guided Backpropagation

Springerberg et al, Striving for Simplicity: The All Convolutional Net (ICLR 2015 workshops)

slide-30
SLIDE 30

What About Doing Gradient Descent?

  • What to maximize the i-th output of the softmax
  • Can compute the gradient of the i-th output of the

softmax with respect to the input x (the W’s and b’s are fixed to make classification as good as possible)

  • Perform gradient descent on the input
slide-31
SLIDE 31

Yosinski et al, Understanding Neural Networks Through Deep Visualization (ICML 2015)

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

ConvNet Image P(category)

slide-38
SLIDE 38

ConvNet Image P(category)

What if we learn to generate adversarial examples?

slide-39
SLIDE 39

ConvNet P(category)

What if we learn to generate adversarial examples?

ConvNet Noise

slide-40
SLIDE 40

Generative Adversarial Networks

Goodfellow et al D P(real) G Noise

slide-41
SLIDE 41

Generated images

Trained with CIFAR-10

slide-42
SLIDE 42

Introduced a form of ConvNet more stable under adversarial training than previous attempts.

slide-43
SLIDE 43

Generator

Random uniform vector (100 numbers)

slide-44
SLIDE 44

Synthesized images

slide-45
SLIDE 45

Transposed-convolution

slide-46
SLIDE 46

Transposed-convolution

Convolution Transposed-convolution

slide-47
SLIDE 47

Generated Images

Brock et al. Large scale GAN training for high fidelity natural image synthesis

slide-48
SLIDE 48

Image Interpolation

slide-49
SLIDE 49

Image Interpolation

slide-50
SLIDE 50

Nearest Neighbors

slide-51
SLIDE 51

Nearest Neighbors

slide-52
SLIDE 52

Generating Dynamics

slide-53
SLIDE 53
slide-54
SLIDE 54

Two components

conv1 conv2 conv3 conv4 conv5 fc6 fc7 Classification layer

Generator Network to visualize

car

slide-55
SLIDE 55

Two components

conv1 conv2 conv3 conv4 conv5 fc6 fc7 Classification layer

Generator Network to visualize

Table lamp

slide-56
SLIDE 56

Two components

Generator

conv1 conv2 conv3 conv4 conv5 fc6 fc7 Classification layer

Table lamp

Unit to visualize

slide-57
SLIDE 57

Synthesizing Images Preferred by CNN

Nguyen A, Dosovitskiy A, Yosinski J, Brox T, Clune J. (2016). "Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.". arXiv:1605.09304.

ImageNet-Alexnet-final units (class units)

slide-58
SLIDE 58

Where to start training?

slide-59
SLIDE 59

Gradient Descent

θ ℒ

α δℒ δθ

How to pick where to start?

slide-60
SLIDE 60

Idea 0: Train many models

slide-61
SLIDE 61

Drop-out regularization

(a) Standard Neural Net (b) After applying dropout.

Intuition: we should really train a family of models with different architectures and average their predictions (c.f. model averaging from machine learning) Practical implementation: learn a single “superset” architecture that randomly removes nodes (by randomly zero’ing out activations) during gradient updates

Slide credit: Deva Ramanan

slide-62
SLIDE 62

Idea 1: Carefully pick starting point

slide-63
SLIDE 63

Backprop

x0 f1 f2 ... fL ` w1 w2 wL z 2 R x2 x3 xL1 xL

dz dwl = d dwl [`y fL(·; wL) ... f2(·; w2) f1(x0; w1)]

dz dwl = dz d(vec xL)> d vec xL d(vec xL1)> . . . d vec xl+1 d(vec xl)> d vec xl dw>

l

Slide credit: Deva Ramanan

slide-64
SLIDE 64

Idea 1: Carefully pick starting point

He et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

slide-65
SLIDE 65

Exploding and vanishing gradient

  • How does the determinant of the gradients effect the final

gradient?

  • What if the determinant is less than one?
  • What if the determinant is greater than one?

dz dwl = dz d(vec xL)> d vec xL d(vec xL1)> . . . d vec xl+1 d(vec xl)> d vec xl dw>

l

slide-66
SLIDE 66

Exploding and vanishing gradient

Source: Roger Grosse

slide-67
SLIDE 67

Initialization

  • Key idea: initialization weights so that the variance of

activations is one at each layer

  • You can derive what this should be for different layers and

nonlinearities

  • For ReLU:

He et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification wi ∼ 𝒪 (0,2 k ) bi = 0

slide-68
SLIDE 68

Idea 2: How to maintain this throughout training?

slide-69
SLIDE 69

Batch Normalization

! " ! = ! − % & ' = (" ! + *

  • %: mean of ! in mini-batch
  • &: std of ! in mini-batch
  • (: scale
  • *: shift
  • %, &: functions of !,

analogous to responses

  • (, *: parameters to be learned,

analogous to weights

Ioffe & Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. ICML 2015

slide-70
SLIDE 70

Batch Normalization

2 modes of BN:

  • Train mode:
  • !, " are functions of a batch of #
  • Test mode:
  • !, " are pre-computed on training set

Caution: make sure your BN usage is correct!

(this causes many of my bugs in my research experience!)

# $ # = # − ! " ' = ($ # + *

Ioffe & Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. ICML 2015

slide-71
SLIDE 71

Batch Normalization

Figure credit: Ioffe & Szegedy

w/o BN w/ BN accuracy iter.

Ioffe & Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. ICML 2015

slide-72
SLIDE 72

Back to breaking things…

slide-73
SLIDE 73

Architecture of Krizhevsky et al.

  • 8 layers total
  • Trained on Imagenet

dataset [Deng et al. CVPR’09]

  • 18.2% top-5 error
  • Our reimplementation:

18.1% top-5 error

Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool Layer 7: Full

slide-74
SLIDE 74

Architecture of Krizhevsky et al.

  • Remove top fully

connected layer

– Layer 7

  • Drop 16 million

parameters

  • Only 1.1% drop in

performance!

Input Image Layer 1: Conv + Pool Layer 6: Full Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool

slide-75
SLIDE 75

Architecture of Krizhevsky et al.

  • Remove both fully connected

layers

– Layer 6 & 7

  • Drop ~50 million parameters
  • 5.7% drop in performance

Input Image Layer 1: Conv + Pool Layer 3: Conv Softmax Output Layer 2: Conv + Pool Layer 4: Conv Layer 5: Conv + Pool

slide-76
SLIDE 76

Architecture of Krizhevsky et al.

  • Now try removing upper feature

extractor layers:

– Layers 3 & 4

  • Drop ~1 million parameters
  • 3.0% drop in performance

Input Image Layer 1: Conv + Pool Layer 6: Full Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool Layer 7: Full

slide-77
SLIDE 77

Architecture of Krizhevsky et al.

  • Now try removing upper feature

extractor layers & fully connected:

– Layers 3, 4, 6 ,7

  • Now only 4 layers
  • 33.5% drop in performance

àDepth of network is key

Input Image Layer 1: Conv + Pool Softmax Output Layer 2: Conv + Pool Layer 5: Conv + Pool

slide-78
SLIDE 78

1 2 3 4 5 6 10 20

  • iter. (1e4)

training error (%)

1 2 3 4 5 6 10 20

  • iter. (1e4)

test error (%) 56-layer 20-layer 56-layer 20-layer

Figure 1. Training error (left) and test error (right) on CIFAR-10

What should happen if I train a deeper network?

slide-79
SLIDE 79

1 2 3 4 5 6 10 20

  • iter. (1e4)

training error (%)

1 2 3 4 5 6 10 20

  • iter. (1e4)

test error (%) 56-layer 20-layer 56-layer 20-layer

Figure 1. Training error (left) and test error (right) on CIFAR-10

What should happen if I train a deeper network?

slide-80
SLIDE 80

Simply stacking layers?

1 2 3 4 5 6 5 10 20

  • iter. (1e4)

error (%)

plain-20 plain-32 plain-44 plain-56

CIFAR-10 20-layer 32-layer 44-layer 56-layer

10 20 30 40 50 20 30 40 50 60

  • iter. (1e4)

error (%)

plain-18 plain-34

ImageNet-1000 34-layer 18-layer

  • “Overly deep” plain nets have higher training error
  • A general phenomenon, observed in many datasets

solid: test/val dashed: train

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

slide-81
SLIDE 81 7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 fc 1000

a shallower model (18 layers) a deeper counterpart (34 layers)

7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 fc 1000

“extra” layers

  • Richer solution space
  • A deeper model should not have higher

training error

  • A solution by construction:
  • original layers: copied from a

learned shallower model

  • extra layers: set as identity
  • at least the same training error
  • Optimization difficulties: solvers cannot

find the solution when going deeper…

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

slide-82
SLIDE 82

Deep Residual Learning

  • Plaint net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

any two stacked layers

! "(!)

weight layer weight layer

relu relu

" ! is any desired mapping, hope the 2 weight layers fit "(!)

slide-83
SLIDE 83

Deep Residual Learning

  • ! " is a residual mapping w.r.t. identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

  • If identity were optimal,

easy to set weights as 0

  • If optimal mapping is closer to identity,

easier to find small fluctuations weight layer weight layer

relu relu

" # " = ! " + "

identity

" !(")

slide-84
SLIDE 84

CIFAR-10 experiments

1 2 3 4 5 6 5 10 20

  • iter. (1e4)

error (%)

plain-20 plain-32 plain-44 plain-56

20-layer 32-layer 44-layer 56-layer CIFAR-10 plain nets

1 2 3 4 5 6 5 10 20

  • iter. (1e4)

error (%)

ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110

CIFAR-10 ResNets 56-layer 44-layer 32-layer 20-layer 110-layer

  • Deep ResNets can be trained without difficulties
  • Deeper ResNets have lower training error, and also lower test error

solid: test dashed: train

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

slide-85
SLIDE 85

ImageNet experiments

10 20 30 40 50 20 30 40 50 60

  • iter. (1e4)

error (%)

ResNet-18 ResNet-34

10 20 30 40 50 20 30 40 50 60

  • iter. (1e4)

error (%)

plain-18 plain-34

ImageNet plain nets ImageNet ResNets

solid: test dashed: train

34-layer 18-layer 18-layer 34-layer

  • Deep ResNets can be trained without difficulties
  • Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.

slide-86
SLIDE 86

How much data do you need?

Systematic evaluation of CNN advances on the ImageNet

slide-87
SLIDE 87

How much data do you need?

CNN Features off-the-shelf: an Astounding Baseline for Recognition

slide-88
SLIDE 88

Next Class

Neural networks for visual recognition