Language and Vision EECS 442 Prof. David Fouhey Winter 2019, - - PowerPoint PPT Presentation

language and vision
SMART_READER_LITE
LIVE PREVIEW

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, - - PowerPoint PPT Presentation

Language and Vision EECS 442 Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/ Administrivia Last class! Poster session later today Turn in project reports anytime up


slide-1
SLIDE 1

Language and Vision

EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan

http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

slide-2
SLIDE 2

Administrivia

  • Last class!
  • Poster session later today
  • Turn in project reports anytime up until Sunday.

We’ll try to grade them as they come in.

  • Fill out course feedback forms if you haven’t

already

  • Enjoy your summers. Remember to relax (for

everyone) and celebrate (for those graduating)

slide-3
SLIDE 3

Project Reports

  • Look at the syllabus for roughly what we’re

looking for. Make sure you cover everything.

  • Pictures (take up space and are really

important): half my papers are pictures

  • Copy/paste your proposal and progress report

in, smoothen the text, add a few results.

slide-4
SLIDE 4

Quick – what’s this?

Dog image credit: T. Gupta

slide-5
SLIDE 5

Previously on EECS 442

Diagram by: Karpathy, Fei-Fei

0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2

  • 0.3

1.1 3.2

  • 1.2

𝑿

Cat weight vector Dog weight vector Hat weight vector Weight matrix a collection of scoring functions, one per class

𝑿𝒚𝒋

  • 96.8

437.9 61.95

Cat score Dog score Hat score Prediction is vector where jth component is “score” for jth class.

56 231 24 2 1

𝒚𝒋 Feature vector from image

slide-6
SLIDE 6

Previously on EECS 442

  • 0.9

0.6 0.4

Cat score Dog score Hat score

exp(x) e-0.9 e0.6 e0.4 0.41 1.82 1.49

∑=3.72

Norm 0.11 0.49 0.40

P(cat) P(dog) P(hat)

Converting Scores to “Probability Distribution”

exp (𝑋𝑦 𝑘) σ𝑙 exp( 𝑋𝑦 𝑙)

Generally P(class j):

slide-7
SLIDE 7

What’s a Big Issue?

Is it a dog? Is it a hat?

slide-8
SLIDE 8

Take 2

0.2

  • 0.5

0.1 2.0 1.5 1.3 2.1 0.0 0.0 0.3 0.2

  • 0.3

1.1 3.2

  • 1.2

𝑿

56 231 24 2 1

𝒚𝒋

Cat weight vector Dog weight vector Hat weight vector

𝑿𝒚𝒋

  • 96.8

437.9 61.95

Cat score Dog score Hat score

Diagram by: Karpathy, Fei-Fei

Weight matrix a collection of scoring functions, one per class Prediction is vector where jth component is “score” for jth class.

Feature vector from image

slide-9
SLIDE 9

Take 2

  • 1.9

1.2 0.9

Cat score Dog score Hat score

sgm(x) 0.13 0.77 0.71

P(cat) P(dog) P(hat)

Converting Scores to “Probability Distribution”

77% dog 71% hat 13% cat?

slide-10
SLIDE 10

Hmm…

  • We’d like to say: “dog with a hat” or “husky

wearing a hat” or something else.

  • Naïve approach (given N words to choose from

and up to C words). How many?

  • σ𝑗=1

𝐷

𝑂𝑗 classes to choose from (~Ni)

  • N=10k, C=5 -> 100 billion billion
  • Can’t train 100 billion billion classifiers
slide-11
SLIDE 11

Hmm…

  • Pick N-word dictionary, call them class 1, …, N
  • New goal: emit sequence of C N-way

classification outputs

  • Dictionary could be:
  • All the words that appear in training set
  • All the ascii characters
  • Typically includes special “words”: START, END,

UNK

slide-12
SLIDE 12

Option 1 – Sequence Modeling

𝒊𝒋 = 𝝉(𝑿𝒊𝒚𝒚𝒋 + 𝑿𝒊𝒊𝒊𝐣−𝟐)

Hidden state at i is linear function

  • f previous hidden state and input

at i, + nonlinearity

hi-1 xi

HX HH

𝒛𝒋 = 𝑿𝒛𝒊 𝒊𝒋

Output at i is linear transformation of hidden state

hi yi

YH

slide-13
SLIDE 13

Option 1 – Sequence Modeling

𝒊𝒋 = 𝝉(𝑿𝒊𝒚𝒚𝒋 + 𝑿𝒊𝒊𝒊𝐣−𝟐)

hi-1 xi

HX HH

hi yi

YH

𝒛𝒋 = 𝑿𝒛𝒊 𝒊𝒋

xi+1

HX HH

hi+1 yi+1

YH

Can stack arbitrarily to create a function of multiple inputs with multiple outputs that’s in terms of parameters WHX, WHH, WYH

slide-14
SLIDE 14

Option 1 – Sequence Modeling

𝒊𝒋 = 𝝉(𝑿𝒊𝒚𝒚𝒋 + 𝑿𝒊𝒊𝒊𝐣−𝟐)

hi-1 xi

HX HH

hi yi

YH

𝒛𝒋 = 𝑿𝒛𝒊 𝒊𝒋

xi+1

HX HH

hi+1 yi+1

YH

Can define a loss with respect to each output and differentiate wrt to all the weights Backpropagation through time

Loss i Loss i+1

slide-15
SLIDE 15

Captioning

h1 x1 h0

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 START C

Dog in

x2 h2

C Dog

x3 h3

END

x4 h4 x5 h5

C C C

a hat

in a hat

slide-16
SLIDE 16

Captioning

h3 x4 h4

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 C

a hat

a

Each step: look at input and hidden state (more on that in a second) and decide output. Can learn through CNN!

slide-17
SLIDE 17

Results

Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Donahue et al. TPAMI, CVPR 2015.

slide-18
SLIDE 18

Results

Long-term Recurrent Convolutional Networks for Visual Recognition and Description. Donahue et al. TPAMI, CVPR 2015.

slide-19
SLIDE 19

Captioning – Looking at Each Step

h3 x4 h4

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 C

a hat

a

Why might this be better than doing billions of classification problems?

slide-20
SLIDE 20

What Goes On Inside?

  • Great repo for playing with RNNs (Char-RNN)
  • https://github.com/karpathy/char-rnn
  • (Or search char-rnn numpy)
  • Tokens are just the characters that appear in

the training set

Result credit: A. Karpathy

slide-21
SLIDE 21

/* * If this error is set, we will need anything right after that BSD. */ static void action_new_function(struct s_stat_info *wb) { unsigned long flags; int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT); buf[0] = 0xFFFFFFFF & (bit << 4); min(inc, slist->bytes); printk(KERN_WARNING "Memory allocated %02x/%02x, " "original MLL instead\n"), min(min(multi_run - s->len, max) * num_data_in), frame_pos, sz + first_seg); div_u64_w(val, inb_p); spin_unlock(&disk->queue_lock); mutex_unlock(&s->sock->mutex); mutex_unlock(&func->mutex); return disassemble(info->pending_bh); }

Sample Trained on Linux Code

Result credit: A. Karpathy

slide-22
SLIDE 22

Sample Trained on Names

Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales

Result credit: A. Karpathy

slide-23
SLIDE 23

What Goes on Inside

Outputs of an RNN. Blue to red show timesteps where a given cell is active. What’s this?

Result credit: A. Karpathy

slide-24
SLIDE 24

What Goes on Inside

Outputs of an RNN. Blue to red show timesteps where a given cell is active. What’s this?

Result credit: A. Karpathy

slide-25
SLIDE 25

What Goes on Inside

Outputs of an RNN. Blue to red show timesteps where a given cell is active. What’s this?

Result credit: A. Karpathy

slide-26
SLIDE 26

Nagging Detail #1 – Depth

h0

D

E

x2 h2

E

x3 h3

E

x4 h4

E

P

x1

START

h1

D

P

x5 h5

_

x6 h6

_

L

x7 h7

L

E

E

x8 h8

N

END

x8 h8

A

x9 h9

R

A

x

1

h

1

N

R

What happens to really deep networks? Remember gn for g ≠ 1 Gradients explode / vanish

slide-27
SLIDE 27

Nagging Detail #1 – Depth

  • Typically use more complex methods that

better manage gradient flowback (LSTM, GRU)

  • General strategy: pass the hidden state to the

next timestep as unchanged as possible, only adding updates as necessary

slide-28
SLIDE 28

Nagging Detail #2

  • A dog in a hat
  • A dog wearing a hat
  • Husky wearing a hat
  • Husky holding a camera,

sitting in grass

  • A dog that’s in a hat, sitting
  • n a lawn with a camera

Lots of captions are in principle possible!

slide-29
SLIDE 29

Nagging Detail #2 – Sampling

h1 x1

START

h0

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 C

Dog (P=0.3), A (P=0.2), Husky (P=0.15), ….

  • Pick proportional to

probability of each word

  • Can adjust “temperature”

parameter exp(score/t) to equalize probabilities

  • exp(5) / exp(1) → 54.6
  • exp(5/5) / exp(1/5) → 2.2
slide-30
SLIDE 30

Effect of Temperature

  • Train on essays about startups and investing
  • Normal Temperature: “The surprised in investors

weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do.”

  • Low temperature: “is that they were all the same thing

that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same”

slide-31
SLIDE 31

Nagging Detail #2 – Sampling

h1 x1

START

h0

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 C

A

x2 h2

C A

Dog (P=0.4), Husky (P=0.3), ….

slide-32
SLIDE 32

Nagging Detail #2 – Sampling

h1 x1

START

h0

CNN

𝑔

𝐽𝑛 ∈ 𝑆4096 C

Dog

x2 h2

C Dog

wearing (P=0.5), in (P=0.3), …. Each evaluation gives P(wi|w1,…,wi-1) Can expand a finite tree

  • f possibilities (beam

search) and pick most likely sequence

slide-33
SLIDE 33

Nagging Detail #3 – Evaluation

Computer: “A husky in a hat” Human: “A dog in a hat” How do you decide? 1) Ask humans. Why might this be an issue? 2) In practice: use something like precision (how many generated words appear in ground-truth sentences) or recall. Details very important to prevent gaming (e.g., “A a a a a”)

slide-34
SLIDE 34

More General Sequence Models

h1 x1 I h0 x2 h2 x3 h3

Positive Review

x4 h4 x5 h5 loved my meal here

Can have multiple inputs, single output

slide-35
SLIDE 35

More General Sequence Models

h1 x1 What h0 x2 h2 x3 h3 x4 h4 x5 h5 is the dog wearing

Could be a feature vector!

slide-36
SLIDE 36

More General Models

h1 x1 What h0 x2 h2 x3 h3 x4 h4 x5 h5 is the dog wearing

0.03 0.00 0.5 … 0.2

Bat Dolphin Hat Grass

slide-37
SLIDE 37

Visual Question-Answering

VQA: Visual Question Answering. S. Antol, A. Agrawal et al. ICCV 2015

slide-38
SLIDE 38

Top-Performing Methods

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Anderson et al. 2018.

Top methods now look at objects in the image as

  • pposed to one big image vector.
slide-39
SLIDE 39

Top-Performing Methods

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. Anderson et al. 2018.

slide-40
SLIDE 40

Let’s Revisit A Number

  • How many 20-word sentences with a

vocabulary of 10k words are there really?

  • Is it really (10k)20? Why not?
  • Let’s look at some giraffes (I swear this is

relevant)

Giraffe example credit: L. Zitnick

slide-41
SLIDE 41

What do Giraffes Do All Day?

A giraffe grazing in its enclosure A giraffe wandering around A giraffe sitting and resting

With apologies to both giraffes and people who study giraffes, I’m sure they’re fascinating

slide-42
SLIDE 42

Alternate Idea – Retrieval

New “test” image

Training images + captions

“Giraffe that’s out standing in its field” “Giraffe sitting & relaxing”

R4096

slide-43
SLIDE 43

Alternate Idea – Retrieval

“Giraffe sitting & relaxing”

Training images + captions

“Giraffe that’s out standing in its field” “Giraffe sitting & relaxing”

R4096

slide-44
SLIDE 44

Retrieval Results

Exploring Nearest Neighbor Approaches for Image Captioning. Devlin et al. 2015

slide-45
SLIDE 45

Retrieval Results

Exploring Nearest Neighbor Approaches for Image Captioning. Devlin et al. 2015

slide-46
SLIDE 46

Retrieval Results

  • In practice: humans don’t like retrieved

captions as much

  • Can’t generate anything new!
slide-47
SLIDE 47

Novel Captions

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data.

  • L. Hendricks et al. CVPR 2016
slide-48
SLIDE 48

Simple Baseline for VQA

  • Construct a vocabulary of 5000 most frequent answers
  • Extract all the information from the image, 𝐽
  • Construct an image representation using a CNN
  • Represent the question, 𝑅 with BoW
  • Compute distribution of answers, 𝑄(𝐵|𝑅, 𝐽)

Zhou, Bolei, et al. "Simple baseline for visual question answering." arXiv preprint arXiv:1512.02167 (2015).

Slide credit: T. Gupta

slide-49
SLIDE 49

Qualitative Results

Zhou, Bolei, et al. "Simple baseline for visual question answering." arXiv preprint arXiv:1512.02167 (2015).

Slide credit: T. Gupta

slide-50
SLIDE 50

Qualitative Results

  • Language prior prunes the answer space

significantly

Zhou, Bolei, et al. "Simple baseline for visual question answering." arXiv preprint arXiv:1512.02167 (2015).

Slide credit: T. Gupta

slide-51
SLIDE 51

Quantitative Evaluation

Evaluated on the VQA dataset

Slide credit: T. Gupta

slide-52
SLIDE 52

Does the model learn to localize?

Class Activation Mapping applied to VQA Baseline

Slide credit: T. Gupta

slide-53
SLIDE 53

Recent Developments

Can balance data to make things difficult

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Goyal et al. 2017

slide-54
SLIDE 54

Some Concluding Thoughts

  • Getting this right is really hard!
  • Deep learning is trying to do solve any problem

you pose with as little effort as possible.

  • A lot of this has to do with the data
slide-55
SLIDE 55

Some Concluding Thoughts In General

slide-56
SLIDE 56

What We’ve Seen

Photosensitive Material

What happens, math-wise, when you take pictures by poking a hole in barriers

slide-57
SLIDE 57

What We’ve Seen

How to line up two images by finding local regions and matching them

slide-58
SLIDE 58

What We’ve Seen

How to fit functions to data by computing derivatives with respect to a loss function and how that lets you learn things

x C f(n)

W1 b1

C f(n)

W2 b2

C f(n)

W3 b3

slide-59
SLIDE 59

What We’ve Seen

I(x,y,t) I(x,y,t+1) How to find the motion between two images that are nearby in time

slide-60
SLIDE 60

What We’ve Seen

  • '

e e' p p' What happens mathematically when 2+ cameras see the same scene, and how to get depth from this

slide-61
SLIDE 61

Some Take-Homes

  • Computer vision, even the most magical parts,

isn’t magic

  • It’s linear algebra, data, and figuring out what

problems to ask

  • The class did a great job implementing stuff

and I’m looking forward to seeing the projects

  • Many of you will probably be asked to make

decisions involving at least machine learning as part of your jobs down the road. Be aware!