DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google - - PowerPoint PPT Presentation

devise a deep visual semantic embedding model
SMART_READER_LITE
LIVE PREVIEW

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google - - PowerPoint PPT Presentation

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by: Tushar Nagarajan The year is 2012... Krizhevsky et al. 2012 The year is 2012... Koala? Cat? Giraffe? Yes Of course! Dont be silly. Whats


slide-1
SLIDE 1

DeViSE: A Deep Visual-Semantic Embedding Model

Frome et al., Google Research

Presented by: Tushar Nagarajan

slide-2
SLIDE 2

The year is 2012...

Krizhevsky et al. 2012

slide-3
SLIDE 3

The year is 2012...

Koala? Cat? Giraffe? Yes Of course! Don’t be silly. What’s that?

slide-4
SLIDE 4

The year is 2012...

Horse? :| Collect more giraffe data

slide-5
SLIDE 5

Re-training networks is annoying

Imagenet 1k

  • Only 1000 classes
  • 3 year olds have a 1k word vocabulary

Doesn’t scale easily

Getting data is hard

Label: “This thing”

slide-6
SLIDE 6

Structure in Labels

slide-7
SLIDE 7

Label Structure - Similarity

Hospital Room Dorm Room Crevasse Vegetable Garden Formal Garden Snowfield

SUN dataset, Xiao et al.

slide-8
SLIDE 8

Label Structure - Similarity

Hospital Room Dorm Room Crevasse Vegetable Garden Formal Garden Crevasse-like?

SUN dataset, Xiao et al.

slide-9
SLIDE 9

Label Structure - Similarity

similar(Crevasse, Snowfield) Visual similar(Guitar, Harp) Semantic

slide-10
SLIDE 10

Label Structure - Hierarchy

slide-11
SLIDE 11

Label Structure - Hierarchy

Hwang et al., 2011

parents? siblings?

slide-12
SLIDE 12

Does Softmax Care?

Chair Dog Clock

slide-13
SLIDE 13

Clock Guitar Harp

Does Softmax Care?

Completely independent?

slide-14
SLIDE 14

Clock Guitar Harp

Does Softmax Care?

Are labels independent? Not really - guitar and harp are more closely related than guitar and clock. Abandon softmax - move to label space

slide-15
SLIDE 15

Regress to Label Space

Step 1: Train a CNN for classification

  • Regular CNN for object classification
  • 1000 way softmax output

Hu et al., Remote Sens. 2015

slide-16
SLIDE 16

Regress to Label Space

Step 2: Abandon Softmax

Step 1: Train a CNN for classification Hu et al., Remote Sens. 2015

slide-17
SLIDE 17

Regress to Label Space

Step 2: Abandon Softmax What regression labels?

Step 1: Train a CNN for classification Hu et al., Remote Sens. 2015

slide-18
SLIDE 18

Clock Guitar Harp

Label Space

We didn’t think this through… Where do we get this space from? Hint: Imagenet classes are words!

slide-19
SLIDE 19

Word Embeddings - Skip-gram

The quick brown fox jumps over the lazy dog. fox quick brown jumps

  • ver

Mikolov et al., 2013

slide-20
SLIDE 20

Word Embeddings - Skip-gram

Gender encoded into subspace comparative - superlative info

Mikolov et al., 2013

slide-21
SLIDE 21

Word Embeddings - Skip-gram

Sebastian Ruder

slide-22
SLIDE 22

Word Embeddings - Skip-gram

Step 3: Train a LM on 5.7M documents from wikipedia

  • 20 word window
  • Hierarchical Softmax
  • 500D vectors

Q: What about multi-word classes like “snow leopard”?

Step 1: Train a CNN for classification Step 2: Abandon Softmax Frome et al., 2013

slide-23
SLIDE 23

Word Embeddings - Skip-gram

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM

Tiger Shark Car Bull shark Blacktip shark Shark Blue shark ... Cars Muscle car Sports car Automobile ...

Frome et al., 2013

slide-24
SLIDE 24

Step 4: Surgery

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM

Contrastive loss “Guitar” vlabel vimage Image

slide-25
SLIDE 25

Step 4: Surgery

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM

Contrastive loss vlabel vimage margin random incorrect class

slide-26
SLIDE 26

Inference - ZSL

When a new image comes in: 1. Push it through the CNN, get vimage vimage

slide-27
SLIDE 27

Inference - ZSL

When a new image comes in: 1. Push it through the CNN, get vimage vguitar vharp vbanjo vviolin

slide-28
SLIDE 28

Inference - ZSL

When a new image comes in: 1. Push it through the CNN, get vimage 2. Find the nearest vlabel to vimage vguitar vharp vbanjo vviolin Potentially unseen labels!

slide-29
SLIDE 29

Results

slide-30
SLIDE 30

Evaluation Metrics

  • Flat hit @ k : Regular precision
  • Hierarchical precision @ k:

k=1 k=3 k=8

slide-31
SLIDE 31

Results on Imagenet

Softmax is hard to beat on raw classification on 1k classes DeViSE gets pretty close with a regression model!

Frome et al., 2013

slide-32
SLIDE 32

Results - Imagenet Classification

Hierarchical precision tells a different story DeViSE finds labels that are semantically relevant

Frome et al., 2013

slide-33
SLIDE 33

Results - Imagenet ZSL

Correct label @1

Frome et al., 2013

garbage?

slide-34
SLIDE 34

Results - Imagenet ZSL

Frome et al., 2013

slide-35
SLIDE 35

Results - Imagenet ZSL

3-hop: Unknown classes 3 hops away from imagenet labels Imagenet 21k: ALL unknown classes

Chance: 0.00047 168x better!

Frome et al., 2013

slide-36
SLIDE 36

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM Step 4: Surgery Step 5: Profit?

Summary

“Guitar” vlabel vimage Image

The Register, 2013

slide-37
SLIDE 37

Discussion

Embeddings are not fine-tuned during training Semantic similarity is a happy coincidence

  • sim(cat, kitten) = 0.746
  • sim(cat, dog) =

Semantic similarity is a depressing coincidence sim(happy, depressing) = ? 0.761 (!!)

slide-38
SLIDE 38

Discussion

Nearest neighbors of pineapple: Pineapples, papaya, mango, avocado, banana ...

Frome et al., 2013

slide-39
SLIDE 39

Discussion

Categories are fine-grained We TRUST softmax to distinguish them

Stanford Dogs Dataset - Khosla et al., 2011

slide-40
SLIDE 40

Conclusion

Label spaces to embed semantic information Shared embedding spaces background knowledge for ZSL

Zedonk

slide-41
SLIDE 41

Thank you

Questions?

slide-42
SLIDE 42

Bonus: ConSE

Norouzi et al., 2013

0.2 harp 0.01 chair 0.5 guitar