DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google - PowerPoint PPT Presentation

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by: Tushar Nagarajan

The year is 2012... Krizhevsky et al. 2012

The year is 2012... Koala? Cat? Giraffe? Yes Of course! Don’t be silly. What’s that?

The year is 2012... Horse? :| Collect more giraffe data

Imagenet 1k - Only 1000 classes - 3 year olds have a 1k word vocabulary Re-training networks is annoying Getting data is hard Doesn’t scale easily Label: “This thing”

Structure in Labels

Label Structure - Similarity Hospital Room Crevasse Formal Garden Dorm Room Snowfield Vegetable Garden SUN dataset, Xiao et al.

Label Structure - Similarity Hospital Room Crevasse Formal Garden Crevasse-like? Dorm Room Vegetable Garden SUN dataset, Xiao et al.

Label Structure - Similarity similar(Crevasse, Snowfield) similar(Guitar, Harp) Visual Semantic

Label Structure - Hierarchy

Label Structure - Hierarchy siblings? parents? Hwang et al., 2011

Does Softmax Care? Dog Clock Chair

Does Softmax Care? Guitar Completely independent? Clock Harp

Does Softmax Care? Are labels independent? Not really - guitar and harp are more closely related than guitar and clock. Guitar Abandon softmax - move to Harp label space Clock

Regress to Label Space Step 1: Train a CNN for classification - Regular CNN for object classification - 1000 way softmax output Hu et al., Remote Sens. 2015

Step 1: Train a CNN for classification Regress to Label Space Step 2: Abandon Softmax Hu et al., Remote Sens. 2015

Step 1: Train a CNN for classification Regress to Label Space Step 2: Abandon Softmax What regression labels? Hu et al., Remote Sens. 2015

Label Space We didn’t think this through… Where do we get this space from? Guitar Hint: Imagenet classes are words! Harp Clock

Word Embeddings - Skip-gram The quick brown fox jumps over the lazy dog. quick brown fox jumps over Mikolov et al., 2013

Word Embeddings - Skip-gram Gender encoded into subspace comparative - superlative info Mikolov et al., 2013

Word Embeddings - Skip-gram Sebastian Ruder

Step 1: Train a CNN for classification Step 2: Abandon Softmax Word Embeddings - Skip-gram Step 3: Train a LM on 5.7M documents from wikipedia - 20 word window - Hierarchical Softmax - 500D vectors Q: What about multi-word classes like “snow leopard”? Frome et al., 2013

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM Word Embeddings - Skip-gram Tiger Shark Car Bull shark Cars Blacktip shark Muscle car Shark Sports car Blue shark Automobile ... ... Frome et al., 2013

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM Step 4: Surgery v image Contrastive loss Image “Guitar” v label

Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM Step 4: Surgery v image Contrastive loss v label margin random incorrect class

Inference - ZSL When a new image comes in: v image 1. Push it through the CNN, get v image

Inference - ZSL v harp When a new image comes in: 1. Push it through the CNN, get v image v banjo v violin v guitar

Inference - ZSL v harp When a new image comes in: 1. Push it through the CNN, get v image 2. Find the nearest v label to v image v banjo v violin v guitar Potentially unseen labels!

Results

Evaluation Metrics - Flat hit @ k : Regular precision - Hierarchical precision @ k: k=1 k=3 k=8

Results on Imagenet Softmax is hard to beat on raw classification on 1k classes DeViSE gets pretty close with a regression model! Frome et al., 2013

Results - Imagenet Classification Hierarchical precision tells a different story DeViSE finds labels that are semantically relevant Frome et al., 2013

Results - Imagenet ZSL Correct label @1 garbage? Frome et al., 2013

Results - Imagenet ZSL Frome et al., 2013

Results - Imagenet ZSL 3-hop: Unknown classes 3 hops away from imagenet labels Imagenet 21k: ALL unknown classes Chance: 0.00047 168x better! Frome et al., 2013

Summary v image Step 1: Train a CNN for classification Step 2: Abandon Softmax Step 3: Train a skip-gram LM Step 4: Surgery Image Step 5: Profit? “Guitar” v label The Register, 2013

Discussion Embeddings are not fine-tuned during training Semantic similarity is a happy coincidence - sim(cat, kitten) = 0.746 - sim(cat, dog) = 0.761 (!!) Semantic similarity is a depressing coincidence sim(happy, depressing) = ?

Discussion Nearest neighbors of pineapple : Pineapples, papaya, mango, avocado, banana ... Frome et al., 2013

Discussion Categories are fine-grained We TRUST softmax to distinguish them Stanford Dogs Dataset - Khosla et al., 2011

Conclusion Label spaces to embed semantic information Shared embedding spaces background knowledge for ZSL Zedonk

Thank you Questions?

Bonus: ConSE 0.2 harp 0.01 chair 0.5 guitar Norouzi et al., 2013

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google - PowerPoint PPT Presentation

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by: Tushar Nagarajan The year is 2012... Krizhevsky et al. 2012 The year is 2012... Koala? Cat? Giraffe? Yes Of course! Dont be silly. Whats

DeViSE: A Deep Visual-Semantic Embedding Model Presenters: Ji Gao, Fandi Lin Motivation Visual

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Semantic Analysis and Semantic Roles Ling 571 Deep Processing Techniques for NLP February 10,

Semantic Roles & Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Understanding the Enablers and Barriers to Local Environmental Stewardship Organized by Nathan

Drug Policies Beyond the War on Drugs Dr Joanne Csete Dr John Collins Professor Lawrence

Measuring the Impact of One Assignment on Reported Sustainability-Related Behaviors Bob

What is a NEON Satellite Site (Comprehensive NSS Definitions) The NSS is a spatial and/or

Learning from unlabelled speech, with and without visual cues Ohio State University, May 2017

MPI and MapReduce CCGSC 2010 Flat Rock NC September 8

Policy Showcase 2: Policy Innovations to Maximise Impact 2.30pm 4.30pm Institutionalising

ECE 457/557 Go over syllabus Engineering Data Analysis & Modeling Class overview &

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google - PowerPoint PPT Presentation

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by: Tushar Nagarajan The year is 2012... Krizhevsky et al. 2012 The year is 2012... Koala? Cat? Giraffe? Yes Of course! Dont be silly. Whats

DeViSE: A Deep Visual-Semantic Embedding Model Presenters: Ji Gao, Fandi Lin Motivation Visual

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

DEEP SEMANTIC-VISUAL EMBEDDING WITH LOCALIZATION Thursday 4th October, 2018 Martin Engilberge,

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Embedding 3-manifolds via surgery on surfaces Kyle Larson University of Texas at Austin

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Semantic Analysis and Semantic Roles Ling 571 Deep Processing Techniques for NLP February 10,

Semantic Roles &amp; Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Understanding the Enablers and Barriers to Local Environmental Stewardship Organized by Nathan

Drug Policies Beyond the War on Drugs Dr Joanne Csete Dr John Collins Professor Lawrence

Measuring the Impact of One Assignment on Reported Sustainability-Related Behaviors Bob

What is a NEON Satellite Site (Comprehensive NSS Definitions) The NSS is a spatial and/or

Learning from unlabelled speech, with and without visual cues Ohio State University, May 2017

MPI and MapReduce CCGSC 2010 Flat Rock NC September 8

Policy Showcase 2: Policy Innovations to Maximise Impact 2.30pm 4.30pm Institutionalising

ECE 457/557 Go over syllabus Engineering Data Analysis &amp; Modeling Class overview &amp;

Semantic Roles & Semantic Role Labeling Ling571 Deep Processing Techniques for NLP February

ECE 457/557 Go over syllabus Engineering Data Analysis & Modeling Class overview &