Image Identification with Natural Language Specification Qi Feng, - - PowerPoint PPT Presentation

image identification with natural language specification
SMART_READER_LITE
LIVE PREVIEW

Image Identification with Natural Language Specification Qi Feng, - - PowerPoint PPT Presentation

Introduction Methods Results Saliency Map Image Identification with Natural Language Specification Qi Feng, Donghyun Kim Department of Computer Science, Boston University fung@bu.edu, donhk@bu.edu December 08, 2017 . . . . . . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Image Identification with Natural Language Specification

Qi Feng, Donghyun Kim

Department of Computer Science, Boston University

fung@bu.edu, donhk@bu.edu

December 08, 2017

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Outline

1

Introduction

2

Methods

3

Results

4

Saliency Map

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Photo Search Figure: Screen shot of a natural language search on Google Photos.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

The Problem Figure: Identification of the target image by natural language specification.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

GloVe Embedding

GloVe is an unsupervised learning algorithm for obtaining vector representations for words.[2]

Figure: The projection of word embedding into 2D space

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

The Baseline Model

▶ Cosine similarity

▶ average of word embeddings[2] ▶ the input query ▶ a generated caption for an image[4]

▶ The Inception v3

▶ pretrained on the ILSVRC-2012-CLS[3].

▶ The language model

▶ trained 20,000 iterations on MSCOCO[1].

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

The Proposed Model

Image CNN Visual Representation Query Language Model(LSTM) Similarity concat

Figure: The proposed model. Red rounded rectangles are inputs to the

  • model. The blue rectangle is the intermediate result from the convolutional

neural network.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Training and Testing Figure: Positive Training Data

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Results

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Results cont.

▶ The Baseline Model: 91.1% ▶ The Proposed Model: 93.4%

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Excitation Back-propagation for Saliency Map

▶ Goal

▶ The goal is to find a salient region in input to interpret model’s

predictions using a back-propagation.

▶ Assumptions

▶ The response of the activation neuron is non-negative. ▶ An activation neuron is tuned to detect certain visual features.

Its response is positively correlated to its confidence of the detection.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Spacial and Temporal Saliency Figure: Spatial and temporal saliency on MS-COCO. Original images on the left and saliency maps on the right. The queries are shown under each

  • image. Red word represents the maximum temporal saliency.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Conclusion

▶ A model that identify an image by natural language specifications. ▶ An RNN to measure the similarity between images and queries. ▶ Excitation Back-propagation for finding spatial and temporal

groundings.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction Methods Results Saliency Map

Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. CoRR, abs/1609.06647, 2016.

Image Identification with Natural Language Specification Qi Feng, Donghyun Kim