Merging language and vision modalities: Last years work Raffaella - - PowerPoint PPT Presentation

merging language and vision modalities last years work
SMART_READER_LITE
LIVE PREVIEW

Merging language and vision modalities: Last years work Raffaella - - PowerPoint PPT Presentation

Merging language and vision modalities: Last years work Raffaella Bernardi University of Trento November, 2017 Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 1 / 49 Last time


slide-1
SLIDE 1

Merging language and vision modalities: Last years work

Raffaella Bernardi

University of Trento

November, 2017

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 1 / 49

slide-2
SLIDE 2

Last time

Last time we have introduced the first computational work on Language and Vision integration. Today, we look at new tasks that have been proposed more recently.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 2 / 49

slide-3
SLIDE 3

Cross Modal Mapping

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 3 / 49

slide-4
SLIDE 4

Cross Modal Mapping

Cross-modal mapping: Generalization

Angeliki Lazaridou, Elia Bruni and Marco Baroni. (ACL 2014) Transfering knowledge acquired in one modality to the other one. Learn to project one space into the other, from the visual space onto the language space. Two tasks: Zero-Shot Learning Fast Mapping In both tasks, the projected vector of the unseen concept is labeled with the word associated to its cosine-based nearest neighbor vector in the corresponding semantic space.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 4 / 49

slide-5
SLIDE 5

Cross Modal Mapping

Zero-Shot Learning: the task

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 5 / 49

slide-6
SLIDE 6

Cross Modal Mapping

Zero-Shot Learning: the task

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 5 / 49

slide-7
SLIDE 7

Cross Modal Mapping

Zero-Shot Learning

Learn a classifier X → Y , s.t. X are images, Y are language vectors. Label an image of an unseen concept with the word associated to its cosine-based nearest neighbor vector in the language space. For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function, which intuitively represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (containing all the unseen and seen concepts).

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 6 / 49

slide-8
SLIDE 8

Cross Modal Mapping

Zero-Shot Learning

Learn a classifier X → Y , s.t. X are images, Y are language vectors. Label an image of an unseen concept with the word associated to its cosine-based nearest neighbor vector in the language space. For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function, which intuitively represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (containing all the unseen and seen concepts).

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 6 / 49

slide-9
SLIDE 9

Cross Modal Mapping

Zero-Shot Learning

Learn a classifier X → Y , s.t. X are images, Y are language vectors. Label an image of an unseen concept with the word associated to its cosine-based nearest neighbor vector in the language space. For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function, which intuitively represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (containing all the unseen and seen concepts).

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 6 / 49

slide-10
SLIDE 10

Cross Modal Mapping

Zero-Shot Learning

Learn a classifier X → Y , s.t. X are images, Y are language vectors. Label an image of an unseen concept with the word associated to its cosine-based nearest neighbor vector in the language space. For a subset of concepts (e.g., a set of animals, a set of vehicles), we possess information related to both their linguistic and visual representations. During training, this cross-modal vocabulary is used to induce a projection function, which intuitively represents a mapping between visual and linguistic dimensions. Thus, this function, given a visual vector, returns its corresponding linguistic representation. At test time, the system is presented with a previously unseen object (e.g., wampimuk). This object is projected onto the linguistic space and associated with the word label of the nearest neighbor in that space (containing all the unseen and seen concepts).

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 6 / 49

slide-11
SLIDE 11

Cross Modal Mapping

Zero-shot leaning: linear mapping

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 7 / 49

slide-12
SLIDE 12

Cross Modal Mapping

Zero-shot leaning: linear mapping

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 7 / 49

slide-13
SLIDE 13

Cross Modal Mapping

Zero-shot leaning: example

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 8 / 49

slide-14
SLIDE 14

Cross Modal Mapping

Zero-shot leaning: example

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 8 / 49

slide-15
SLIDE 15

Cross Modal Mapping

Dataset

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 9 / 49

slide-16
SLIDE 16

Cross Modal Mapping

Dataset

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 9 / 49

slide-17
SLIDE 17

Cross Modal Mapping

Cross Modal Mapping

Fast Mapping

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 10 / 49

slide-18
SLIDE 18

Cross Modal Mapping

Fast Mapping

Learn a word vector from a few sentences, associate it to the referring image exploiting cosine-based neighbor vector in the visual space. The fast mapping setting can be seen as a special case of the zero-shot task. Whereas for the latter our system assumes that all concepts have rich linguistic representations (i.e., representations estimated from a large corpus), in the case

  • f the former, new concepts are assumed to be encounted in a limited linguistic

context and therefore lacking rich linguistic representations. This is operationalized by constructing the text-based vector for these concepts from a context of just a few occurrences. In this way, we simulate the first encounter of a learner with a concept that is new in both visual and linguistic terms. New paper: Multimodal semantic learning from child-directed input Angeliki Lazaridou, Grzegorz Chrupala, Raquel Fernandez and Marco Baroni NAACL 2016 Short http://clic.cimec.unitn.it/marco/publications/ lazaridou-etal-multimodal-learning-from-cdi-naacl2016.pdf

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 11 / 49

slide-19
SLIDE 19

Cross Modal Mapping

Fast Mapping

Learn a word vector from a few sentences, associate it to the referring image exploiting cosine-based neighbor vector in the visual space. The fast mapping setting can be seen as a special case of the zero-shot task. Whereas for the latter our system assumes that all concepts have rich linguistic representations (i.e., representations estimated from a large corpus), in the case

  • f the former, new concepts are assumed to be encounted in a limited linguistic

context and therefore lacking rich linguistic representations. This is operationalized by constructing the text-based vector for these concepts from a context of just a few occurrences. In this way, we simulate the first encounter of a learner with a concept that is new in both visual and linguistic terms. New paper: Multimodal semantic learning from child-directed input Angeliki Lazaridou, Grzegorz Chrupala, Raquel Fernandez and Marco Baroni NAACL 2016 Short http://clic.cimec.unitn.it/marco/publications/ lazaridou-etal-multimodal-learning-from-cdi-naacl2016.pdf

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 11 / 49

slide-20
SLIDE 20

Cross Modal Mapping

Fast Mapping

Learn a word vector from a few sentences, associate it to the referring image exploiting cosine-based neighbor vector in the visual space. The fast mapping setting can be seen as a special case of the zero-shot task. Whereas for the latter our system assumes that all concepts have rich linguistic representations (i.e., representations estimated from a large corpus), in the case

  • f the former, new concepts are assumed to be encounted in a limited linguistic

context and therefore lacking rich linguistic representations. This is operationalized by constructing the text-based vector for these concepts from a context of just a few occurrences. In this way, we simulate the first encounter of a learner with a concept that is new in both visual and linguistic terms. New paper: Multimodal semantic learning from child-directed input Angeliki Lazaridou, Grzegorz Chrupala, Raquel Fernandez and Marco Baroni NAACL 2016 Short http://clic.cimec.unitn.it/marco/publications/ lazaridou-etal-multimodal-learning-from-cdi-naacl2016.pdf

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 11 / 49

slide-21
SLIDE 21

Visual Phrases

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 12 / 49

slide-22
SLIDE 22

Visual Phrases

Images as Visual Phrases

Given the visual representation of an object, can we “decompose” it into attribute and object? Can we learn the visual representation of attributes and learn to compose them with the visual representation of an object?

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 13 / 49

slide-23
SLIDE 23

Visual Phrases

Visual Phrase: Decomposition

  • A. Lazaridou, G. Dinu, A. Liska, M. Baroni (TACL 2015)

First intuition: vision and language space have similar structures (also w.r.t attribute/adjectives) Second intuition: Objects are bundles of attributes. Hence, attributes are implicitely learned together with objects.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 14 / 49

slide-24
SLIDE 24

Visual Phrases

Decomposition Model: attribute annotation

Evaluation: (unseen) object/noun and attribute/adjective retrieval.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 15 / 49

slide-25
SLIDE 25

Visual Phrases

Decomposition Model: attribute annotation

Evaluation: (unseen) object/noun and attribute/adjective retrieval.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 15 / 49

slide-26
SLIDE 26

Visual Phrases

Images as Visual Phrases: Composition

Coloring Objects: Adjective-Noun Visual Semantic Compositionality (VL’14) D.T. Nguyen, A. Lazaridou and R. Bernardi

1 Assumption from linguistics: Adjectives are noun modifiers. They are

functions from N into N.

2 From COMPOSES: adjectives can be learned from (ADJ N, N) inputs. 3 Applied to images: Compositional Visual Model? Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 16 / 49

slide-27
SLIDE 27

Visual Phrases

Images as Visual Phrases: Composition

Coloring Objects: Adjective-Noun Visual Semantic Compositionality (VL’14) D.T. Nguyen, A. Lazaridou and R. Bernardi

1 Assumption from linguistics: Adjectives are noun modifiers. They are

functions from N into N.

2 From COMPOSES: adjectives can be learned from (ADJ N, N) inputs. 3 Applied to images: Compositional Visual Model? Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 16 / 49

slide-28
SLIDE 28

Visual Phrases

Images as Visual Phrases: Composition

Coloring Objects: Adjective-Noun Visual Semantic Compositionality (VL’14) D.T. Nguyen, A. Lazaridou and R. Bernardi

1 Assumption from linguistics: Adjectives are noun modifiers. They are

functions from N into N.

2 From COMPOSES: adjectives can be learned from (ADJ N, N) inputs. 3 Applied to images: Compositional Visual Model? Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 16 / 49

slide-29
SLIDE 29

Visual Phrases

Visual Composition

From the visual representation: Dense-Sift feature vectors as Noun vectors (e.g. car. light) Color-Sift feature vectors as Phrase vectors (e.g. red car. red light) Learn the function (color) that maps the noun to the phrase. Apply that function to new (unseen) objects (e.g. red truck) and retrieve the image. We compare the composed visual vector (ATT OBJ) vs. composed linguistic vectors (ADJ N) vs. observed linguistic vectors. New paper: Misra et al. From red wine to red tomato: Composition with Context CVPR 2017

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 17 / 49

slide-30
SLIDE 30

Visual Phrases

Visual Composition

From the visual representation: Dense-Sift feature vectors as Noun vectors (e.g. car. light) Color-Sift feature vectors as Phrase vectors (e.g. red car. red light) Learn the function (color) that maps the noun to the phrase. Apply that function to new (unseen) objects (e.g. red truck) and retrieve the image. We compare the composed visual vector (ATT OBJ) vs. composed linguistic vectors (ADJ N) vs. observed linguistic vectors. New paper: Misra et al. From red wine to red tomato: Composition with Context CVPR 2017

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 17 / 49

slide-31
SLIDE 31

Visual Phrases

Visual Composition

From the visual representation: Dense-Sift feature vectors as Noun vectors (e.g. car. light) Color-Sift feature vectors as Phrase vectors (e.g. red car. red light) Learn the function (color) that maps the noun to the phrase. Apply that function to new (unseen) objects (e.g. red truck) and retrieve the image. We compare the composed visual vector (ATT OBJ) vs. composed linguistic vectors (ADJ N) vs. observed linguistic vectors. New paper: Misra et al. From red wine to red tomato: Composition with Context CVPR 2017

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 17 / 49

slide-32
SLIDE 32

Tasks

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 18 / 49

slide-33
SLIDE 33

Tasks

New evaluation tasks

Image Captioning (IC) Visual Question Answering (VQA) Visual Reasoning

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 19 / 49

slide-34
SLIDE 34

Tasks

Image Captioning (IC)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 20 / 49

slide-35
SLIDE 35

Tasks

Image Captioning (IC)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 20 / 49

slide-36
SLIDE 36

Tasks

IC: Overview

Datasets Flickr, Pascal, MS-COCO (164K images, 5 captions each) Survey Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures, Bernardi et al. JAIR 2016 Very good talk by Karpathy (2015): https://www.youtube.com/watch?v=ZkY7fAoaNcg

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 21 / 49

slide-37
SLIDE 37

Tasks

IC: Overview

Datasets Flickr, Pascal, MS-COCO (164K images, 5 captions each) Survey Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures, Bernardi et al. JAIR 2016 Very good talk by Karpathy (2015): https://www.youtube.com/watch?v=ZkY7fAoaNcg

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 21 / 49

slide-38
SLIDE 38

Tasks

IC: Approaches

Approaches: Retrieve vs. Generate Frameworks: Pipeline of predictions vs. End-to-end

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 22 / 49

slide-39
SLIDE 39

Tasks

IC approaches

Pipeline

E.g., Kulkarni et al. (2011)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 23 / 49

slide-40
SLIDE 40

Tasks

IC approaches

Pipeline

E.g., Kulkarni et al. (2011)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 23 / 49

slide-41
SLIDE 41

Tasks

IC approaches

End-to-end

E.g., Karpathy and Fei Fei (2015)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 24 / 49

slide-42
SLIDE 42

Tasks

IC approaches

End-to-end

E.g., Karpathy and Fei Fei (2015)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 24 / 49

slide-43
SLIDE 43

Tasks

IC: limitations

Evaluation Measures: Bleu, Rouge, etc. but not precise. No reasoning

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 25 / 49

slide-44
SLIDE 44

Tasks

IC: limitations

Evaluation Measures: Bleu, Rouge, etc. but not precise. No reasoning

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 25 / 49

slide-45
SLIDE 45

Tasks

Visual Question Answering (VQA)

VQA: Visual Question Answering Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh (2016)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 26 / 49

slide-46
SLIDE 46

Tasks

VQA: Overview

Datasets DAQUAR 2014, COCO-QA, VQA, Visual7W, Visual Genome. Survey Visual Question Answering: A Survey of Methods and Datasets Wu et ali, (2016)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 27 / 49

slide-47
SLIDE 47

Tasks

VQA: Model

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 28 / 49

slide-48
SLIDE 48

Tasks

VQA

Limitations

Language prior problem: Blind models perform pretty well (50% accuracy on COCO-VQA!). Development of new real image datasets: VQA2, FOIL, TDIUC, NLRV Development of synthetic datasets: SHAPES, CLEVR, Yin and Yang.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 29 / 49

slide-49
SLIDE 49

Tasks

VQA

Limitations

Language prior problem: Blind models perform pretty well (50% accuracy on COCO-VQA!). Development of new real image datasets: VQA2, FOIL, TDIUC, NLRV Development of synthetic datasets: SHAPES, CLEVR, Yin and Yang.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 29 / 49

slide-50
SLIDE 50

Tasks

VQA

Limitations

Language prior problem: Blind models perform pretty well (50% accuracy on COCO-VQA!). Development of new real image datasets: VQA2, FOIL, TDIUC, NLRV Development of synthetic datasets: SHAPES, CLEVR, Yin and Yang.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 29 / 49

slide-51
SLIDE 51

Tasks

Best performing VQA model now

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 30 / 49

slide-52
SLIDE 52

Intermezzo

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 31 / 49

slide-53
SLIDE 53

Intermezzo

Question Answering

History

In 1965 Simmons reviewed 15 QA systems. In the 60’/70’ work on QA as front-end to databases. In the ’90 decrease of interest in NLIDB. Nowadays: NLIDB on controlled natural language. In the ’90 the rise of search engines brought QA over unstructured data to re-emerge. TREC QA track. Challanges of incremental difficulties (the answer is in the document, the answer is not in the document, the answer is spread in various documents; factual question, other types of questions etc.) Pipe-lines

  • f modules.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 32 / 49

slide-54
SLIDE 54

Intermezzo

Question Answering

Sample of QA pipe-line architecture

!"#$%&'( !"#)* +'($%)"+%&'( &(,')-.%&'( )#%)&#/.0 1'+"-#(% +'00#+%&'( .($2#) #3%).+%&'( .($2#) !"#$%&'( +0.$$&,&+.%&'( 4560''7!"8 $8#+&.0&$#1 45$6#3%).+%#1 ',,!0&(# .($2#)6)#!).(7&(9 .(16$#0#+%&'(

In the last year: boom of end-to-end systems; but let’s not forget the ideas proposed in the past.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 33 / 49

slide-55
SLIDE 55

Find their limitations

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 34 / 49

slide-56
SLIDE 56

Find their limitations

Happy?

After the great success on IC and VQA people have started proposing tasks to highlight the weakness of current Language and Vision models.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 35 / 49

slide-57
SLIDE 57

Find their limitations

FOIL

One image associated with very similar captions but one is T and the

  • ther is F.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 36 / 49

slide-58
SLIDE 58

Find their limitations

Irrelevant questions

Mahendru et al. EMNLP 2017 One caption associated with very similar images for which the sentence is T/F.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 37 / 49

slide-59
SLIDE 59

Find their limitations

Synthetic dataset

CLEVR

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 38 / 49

slide-60
SLIDE 60

New task: Visual Reasoning

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 39 / 49

slide-61
SLIDE 61

New task: Visual Reasoning

Visual Reasoning

NLVR

Suhr et al. A Corpus of Natural Language for VIsual Reasoning. ACL 2017 Binary task: T or F.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 40 / 49

slide-62
SLIDE 62

New task: Visual Reasoning

Visual Reasoning

Results

Best performing model: Neural module network (Andreas et al 2016)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 41 / 49

slide-63
SLIDE 63

Others

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 42 / 49

slide-64
SLIDE 64

Others

Other applications

Spoken VQA (posted on ArXiv on the 1st of May) Multimodal Machine Translation Image Generation

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 43 / 49

slide-65
SLIDE 65

Others

Cutting-edge fancy models’ ingredients

Memory & Attention: To focus on some parts of the visual vectors (stored in the memory) e.g. by using the linguistic query to “see” the image. Generative adversarial networks (GAN): two neural networks competing against each other in a game framework. Reinforcement Learning.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 44 / 49

slide-66
SLIDE 66

Others

Attention

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 45 / 49

slide-67
SLIDE 67

Conclusion

Layout

1

Cross Modal Mapping

2

Visual Phrases

3

Tasks

4

Intermezzo

5

Find their limitations

6

New task: Visual Reasoning

7

Others

8

Conclusion

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 46 / 49

slide-68
SLIDE 68

Conclusion

Conclusion

Impressive progress Hard but fun to learn A land of new ideas can be explored My wish: Combine language (pragmatics) with vision. In January we will start a reading group on Language and Vision. Next week (Wed. 29th) talk by Sandro and Claudio (also Ravi will join us.)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 47 / 49

slide-69
SLIDE 69

Conclusion

Conclusion

Impressive progress Hard but fun to learn A land of new ideas can be explored My wish: Combine language (pragmatics) with vision. In January we will start a reading group on Language and Vision. Next week (Wed. 29th) talk by Sandro and Claudio (also Ravi will join us.)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 47 / 49

slide-70
SLIDE 70

Conclusion

Conclusion

Impressive progress Hard but fun to learn A land of new ideas can be explored My wish: Combine language (pragmatics) with vision. In January we will start a reading group on Language and Vision. Next week (Wed. 29th) talk by Sandro and Claudio (also Ravi will join us.)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 47 / 49

slide-71
SLIDE 71

Conclusion

Conclusion

Impressive progress Hard but fun to learn A land of new ideas can be explored My wish: Combine language (pragmatics) with vision. In January we will start a reading group on Language and Vision. Next week (Wed. 29th) talk by Sandro and Claudio (also Ravi will join us.)

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 47 / 49

slide-72
SLIDE 72

Conclusion

Other Useful Links

Neural Networks

http://info.usherbrooke.ca/hlarochelle/neural_networks/ content.html http://www.iro.umontreal.ca/~bengioy/dlbook/ http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf Blog posts: http://colah.github.io/

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 48 / 49

slide-73
SLIDE 73

Conclusion

Other Useful Links

Language and Vision

Describing Images in Sentences by Julia Hockenmaier http://nlp.cs.illinois.edu/HockenmaierGroup/ EACLTutorial2014/index.html Vision and Language Summer Schools: 2nd edition 2016 (Malta). COST-ACTION. “Multimodal Learning and Reasoning”, Desmond Elliott, Douwe Kielay, and Angeliki Lazaridou (Tutorial at ACL 2016) http://acl2016.org/index.php?article_id=59 Ferraro, F. and Mostafazadeh, N. and Huang, T. and Vanderwende,

  • L. and Devlin, J. and Galley, M. and Mitchell, M. (2015). “A Survey
  • f Current Datasets for Vision and Language Research”. Proceedings
  • f EMNLP 2015.

“How we teach computers to understand pictures” TED Talk by Fei Fei Li.

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 49 / 49

slide-74
SLIDE 74

Conclusion

Language and Vision Research Groups

Stanford Vision Lab – Le Fei Fei http://vision.stanford.edu/ MIT: Antonio Torralba http://web.mit.edu/torralba/www/ University of North Carolina – Tamara Berg http://www.tamaraberg.com/ Virginia University – Devi Parikh https://filebox.ece.vt.edu/~parikh/CVL.html CLIC http://clic.cimec.unitn.it/lavi/ – Us. Edinburgh University (M. Lapata, F. Keller ) Facebook Google DeepMind More on the iV&L Net Cost Action http://www.cost.eu/COST_Actions/ict/Actions/IC1307

Raffaella Bernardi (University of Trento) Merging language and vision modalities: Last years work November, 2017 50 / 49