Language and Vision Xavier Gir-i-Nieto Acknowledgments Santi - PowerPoint PPT Presentation

Day 4 Lecture 3 Language and Vision Xavier Giró-i-Nieto

Acknowledgments Santi Pascual 2

In lecture D2L6 RNNs... Language OUT Language IN Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). 3

Motivation 4

Much earlier than lecture D2L6 RNNs... Neco, R.P. and Forcada, M.L., 1997, June. Asynchronous translations with recurrent neural nets. In Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE. 5

Encoder-Decoder For clarity, let’s study a Neural Machine Translation (NMT) case: Representation or Embedding Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 6

Encoder: One-hot encoding One-hot encoding: Binary representation of the words in a vocabulary, where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed. Word Binary One-hot encoding zero 00 0000 one 01 0010 two 10 0100 three 11 1000 7

Encoder: One-hot encoding Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K). Word One-hot encoding economic 000010... growth 001000... has 100000... slowed 000001... 8

Encoder: One-hot encoding One-hot is a very simple representation: every word is equidistant from every other word. Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)

Encoder: Projection to continious space The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights. E s i M K W i K Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 10

Encoder: Projection to continious space Projection matrix E corresponds to a fully connected layer, so its parameters will be learned with a training process. s i M K W i Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 11

Encoder: Projection to continious space Sequence of continious-space word representations Sequence of words Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 12

Encoder: Recurrence Sequence Figure: Cristopher Olah, “Understanding LSTM Networks” (2015) 13

Encoder: Recurrence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 14

Encoder: Recurrence Front View Side View Rotation time 90 o time 15

Encoder: Recurrence Front View Side View Rotation 90 o Representation or embedding of the sentence 16

Sentence Embedding Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014 17

(Word Embeddings) Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and 18 phrases and their compositionality." In Advances in neural information processing systems , pp. 3111-3119. 2013.

Decoder RNN’s internal state z i depends on: sentence embedding h t , previous word u i-1 and previous internal state z i-1 . Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 19

Decoder With z i ready, we can score each word k in the vocabulary with a dot product... Neuron RNN weights for internal word k state Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 20

Decoder ...and finally normalize to word probabilities with a softmax. Score for word k Probability that the ith word is word k Previous words Hidden state Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters." NIPS 1989 21

Decoder More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted. EOS Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 22

Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 23

Encoder-Decoder: Training Dataset of pairs of sentences in the two languages to translate. Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014. 24

Encoder-Decoder: Seq2Seq Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014. 25

Encoder-Decoder: Beyond text 26

Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 27

Captioning: DeepImageSent only takes into account image features in the first hidden state Multimodal Recurrent Neural Network (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 28

Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. 29

Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. 30

Captioning: LSTM for image & video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code 31

Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 32

Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 33

Captioning (+ Detection): DenseCap XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 34

Captioning (+ Retrieval): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 35

Captioning: HRNE hidden state LSTM unit at t = T (2nd layer) Image first chunk of data t = 1 t = T Time ( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. 36

Visual Question Answering “Yes” Decode [z 1 , z 2 , … z N ] [y 1 , y 2 , … y M ] Encode Encode “Is economic growth decreasing ?” 37

Visual Question Answering Extract visual features Answer Merge Predict answer Kite Question Embedding What object is flying? Slide credit: Issey Masuda 38

Visual Question Answering Dynamic Parameter Prediction Network (DPPnet) Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 39

Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." arXiv preprint arXiv:1603.01417 (2016). 40

Visual Question Answering: Dynamic Main idea: split image into local regions . Consider each region equivalent to a sentence. Local Region Feature Extraction: CNN (VGG- 19): (1) Rescale input to 448x448. (2) Take output from last pooling layer → D=512x14x14 → 196 512-d local region vectors. Visual feature embedding: W matrix to project image features to “ q ”-textual space. (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016. 41

Visual Question Answering: Grounded (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016. 42

Datasets: Visual Genome Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." arXiv preprint arXiv:1602.07332 (2016). 43

Datasets: Microsoft SIND Microsoft SIND 44

Challenge: Microsoft Coco Captioning 45

Challenge: Storytelling Storytelling 46

Challenge: Movie Description Movie Description, Retrieval and Fill-in-the-blank 47

Challenges: Movie Question Answering Movie Question Answering 48

Language and Vision Xavier Gir-i-Nieto Acknowledgments Santi - PowerPoint PPT Presentation

Day 4 Lecture 3 Language and Vision Xavier Gir-i-Nieto Acknowledgments Santi Pascual 2 In lecture D2L6 RNNs... Language OUT Language IN Cho, Kyunghyun, Bart Van Merrinboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Developmental Developmental Disorders affecting Disorders affecting language language

Building an IDE on top of a Build System The tale of a Haskell IDE How to write a compiler? +

Topic 1: Systems SYSTEM: an assemblage of parts and their relationship forming a functioning

CSE507 Computer-Aided Reasoning for Software Solver-Aided Languages

Experiment 6 Absorption and Emision Spectra Chlorophyll a is essential for most photosynthetic

Translation Model Parallel corpus source target translation e f phrase phrase features

Robust Decision Trees Against Adversarial Examples Honge Chen 1 , Huan Zhang 2 , Duane Boning 1 and

Data Quality Management Program (DQMP) Part 1: Overview of a DQMP Mike Lindsay, ICF &

Specialising the EDM for Digitised Manuscripts Kai Eckert 1 , Steffen Hennicke, Evelyn Drge,

Language and Vision Xavier Gir-i-Nieto Acknowledgments Santi - PowerPoint PPT Presentation

Day 4 Lecture 3 Language and Vision Xavier Gir-i-Nieto Acknowledgments Santi Pascual 2 In lecture D2L6 RNNs... Language OUT Language IN Cho, Kyunghyun, Bart Van Merrinboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Post- -trauma vision trauma vision Post Post- -trauma vision trauma vision Post syndrome

Vision What is the Vision? The American Fork Canyon Vision (Vision) will ho- Few places in the

Building Our Vision St. Andrews Vision and Mission Our Vision: Our Vision: The Tree of Life is

J J R R Our Vision . . . Our Vision . . . Our Vision . . . Our Vision . . . TO BE THE BEST

2017 Humana Vision 130 LOOK Whats NEW! NEW RETAIL FRAME BENEFIT 2 Humana Vision 100

FLITTER FLITTER The Foldable Litter Pink B Our Vision Our Vision Our Vision Our Vision A

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Developmental Developmental Disorders affecting Disorders affecting language language

Building an IDE on top of a Build System The tale of a Haskell IDE How to write a compiler? +

Topic 1: Systems SYSTEM: an assemblage of parts and their relationship forming a functioning

CSE507 Computer-Aided Reasoning for Software Solver-Aided Languages

Experiment 6 Absorption and Emision Spectra Chlorophyll a is essential for most photosynthetic

Translation Model Parallel corpus source target translation e f phrase phrase features

Robust Decision Trees Against Adversarial Examples Honge Chen 1 , Huan Zhang 2 , Duane Boning 1 and

Data Quality Management Program (DQMP) Part 1: Overview of a DQMP Mike Lindsay, ICF &amp;

Specialising the EDM for Digitised Manuscripts Kai Eckert 1 , Steffen Hennicke, Evelyn Drge,

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Data Quality Management Program (DQMP) Part 1: Overview of a DQMP Mike Lindsay, ICF &