multi modal reasoning bridging vision and language
play

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - PowerPoint PPT Presentation

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California Personal Assistant AI Touchstone 2 The mass of an electron is approximately 9.109 10 -31 kg. 3 Has Personal


  1. Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California

  2. Personal Assistant – AI Touchstone 2

  3. The mass of an electron is approximately 9.109 × 10 -31 kg. 3

  4. Has Personal Assistant Come True? Illustration by Fiona Carswell 4

  5. Vision & Language in MCL Vision Vision & Language • Object detection • Visual dialogue • Semantic segmentation • Vision & Language navigation • Video segmentation • Multi-modal machine Language translation • Text classification • Language graph learning 5

  6. What is Visual Dialogue? • Dialogue that is grounded in vision A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. What color is his leather? 6

  7. Why Visual Dialogue? • Aiding visually impaired users Daisy just sent you some pictures of her new house. Great, is the living room large? Yes, there is a large living room with fireplace • Aiding analysts Did anyone pass the gate yesterday? Yes, 45 instances logged on camera. Were any of them carrying a cardboard box? 7

  8. From Information Point of View Image Text 8

  9. Previous Work • Encoder-decoder framework (Das et al., 2017, Lu et al., 2017, Wu et al., 2018, etc.) Embedding Encoder Decoder Q t , I , H t  t E t – Encoder • Embeds image, question and dialogue history – Decoder • Decodes the embedding to answers in natural language 9

  10. Previous multi-modal encoders • Lu et al., 2017, Wu et al., 2018, etc. – Use one input as guidance to compute attention on another input 10

  11. Attention • Weighted-sum over features Weights h c w c 11

  12. Attention with Guidance 𝒈 𝑕 Weights h c w c 12

  13. Previous multi-modal encoders • Lu et al., 2017, Wu et al., 2018, etc. – Use one input as guidance to compute attention on another input – Process inputs sequentially in pre-defined orders 13

  14. Encoders with Sequential Attention • Lu et al. 2017 What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. 14

  15. Encoders with Sequential Attention • Wu et al., 2018 What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. 15

  16. Previous multi-modal encoders • Cannot accommodate to different scenario’s • How many people are there in the image? • Is there anything else on the table? 16

  17. Adaptive reasoning F Q F Q F I F I F H F H Guided Guided Guided Attention Attention Attention f H, i f Q, i f I, i Comprehension Exploration f QIH, i f g, i No Reasoning i = i max ? RNN Yes E 17

  18. Attention Visualization Is the little boy on a beach? How old does he look? 18

  19. Attention Visualization What color hair does he have? How old does he look? 19

  20. Attention Visualization What color hair does he have? Is he dressed for summer? 20

  21. Attention Visualization What color is the airplane? Time step i=1 21

  22. Attention Visualization What color is the airplane? Time step i=2 22

  23. Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down 23

  24. Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any grass? Yes Yes, a lot of grass What color grass? It is green with brownish dead spots Green and brown 24

  25. Qualitative results 4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any vehicles on the lot? Yes Yes, there are a lot of cars Do they look new or old? They look new They look new 25

  26. IJCAI 2019 Generative Visual Dialogue System via Weighted Likelihood Estimation Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo Thursday Aug. 15th 09:30 - 10:30 AM CV|LV - Language and Vision 2 (2501-2502) 26

  27. Vision-grounded Problems Revisited • What is visual dialogue? • Dialogue that grounded in vision A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is. What color is his leather? 27

  28. Vision-grounded Problems Revisited • From information point of view Image Text 28

  29. Vision-grounded Problems Revisited • No alignment between image & text manifolds Image Text SIFT RNN BoW Transformer CNN … … 29

  30. Bridging Vision & Language ? Image Text 30

  31. Bridging Vision & Language • Manifold alignment Image Joint Text 31

  32. Bridging Vision & Language • Usually one-to-one mapping in other manifold alignment problems – E.g. machine translation English Joint Dutch Ik hou van jou I like you I take you with me Ik neem je mee 32

  33. Bridging Vision & Language • Alignment between vision and language – No one-to-one mapping Image Joint Text 33

  34. Attention Revisited • Weighted-sum over features Weights h c w c 34

  35. Bridging Vision & Language • Alignment by attention – Joint learning of attention and alignment Image Joint Text 35

  36. Related Research in MCL Vision Vision & Language • Object detection • Visual dialogue • Semantic segmentation • Vision & Language navigation • Video segmentation • Multi-modal machine Language translation • Text classification • Language graph learning 36

  37. Vision-and-language Navigation • Instructions in natural language – Walk down and turn right. • Surrounding environment in vision 37

  38. Co-attention between Vision & Language • Leave the room into the hall and go straight. • Head towards the stairs. • Stop on the round rug next to the flowers. 38

  39. Unsupervised Multi-modal Neural Machine Translation 39

  40. Media Communication Lab • Lab director: Prof. C.-C. Jay Kuo • Visiting scholars • PhD students • Master students 40

  41. Thank you for listening Visit us at http://mcl.usc.edu/ 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend