visually grounded
play

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi - PowerPoint PPT Presentation

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL When we were c hildren A cat is on the lawn. When we were c hildren A cat is on the lawn. A cat sleeps outside.


  1. Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL

  2. When we were c hildren… A cat is on the lawn.

  3. When we were c hildren… A cat is on the lawn. A cat sleeps outside.

  4. When we were c hildren… A cat is on the lawn. A cat, as a whole, means something concrete. A cat sleeps outside.

  5. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground.

  6. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat, as a whole, means something concrete. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.

  7. When we were c hildren… A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat was chasing a mouse. A dog was chasing a cat . A cat was chased by a dog. … A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat, as a whole, functions as a single unit in sentences.

  8. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? Figure credit: Ding et al. (2018)

  9. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  10. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  11. Problem Definition • Given a large set of parallel image-text data (e.g., MSCOCO), can we generate linguistically plausible structure for the text? A cat is on the lawn

  12. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) … Estimated Concreteness as Scores

  13. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  14. Greedy Bottom-Up Parser a cat is on the lawn

  15. Greedy Bottom-Up Parser Compute score 𝐰 𝑏 𝐺𝐺𝑂 = 4.5 4.5 𝐰 𝑑𝑏𝑢 a cat is on the lawn

  16. Greedy Bottom-Up Parser Compute score 𝐰 𝑑𝑏𝑢 𝐺𝐺𝑂 = 0.5 4.5 0.5 𝐰 𝑗𝑡 a cat is on the lawn

  17. Greedy Bottom-Up Parser Compute score 𝐰 𝑗𝑡 𝐺𝐺𝑂 = 1 4.5 0.5 1 𝐰 𝑝𝑜 a cat is on the lawn

  18. Greedy Bottom-Up Parser Compute score 𝐰 𝑝𝑜 𝐺𝐺𝑂 = 1 4.5 0.5 1 1 𝐰 𝑢ℎ𝑓 a cat is on the lawn

  19. Greedy Bottom-Up Parser Compute score 𝐰 𝑢ℎ𝑓 𝐺𝐺𝑂 = 3 4.5 0.5 1 1 3 𝐰 𝑚𝑏𝑥𝑜 a cat is on the lawn

  20. Greedy Bottom-Up Parser Normalized to a probability distribution 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  21. Greedy Bottom-Up Parser 0.45 0.05 0.1 0.1 0.3 Sample a pair to combine (training) Greedily combine (inference) a cat is on the lawn

  22. Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn

  23. Greedy Bottom-Up Parser Textual representation: Normalized sum of children (a cat) is on the lawn 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰 𝑏 + 𝐰 𝑑𝑏𝑢 0.45 0.05 0.1 0.1 0.3 2 a cat is on the lawn

  24. Greedy Bottom-Up Parser Compute probability 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  25. Greedy Bottom-Up Parser Combine (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  26. Greedy Bottom-Up Parser Finished! ((a cat) (is (on (the lawn)))) … (a cat) is on (the lawn) 0.25 0.15 0.15 0.45 (a cat) is on the lawn 0.45 0.05 0.1 0.1 0.3 a cat is on the lawn

  27. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  28. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Parser Constituency Parse Tree 𝒅 1 : a cat 𝒅 2 : the lawn 𝒅 3 : on the lawn …

  29. Visually Grounded Neural Syntax Learner • Concrete spans are more likely to be constituents. Caption : “A cat is on the lawn” Image Joint Embedding Space Parser Constituency Parse Tree Text 𝒅 3 Encoder 𝒅 1 𝒅 1 : a cat 𝒅 2 𝒅 2 : the lawn Image Encoder: 𝒅 3 : on the lawn ResNet 101 (He et al., 2015) …

  30. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  31. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  32. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): √ A cat is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  33. The Joint Embedding Space Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): × √ A cat is on the lawn. A dog is on the lawn. 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) × ⋅ + = max ⋅, 0 A cat is on the lawn. 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  34. Concreteness Estimation in the Joint Embedding Space Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  35. Concreteness Estimation in the Joint Embedding Space a cat √ Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  36. Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: on the ? 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

  37. Concreteness Estimation in the Joint Embedding Space √ a cat Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: ? on the 𝑡𝑗𝑛 𝑗 ′ , 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ℒ 𝑗, 𝑑 = ෍ 𝑗 ′ ,𝑑 ′ ≠(𝑗,𝑑) Abstractness: local hinge loss between constituents and images. 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑; 𝑗 = ℒ(𝑗, 𝑑) ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend