Visually Grounded Neural Syntax Acquisition
Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu
* *
July 29th, 2019 @ACL
Visually Grounded Neural Syntax Acquisition * * Haoyue Shi - - PowerPoint PPT Presentation
Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu July 29 th , 2019 @ACL When we were c hildren A cat is on the lawn. When we were c hildren A cat is on the lawn. A cat sleeps outside.
Haoyue Shi Jiayuan Mao Kevin Gimpel Karen Livescu
* *
July 29th, 2019 @ACL
A cat is on the lawn.
A cat is on the lawn. A cat sleeps outside.
A cat is on the lawn.
A cat sleeps outside.
A cat is on the lawn. A cat is staring at you. A cat plays with a ball.
A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground.
A cat is on the lawn. A cat is staring at you. A cat plays with a ball.
A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground.
A cat is on the lawn. A cat is staring at you. A cat plays with a ball. A cat sleeps outside. A cat is on the ground. There is a cat sleeping on the ground. A cat was chasing a mouse. A dog was chasing a cat. A cat was chased by a dog. …
Figure credit: Ding et al. (2018)
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image
𝒅1 𝒅2 𝒅3
Joint Embedding Space
Image Encoder: ResNet 101 (He et al., 2015) Text Encoder Estimated Concreteness as Scores
Caption: “A cat is on the lawn”
Parser
Constituency Parse Tree
Caption: “A cat is on the lawn”
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Parser
Constituency Parse Tree
Compute score 𝐺𝐺𝑂 𝐰𝑏 𝐰𝑑𝑏𝑢 = 4.5 4.5
0.5 Compute score 4.5 𝐺𝐺𝑂 𝐰𝑑𝑏𝑢 𝐰𝑗𝑡 = 0.5
0.5 Compute score 4.5 1 𝐺𝐺𝑂 𝐰𝑗𝑡 𝐰𝑝𝑜 = 1
0.5 Compute score 4.5 1 1 𝐺𝐺𝑂 𝐰𝑝𝑜 𝐰𝑢ℎ𝑓 = 1
0.5 Compute score 4.5 1 1 3 𝐺𝐺𝑂 𝐰𝑢ℎ𝑓 𝐰𝑚𝑏𝑥𝑜 = 3
0.05 0.45 0.1 0.1 0.3 Normalized to a probability distribution
0.05 Sample a pair to combine (training) Greedily combine (inference) 0.45 0.1 0.1 0.3
0.05
0.45 0.1 0.1 0.3 Textual representation: Normalized sum of children 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰𝑏 + 𝐰𝑑𝑏𝑢 𝐰𝑏 + 𝐰𝑑𝑏𝑢
2
0.05
0.45 0.1 0.1 0.3
Textual representation: Normalized sum of children 𝐰 𝑏 𝑑𝑏𝑢 = 𝐰𝑏 + 𝐰𝑑𝑏𝑢 𝐰𝑏 + 𝐰𝑑𝑏𝑢
2
0.05
0.25 0.15 0.15 0.45 Compute probability 0.45 0.1 0.1 0.3
0.05
0.25 0.15 0.15 0.45
Combine 0.45 0.1 0.1 0.3
0.05
0.25 0.15 0.15 0.45
Finished! 0.45 0.1 0.1 0.3
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image
𝒅1 𝒅2 𝒅3
Joint Embedding Space
Image Encoder: ResNet 101 (He et al., 2015) Text Encoder
Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 +
A cat is on the lawn.
⋅ + = max ⋅, 0
𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 +
A cat is on the lawn.
⋅ + = max ⋅, 0
A cat is on the lawn.
𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Hinge-based triplet loss between images and captions for visual semantic embeddings (VSE; Kiros et al., 2015): ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 +
A cat is on the lawn.
⋅ + = max ⋅, 0
A cat is on the lawn. A dog is on the lawn.
𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + Hinge-based triplet loss between images and captions constituents for visual semantic embeddings:
a cat √
⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
Hinge-based triplet loss between images and captions constituents for visual semantic embeddings:
a cat
⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅) ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 +
ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑; 𝑗 = ℒ(𝑗, 𝑑) Abstractness: local hinge loss between constituents and images.
a cat
⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
ℒ 𝑗, 𝑑 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
𝑡𝑗𝑛 𝑗′, 𝑑 − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + + 𝑡𝑗𝑛 𝑗, 𝑑′ − 𝑡𝑗𝑛 𝑗, 𝑑 + 𝜀 + Concreteness is defined similarly:
Hinge-based triplet loss between images and captions constituents for visual semantic embeddings: 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑; 𝑗 = ℒ(𝑗, 𝑑) Abstractness: local hinge loss between constituents and images.
a cat
𝑑𝑝𝑜𝑑𝑠𝑓𝑢𝑓 𝑑; 𝑗 =
𝑗′,𝑑′ ≠(𝑗,𝑑)
−𝑡𝑗𝑛 𝑗′, 𝑑 + 𝑡𝑗𝑛 𝑗, 𝑑 − 𝜀 + + −𝑡𝑗𝑛 𝑗, 𝑑′ + 𝑡𝑗𝑛 𝑗, 𝑑 − 𝜀 + ⋅ + = max ⋅, 0 𝑡𝑗𝑛 ⋅,⋅ = cos(⋅,⋅)
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image
𝒅1 𝒅2 𝒅3
Joint Embedding Space
Image Encoder: ResNet 101 (He et al., 2015) Text Encoder
Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image
𝒅1 𝒅2 𝒅3
Joint Embedding Space
Image Encoder: ResNet 101 (He et al., 2015) Text Encoder Estimated Concreteness as Scores
Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
𝒅1 : a cat 𝒅2 : the lawn 𝒅3 : on the lawn …
Image
𝒅1 𝒅2 𝒅3
Joint Embedding Space
Image Encoder: ResNet 101 (He et al., 2015) Text Encoder Estimated Concreteness as Scores
Caption: “A cat is on the lawn” Constituency Parse Tree
Parser
Fact #1: On is the head of on the lawn. Fact #2: English is strongly head-initial. Many other Indo-European languages are head-initial as well. Fact #3: Under the setting of visual grounding, most abstract words are function words (e.g., prepositions, determiners, complementizers). Empirical Solution (Head-Initial; HI): Discourage abstract words from combining to the front.
𝑠𝑓𝑥𝑏𝑠𝑒 𝑑 = 𝑑𝑝𝑜𝑑𝑠𝑓𝑢𝑓 𝑑; 𝑗 𝑑 = [𝑑𝑚𝑓𝑔𝑢; 𝑑𝑠𝑗ℎ𝑢] 𝑠𝑓𝑥𝑏𝑠𝑒 𝑑 = 𝑑𝑝𝑜𝑑𝑠𝑓𝑢𝑓(𝑑; 𝑗) 𝜇 ⋅ 𝑏𝑐𝑡𝑢𝑠𝑏𝑑𝑢 𝑑𝑠𝑗ℎ𝑢; 𝑗 + 1 , 𝜇 > 0
Each model takes 5 runs, with different random seeds 𝐺
1: Average agreement with Benepar (Kitaev and Klein, 2018)
Std: Standard deviation of 𝐺
1 scores
Self-𝐺
1: Average agreement across the 5 2 pairs of models
Datasets Language # Image (train/dev/test) # Caption (train/dev/test) MSCOCO (Lin et al., 2014) EN 80K/1K/1K 400K/5K/5K Multi30K (Elliott et al., 2016) EN, DE, FR 28K/1K/1K 28K/1K/1K
27.1 23.3 22.9 52.5 45.5 50.4 53.5 54.4 0.2 2.6 3.3 0.3 0.2 0.4 32.4 60.3 69.3 87.1 90.2 89.8
20 40 60 80 100
Std Self-F1
Trivial Structures Language Modeling VG-NSL (ours)
PRPN: Shen et al. (2018) ON-LSTM: Shen et al. (2019) FastText: Joulin et al. (2016)
Self-F1 F1 (agreement with Benepar)
Benepar: Kitaev and Klein (2018)
30.8 27.5 31.5 38.7 34.9 27.7 33.5 34.3 36.3 38.7 38.1 38.3
5 10 15 20 25 30 35 40 45
English French German
PRPN ON-LSTM VG-NSL(ours) VG-NSL+HI (ours) PRPN: Shen et al. (2018) ON-LSTM: Shen et al. (2019)
1 = 95.65%
Pearson 𝜍 Turney et al. (2011) Brysbaert et al. (2014) VG-NSL+HI Turney et al. (2011) 1.00 0.84 0.72 Brysbaert et al. (2014)
0.71 VG-NSL+HI (ours)
Turney et al. (2011): Semi-supervised concreteness estimation Brysbaert et al. (2014): Manually labeled concreteness scores
30.8 27.5 31.5 38.7 34.9 27.7 33.5 34.3 36.3 38.7 38.1 38.3
5 10 15 20 25 30 35 40 45
English French German
PRPN ON-LSTM VG-NSL(ours) VG-NSL+HI (ours)
5.2 2.0
PRPN: Shen et al. (2018) ON-LSTM: Shen et al. (2019)
(Kim et al., 2019)