Visually Grounded Meaning Representation
Qi Huang Ryan Rock
Visually Grounded Meaning Representation Qi Huang Ryan Rock - - PowerPoint PPT Presentation
Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment:
Qi Huang Ryan Rock
1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment: Categorization 7. Takeaway & Critiques
statistical pattern of its surrounding context, without grounding to their real-world referents, which is usually accessible in modality other than text
fundamentally limited by) the training corpus’ statistical patterns; the learned representation cannot generalize well
Solution:
multisensory information of a word not present in the surrounding text.
○ Example: apples are “green”, “red”, “round”, “shiny”.
representations by mapping words and images (represented as attributes) into a common hidden space
An autoencoder is an unsupervised feed-forward neural network which is trained to reconstruct a given input from its hidden distribution. Encoder : map input vector x to a hidden representation h Decoder : reconstruct x (y) from hidden representation x Minimize the reconstruction loss:
Denoising: reconstruct clean input given a corrupted input. For example, randomly mask out some elements in the input. Effect: learn to activate knowledge about a concept when being exposed to partial information
criterion (global reconstruction) or supervised criterion
Train text autoencoder and image autoencoder with two hidden layers separately
Feed their respective encoding as input to
encoding
Finetune the whole model with global reconstruction loss, and label prediction as a supervised signal The bimodal encoding is used as the final word representation
version is the centroid of multiple image embedding containing that object
In our case, the two modalities are unified in their representation by natural language attributes (in vector forms) Goal: allow to generalize to new instances when there are no training example available
feature norms and images from ImageNet that represent McRae’s concepts.
collection of concepts with vectorized representation, each entry corresponds to a property
700K images. Each concept has 5 (“prune”) - 2,149 (“closet”) images
The concepts and attributes essentially form a bipartite graph
Train a SVM classifier for each attributes, images with that attribute as positive examples; otherwise as negative examples on image processed by hand-crafted feature extractor. Concept’s visual attribute representation is the average of all consisting images’ visual vectors
vector-based models except that collocates of a concept are established by relations to other concepts interpreted as properties”
strength of association
training corpus
comparison
and visual similarity scores from 1 to 5
○ Kernelized canonical correlation analysis (kCCA) ○ Deep canonical correlation analysis (DCCA) ○ SVD projection ○ Bimodal SAE ○ Unimodal SAE
○ CCA: linear projections ○ kCCA: nonlinear projections ○ DCCA: nonlinear deep projections
○ X1: textual attributes ○ X2: visual attributes
○ Compute the SVD ○ Use right eigenvectors to project attributes
○ Semantic: 0.77 ○ Visual: 0.66
○ node: object ○ edge: semantic or visual similarity weight
○ Original message is iteratively distorted
○ Use F-score – harmonic mean of precision and recall
○ s: class ○ n: size ○ h: cluster
○ F-score of 0.48
○ Induce representation for missing mode by learning statistical redundancy between modes ○ Predict reasonable, interpretable textual attributes given visual attributes ■ jellyfish: swim, fish, ocean ■ currant: fruit, ripe, cultivate
○ No evaluation on inductive inference
○ Bottleneck SAE performance on SVM performance
words?
relations, with only 500+ concepts