visually grounded meaning representation
play

Visually Grounded Meaning Representation Qi Huang Ryan Rock - PowerPoint PPT Presentation

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment:


  1. Visually Grounded Meaning Representation Qi Huang Ryan Rock

  2. Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment: Categorization 7. Takeaway & Critiques

  3. Motivation ● Word embedding is “disembodied”: represent word meaning only as the statistical pattern of its surrounding context, without grounding to their real-world referents, which is usually accessible in modality other than text Problem: the distribution of word representation overfits (and ● fundamentally limited by) the training corpus’ statistical patterns; the learned representation cannot generalize well

  4. Motivation Question: Can we take other modality information as input when building word representation?

  5. Motivation Solution: Cognitive Science study shows that semantic attributes can represents ● multisensory information of a word not present in the surrounding text. ○ Example: apples are “green”, “red”, “round”, “shiny”. Model: a (stacked)autoencoder based model that learn high-level meaning ● representations by mapping words and images (represented as attributes ) into a common hidden space ● Verification: experiment on world similarity and concept categorization

  6. Autoencoders An autoencoder is an unsupervised feed-forward neural network which is trained to reconstruct a given input from its hidden distribution. Encoder : map input vector x to a hidden representation h Decoder : reconstruct x (y) from hidden representation x Minimize the reconstruction loss:

  7. Denoising in autoencoder Denoising: reconstruct clean input given a corrupted input. For example, randomly mask out some elements in the input. Effect: learn to activate knowledge about a concept when being exposed to partial information

  8. Stacked autoencoders ● Stacked autoencoders is essentially a collection of autoencoder “stacked” on top of each other To initialize weight for each layer, train a collection of autoencoder, one at a ● time. Feed one autoencoder’s output as the next autoencoder’s input ● Fine tune the model end-to-end afterwards with unsupervised training criterion (global reconstruction) or supervised criterion

  9. Visually Grounded Autoencoders Train text autoencoder and image autoencoder with two hidden layers separately

  10. Visually Grounded Autoencoders Feed their respective encoding as input to obtain the bimodal encoding

  11. Visually Grounded Autoencoders Finetune the whole model with global reconstruction loss, and label prediction as a supervised signal The bimodal encoding is used as the final word representation

  12. Some training details: ● Weights of each AE are tied (encoder and decoder) : ● Denoising for image modality: treat x itself as “corrupted” and the “clean” version is the centroid of multiple image embedding containing that object

  13. Constructing visual & textual attribute representation In our case, the two modalities are unified in their representation by natural language attributes (in vector forms) Goal: allow to generalize to new instances when there are no training example available

  14. Constructing visual attribute representation ● VISA dataset, built from McRae feature norms and images from ImageNet that represent McRae’s concepts. ● McRae feature norm: a collection of concepts with vectorized representation, each entry corresponds to a property of that concept ● 541 concepts represented by 700K images. Each concept has 5 (“prune”) - 2,149 (“closet”) images

  15. Constructing visual attribute representation The concepts and attributes essentially form a bipartite graph

  16. Constructing visual attribute representation Train a SVM classifier for each attributes, images with that attribute as positive examples; otherwise as negative examples on image processed by hand-crafted feature extractor. Concept’s visual attribute representation is the average of all consisting images’ visual vectors

  17. Constructing textual attribute representation ● Text attributes obtained from Strudel, “a distributional model akin to other vector-based models except that collocates of a concept are established by relations to other concepts interpreted as properties” Vector representation: each element represent the concept-attribute pair’s ● strength of association ● Attributes are automatically discovered by the Strudel algorithm from training corpus Also take word embeddings from continuous skip-gram model for ● comparison

  18. Visual Attributes & Textual Attributes Comparison

  19. Experiments: Similarity and Categorization

  20. Experiment 1: Similarity ● Create dataset by pairing concrete McRae nouns and assigning semantic and visual similarity scores from 1 to 5

  21. Comparison Models ● Compare against ○ Kernelized canonical correlation analysis (kCCA) ○ Deep canonical correlation analysis (DCCA) ○ SVD projection ○ Bimodal SAE ○ Unimodal SAE

  22. CCA Find projections that maximally correlate two random vectors ● ○ CCA: linear projections ○ kCCA: nonlinear projections ○ DCCA: nonlinear deep projections ● Vectors X1: textual attributes ○ ○ X2: visual attributes

  23. SVD on tAttrib + vAttrib ● Create matrix of all objects’ textual + visual attributes by row ○ Compute the SVD ○ Use right eigenvectors to project attributes

  24. Bruni’s SVD Collect textual co-occurrence matrix from ukWaC and WaCkypedia ● Collect visual information through SIFT bag-of-visual words on ESP ● ● Harvest text-based and image-based semantic vectors for target words ● Concatenate textual and visual information by row Take the SVD and project objects to lower-rank approximations ●

  25. Similarity Metric ● Take cosine of angle between vectors ● Calculate correlation coefficient against humans

  26. Similarity Metrics Bimodal SAE (skip-gram, vAttrib) gets best results ● ○ Semantic: 0.77 Visual: 0.66 ○

  27. Experiment 2: Categorization ● Unimodal classifiers exist – ResNet ● Can bimodal classifiers perform better? Use same comparison models ●

  28. Experiment 2: Categorization ● Create a graph ○ node: object ○ edge: semantic or visual similarity weight ● Use Chinese Whispers algorithm

  29. Chinese Whispers ● Gets its name from ‘Chinese Whispers’ or ‘telephone’ game ○ Original message is iteratively distorted Nodes iteratively take on class with maximum weight in neighborhood ●

  30. Chinese Whispers

  31. Categorization Results ● Compare classifications against AMT categories ○ Use F-score – harmonic mean of precision and recall Variables ● ○ s: class n: size ○ ○ h: cluster

  32. Categorization Results ● Bimodal SAE (skip-gram, vAttrib) performs best ○ F-score of 0.48

  33. Conclusion ● Bimodal SAE performs ‘inductive inference’ ○ Induce representation for missing mode by learning statistical redundancy between modes ○ Predict reasonable, interpretable textual attributes given visual attributes ■ jellyfish: swim, fish, ocean ■ currant: fruit, ripe, cultivate

  34. Critiques ● Evaluation mainly on unimodal or bimodal data ○ No evaluation on inductive inference Skip-gram + vAttrib outperforms tAttrib + vAttrib ● ● Bootstrap visual attributes with SVM ○ Bottleneck SAE performance on SVM performance

  35. Critiques ● Textual “attributes” and virtual attributes do not share the same vocabulary -- intentional or a mistake? Is it textual attributes or just co-occurrence words? Extract image features using NNs instead of specialized feature extractor? ● ● Data quality & size -- two authors manually inspect concept-visual attribute relations, with only 500+ concepts If only proof of concept -- can it generalize well to abstract concept? ●

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend