Visually Grounded Meaning Representation Qi Huang Ryan Rock - - PowerPoint PPT Presentation

visually grounded meaning representation
SMART_READER_LITE
LIVE PREVIEW

Visually Grounded Meaning Representation Qi Huang Ryan Rock - - PowerPoint PPT Presentation

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment:


slide-1
SLIDE 1

Visually Grounded Meaning Representation

Qi Huang Ryan Rock

slide-2
SLIDE 2

Outline

1. Motivation 2. Visually Grounded Autoencoders 3. Constructing Visual Attributes 4. Constructing Textual Attributes 5. Experiment: Similarity 6. Experiment: Categorization 7. Takeaway & Critiques

slide-3
SLIDE 3

Motivation

  • Word embedding is “disembodied”: represent word meaning only as the

statistical pattern of its surrounding context, without grounding to their real-world referents, which is usually accessible in modality other than text

  • Problem: the distribution of word representation overfits (and

fundamentally limited by) the training corpus’ statistical patterns; the learned representation cannot generalize well

slide-4
SLIDE 4

Motivation

Question: Can we take other modality information as input when building word representation?

slide-5
SLIDE 5

Motivation

Solution:

  • Cognitive Science study shows that semantic attributes can represents

multisensory information of a word not present in the surrounding text.

○ Example: apples are “green”, “red”, “round”, “shiny”.

  • Model: a (stacked)autoencoder based model that learn high-level meaning

representations by mapping words and images (represented as attributes) into a common hidden space

  • Verification: experiment on world similarity and concept categorization
slide-6
SLIDE 6

Autoencoders

An autoencoder is an unsupervised feed-forward neural network which is trained to reconstruct a given input from its hidden distribution. Encoder : map input vector x to a hidden representation h Decoder : reconstruct x (y) from hidden representation x Minimize the reconstruction loss:

slide-7
SLIDE 7

Denoising in autoencoder

Denoising: reconstruct clean input given a corrupted input. For example, randomly mask out some elements in the input. Effect: learn to activate knowledge about a concept when being exposed to partial information

slide-8
SLIDE 8

Stacked autoencoders

  • Stacked autoencoders is essentially a collection of autoencoder “stacked”
  • n top of each other
  • To initialize weight for each layer, train a collection of autoencoder, one at a
  • time. Feed one autoencoder’s output as the next autoencoder’s input
  • Fine tune the model end-to-end afterwards with unsupervised training

criterion (global reconstruction) or supervised criterion

slide-9
SLIDE 9

Visually Grounded Autoencoders

Train text autoencoder and image autoencoder with two hidden layers separately

slide-10
SLIDE 10

Visually Grounded Autoencoders

Feed their respective encoding as input to

  • btain the bimodal

encoding

slide-11
SLIDE 11

Visually Grounded Autoencoders

Finetune the whole model with global reconstruction loss, and label prediction as a supervised signal The bimodal encoding is used as the final word representation

slide-12
SLIDE 12

Some training details:

  • Weights of each AE are tied (encoder and decoder):
  • Denoising for image modality: treat x itself as “corrupted” and the “clean”

version is the centroid of multiple image embedding containing that object

slide-13
SLIDE 13

Constructing visual & textual attribute representation

In our case, the two modalities are unified in their representation by natural language attributes (in vector forms) Goal: allow to generalize to new instances when there are no training example available

slide-14
SLIDE 14

Constructing visual attribute representation

  • VISA dataset, built from McRae

feature norms and images from ImageNet that represent McRae’s concepts.

  • McRae feature norm: a

collection of concepts with vectorized representation, each entry corresponds to a property

  • f that concept
  • 541 concepts represented by

700K images. Each concept has 5 (“prune”) - 2,149 (“closet”) images

slide-15
SLIDE 15

Constructing visual attribute representation

The concepts and attributes essentially form a bipartite graph

slide-16
SLIDE 16

Constructing visual attribute representation

Train a SVM classifier for each attributes, images with that attribute as positive examples; otherwise as negative examples on image processed by hand-crafted feature extractor. Concept’s visual attribute representation is the average of all consisting images’ visual vectors

slide-17
SLIDE 17

Constructing textual attribute representation

  • Text attributes obtained from Strudel, “a distributional model akin to other

vector-based models except that collocates of a concept are established by relations to other concepts interpreted as properties”

  • Vector representation: each element represent the concept-attribute pair’s

strength of association

  • Attributes are automatically discovered by the Strudel algorithm from

training corpus

  • Also take word embeddings from continuous skip-gram model for

comparison

slide-18
SLIDE 18

Visual Attributes & Textual Attributes Comparison

slide-19
SLIDE 19

Experiments: Similarity and Categorization

slide-20
SLIDE 20

Experiment 1: Similarity

  • Create dataset by pairing concrete McRae nouns and assigning semantic

and visual similarity scores from 1 to 5

slide-21
SLIDE 21

Comparison Models

  • Compare against

○ Kernelized canonical correlation analysis (kCCA) ○ Deep canonical correlation analysis (DCCA) ○ SVD projection ○ Bimodal SAE ○ Unimodal SAE

slide-22
SLIDE 22

CCA

  • Find projections that maximally correlate two random vectors

○ CCA: linear projections ○ kCCA: nonlinear projections ○ DCCA: nonlinear deep projections

  • Vectors

○ X1: textual attributes ○ X2: visual attributes

slide-23
SLIDE 23

SVD on tAttrib + vAttrib

  • Create matrix of all objects’ textual + visual attributes by row

○ Compute the SVD ○ Use right eigenvectors to project attributes

slide-24
SLIDE 24

Bruni’s SVD

  • Collect textual co-occurrence matrix from ukWaC and WaCkypedia
  • Collect visual information through SIFT bag-of-visual words on ESP
  • Harvest text-based and image-based semantic vectors for target words
  • Concatenate textual and visual information by row
  • Take the SVD and project objects to lower-rank approximations
slide-25
SLIDE 25

Similarity Metric

  • Take cosine of angle between vectors
  • Calculate correlation coefficient against humans
slide-26
SLIDE 26

Similarity Metrics

  • Bimodal SAE (skip-gram, vAttrib) gets best results

○ Semantic: 0.77 ○ Visual: 0.66

slide-27
SLIDE 27

Experiment 2: Categorization

  • Unimodal classifiers exist – ResNet
  • Can bimodal classifiers perform better?
  • Use same comparison models
slide-28
SLIDE 28

Experiment 2: Categorization

  • Create a graph

○ node: object ○ edge: semantic or visual similarity weight

  • Use Chinese Whispers algorithm
slide-29
SLIDE 29

Chinese Whispers

  • Gets its name from ‘Chinese Whispers’ or ‘telephone’ game

○ Original message is iteratively distorted

  • Nodes iteratively take on class with maximum weight in neighborhood
slide-30
SLIDE 30

Chinese Whispers

slide-31
SLIDE 31

Categorization Results

  • Compare classifications against AMT categories

○ Use F-score – harmonic mean of precision and recall

  • Variables

○ s: class ○ n: size ○ h: cluster

slide-32
SLIDE 32

Categorization Results

  • Bimodal SAE (skip-gram, vAttrib) performs best

○ F-score of 0.48

slide-33
SLIDE 33

Conclusion

  • Bimodal SAE performs ‘inductive inference’

○ Induce representation for missing mode by learning statistical redundancy between modes ○ Predict reasonable, interpretable textual attributes given visual attributes ■ jellyfish: swim, fish, ocean ■ currant: fruit, ripe, cultivate

slide-34
SLIDE 34

Critiques

  • Evaluation mainly on unimodal or bimodal data

○ No evaluation on inductive inference

  • Skip-gram + vAttrib outperforms tAttrib + vAttrib
  • Bootstrap visual attributes with SVM

○ Bottleneck SAE performance on SVM performance

slide-35
SLIDE 35

Critiques

  • Textual “attributes” and virtual attributes do not share the same vocabulary
  • - intentional or a mistake? Is it textual attributes or just co-occurrence

words?

  • Extract image features using NNs instead of specialized feature extractor?
  • Data quality & size -- two authors manually inspect concept-visual attribute

relations, with only 500+ concepts

  • If only proof of concept -- can it generalize well to abstract concept?