Toward Artificial Synesthesia: Linking Images and Sounds via Words - - PowerPoint PPT Presentation

toward artificial synesthesia linking images and sounds
SMART_READER_LITE
LIVE PREVIEW

Toward Artificial Synesthesia: Linking Images and Sounds via Words - - PowerPoint PPT Presentation

Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult at f ur Informatik Technische Universit at M unchen { xiao,stibor } @in.tum.de December 10, 2010 Synesthesia Synesthesia :


slide-1
SLIDE 1

Toward Artificial Synesthesia: Linking Images and Sounds via Words

Han Xiao and Thomas Stibor

Fakult¨ at f¨ ur Informatik Technische Universit¨ at M¨ unchen {xiao,stibor}@in.tum.de

December 10, 2010

slide-2
SLIDE 2

Synesthesia

Synesthesia: Perceptual experience in which a stimulus in one modality gives rise to an experience in a different sensory modality. Example:

  • Picture of golden beach might stimulate human’s hearing by

imagining the sound of waves crashing against the shore.

  • Sound of a baaing sheep might illustrate a green hillside.

Images and sounds represent distinct modalities, however, both modalities capture the same underlying concept.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 2 / 21

slide-3
SLIDE 3

Explicit/Implicit Linking between Images and Sounds

  • Explicit: Images and sounds are directly associated (without

intermediate links).

  • Implicit: Images and sounds are not directly associated, however,

they are linked together by another intermediate but obscure modality.

J.S. BACH COMPOSER VIOLINIST

  • VIOLIN

STRING INSTRUMENT

  • Natural language is based on visual and auditive stimuli ⇒ link images

and sounds with text.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 3 / 21

slide-4
SLIDE 4

Related Work

Domain: Linking an image with associated text (e.g. image annotation, multi-media information retrieval, object recognition).

  • Probability of associating words with image grids [Hironobu et al.,

1999].

  • Predicting words from images [Barnard et al., 2003].
  • Modeling the generative process of image regions and words in the

same latent space [Blei et al., 2003].

  • Jointly modeling image, class label and annotations (supervised topic

model) [Wang et al., 2009]. Consider images and text as two different languages. Linking images and words can be viewed as a process of translating from visual vocabulary to textual vocabulary. Inspiration: Probabilistic models for text/image analysis (LDA,Corr-LDA). Representation: Bags-of-words model of images and text.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 4 / 21

slide-5
SLIDE 5

Input Representation and Preprocessing

Build visual vocabulary and auditory vocabulary for representing images and sounds as bags-of-words. Image representation:

  • Divide image in patches and compute SIFT descriptors (128 dim.) for

each patch.

  • Quantize SIFT descriptors in collection using k-means to obtain

centroids of learned clusters, which compose the visual vocabulary. Sound representation:

  • Sound snippet is cut into frames (sequence of 1024 audio samples).
  • For each frame, compute Mel-Frequency-Cepstral Coefficients.
  • Each sound snippet is thus represented as set of 25 dimensional

feature vectors.

  • Cluster all feature vectors in collection using k-means to obtain

auditory words.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 5 / 21

slide-6
SLIDE 6

Notations

Annotated image I consists of M visual words and N textual words (annotations), I = {v1, . . . , vM

  • visual words

; w1, . . . , wN

  • annotations

}. Captioned sound snippet S consists of M auditory words and N textual words, S = {u1, . . . , uM

  • auditory words

; w1, . . . , wN

  • sound tags

}. Training collection T = {I1, . . . , IK; S1, . . . , SL}, K annotated images, L tagged sounds. Denote Wi vocabulary of image annotations, and Ws vocabulary of sound tags. Complete textual vocabulary W = Wi ∪ Ws.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 6 / 21

slide-7
SLIDE 7

Linking Images and Sounds via Text

Image composition: Given an un-annotated image I∗ / ∈ T, estimate the conditional probability p(S | I∗) for every sound snippet S ∈ T. Sound illustration: Given an un-tagged sound S∗ / ∈ T, estimate the conditional probability p(I | S∗) for every image I ∈ T. Problem: We can not estimate p(S | I∗) and p(I | S∗) directly, as no explicit correspondences exist. Idea: “Translate” image into natural language text, then “translate” text back into sound, that is p(S | I∗) ≈

  • w∈Wi
  • w′∈Ws

p(S | w′)p(w′ | w)p(w | I∗), p(S | I∗) ≈

  • w∈Ws
  • w′∈Wi

p(I | w′)p(w′ | w)p(w | S∗).

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 7 / 21

slide-8
SLIDE 8

Modeling Images/Text and Sounds/Text with Corr-LDA

Generative process of an annotated image I = {v1, . . . , vM; w1, . . . , wN}:

1 Draw topic proportions θ ∼ Dirichlet(α) 2 For each visual word vm, m ∈ {1, . . . , M} 1 Draw topic assignment zm|θ ∼ Multinomial(θ) 2 Draw visual word vm|zm ∼ Multinomial1(πzm) 3 For each textual word wn, n ∈ {1, . . . , N} 1 Draw discrete indexing variable yn ∼ Uniform(1, . . . , M) 2 Draw textual word wn ∼ Multinomial(βzyn )

y

w

β

π

v

z

θ

α

  • M

D

K

Exchange visual word vm by auditory word um leads to generative process

  • f modeling sounds and text.
  • 1Orig. Corr-LDA uses multivariate Gaussian.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 8 / 21

slide-9
SLIDE 9

Modeling Images/Text and Sounds/Text with Corr-LDA (cont.)

Trained model gives distributions of interests: p(I | w) and p(w | I∗), where I ∈ T and I∗ / ∈ T. Specifically, distribution over words conditioned on an unseen image is approximated by: p(w|I∗) ≈

M

  • m=1
  • zm

p(zm|θ)p(w|zm, β). Using Bayes rule for p(I | w) gives p(I|w) = p(w|I)p(I)

  • I′∈T p(w|I′)p(I′),

where p(I) = p(θ|α)

M

  • m=1

p(zm|θ)p(vm|zm, π)

N

  • n=1

p(yn|M)p(wn|zyn, β).

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 9 / 21

slide-10
SLIDE 10

Modeling Text

Recall: p(S | I∗) ≈

  • w∈Wi
  • w′∈Ws

p(S | w′)p(w′ | w)p(w | I∗), p(S | I∗) ≈

  • w∈Ws
  • w′∈Wi

p(I | w′)p(w′ | w)p(w | S∗). Remaining problem: Estimate p(w′ | w) (semantic relatedness between two words). Approach: LDA model, with data set D containing only captions of all images and sounds. Generative process of a document (captions) D ∈ D:

1 Draw topic proportions θ ∼ Dirichlet(α) 2 For each textual word wn, n ∈ {1, . . . , N} 1 Draw topic assignment zn|θ ∼ Multinomial(θ) 2 Draw textual word wn|zn ∼ Multinomial(βzn)

Two sets of parameters are estimated: ΘD = p(z | D) (mixing proportions

  • ver topics), and β = p(w | z) (word distributions over topics).

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 10 / 21

slide-11
SLIDE 11

Modeling Text (cont.)

Given trained LDA model, word relatedness between w and w′ is pLDA(w|w′) = 1 C

  • zn

p(w|zn)nw′ nzn p(w′|zn), where nw′ is the number of w′ occurred in D, nzn is the number of words assigned to topic zn, C is normalization factor. Relatedness is calculated on small data set (problematic). Smooth p(w | w′) by using WordNet dictionary. Exemplary outputs the word relatedness of LDA and WordNet: pLDA(w | rain)

  • pWordNet(w | rain)

p(w|w′) = σ pLDA(w|w′) + (1 − σ)pWordNet(w|w′), σ is smoothing param.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 11 / 21

slide-12
SLIDE 12

Putting Everything in a Probabilistic Framework

  • !"
  • #

#

  • $

$ % %

&

& &

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 12 / 21

slide-13
SLIDE 13

Experimental Results (Data Sets)

Image data set: Three classes of images from LabelMe data set2 (“street”, “coast”, “forest”). Randomly selected 300 images from each class.

  • Average length of annotations if 7 tokens for each image.
  • Textual vocabulary size of all annotations is 156.

Sound data set: Downloaded 831 audio snippets from Freesound Project.3 Most are natural sounds and synthetic sound effects.

  • Sound snippets are converted to 44.1kHz mono WAV format.
  • Each snippet is tagged by uploader or other online users.
  • Average number of tags per sound is 6 tokens.
  • Textual vocabulary size of sound tags is 1576.

2http://labelme.csail.mit.edu/ 3http://www.freesound.org H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 13 / 21

slide-14
SLIDE 14

Experimental Results (Details)

  • 20 % of the data was held out for testing, remaining 80 % to estimate

parameters.

  • First, Corr-LDA model with annotated images was trained.
  • Second, Corr-LDA model with tagged sounds was trained.
  • Third, LDA model with all annotations and tags was trained.

Model Parameters:

  • Size of patch for computing SIFT descriptors is 16 × 16.
  • Clustering SIFT descriptors and audio feature vectors yield 241 visual

words and 89 auditory words in total.

  • Dirichlet prior α = 0.1 for Corr-LDA and LDA. Smoothing parameter

σ = 0.8.

  • Number of topics is 40. Maximum number of iterations for variational

inference and EM algorithm4 is 100.

4See technical note

http://home.in.tum.de/∼xiaoh/pub/derivation.pdf

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 14 / 21

slide-15
SLIDE 15

Illustrative Example

  • 1. waterfall flowing
  • 2. wave splashing,

powerboat engine booming

  • 3. wood stick breaking
  • 4. wave splashing
  • 5. stream flowing
  • 1. wood stick breaking
  • 2. bell ringing
  • 3. ice cube shaking in glass
  • 4. child speaking
  • 5. glass shattering

waterfall flowing vehicles passing

(a) Image composition task, a good prediction (left) and a bad prediction (right). (b) Sound illustration task, a good prediction (top) and a bad prediction (bottom). The images are ranked according to the conditional probabilities from highest probability (left most) to smallest probability (right most).

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 15 / 21

slide-16
SLIDE 16

Online Evaluation System (http://yulei.appspot.com)

Allows humans to judge the predicted sounds/images of a randomly given scene. Image composition task:

  • Website will randomly draw an image from test set scene, and present

10 sound snippets with highest probability under p(S | I∗).

  • Users listen to sound snippets and decide whether the sounds are

acceptable or not. Image composition task:

  • Website will randomly present a sound from test set, and provide 10

images with highest probability under p(I | S∗). Website occasionally randomly draws images and sounds from data set as “intruders” and present them to the users (random baseline).

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 16 / 21

slide-17
SLIDE 17

Results (Precision and Recall)

2 4 6 8 10 0.1 0.2 0.3 Top # of images Recall Random Our approach 2 4 6 8 10 0.2 0.4 0.6 0.8 1 Top # of images Precision 2 4 6 8 10 0.2 0.4 0.6 0.8 1 Top # of sounds Precision 2 4 6 8 10 0.05 0.1 0.15 0.2 Top # of sounds Recall

(a) Image composition task. (b) Sound illustration task.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 17 / 21

slide-18
SLIDE 18

Conclusion & Future Work

  • Developed a framework based on latent Dirichlet models that enables

implicit linking of images and sounds via text.

  • Framework allows to integrate new probabilistic models

straightforwardly.

  • Online evaluation system was developed to enable humans to evaluate

the model’s performance. Some problems we run into:

  • Performance of Corr-LDA is varying on different data sets, particular

difficulties of images with clutter. Future ideas:

  • Explore a suitable way of mixing relevant sounds into a single track,

and compose a lifelike environmental sound effect.

  • Automatically paint a single collage by selecting segments from

relevant images.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 18 / 21

slide-19
SLIDE 19

Thank you for your attention. Questions?

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 19 / 21

slide-20
SLIDE 20

Bibliography

  • K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D.M. Blei, and

M.I. Jordan. Matching words and pictures. JMLR, 3:1107–1135, 2003. D.M. Blei and M.I. Jordan. Modeling annotated data. In SIGIR, pages 127–134. ACM, 2003. Y.M. Hironobu, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Manegement, 1999.

  • C. Wang, D. Blei, and F.F. Li.

Simultaneous image classification and annotation. In CVPR, pages 1903–1910. IEEE, 2009.

H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 20 / 21