toward artificial synesthesia linking images and sounds
play

Toward Artificial Synesthesia: Linking Images and Sounds via Words - PowerPoint PPT Presentation

Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult at f ur Informatik Technische Universit at M unchen { xiao,stibor } @in.tum.de December 10, 2010 Synesthesia Synesthesia :


  1. Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult¨ at f¨ ur Informatik Technische Universit¨ at M¨ unchen { xiao,stibor } @in.tum.de December 10, 2010

  2. Synesthesia Synesthesia : Perceptual experience in which a stimulus in one modality gives rise to an experience in a different sensory modality. Example: • Picture of golden beach might stimulate human’s hearing by imagining the sound of waves crashing against the shore. • Sound of a baaing sheep might illustrate a green hillside. Images and sounds represent distinct modalities, however, both modalities capture the same underlying concept. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 2 / 21

  3. Explicit/Implicit Linking between Images and Sounds • Explicit: Images and sounds are directly associated (without intermediate links). • Implicit: Images and sounds are not directly associated, however, they are linked together by another intermediate but obscure modality. ����� ����� � ���� � �������� ������� � � � �� �� �� ���� � ���� J.S. BACH VIOLIN � COMPOSER STRING VIOLINIST INSTRUMENT � �������� ������� ��� ���� � ��� � ��� ���� Natural language is based on visual and auditive stimuli ⇒ link images and sounds with text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 3 / 21

  4. Related Work Domain: Linking an image with associated text (e.g. image annotation, multi-media information retrieval, object recognition). • Probability of associating words with image grids [Hironobu et al., 1999]. • Predicting words from images [Barnard et al., 2003]. • Modeling the generative process of image regions and words in the same latent space [Blei et al., 2003]. • Jointly modeling image, class label and annotations (supervised topic model) [Wang et al., 2009]. Consider images and text as two different languages. Linking images and words can be viewed as a process of translating from visual vocabulary to textual vocabulary. Inspiration: Probabilistic models for text/image analysis (LDA,Corr-LDA). Representation: Bags-of-words model of images and text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 4 / 21

  5. Input Representation and Preprocessing Build visual vocabulary and auditory vocabulary for representing images and sounds as bags-of-words. Image representation: • Divide image in patches and compute SIFT descriptors ( 128 dim.) for each patch. • Quantize SIFT descriptors in collection using k -means to obtain centroids of learned clusters, which compose the visual vocabulary . Sound representation: • Sound snippet is cut into frames (sequence of 1024 audio samples). • For each frame, compute Mel-Frequency-Cepstral Coefficients. • Each sound snippet is thus represented as set of 25 dimensional feature vectors. • Cluster all feature vectors in collection using k -means to obtain auditory words. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 5 / 21

  6. Notations Annotated image I consists of M visual words and N textual words (annotations), I = { v 1 , . . . , v M ; w 1 , . . . , w N } . � �� � � �� � visual words annotations Captioned sound snippet S consists of M auditory words and N textual words, S = { u 1 , . . . , u M ; w 1 , . . . , w N } . � �� � � �� � auditory words sound tags Training collection T = { I 1 , . . . , I K ; S 1 , . . . , S L } , K annotated images, L tagged sounds. Denote W i vocabulary of image annotations, and W s vocabulary of sound tags. Complete textual vocabulary W = W i ∪ W s . H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 6 / 21

  7. Linking Images and Sounds via Text Image composition: Given an un-annotated image I ∗ / ∈ T , estimate the conditional probability p ( S | I ∗ ) for every sound snippet S ∈ T . Sound illustration: Given an un-tagged sound S ∗ / ∈ T , estimate the conditional probability p ( I | S ∗ ) for every image I ∈ T . Problem: We can not estimate p ( S | I ∗ ) and p ( I | S ∗ ) directly, as no explicit correspondences exist. Idea: “Translate” image into natural language text, then “translate” text back into sound, that is � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 7 / 21

  8. Modeling Images/Text and Sounds/Text with Corr-LDA Generative process of an annotated image I = { v 1 , . . . , v M ; w 1 , . . . , w N } : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each visual word v m , m ∈ { 1 , . . . , M } 1 Draw topic assignment z m | θ ∼ Multinomial( θ ) 2 Draw visual word v m | z m ∼ Multinomial 1 ( π z m ) 3 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw discrete indexing variable y n ∼ Uniform(1 , . . . , M ) 2 Draw textual word w n ∼ Multinomial( β z yn ) z α v π θ M w y β � D K Exchange visual word v m by auditory word u m leads to generative process of modeling sounds and text. 1 Orig. Corr-LDA uses multivariate Gaussian. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 8 / 21

  9. Modeling Images/Text and Sounds/Text with Corr-LDA (cont.) Trained model gives distributions of interests: p ( I | w ) and p ( w | I ∗ ) , where I ∈ T and I ∗ / ∈ T . Specifically, distribution over words conditioned on an unseen image is approximated by: M � � p ( w | I ∗ ) ≈ p ( z m | θ ) p ( w | z m , β ) . m =1 z m Using Bayes rule for p ( I | w ) gives p ( w | I ) p ( I ) p ( I | w ) = I ′ ∈ T p ( w | I ′ ) p ( I ′ ) , � where M N � � p ( I ) = p ( θ | α ) p ( z m | θ ) p ( v m | z m , π ) p ( y n | M ) p ( w n | z y n , β ) . m =1 n =1 H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 9 / 21

  10. Modeling Text Recall: � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i Remaining problem: Estimate p ( w ′ | w ) (semantic relatedness between two words). Approach: LDA model, with data set D containing only captions of all images and sounds. Generative process of a document (captions) D ∈ D : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw topic assignment z n | θ ∼ Multinomial( θ ) 2 Draw textual word w n | z n ∼ Multinomial( β z n ) Two sets of parameters are estimated: Θ D = p ( z | D ) (mixing proportions over topics), and β = p ( w | z ) (word distributions over topics). H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 10 / 21

  11. Modeling Text (cont.) Given trained LDA model, word relatedness between w and w ′ is p LDA ( w | w ′ ) = 1 p ( w | z n ) n w ′ � p ( w ′ | z n ) , C n z n z n where n w ′ is the number of w ′ occurred in D , n z n is the number of words assigned to topic z n , C is normalization factor. Relatedness is calculated on small data set (problematic). Smooth p ( w | w ′ ) by using WordNet dictionary. Exemplary outputs the word relatedness of LDA and WordNet: ������������������� ������������������� ����������� ����������� ���� ��� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� p WordNet ( w | rain ) p LDA ( w | rain ) ����������� p ( w | w ′ ) = σ p LDA ( w | w ′ ) + (1 − σ ) p WordNet ( w | w ′ ) , σ is smoothing param. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 11 / 21

  12. Putting Everything in a Probabilistic Framework #������� #������ ����� %���� $������ ����� $������ %���� ����� ������� ������� ������� ������� ������� ������� ������� ���������� ����� �������� ��������� ���� ��������� ���� ����� ���� � ��� ����� ���� � ��� �� �� ������ ���� �������� ���� ������� ������� ������� ������� ������� ������� &������������ ����� ����� ���� ���� ���� �� ������������� ������� !���"�� ������� ����� ����� &�������� ����� &�������� ����� H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 12 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend