Toward Artificial Synesthesia: Linking Images and Sounds via Words - PowerPoint PPT Presentation

Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult¨ at f¨ ur Informatik Technische Universit¨ at M¨ unchen { xiao,stibor } @in.tum.de December 10, 2010

Synesthesia Synesthesia : Perceptual experience in which a stimulus in one modality gives rise to an experience in a different sensory modality. Example: • Picture of golden beach might stimulate human’s hearing by imagining the sound of waves crashing against the shore. • Sound of a baaing sheep might illustrate a green hillside. Images and sounds represent distinct modalities, however, both modalities capture the same underlying concept. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 2 / 21

Explicit/Implicit Linking between Images and Sounds • Explicit: Images and sounds are directly associated (without intermediate links). • Implicit: Images and sounds are not directly associated, however, they are linked together by another intermediate but obscure modality. �� J.S. BACH VIOLIN � COMPOSER STRING VIOLINIST INSTRUMENT � �� Natural language is based on visual and auditive stimuli ⇒ link images and sounds with text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 3 / 21

Related Work Domain: Linking an image with associated text (e.g. image annotation, multi-media information retrieval, object recognition). • Probability of associating words with image grids [Hironobu et al., 1999]. • Predicting words from images [Barnard et al., 2003]. • Modeling the generative process of image regions and words in the same latent space [Blei et al., 2003]. • Jointly modeling image, class label and annotations (supervised topic model) [Wang et al., 2009]. Consider images and text as two different languages. Linking images and words can be viewed as a process of translating from visual vocabulary to textual vocabulary. Inspiration: Probabilistic models for text/image analysis (LDA,Corr-LDA). Representation: Bags-of-words model of images and text. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 4 / 21

Input Representation and Preprocessing Build visual vocabulary and auditory vocabulary for representing images and sounds as bags-of-words. Image representation: • Divide image in patches and compute SIFT descriptors ( 128 dim.) for each patch. • Quantize SIFT descriptors in collection using k -means to obtain centroids of learned clusters, which compose the visual vocabulary . Sound representation: • Sound snippet is cut into frames (sequence of 1024 audio samples). • For each frame, compute Mel-Frequency-Cepstral Coefficients. • Each sound snippet is thus represented as set of 25 dimensional feature vectors. • Cluster all feature vectors in collection using k -means to obtain auditory words. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 5 / 21

Notations Annotated image I consists of M visual words and N textual words (annotations), I = { v 1 , . . . , v M ; w 1 , . . . , w N } . � �� visual words annotations Captioned sound snippet S consists of M auditory words and N textual words, S = { u 1 , . . . , u M ; w 1 , . . . , w N } . � �� auditory words sound tags Training collection T = { I 1 , . . . , I K ; S 1 , . . . , S L } , K annotated images, L tagged sounds. Denote W i vocabulary of image annotations, and W s vocabulary of sound tags. Complete textual vocabulary W = W i ∪ W s . H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 6 / 21

Linking Images and Sounds via Text Image composition: Given an un-annotated image I ∗ / ∈ T , estimate the conditional probability p ( S | I ∗ ) for every sound snippet S ∈ T . Sound illustration: Given an un-tagged sound S ∗ / ∈ T , estimate the conditional probability p ( I | S ∗ ) for every image I ∈ T . Problem: We can not estimate p ( S | I ∗ ) and p ( I | S ∗ ) directly, as no explicit correspondences exist. Idea: “Translate” image into natural language text, then “translate” text back into sound, that is � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 7 / 21

Modeling Images/Text and Sounds/Text with Corr-LDA Generative process of an annotated image I = { v 1 , . . . , v M ; w 1 , . . . , w N } : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each visual word v m , m ∈ { 1 , . . . , M } 1 Draw topic assignment z m | θ ∼ Multinomial( θ ) 2 Draw visual word v m | z m ∼ Multinomial 1 ( π z m ) 3 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw discrete indexing variable y n ∼ Uniform(1 , . . . , M ) 2 Draw textual word w n ∼ Multinomial( β z yn ) z α v π θ M w y β � D K Exchange visual word v m by auditory word u m leads to generative process of modeling sounds and text. 1 Orig. Corr-LDA uses multivariate Gaussian. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 8 / 21

Modeling Text Recall: � � p ( S | w ′ ) p ( w ′ | w ) p ( w | I ∗ ) , p ( S | I ∗ ) ≈ w ′ ∈ W s w ∈ W i � � p ( I | w ′ ) p ( w ′ | w ) p ( w | S ∗ ) . p ( S | I ∗ ) ≈ w ∈ W s w ′ ∈ W i Remaining problem: Estimate p ( w ′ | w ) (semantic relatedness between two words). Approach: LDA model, with data set D containing only captions of all images and sounds. Generative process of a document (captions) D ∈ D : 1 Draw topic proportions θ ∼ Dirichlet( α ) 2 For each textual word w n , n ∈ { 1 , . . . , N } 1 Draw topic assignment z n | θ ∼ Multinomial( θ ) 2 Draw textual word w n | z n ∼ Multinomial( β z n ) Two sets of parameters are estimated: Θ D = p ( z | D ) (mixing proportions over topics), and β = p ( w | z ) (word distributions over topics). H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 10 / 21

Modeling Text (cont.) Given trained LDA model, word relatedness between w and w ′ is p LDA ( w | w ′ ) = 1 p ( w | z n ) n w ′ � p ( w ′ | z n ) , C n z n z n where n w ′ is the number of w ′ occurred in D , n z n is the number of words assigned to topic z n , C is normalization factor. Relatedness is calculated on small data set (problematic). Smooth p ( w | w ′ ) by using WordNet dictionary. Exemplary outputs the word relatedness of LDA and WordNet: �� p WordNet ( w | rain ) p LDA ( w | rain ) �� p ( w | w ′ ) = σ p LDA ( w | w ′ ) + (1 − σ ) p WordNet ( w | w ′ ) , σ is smoothing param. H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 11 / 21

Putting Everything in a Probabilistic Framework #�� #�� %�� $�� $�� %�� &�� !��"�� &�� &�� H.Xiao and T.Stibor (TUM) Artificial Synesthesia December 10, 2010 12 / 21

Toward Artificial Synesthesia: Linking Images and Sounds via Words - PowerPoint PPT Presentation

Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult at f ur Informatik Technische Universit at M unchen { xiao,stibor } @in.tum.de December 10, 2010 Synesthesia Synesthesia :

SOUND M Bethancourt What is Sound? Sounds as Physical Phenomena Sounds as Organized Beauty

Linking linking Weak forms Linking Weak forms Elision (sound cut)

Phase 1 Environmental sounds Environmental Sounds and Instrumental Sounds . To develop the

Synesthesia The Phenomenon And Its Presentation In Torts And Criminal Justice Law Medicine And

Syntax 3 Predicates Predicates and Linking Verbs Linking Verbs Linking Verbs

Letters and Sounds Phonics information for parents What is Letters and Sounds ? Letters and

Session 2 Session 2 Tool Time Tuesday Tool Time Tuesday Soothing Sounds, WebEx Sounds, Security

A framework for linking land use and A framework for linking land use and A framework for linking

Concepts of Print Jolly Phonics and Active Literacy Learning the letter sounds The main sounds

sounds or phonemes. They are then taught how to blend these sounds together to read the

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

BlindSight Robert Carlsen | Andrew Styer Concept Explore synesthesia by associating certain

Synesthesia The problem Many colleagues appear blandly disengaged during crucial

SYNESTHESIA color to sound at the tips of your fingers eurah, joe, krithika R: 124 pink

Overview/Questions How do we hear sounds? How can audio information (sounds) be

Bridging the Gap between Patients Expectations and General Practitioners Knowledge through

MIPS: The Final Rule and You Brett M. Paepke, OD Director, ECP Services - RevolutionEHR November

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms

Reporting Tools CREATING CARE-CONNECTED COMMUNITIES 2 Mary Graham Manager, Community Engagement

get excited & make things with science University of Cambridge Department of Engineering

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Degree and Quantity: Semantics and Conceptual Representation Stephanie Solt (ZAS Berlin)

Visual Communication Serena Williams: Sexist? Racist? Questions: 1) What does visual

Toward Artificial Synesthesia: Linking Images and Sounds via Words - PowerPoint PPT Presentation

Toward Artificial Synesthesia: Linking Images and Sounds via Words Han Xiao and Thomas Stibor Fakult at f ur Informatik Technische Universit at M unchen { xiao,stibor } @in.tum.de December 10, 2010 Synesthesia Synesthesia :

SOUND M Bethancourt What is Sound? Sounds as Physical Phenomena Sounds as Organized Beauty

Linking linking Weak forms Linking Weak forms Elision (sound cut)

Phase 1 Environmental sounds Environmental Sounds and Instrumental Sounds . To develop the

Synesthesia The Phenomenon And Its Presentation In Torts And Criminal Justice Law Medicine And

Syntax 3 Predicates Predicates and Linking Verbs Linking Verbs Linking Verbs

Letters and Sounds Phonics information for parents What is Letters and Sounds ? Letters and

Session 2 Session 2 Tool Time Tuesday Tool Time Tuesday Soothing Sounds, WebEx Sounds, Security

A framework for linking land use and A framework for linking land use and A framework for linking

Concepts of Print Jolly Phonics and Active Literacy Learning the letter sounds The main sounds

sounds or phonemes. They are then taught how to blend these sounds together to read the

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

BlindSight Robert Carlsen | Andrew Styer Concept Explore synesthesia by associating certain

Synesthesia The problem Many colleagues appear blandly disengaged during crucial

SYNESTHESIA color to sound at the tips of your fingers eurah, joe, krithika R: 124 pink

Overview/Questions How do we hear sounds? How can audio information (sounds) be

Bridging the Gap between Patients Expectations and General Practitioners Knowledge through

MIPS: The Final Rule and You Brett M. Paepke, OD Director, ECP Services - RevolutionEHR November

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms

Reporting Tools CREATING CARE-CONNECTED COMMUNITIES 2 Mary Graham Manager, Community Engagement

get excited &amp; make things with science University of Cambridge Department of Engineering

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701

Degree and Quantity: Semantics and Conceptual Representation Stephanie Solt (ZAS Berlin)

Visual Communication Serena Williams: Sexist? Racist? Questions: 1) What does visual

get excited & make things with science University of Cambridge Department of Engineering