multimodal machine learning
play

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU - PowerPoint PPT Presentation

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives What is Multimodal? Multimodal:


  1. Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1

  2. CMU Course 11-777: Multimodal Machine Learning 2

  3. Lecture Objectives ▪ What is Multimodal? ▪ Multimodal: Core technical challenges ▪ Representation learning, translation, alignment, fusion and co-learning ▪ Multimodal representation learning ▪ Joint and coordinated representations ▪ Multimodal autoencoder and tensor representation ▪ Deep canonical correlation analysis ▪ Fusion and temporal modeling ▪ Multi-view LSTM and memory-based fusion ▪ Fusion with multiple attentions

  4. What is Multimodal? 4

  5. What is Multimodal? Multimodal distribution ➢ Multiple modes, i.e., distinct “peaks” (local maxima) in the probability density function

  6. What is Multimodal? Sensory Modalities

  7. Multimodal Communicative Behaviors V erbal V isual ▪ Lexicon ▪ Gestures ▪ Words ▪ Head gestures ▪ Eye gestures ▪ Syntax ▪ Arm gestures ▪ Part-of-speech ▪ Dependencies ▪ Body language ▪ Body posture ▪ Pragmatics ▪ Proxemics ▪ Discourse acts V ocal ▪ Eye contact ▪ Head gaze ▪ Prosody ▪ Eye gaze ▪ Intonation ▪ Facial expressions ▪ Voice quality ▪ FACS action units ▪ Vocal expressions ▪ Smile, frowning ▪ Laughter, moans 7

  8. What is Multimodal? Modality The way in which something happens or is experienced. • Modality refers to a certain type of information and/or the representation format in which information is stored. • Sensory modality: one of the primary forms of sensation, as vision or touch; channel of communication. (“middle”) Medium A means or instrumentality for storing or communicating information; system of communication/transmission. • Medium is the means whereby this information is delivered to the senses of the interpreter. 8

  9. Multiple Communities and Modalities Psychology Medical Speech Vision Language Multimedia Robotics Learning

  10. Examples of Modalities ❑ Natural language (both spoken or written) ❑ Visual (from images or videos) ❑ Auditory (including voice, sounds and music) ❑ Haptics / touch ❑ Smell, taste and self-motion ❑ Physiological signals ▪ Electrocardiogram (ECG), skin conductance ❑ Other modalities ▪ Infrared images, depth images, fMRI

  11. Prior Research on “Multimodal” Four eras of multimodal research ➢ The “ behavioral ” era (1970s until late 1980s) ➢ The “ computational ” era (late 1980s until 2000) ➢ The “ interaction ” era (2000 - 2010) ➢ The “ deep learning ” era (2010s until …) ❖ Main focus of this tutorial 1970 1980 1990 2000 2010 11

  12. The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 12

  13. The McGurk Effect (1976) Hearing lips and seeing voices – Nature 1970 1980 1990 2000 2010 13

  14. ➢ The “ Computational ” Era(Late 1980s until 2000) 1) Audio-Visual Speech Recognition (AVSR) 1970 1980 1990 2000 2010 14

  15. Core Technical Challenges 15

  16. Core Challenges in “Deep” Multimodal ML Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency https://arxiv.org/abs/1705.09406 These challenges are non-exclusive. 16

  17. Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: A Representation Modality 1 Modality 2 17

  18. Joint Multimodal Representation “I like it!” Joyful tone “Wow!” Tensed voice Joint Representation (Multimodal Space) 18

  19. Joint Multimodal Representations Multimodal Representation Audio-visual speech recognition Depth Multimodal [Ngiam et al., ICML 2011] • Bimodal Deep Belief Network Image captioning [Srivastava and Salahutdinov, NIPS 2012] • Multimodal Deep Boltzmann Machine Depth Video Depth Verbal Audio-visual emotion recognition [Kim et al., ICASSP 2013] • Deep Boltzmann Machine Visual Verbal 19

  20. Multimodal Vector Space Arithmetic [Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] 20

  21. Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 21

  22. Coordinated Representation: Deep CCA Learn linear projections that are maximally correlated: View 𝐼 𝑦 𝒗 ∗ , 𝒘 ∗ = argmax 𝑑𝑝𝑠𝑠 𝒗 𝑼 𝒀, 𝒘 𝑼 𝒁 View 𝐼 𝑧 𝒗,𝒘 𝑰 𝒚 𝑰 𝒛 · · · · · · 𝑾 𝑽 𝒘 · · · · · · 𝒗 𝑿 𝒛 𝑿 𝒚 · · · · · · 𝒁 Text Image 𝒀 𝒀 𝒁 Andrew et al., ICML 2013 22

  23. Core Challenge 2: Alignment Definition: I dentify the direct relations between (sub)elements from two or more different modalities. Modality 2 Modality 1 A Explicit Alignment t 1 The goal is to directly find correspondences t 2 t 4 between elements of different modalities Fancy algorithm t 3 t 5 B Implicit Alignment Uses internally latent alignment of modalities in order to better solve a different problem t n t n 23

  24. Temporal sequence alignment Applications: - Re-aligning asynchronous data - Finding similar data across modalities (we can estimate the aligned cost) - Event reconstruction from multiple sources

  25. Alignment examples (multimodal)

  26. Implicit Alignment Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf 26

  27. Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. A Model-Agnostic Approaches 2) Late Fusion 1) Early Fusion Modality 1 Classifier Modality 1 Classifier Modality 2 Classifier Modality 2 27

  28. Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. B Model-Based (Intermediate) Approaches 1) Deep neural networks Multiple kernel learning 2) Kernel-based methods y 𝐵 𝐵 𝐵 𝐵 𝐵 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 𝑊 𝑊 𝑊 𝑊 𝑊 ℎ 1 ℎ 2 ℎ 3 ℎ 4 ℎ 5 3) Graphical models 𝑩 𝑩 𝑩 𝑩 𝑩 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 𝑾 𝑾 𝑾 𝑾 𝑾 𝒚 𝟐 𝒚 𝟑 𝒚 𝟒 𝒚 𝟓 𝒚 𝟔 Multi-View Hidden CRF 28

  29. Core Challenge 4: Translation Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective. A Example-based B Model-driven 29

  30. Core Challenge 4 – Translation Transcriptions Visual gestures + (both speaker and listener gestures) Audio streams Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

  31. Core Challenge 5: Co-Learning Definition: T ransfer knowledge between modalities, including their representations and predictive models. Prediction Modality 1 Modality 2 31

  32. Core Challenge 5: Co-Learning C Hybrid B Non-Parallel A Parallel 32

  33. Big dog Prediction on the beach 1 2 𝑢 1 𝑢 4 𝑢 2 𝑢 2 𝑢 5 𝑢 3 𝑢 3 𝑢 6 𝑢 𝑜 𝑢 𝑜 Language Visual Input Modalities 33 Acoustic

  34. Taxonomy of Multimodal Research [ https://arxiv.org/abs/1705.09406 ] ▪ o Encoder-decoder Model-based Representation o Online prediction o Kernel-based ▪ Joint o Alignment Graphical models o Neural networks o Neural networks ▪ Explicit o Graphical models Co-learning o Sequential o Unsupervised ▪ Coordinated ▪ Parallel data o Supervised o Similarity ▪ Implicit o Co-training o Structured o o Transfer learning Graphical models Translation ▪ Non-parallel data o Neural networks ▪ Example-based ▪ Fusion Zero-shot learning ▪ Concept grounding o Retrieval ▪ Model agnostic ▪ Transfer learning o Combination o Early fusion ▪ Hybrid data ▪ Model-based o Late fusion ▪ Bridging o Grammar-based o Hybrid fusion Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

  35. Multimodal Applications [ https://arxiv.org/abs/1705.09406 ] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

  36. Multimodal Representations 36

  37. Core Challenge: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. Joint representations: Coordinated representations: A B Repres. 1 Repres 2 Representation Modality 1 Modality 2 Modality 1 Modality 2 37

  38. Deep Multimodal autoencoders ▪ A deep representation learning approach ▪ A bimodal auto-encoder ▪ Used for Audio-visual speech recognition [Ngiam et al., Multimodal Deep Learning, 2011]

  39. Deep Multimodal autoencoders - training ▪ Individual modalities can be pretrained ▪ RBMs ▪ Denoising Autoencoders ▪ To train the model to reconstruct the other modality ▪ Use both ▪ Remove audio [Ngiam et al., Multimodal Deep Learning, 2011]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend