Multimodal Machine Learning Main Goal Define a common taxonomy for - PowerPoint PPT Presentation

Multimodal Machine Learning

Main Goal Define a common taxonomy for multimodal machine learning and provide an overview of research in this area

Introduction: Preliminary Terms Modality : the way in which something happens or is experienced Multimodal machine learning (MML): building models that process and relate information from multiple modalities

History of MML Audio-Visual Speech Multimedia content Multimodal Media Description Recognition (AVSR) indexing and retrieval interaction ● McGurk effect ● Searching visual ● Understanding ● Image captioning ● Visual information and multimodal human multimodal ● Challenging improved content directly behaviors (facial problem to performance when expressions, evaluate the speech signal speech, etc.) was noisy during social interactions

Five Main Challenges of MML 1. Representation – representing and summarizing multimodal data 2. Translation – mapping from one modality to another (e.g., image captioning) 3. Alignment – identifying the corresponding elements between modalities (e.g., recipe steps to the correct video frame) 4. Fusion – joining information from multiple modalities to predict (e.g., using lip motion and speech to predict spoken words) 5. Co-learning – transferring knowledge between modalities, their representation, and their predictive models These challenges need to be tackled for the field to progress.

Representation Multimodal representation : a representation of data using information from multiple entities (an image, word/sentence, audio sample, etc.) We need to represent multimodal data in a meaningful way to have good models. This is challenging because multimodal data are heterogeneous.

Representation Joint Coordinated Example constraints: minimize cosine similarity, maximize correlation

Joint Representation Mostly used when multimodal data is present during training and inference Methods: ● Simple concatenation ● Neural networks ● Probabilistic graphical models ● Sequential representation Neural networks are often pre-trained using an autoencoder on unsupervised data.

Coordinated Representation Similarity Models Structured Coordinated Space Models Enforce similarity between representations by Enforce additional constraints between modalities minimizing the distance between modalities in the coordinated space Example: cross-modal hashing. Additional constraints are: “dog” ● N -dimensional Hamming space ● The same object from different modalities has to have a similar hash code ● Similarity-preserving

Translation: Mapping from one modality to another (e.g., image captioning) Example-based Generative Use a dictionary to translate between modalities Construct a model that translates between modalities

Example-Based Translation Combination-Based Retrieval-Based Combines retrievals from the dictionary in a Use retrieved translation without modification meaningful way to create a better translation Problem : Often requires an extra processing step Rules are often hand-crafted or heuristic. (e.g., re-ranking of retrieved translations) – similarity in the unimodal space does not always mean a good translation Solution : Use an intermediate semantic space for similarity comparison. Performs better because the space reflects both modalities and allows for bi-directional translation. Requires manual construction or learning of the space, which needs large training dictionaries.

Generative Translation Constructing models that perform multimodal translation on a unimodal source Requires the ability to understand the source and generate the target Grammar-Based Encoder-Decoder Continuous Generation Detect high-level concepts from Encode the source modality into Generate target modality at every source and generate a target a latent representation, then timestep based on a stream of using a pre-defined grammar decode that representation into source modality inputs the target modality (one pass) More likely to generate logically HMMs, RNNs, encoder-decoders correct targets Encoders: RNNs, DBNs, CNNs Decoders: RNNs, LSTMs Formulaic translations, need complex pipelines for concept May be memorizing the data detection Require lots of data for training Example: video description of who did what to whom and where and how

Translation Evaluation: A Major Challenge There are often multiple correct translations. Evaluation methods ● Human evaluation – impractical and biased ● BLEU, ROUGE, Meteor, CIDEr – low correlation to human judgment, require a high number of reference translations ● Retrieval – better reflects human judgments ○ Rank the available captions and assess if the correct captions get a high rank ● Visual question-answering for image captioning – ambiguity in questions and answers, question bias

Alignment “Finding relationships and correspondences between sub-components of instances from two or more modalities.” Examples: ● Given an image and caption, find the areas of the image corresponding to the caption. ● Given a movie, align it to the book chapters it was based on.

Explicit Alignment (unsupervised and supervised) Unsupervised: no direct alignment labels. Example applications: Supervised: direct alignment labels. ● Spoken words ↔ visual objects in images Most approaches inspired from work on ● Movie shots and scenes ↔ screenplay statistical machine translation and genome ● Recipes ↔ cooking videos sequences. ● Speakers ↔ videos ● Sentences ↔ video frames If there is no similarity metric between ● Image regions ↔ phrases modalities, canonical correlation analysis (CCA) ● Speakers in audio ↔ locations in video is used to map the modalities to a shared space. ● Objects in 3D scenes ↔ nouns in text CCA finds the linear combinations of data that maximizes their correlation

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Implicit Alignment Used as an intermediate step for another task Difficulties in alignment: Does not rely on supervised alignment ● Few datasets with explicitly annotated examples alignments ● Difficult to design similarity metrics Data is latently aligned during model training ● May exist 0, 1, or many correct alignments Useful for speech recognition, machine translation, media description, visual question-answering Example: alignment of words and image regions before performing image retrieval based on text descriptions

Fusion Early fusion - features integrated immediately (concatenation) Late fusion - each modality makes an independent decision (averaging, voting schemes, weighted combinations, other ensemble techniques) Hybrid fusion - exploits advantages of both

Fusion Techniques Multiple kernel learning (MKL): Neural networks (RNN/LSTM) can learn the multimodal representation and fusion ● An extension of kernel support vector component end-to-end. They achieve good machines performance but require large datasets and are ● Kernels function as similarity functions less interpretable. between data ● Modality-specific kernels allows for better LSTM Applications: fusion ● Audio-visual emotion classification MKL Application: performing musical artist ● Neural image captioning similarity ranking from acoustic, semantic, and social view data. (McFee et al., Learning Multi-modal Similarity)

Co-learning Modeling a resource poor modality by exploiting a resource rich modality. Used to address lack of annotated data, noisy data, and unreliable labels. Can generate more labeled data, but also can lead to overfitting.

Co-learning examples Transfer learning application: using text to improve visual representations for image classification by coordinating CNN features with word2vec features Conceptual grounding: learning meanings/concepts based on vision, sound, or smell (not just on language) Zero-short learning (ZSL): recognizing a class without having seen a labeled example of it ZSL Example: using an intermediate semantic space to predict unseen words people are thinking about from fMRI data

Zero-Shot Learning with Semantic Output Codes

Grounding Semantics in Olfactory Perception “This work opens up interesting possibilities in analyzing smell and even taste. It could be applied in a variety of settings beyond semantic similarity, from chemical information retrieval to metaphor interpretation to cognitive modelling. A speculative blue-sky application based on this, and other multi-modal models, would be an NLG application describing a wine based on its chemical composition, and perhaps other information such as its color and country of origin.”

Paper Critique This paper is very thorough in its survey of MML challenges and what researchers have done to approach them. MML is central to the advancement of AI; thus, this area must be studied in order to make progress. Future research directions include any MML projects that make headway in the five challenge areas.

Questions?

Multimodal Machine Learning Main Goal Define a common taxonomy for - PowerPoint PPT Presentation

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning and provide an overview of research in this area Introduction: Preliminary Terms Modality : the way in which something happens or is experienced

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

12/17/2019 Department of Veterinary and Animal Sciences Hierarchical Markov decision processes

Multi-level Models for Classroom Dynamics Christopher DuBois Padhraic Smyth, UC Irvine Carter

Multilevel Models Session 2: Random intercept models Outline Two level random intercept

WebPlotViz: Browser Visualization of High Dimensional Streaming Data with HTML5 STREAM2016

Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Multimodal Machine Learning Main Goal Define a common taxonomy for - PowerPoint PPT Presentation

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning and provide an overview of research in this area Introduction: Preliminary Terms Modality : the way in which something happens or is experienced

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Probing the Need for Visual Context in Multimodal Machine Translation Ozan Caglayan 1 , Pranava

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

12/17/2019 Department of Veterinary and Animal Sciences Hierarchical Markov decision processes

Multi-level Models for Classroom Dynamics Christopher DuBois Padhraic Smyth, UC Irvine Carter

Multilevel Models Session 2: Random intercept models Outline Two level random intercept

WebPlotViz: Browser Visualization of High Dimensional Streaming Data with HTML5 STREAM2016

Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING