Multimodal learning from images and speech KU Leuven & UPF - PowerPoint PPT Presentation

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large Play play sitting yellow young 16 / 35

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water play sitting yellow young 16 / 35

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play sitting yellow young 16 / 35

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play children playing in a ball pit variant sitting two people are seated at a table with drinks semantic yellow a tan dog jumping over a red and blue toy mistake young a little girl on a kid swing semantic 16 / 35

Task 3: Semantic speech retrieval burning Written query: fire burning burning [Kamper et al., TASLP’19] 17 / 35

Human (MTurk) evaluation 18 / 35

Human (MTurk) evaluation Keyword Top retrieved utterance Human label ocean man falling off a blue surfboard in the ocean 5 / 5 snowy a skier catches air over the snow 5 / 5 bike a dirt biker rides through some trees 4 / 5 children a group of young boys playing soccer 4 / 5 field two white dogs running in the grass together 3 / 5 swimming a woman holding a young boy slide down a 3 / 5 water slide into a pool carrying small dog running in the grass with a toy in its 2 / 5 ∗ mouth large a group of people on a zig path through the 1 / 5 ∗ mountains hair two women and a man smile for the camera 0 / 5 ∗ 18 / 35

Task 3: Semantic speech retrieval 19 / 35

Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 20 40 60 80 100 P @10 19 / 35

Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 10 20 30 Spearman’s ρ 20 / 35

But this model is trained for English? n t y vis r t a f ( X ) a i m h h s Loss max feedfwd L conv VGG max X [Kamper et al., Interspeech’17] 21 / 35

Task 4: Cross-lingual keyword spotting Arapaho speech collection (want to search) Given English keyword: ‘Disease’ [Kamper and Roth, SLTU’18] 22 / 35

Task 4: Cross-lingual keyword spotting English speech collection (want to search) Given German keyword: ‘Hunde’ [Kamper and Roth, SLTU’18] 22 / 35

Task 4: Cross-lingual keyword spotting Cross-lingual keyword spotter German (text) tags springt Hunde d ˆ y de l f ( X ) e F Loss max feedfwd ℓ conv VGG-16 max I X English speech [Kamper and Roth, SLTU’18] 23 / 35

2. Multimodal One-Shot Learning from Images and Speech

2. Multimodal One-Shot Learning from Images and Speech Herman Ryan Eloff Engelbrecht

You are the robot 26 / 35

You are the robot ? 26 / 35

Unimodal one-shot learning and classification – three – one – five – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

Unimodal one-shot learning and classification – three – one Support set Query: y = two – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

Multimodal one-shot learning and matching Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

Multimodal one-shot learning and matching Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

Multimodal one-shot learning and matching Matching set Query: Support set ( two ) ? Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

Our approach to multimodal one-shot learning 30 / 35

Our approach to multimodal one-shot learning • Requires within-modality distance metrics • Can be done directly over features: DTW over speech, cosine over image pixels • Or distance metrics can be learned from background data • Compare these on TIDigits (speech) paired with MNIST (images) 30 / 35

Background data Omniglot (no digits): 31 / 35

Background data Omniglot (no digits): Isolated labelled words (no digits): 31 / 35

Models for metric learning Classifier network: g n cricket i d n e g a r t a s l X 32 / 35

Models for metric learning Classifier network: Siamese network: g n d ( y 1 , y 2 ) cricket distance i d n e g a r t a s l y 1 = f ( X 1 ) y 2 = f ( X 2 ) X X 2 X 1 32 / 35

Multimodal one-shot matching DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online) 0 20 40 60 80 100 Accuracy (%) 33 / 35

Multimodal learning from images and speech KU Leuven & UPF - PowerPoint PPT Presentation

Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Combining Modalities in Multimodal Interfaces Focus on speech and gestures Focus on speech and

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Lecture 30 determine the relative risk of heart attack for those who take aspirin and those who

Built to Last: 7 Essential Steps to Building a True Culture of Successful Inbound Marketing

These materials are the intellectual property of Integrated Project Management Company, Inc., and

The Challenges of Designing a First Person Melee Combat Game Raphael Colantonio

Today Computer Vision Overview Project Brainstorm / Team-up Activity COMP 150:

AI Ethics, Impossibility Theorems and Tradeoffs Chris Stucchio Director of Data Science, Simpl

Nonparametric and Simulation-Based Tests STAT 3202 @ OSU, Spring 2019 Dalpiaz 1 What is

Whodunnit? Crime Drama as a Case for Natural Language Understanding Lea Frermann , Shay Cohen and

Multimodal learning from images and speech KU Leuven & UPF - PowerPoint PPT Presentation

Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Combining Modalities in Multimodal Interfaces Focus on speech and gestures Focus on speech and

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

Lecture 30 determine the relative risk of heart attack for those who take aspirin and those who

Built to Last: 7 Essential Steps to Building a True Culture of Successful Inbound Marketing

These materials are the intellectual property of Integrated Project Management Company, Inc., and

The Challenges of Designing a First Person Melee Combat Game Raphael Colantonio

Today Computer Vision Overview Project Brainstorm / Team-up Activity COMP 150:

AI Ethics, Impossibility Theorems and Tradeoffs Chris Stucchio Director of Data Science, Simpl

Nonparametric and Simulation-Based Tests STAT 3202 @ OSU, Spring 2019 Dalpiaz 1 What is

Whodunnit? Crime Drama as a Case for Natural Language Understanding Lea Frermann , Shay Cohen and

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING