Multimodal learning from images and speech KU Leuven & UPF - - PowerPoint PPT Presentation
Multimodal learning from images and speech KU Leuven & UPF - - PowerPoint PPT Presentation
Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech
Advances in speech recognition
3 / 35
Advances in speech recognition
- Addiction to labels: 2000 hours transcribed speech audio;
∼350M/560M words text [Xiong et al., TASLP’17]
3 / 35
Advances in speech recognition
- Addiction to labels: 2000 hours transcribed speech audio;
∼350M/560M words text [Xiong et al., TASLP’17]
- Sometimes not possible, e.g., for unwritten languages
3 / 35
“Zero-resource” speech processing
5 / 35
“Zero-resource” speech processing
[Kamper et al., TASLP’16] 5 / 35
“Zero-resource” speech processing
[Kamper et al., TASLP’16] 5 / 35
Why learn without labels?
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
- New insights and models for speech processing
[Jansen et al., ’13]
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
- New insights and models for speech processing
[Jansen et al., ’13]
- but . . .
6 / 35
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
- New insights and models for speech processing
[Jansen et al., ’13]
- but . . . what about context?
6 / 35
- 1. Visually Grounded Keyword Spotting
- 1. Visually Grounded Keyword Spotting
Shane Settle Michael Roth Greg Shakhnarovich Karen Livescu
Images as weak labels for speech
9 / 35
Images as weak labels for speech
Can we use images as weak labels in low-resource settings?
Play
9 / 35
Images as weak labels for speech
Can we use images as weak labels in low-resource settings?
Play
Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks?
9 / 35
Map images and speech into common space
10 / 35
Map images and speech into common space
X VGG
max conv max feedfwd d(yvis, yspch)
distance yvis yspch
[Harwath et al., NIPS’16] 10 / 35
Retrieval in common (semantic) space
y ∈ RD in D-dimensional space yvis yspch
[Harwath et al., NIPS’16] 11 / 35
Can we use (supervised) vision model to get labels?
X VGG
max conv max feedfwd d(yvis, yspch)
distance yvis yspch
Cannot obtain textual labels for the speech using this model
12 / 35
Word prediction from images and speech
13 / 35
Word prediction from images and speech
VGG
h a t m a n s h i r t
yvis
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
VGG
h a t m a n s h i r t
yvis
0.85 0.8 0.9 [Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X VGG
h a t m a n s h i r t
yvis f(X)
max conv max feedfwd
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X VGG
h a t m a n s h i r t
yvis f(X) Loss
max conv max feedfwd
L
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X f(X)
max conv max feedfwd
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X f(X)
max conv max feedfwd m a n h a t
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X f(X)
max conv max feedfwd m a n h a t
f(X) ∈ RW is vector of word probabilities
[Kamper et al., Interspeech’17] 13 / 35
Word prediction from images and speech
X f(X)
max conv max feedfwd m a n h a t
f(X) ∈ RW is vector of word probabilities I.e., a spoken bag-of-words (BoW) classifier
[Kamper et al., Interspeech’17] 13 / 35
Images paired with untranscribed speech
We are still in this setting:
- We do not use any of the speech transcriptions during model training
(only for evaluation)
- But our resulting model can make bag-of-words (BoW) predictions
14 / 35
Task 1: Spoken bag-of-words prediction
Input utterance Predicted BoW labels
Play
15 / 35
Task 1: Spoken bag-of-words prediction
Input utterance Predicted BoW labels
Play
bicycle, bike, man, riding, wearing
15 / 35
Task 1: Spoken bag-of-words prediction
Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing
15 / 35
Task 1: Spoken bag-of-words prediction
Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing a little girl is climbing a ladder child, girl, little, young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog, field, grass, running a man in a miami basketball uniform looking to the right ball, basketball, man, player, uniform, wearing
15 / 35
Task 1: Spoken bag-of-words prediction
Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing a little girl is climbing a ladder child, girl, little, young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog, field, grass, running a man in a miami basketball uniform looking to the right ball, basketball, man, player, uniform, wearing
15 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach
Play (one of top 10)
behind bike boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys
Play
large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large
Play
play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play sitting yellow young
16 / 35
Task 2: Keyword spotting
Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play children playing in a ball pit variant sitting two people are seated at a table with drinks semantic yellow a tan dog jumping over a red and blue toy mistake young a little girl on a kid swing semantic
16 / 35
Task 3: Semantic speech retrieval
burning burning fire Written query:
burning
[Kamper et al., TASLP’19] 17 / 35
Human (MTurk) evaluation
18 / 35
Human (MTurk) evaluation
Keyword Top retrieved utterance Human label
- cean
man falling off a blue surfboard in the ocean 5 / 5 snowy a skier catches air over the snow 5 / 5 bike a dirt biker rides through some trees 4 / 5 children a group of young boys playing soccer 4 / 5 field two white dogs running in the grass together 3 / 5 swimming a woman holding a young boy slide down a water slide into a pool 3 / 5 carrying small dog running in the grass with a toy in its mouth 2 / 5 ∗ large a group of people on a zig path through the mountains 1 / 5 ∗ hair two women and a man smile for the camera 0 / 5 ∗
18 / 35
Task 3: Semantic speech retrieval
19 / 35
Task 3: Semantic speech retrieval
20 40 60 80 100 P@10
TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 19 / 35
Task 3: Semantic speech retrieval
10 20 30 Spearman’s ρ
TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 20 / 35
But this model is trained for English?
X VGG
h a t m a n s h i r t
yvis f(X) Loss
max conv max feedfwd
L
[Kamper et al., Interspeech’17] 21 / 35
Task 4: Cross-lingual keyword spotting
Arapaho speech collection (want to search) Given English keyword: ‘Disease’
[Kamper and Roth, SLTU’18] 22 / 35
Task 4: Cross-lingual keyword spotting
English speech collection (want to search) Given German keyword: ‘Hunde’
[Kamper and Roth, SLTU’18] 22 / 35
Task 4: Cross-lingual keyword spotting
X Loss
max conv max feedfwd
ℓ
VGG-16 f(X)
F e l d Hunde springt
English speech German (text) tags ˆ yde Cross-lingual keyword spotter
I
[Kamper and Roth, SLTU’18] 23 / 35
- 2. Multimodal One-Shot Learning
from Images and Speech
- 2. Multimodal One-Shot Learning
from Images and Speech
Ryan Eloff Herman Engelbrecht
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
26 / 35
You are the robot
?
26 / 35
Unimodal one-shot learning and classification
– three – one – five – two – four
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
– three – one – five – two – four (two) Query: ˆ y = ?
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
One-shot speech learning – three – one – five – two – four One-shot speech classification (two) Query: ˆ y = ?
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
One-shot speech learning – three – one – five – two – four One-shot speech classification Support set (two) Query: ˆ y = ?
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = ?
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = ?
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Unimodal one-shot learning and classification
One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = two
[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35
Multimodal one-shot learning and matching
Query: (two) Support set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 28 / 35
Multimodal one-shot learning and matching
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 28 / 35
Multimodal one-shot learning and matching
Query: (two)
?
Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 28 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our framework
Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching
[Eloff et al., arXiv’18] 29 / 35
Our approach to multimodal one-shot learning
30 / 35
Our approach to multimodal one-shot learning
- Requires within-modality distance metrics
- Can be done directly over features: DTW over speech, cosine over
image pixels
- Or distance metrics can be learned from background data
- Compare these on TIDigits (speech) paired with MNIST (images)
30 / 35
Background data
Omniglot (no digits):
31 / 35
Background data
Omniglot (no digits): Isolated labelled words (no digits):
31 / 35
Models for metric learning
Classifier network:
X
cricket s t a n d i n g l a r g e
32 / 35
Models for metric learning
Classifier network:
X
cricket s t a n d i n g l a r g e
Siamese network:
X2 distance y2 = f(X2)
X1
y1 = f(X1)
d(y1, y2)
32 / 35
Multimodal one-shot matching
20 40 60 80 100 Accuracy (%) DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)
33 / 35
Multimodal five-shot matching
20 40 60 80 100 Accuracy (%) DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)
34 / 35
Takeaways and future work
What to take away from this talk:
35 / 35
Takeaways and future work
What to take away from this talk:
- Visual grounding is useful for dealing with unlabelled speech
- Some things are better when using visual grounding, e.g., one-shot
learning, semantic search (?)
- Some things are impossible without it, e.g., keyword prediction from
unlabelled speech
35 / 35
Takeaways and future work
What to take away from this talk:
- Visual grounding is useful for dealing with unlabelled speech
- Some things are better when using visual grounding, e.g., one-shot
learning, semantic search (?)
- Some things are impossible without it, e.g., keyword prediction from
unlabelled speech Future work:
- Visual grounding of speech paired with videos
- Language universal/agnostic vision systems
- Meta-learning and unsupervised background modelling for one-shot
learning
- Developing practical tools for low-resource languages
35 / 35
http://www.kamperh.com/ https://github.com/kamperh/recipe_semantic_flickraudio https://github.com/rpeloff/multimodal_one_shot_learning
Unimodal one-shot speech classification
20 40 60 80 100 Accuracy (%) DTW FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)
37 / 35