Multimodal learning from images and speech KU Leuven & UPF - - PowerPoint PPT Presentation

multimodal learning from images and speech
SMART_READER_LITE
LIVE PREVIEW

Multimodal learning from images and speech KU Leuven & UPF - - PowerPoint PPT Presentation

Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech


slide-1
SLIDE 1

Multimodal learning from images and speech

KU Leuven & UPF Barcelona, January 2019 Herman Kamper

E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Advances in speech recognition

3 / 35

slide-5
SLIDE 5

Advances in speech recognition

  • Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

3 / 35

slide-6
SLIDE 6

Advances in speech recognition

  • Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

  • Sometimes not possible, e.g., for unwritten languages

3 / 35

slide-7
SLIDE 7
slide-8
SLIDE 8

“Zero-resource” speech processing

5 / 35

slide-9
SLIDE 9

“Zero-resource” speech processing

[Kamper et al., TASLP’16] 5 / 35

slide-10
SLIDE 10

“Zero-resource” speech processing

[Kamper et al., TASLP’16] 5 / 35

slide-11
SLIDE 11

Why learn without labels?

6 / 35

slide-12
SLIDE 12

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

6 / 35

slide-13
SLIDE 13

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]

6 / 35

slide-14
SLIDE 14

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]

6 / 35

slide-15
SLIDE 15

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]
  • New insights and models for speech processing

[Jansen et al., ’13]

6 / 35

slide-16
SLIDE 16

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]
  • New insights and models for speech processing

[Jansen et al., ’13]

  • but . . .

6 / 35

slide-17
SLIDE 17

Why learn without labels?

  • Get insight into human language acquisition [R¨

as¨ anen and Rasilo, ’15]

  • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
  • Analysis of audio for unwritten languages [Besacier et al., ’14]
  • New insights and models for speech processing

[Jansen et al., ’13]

  • but . . . what about context?

6 / 35

slide-18
SLIDE 18
slide-19
SLIDE 19
  • 1. Visually Grounded Keyword Spotting
slide-20
SLIDE 20
  • 1. Visually Grounded Keyword Spotting

Shane Settle Michael Roth Greg Shakhnarovich Karen Livescu

slide-21
SLIDE 21

Images as weak labels for speech

9 / 35

slide-22
SLIDE 22

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

9 / 35

slide-23
SLIDE 23

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks?

9 / 35

slide-24
SLIDE 24

Map images and speech into common space

10 / 35

slide-25
SLIDE 25

Map images and speech into common space

X VGG

max conv max feedfwd d(yvis, yspch)

distance yvis yspch

[Harwath et al., NIPS’16] 10 / 35

slide-26
SLIDE 26

Retrieval in common (semantic) space

y ∈ RD in D-dimensional space yvis yspch

[Harwath et al., NIPS’16] 11 / 35

slide-27
SLIDE 27

Can we use (supervised) vision model to get labels?

X VGG

max conv max feedfwd d(yvis, yspch)

distance yvis yspch

Cannot obtain textual labels for the speech using this model

12 / 35

slide-28
SLIDE 28

Word prediction from images and speech

13 / 35

slide-29
SLIDE 29

Word prediction from images and speech

VGG

h a t m a n s h i r t

yvis

[Kamper et al., Interspeech’17] 13 / 35

slide-30
SLIDE 30

Word prediction from images and speech

VGG

h a t m a n s h i r t

yvis

0.85 0.8 0.9 [Kamper et al., Interspeech’17] 13 / 35

slide-31
SLIDE 31

Word prediction from images and speech

X VGG

h a t m a n s h i r t

yvis f(X)

max conv max feedfwd

[Kamper et al., Interspeech’17] 13 / 35

slide-32
SLIDE 32

Word prediction from images and speech

X VGG

h a t m a n s h i r t

yvis f(X) Loss

max conv max feedfwd

L

[Kamper et al., Interspeech’17] 13 / 35

slide-33
SLIDE 33

Word prediction from images and speech

X f(X)

max conv max feedfwd

[Kamper et al., Interspeech’17] 13 / 35

slide-34
SLIDE 34

Word prediction from images and speech

X f(X)

max conv max feedfwd m a n h a t

[Kamper et al., Interspeech’17] 13 / 35

slide-35
SLIDE 35

Word prediction from images and speech

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities

[Kamper et al., Interspeech’17] 13 / 35

slide-36
SLIDE 36

Word prediction from images and speech

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities I.e., a spoken bag-of-words (BoW) classifier

[Kamper et al., Interspeech’17] 13 / 35

slide-37
SLIDE 37

Images paired with untranscribed speech

We are still in this setting:

  • We do not use any of the speech transcriptions during model training

(only for evaluation)

  • But our resulting model can make bag-of-words (BoW) predictions

14 / 35

slide-38
SLIDE 38

Task 1: Spoken bag-of-words prediction

Input utterance Predicted BoW labels

Play

15 / 35

slide-39
SLIDE 39

Task 1: Spoken bag-of-words prediction

Input utterance Predicted BoW labels

Play

bicycle, bike, man, riding, wearing

15 / 35

slide-40
SLIDE 40

Task 1: Spoken bag-of-words prediction

Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing

15 / 35

slide-41
SLIDE 41

Task 1: Spoken bag-of-words prediction

Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing a little girl is climbing a ladder child, girl, little, young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog, field, grass, running a man in a miami basketball uniform looking to the right ball, basketball, man, player, uniform, wearing

15 / 35

slide-42
SLIDE 42

Task 1: Spoken bag-of-words prediction

Input utterance Predicted BoW labels man on bicycle is doing tricks in an old building bicycle, bike, man, riding, wearing a little girl is climbing a ladder child, girl, little, young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog, field, grass, running a man in a miami basketball uniform looking to the right ball, basketball, man, player, uniform, wearing

15 / 35

slide-43
SLIDE 43

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach

Play (one of top 10)

behind bike boys large play sitting yellow young

16 / 35

slide-44
SLIDE 44

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young

16 / 35

slide-45
SLIDE 45

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young

16 / 35

slide-46
SLIDE 46

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young

16 / 35

slide-47
SLIDE 47

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young

16 / 35

slide-48
SLIDE 48

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young

16 / 35

slide-49
SLIDE 49

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys large play sitting yellow young

16 / 35

slide-50
SLIDE 50

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys

Play

large play sitting yellow young

16 / 35

slide-51
SLIDE 51

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park large play sitting yellow young

16 / 35

slide-52
SLIDE 52

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large play sitting yellow young

16 / 35

slide-53
SLIDE 53

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large

Play

play sitting yellow young

16 / 35

slide-54
SLIDE 54

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water play sitting yellow young

16 / 35

slide-55
SLIDE 55

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play sitting yellow young

16 / 35

slide-56
SLIDE 56

Task 2: Keyword spotting

Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play children playing in a ball pit variant sitting two people are seated at a table with drinks semantic yellow a tan dog jumping over a red and blue toy mistake young a little girl on a kid swing semantic

16 / 35

slide-57
SLIDE 57

Task 3: Semantic speech retrieval

burning burning fire Written query:

burning

[Kamper et al., TASLP’19] 17 / 35

slide-58
SLIDE 58

Human (MTurk) evaluation

18 / 35

slide-59
SLIDE 59

Human (MTurk) evaluation

Keyword Top retrieved utterance Human label

  • cean

man falling off a blue surfboard in the ocean 5 / 5 snowy a skier catches air over the snow 5 / 5 bike a dirt biker rides through some trees 4 / 5 children a group of young boys playing soccer 4 / 5 field two white dogs running in the grass together 3 / 5 swimming a woman holding a young boy slide down a water slide into a pool 3 / 5 carrying small dog running in the grass with a toy in its mouth 2 / 5 ∗ large a group of people on a zig path through the mountains 1 / 5 ∗ hair two women and a man smile for the camera 0 / 5 ∗

18 / 35

slide-60
SLIDE 60

Task 3: Semantic speech retrieval

19 / 35

slide-61
SLIDE 61

Task 3: Semantic speech retrieval

20 40 60 80 100 P@10

TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 19 / 35

slide-62
SLIDE 62

Task 3: Semantic speech retrieval

10 20 30 Spearman’s ρ

TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 20 / 35

slide-63
SLIDE 63

But this model is trained for English?

X VGG

h a t m a n s h i r t

yvis f(X) Loss

max conv max feedfwd

L

[Kamper et al., Interspeech’17] 21 / 35

slide-64
SLIDE 64

Task 4: Cross-lingual keyword spotting

Arapaho speech collection (want to search) Given English keyword: ‘Disease’

[Kamper and Roth, SLTU’18] 22 / 35

slide-65
SLIDE 65

Task 4: Cross-lingual keyword spotting

English speech collection (want to search) Given German keyword: ‘Hunde’

[Kamper and Roth, SLTU’18] 22 / 35

slide-66
SLIDE 66

Task 4: Cross-lingual keyword spotting

X Loss

max conv max feedfwd

VGG-16 f(X)

F e l d Hunde springt

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

[Kamper and Roth, SLTU’18] 23 / 35

slide-67
SLIDE 67
  • 2. Multimodal One-Shot Learning

from Images and Speech

slide-68
SLIDE 68
  • 2. Multimodal One-Shot Learning

from Images and Speech

Ryan Eloff Herman Engelbrecht

slide-69
SLIDE 69
slide-70
SLIDE 70

You are the robot

26 / 35

slide-71
SLIDE 71

You are the robot

26 / 35

slide-72
SLIDE 72

You are the robot

26 / 35

slide-73
SLIDE 73

You are the robot

26 / 35

slide-74
SLIDE 74

You are the robot

26 / 35

slide-75
SLIDE 75

You are the robot

26 / 35

slide-76
SLIDE 76

You are the robot

26 / 35

slide-77
SLIDE 77

You are the robot

?

26 / 35

slide-78
SLIDE 78

Unimodal one-shot learning and classification

– three – one – five – two – four

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-79
SLIDE 79

Unimodal one-shot learning and classification

– three – one – five – two – four (two) Query: ˆ y = ?

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-80
SLIDE 80

Unimodal one-shot learning and classification

One-shot speech learning – three – one – five – two – four One-shot speech classification (two) Query: ˆ y = ?

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-81
SLIDE 81

Unimodal one-shot learning and classification

One-shot speech learning – three – one – five – two – four One-shot speech classification Support set (two) Query: ˆ y = ?

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-82
SLIDE 82

Unimodal one-shot learning and classification

One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = ?

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-83
SLIDE 83

Unimodal one-shot learning and classification

One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = ?

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-84
SLIDE 84

Unimodal one-shot learning and classification

One-shot speech learning One-shot speech classification Support set – three – one – five – two – four (two) Query: ˆ y = two

[Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

slide-85
SLIDE 85

Multimodal one-shot learning and matching

Query: (two) Support set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 28 / 35

slide-86
SLIDE 86

Multimodal one-shot learning and matching

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 28 / 35

slide-87
SLIDE 87

Multimodal one-shot learning and matching

Query: (two)

?

Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 28 / 35

slide-88
SLIDE 88

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-89
SLIDE 89

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-90
SLIDE 90

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-91
SLIDE 91

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-92
SLIDE 92

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-93
SLIDE 93

Our framework

Query: (two) Support set Matching set Multimodal one-shot learning Multimodal one-shot matching

[Eloff et al., arXiv’18] 29 / 35

slide-94
SLIDE 94

Our approach to multimodal one-shot learning

30 / 35

slide-95
SLIDE 95

Our approach to multimodal one-shot learning

  • Requires within-modality distance metrics
  • Can be done directly over features: DTW over speech, cosine over

image pixels

  • Or distance metrics can be learned from background data
  • Compare these on TIDigits (speech) paired with MNIST (images)

30 / 35

slide-96
SLIDE 96

Background data

Omniglot (no digits):

31 / 35

slide-97
SLIDE 97

Background data

Omniglot (no digits): Isolated labelled words (no digits):

31 / 35

slide-98
SLIDE 98

Models for metric learning

Classifier network:

X

cricket s t a n d i n g l a r g e

32 / 35

slide-99
SLIDE 99

Models for metric learning

Classifier network:

X

cricket s t a n d i n g l a r g e

Siamese network:

X2 distance y2 = f(X2)

X1

y1 = f(X1)

d(y1, y2)

32 / 35

slide-100
SLIDE 100

Multimodal one-shot matching

20 40 60 80 100 Accuracy (%) DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)

33 / 35

slide-101
SLIDE 101

Multimodal five-shot matching

20 40 60 80 100 Accuracy (%) DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)

34 / 35

slide-102
SLIDE 102

Takeaways and future work

What to take away from this talk:

35 / 35

slide-103
SLIDE 103

Takeaways and future work

What to take away from this talk:

  • Visual grounding is useful for dealing with unlabelled speech
  • Some things are better when using visual grounding, e.g., one-shot

learning, semantic search (?)

  • Some things are impossible without it, e.g., keyword prediction from

unlabelled speech

35 / 35

slide-104
SLIDE 104

Takeaways and future work

What to take away from this talk:

  • Visual grounding is useful for dealing with unlabelled speech
  • Some things are better when using visual grounding, e.g., one-shot

learning, semantic search (?)

  • Some things are impossible without it, e.g., keyword prediction from

unlabelled speech Future work:

  • Visual grounding of speech paired with videos
  • Language universal/agnostic vision systems
  • Meta-learning and unsupervised background modelling for one-shot

learning

  • Developing practical tools for low-resource languages

35 / 35

slide-105
SLIDE 105

http://www.kamperh.com/ https://github.com/kamperh/recipe_semantic_flickraudio https://github.com/rpeloff/multimodal_one_shot_learning

slide-106
SLIDE 106

Unimodal one-shot speech classification

20 40 60 80 100 Accuracy (%) DTW FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online)

37 / 35