[PPT] - Visually grounded cross-lingual keyword spotting in speech SLTU, PowerPoint Presentation

SLIDE 1

Visually grounded cross-lingual keyword spotting in speech

SLTU, August 2018

Herman Kamper1 and Michael Roth2

1E&E Engineering, Stellenbosch University, South Africa 2Saarland University, Germany

http://www.kamperh.com/

SLIDE 2

Advances in speech recognition

1 / 12

SLIDE 3

Advances in speech recognition

Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

1 / 12

SLIDE 4

Advances in speech recognition

Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

Very different from the “supervision” infants use to learn language

1 / 12

SLIDE 5

Advances in speech recognition

Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

Very different from the “supervision” infants use to learn language
Sometimes not possible, e.g., for unwritten languages

1 / 12

SLIDE 6

Images as weak labels for speech

2 / 12

SLIDE 7

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

2 / 12

SLIDE 8

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

Maybe we cannot use this type of data for full ASR, but maybe it can

be used for other tasks?

2 / 12

SLIDE 9

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

Maybe we cannot use this type of data for full ASR, but maybe it can

be used for other tasks?

Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

SLIDE 10

Cross-lingual keyword spotting

kuwaka kuwaka Swahili speech corpus Written query:

burning

(English)

3 / 12

SLIDE 11

Cross-lingual word prediction from images

4 / 12

SLIDE 12

Cross-lingual word prediction from images

VGG

h a t m a n s h i r t

yvis

4 / 12

SLIDE 13

Cross-lingual word prediction from images

VGG

h a t m a n s h i r t

yvis

0.85 0.8 0.9 4 / 12

SLIDE 14

Cross-lingual word prediction from images

X VGG

h a t m a n s h i r t

yvis f(X)

max conv max feedfwd

4 / 12

SLIDE 15

Cross-lingual word prediction from images

X VGG

h a t m a n s h i r t

yvis f(X) Loss

max conv max feedfwd

ℓ

4 / 12

SLIDE 16

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd

4 / 12

SLIDE 17

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd

Swahili speech

4 / 12

SLIDE 18

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

Swahili speech

4 / 12

SLIDE 19

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities Swahili speech

4 / 12

SLIDE 20

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities I.e., a cross-lingual spoken bag-of-words (BoW) classifier Swahili speech

4 / 12

SLIDE 21

Experimental details

Goal: Use visual grounding for cross-lingual keyword spotting

5 / 12

SLIDE 22

Experimental details

Goal: Use visual grounding for cross-lingual keyword spotting
Proof-of-concept: Use English speech with German queries

5 / 12

SLIDE 23

Experimental details

Goal: Use visual grounding for cross-lingual keyword spotting
Proof-of-concept: Use English speech with German queries:

X Loss

max conv max feedfwd

ℓ

VGG-16 f(X)

Feld H u n d e s p r i n g t

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

5 / 12

SLIDE 24

Experimental details

Goal: Use visual grounding for cross-lingual keyword spotting
Proof-of-concept: Use English speech with German queries:

X Loss

max conv max feedfwd

ℓ

VGG-16 f(X)

Feld H u n d e s p r i n g t

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

Data: 8000 images with 5 English spoken captions (∼37 hours)
Weak labels: German visual tagger trained on German Multi30k

5 / 12

SLIDE 25

Predictions on test data

f(X1) f(X2) f(X3) Given German keyword: fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi contains translation of given (German) keyword w ‘Hunde’ English speech collection (want to search)

corresponds to dim. w

Evaluation: Does predicted keyword occur in reference translation?

6 / 12

SLIDE 26

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword

7 / 12

SLIDE 27

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad

7 / 12

SLIDE 28

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

Play

7 / 12

SLIDE 29

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day

7 / 12

SLIDE 30

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day
a biker does a trick on a ramp
a person is doing tricks on a bicycle in a city

7 / 12

SLIDE 31

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day
a biker does a trick on a ramp
a person is doing tricks on a bicycle in a city

Input: Straße (street)

7 / 12

SLIDE 32

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day
a biker does a trick on a ramp
a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

Play

7 / 12

SLIDE 33

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day
a biker does a trick on a ramp
a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

a woman in black and red listens to an ipod walks down the street

7 / 12

SLIDE 34

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

man riding a bicycle on a foggy day
a biker does a trick on a ramp
a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

a woman in black and red listens to an ipod walks down the street
people on the city street walk past a puppet theater
an asian woman rides a bicycle in front of two cars

7 / 12

SLIDE 35

Cross-lingual keyword spotting performance

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 8 / 12

SLIDE 36

A few more example predictions

Input: Feld (field)

9 / 12

SLIDE 37

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field

9 / 12

SLIDE 38

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

9 / 12

SLIDE 39

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

9 / 12

SLIDE 40

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green)

9 / 12

SLIDE 41

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass

9 / 12

SLIDE 42

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

9 / 12

SLIDE 43

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

a brown dog is chasing a red frisbee across a grassy field ∗

(2)

9 / 12

SLIDE 44

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big)

9 / 12

SLIDE 45

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

a large crowd of people ice skating outdoors

9 / 12

SLIDE 46

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

a large crowd of people ice skating outdoors
a surfer catching a large wave in the ocean

9 / 12

SLIDE 47

A few more example predictions

Input: Feld (field) Output:

a team of baseball players in blue uniforms walking together on field
a brown and black dog running through a grassy field ∗

(1)

two small children walk away in a field

Input: gr¨ un(en) (green) Output:

boy wearing a green and white soccer uniform running through the grass
a girl is screaming as she comes off the water slide ∗

(3)

a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

a large crowd of people ice skating outdoors
a surfer catching a large wave in the ocean
a small group of people sitting together outside ∗

(3)

9 / 12

SLIDE 48

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

SLIDE 49

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

SLIDE 50

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

SLIDE 51

Cross-lingual keyword spotting

kuwaka kuwaka moto Written query:

burning

(English)

11 / 12

SLIDE 52

Conclusions and future work

Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

12 / 12

SLIDE 53

Conclusions and future work

Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

Future: Apply approach to a truly low-resource language

12 / 12

SLIDE 54

Conclusions and future work

Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

Future: Apply approach to a truly low-resource language
Perform error analysis on larger scale

12 / 12

SLIDE 55

Conclusions and future work

Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

Future: Apply approach to a truly low-resource language
Perform error analysis on larger scale
Visual tagger improvements: language-agnostic visual recognition

12 / 12

SLIDE 56

https://github.com/kamperh/

SLIDE 57

Training: Visually grounded model

X Loss

max conv max feedfwd

ℓ

VGG-16 f(X)

F e l d Hunde springt

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

SLIDE 58

Testing: Cross-lingual keyword spotting

f(X1) f(X2) f(X3) Given German keyword: fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi contains translation of given (German) keyword w ‘Hunde’ English speech collection (want to search)

corresponds to dim. w