Visually grounded cross-lingual keyword spotting in speech SLTU, - - PowerPoint PPT Presentation

visually grounded cross lingual keyword spotting in speech
SMART_READER_LITE
LIVE PREVIEW

Visually grounded cross-lingual keyword spotting in speech SLTU, - - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in


slide-1
SLIDE 1

Visually grounded cross-lingual keyword spotting in speech

SLTU, August 2018

Herman Kamper1 and Michael Roth2

1E&E Engineering, Stellenbosch University, South Africa 2Saarland University, Germany

http://www.kamperh.com/

slide-2
SLIDE 2

Advances in speech recognition

1 / 12

slide-3
SLIDE 3

Advances in speech recognition

  • Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

1 / 12

slide-4
SLIDE 4

Advances in speech recognition

  • Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

  • Very different from the “supervision” infants use to learn language

1 / 12

slide-5
SLIDE 5

Advances in speech recognition

  • Addiction to labels: 2000 hours transcribed speech audio;

∼350M/560M words text [Xiong et al., TASLP’17]

  • Very different from the “supervision” infants use to learn language
  • Sometimes not possible, e.g., for unwritten languages

1 / 12

slide-6
SLIDE 6

Images as weak labels for speech

2 / 12

slide-7
SLIDE 7

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

2 / 12

slide-8
SLIDE 8

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

  • Maybe we cannot use this type of data for full ASR, but maybe it can

be used for other tasks?

2 / 12

slide-9
SLIDE 9

Images as weak labels for speech

Can we use images as weak labels in low-resource settings?

Play

  • Maybe we cannot use this type of data for full ASR, but maybe it can

be used for other tasks?

  • Goal: Use this type of data for cross-lingual keyword spotting

2 / 12

slide-10
SLIDE 10

Cross-lingual keyword spotting

kuwaka kuwaka Swahili speech corpus Written query:

burning

(English)

3 / 12

slide-11
SLIDE 11

Cross-lingual word prediction from images

4 / 12

slide-12
SLIDE 12

Cross-lingual word prediction from images

VGG

h a t m a n s h i r t

yvis

4 / 12

slide-13
SLIDE 13

Cross-lingual word prediction from images

VGG

h a t m a n s h i r t

yvis

0.85 0.8 0.9 4 / 12

slide-14
SLIDE 14

Cross-lingual word prediction from images

X VGG

h a t m a n s h i r t

yvis f(X)

max conv max feedfwd

4 / 12

slide-15
SLIDE 15

Cross-lingual word prediction from images

X VGG

h a t m a n s h i r t

yvis f(X) Loss

max conv max feedfwd

4 / 12

slide-16
SLIDE 16

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd

4 / 12

slide-17
SLIDE 17

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd

Swahili speech

4 / 12

slide-18
SLIDE 18

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

Swahili speech

4 / 12

slide-19
SLIDE 19

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities Swahili speech

4 / 12

slide-20
SLIDE 20

Cross-lingual word prediction from images

X f(X)

max conv max feedfwd m a n h a t

f(X) ∈ RW is vector of word probabilities I.e., a cross-lingual spoken bag-of-words (BoW) classifier Swahili speech

4 / 12

slide-21
SLIDE 21

Experimental details

  • Goal: Use visual grounding for cross-lingual keyword spotting

5 / 12

slide-22
SLIDE 22

Experimental details

  • Goal: Use visual grounding for cross-lingual keyword spotting
  • Proof-of-concept: Use English speech with German queries

5 / 12

slide-23
SLIDE 23

Experimental details

  • Goal: Use visual grounding for cross-lingual keyword spotting
  • Proof-of-concept: Use English speech with German queries:

X Loss

max conv max feedfwd

VGG-16 f(X)

Feld H u n d e s p r i n g t

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

5 / 12

slide-24
SLIDE 24

Experimental details

  • Goal: Use visual grounding for cross-lingual keyword spotting
  • Proof-of-concept: Use English speech with German queries:

X Loss

max conv max feedfwd

VGG-16 f(X)

Feld H u n d e s p r i n g t

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

  • Data: 8000 images with 5 English spoken captions (∼37 hours)
  • Weak labels: German visual tagger trained on German Multi30k

5 / 12

slide-25
SLIDE 25

Predictions on test data

f(X1) f(X2) f(X3) Given German keyword: fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi contains translation of given (German) keyword w ‘Hunde’ English speech collection (want to search)

corresponds to dim. w

Evaluation: Does predicted keyword occur in reference translation?

6 / 12

slide-26
SLIDE 26

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword

7 / 12

slide-27
SLIDE 27

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad

7 / 12

slide-28
SLIDE 28

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • Play

7 / 12

slide-29
SLIDE 29

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day

7 / 12

slide-30
SLIDE 30

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day
  • a biker does a trick on a ramp
  • a person is doing tricks on a bicycle in a city

7 / 12

slide-31
SLIDE 31

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day
  • a biker does a trick on a ramp
  • a person is doing tricks on a bicycle in a city

Input: Straße (street)

7 / 12

slide-32
SLIDE 32

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day
  • a biker does a trick on a ramp
  • a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

  • Play

7 / 12

slide-33
SLIDE 33

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day
  • a biker does a trick on a ramp
  • a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

  • a woman in black and red listens to an ipod walks down the street

7 / 12

slide-34
SLIDE 34

Example predictions (top retrievals)

Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10):

  • man riding a bicycle on a foggy day
  • a biker does a trick on a ramp
  • a person is doing tricks on a bicycle in a city

Input: Straße (street) Output (in top 10):

  • a woman in black and red listens to an ipod walks down the street
  • people on the city street walk past a puppet theater
  • an asian woman rides a bicycle in front of two cars

7 / 12

slide-35
SLIDE 35

Cross-lingual keyword spotting performance

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 8 / 12

slide-36
SLIDE 36

A few more example predictions

Input: Feld (field)

9 / 12

slide-37
SLIDE 37

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field

9 / 12

slide-38
SLIDE 38

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

9 / 12

slide-39
SLIDE 39

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

9 / 12

slide-40
SLIDE 40

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green)

9 / 12

slide-41
SLIDE 41

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass

9 / 12

slide-42
SLIDE 42

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

9 / 12

slide-43
SLIDE 43

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

  • a brown dog is chasing a red frisbee across a grassy field ∗

(2)

9 / 12

slide-44
SLIDE 44

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

  • a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big)

9 / 12

slide-45
SLIDE 45

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

  • a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

  • a large crowd of people ice skating outdoors

9 / 12

slide-46
SLIDE 46

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

  • a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

  • a large crowd of people ice skating outdoors
  • a surfer catching a large wave in the ocean

9 / 12

slide-47
SLIDE 47

A few more example predictions

Input: Feld (field) Output:

  • a team of baseball players in blue uniforms walking together on field
  • a brown and black dog running through a grassy field ∗

(1)

  • two small children walk away in a field

Input: gr¨ un(en) (green) Output:

  • boy wearing a green and white soccer uniform running through the grass
  • a girl is screaming as she comes off the water slide ∗

(3)

  • a brown dog is chasing a red frisbee across a grassy field ∗

(2) Input: groß(en) (big) Output:

  • a large crowd of people ice skating outdoors
  • a surfer catching a large wave in the ocean
  • a small group of people sitting together outside ∗

(3)

9 / 12

slide-48
SLIDE 48

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

slide-49
SLIDE 49

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

slide-50
SLIDE 50

Error analysis by annotator

20 40 60 80 100 P@10 (%)

DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 10 / 12

slide-51
SLIDE 51

Cross-lingual keyword spotting

kuwaka kuwaka moto Written query:

burning

(English)

11 / 12

slide-52
SLIDE 52

Conclusions and future work

  • Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

12 / 12

slide-53
SLIDE 53

Conclusions and future work

  • Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

  • Future: Apply approach to a truly low-resource language

12 / 12

slide-54
SLIDE 54

Conclusions and future work

  • Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

  • Future: Apply approach to a truly low-resource language
  • Perform error analysis on larger scale

12 / 12

slide-55
SLIDE 55

Conclusions and future work

  • Visual grounding makes it possible to perform cross-lingual keyword

spotting without any parallel speech and text or translations

  • Future: Apply approach to a truly low-resource language
  • Perform error analysis on larger scale
  • Visual tagger improvements: language-agnostic visual recognition

12 / 12

slide-56
SLIDE 56

https://github.com/kamperh/

slide-57
SLIDE 57

Training: Visually grounded model

X Loss

max conv max feedfwd

VGG-16 f(X)

F e l d Hunde springt

English speech German (text) tags ˆ yde Cross-lingual keyword spotter

I

slide-58
SLIDE 58

Testing: Cross-lingual keyword spotting

f(X1) f(X2) f(X3) Given German keyword: fw(Xi) = Pθ(w|Xi): score for whether (English) speech Xi contains translation of given (German) keyword w ‘Hunde’ English speech collection (want to search)

corresponds to dim. w