Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/

Advances in speech recognition 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language 1 / 12

Advances in speech recognition • Addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text [Xiong et al., TASLP’17] • Very different from the “supervision” infants use to learn language • Sometimes not possible, e.g., for unwritten languages 1 / 12

Images as weak labels for speech 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? 2 / 12

Images as weak labels for speech Can we use images as weak labels in low-resource settings? Play • Maybe we cannot use this type of data for full ASR, but maybe it can be used for other tasks? • Goal: Use this type of data for cross-lingual keyword spotting 2 / 12

Cross-lingual keyword spotting kuwaka Written query: burning (English) kuwaka Swahili speech corpus 3 / 12

Cross-lingual word prediction from images 4 / 12

Cross-lingual word prediction from images t n r t a y vis a i m h h s VGG 4 / 12

Cross-lingual word prediction from images t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 12

Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 12

Cross-lingual word prediction from images t y vis n r f ( X ) t a a i m h h s Loss max feedfwd ℓ conv VGG max X 4 / 12

Cross-lingual word prediction from images f ( X ) max feedfwd conv max X 4 / 12

Cross-lingual word prediction from images f ( X ) max feedfwd conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max Swahili speech X 4 / 12

Cross-lingual word prediction from images n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a cross-lingual conv spoken bag-of-words (BoW) classifier max Swahili speech X 4 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech 5 / 12

Experimental details • Goal: Use visual grounding for cross-lingual keyword spotting • Proof-of-concept: Use English speech with German queries: Cross-lingual keyword spotter German (text) tags t e g d n Feld n i ˆ r f ( X ) y de u p H s Loss max feedfwd ℓ conv VGG-16 max I X English speech • Data: 8000 images with 5 English spoken captions ( ∼ 37 hours) • Weak labels: German visual tagger trained on German Multi30k 5 / 12

Predictions on test data Given German keyword: ‘Hunde’ English speech collection (want to search) corresponds to dim. w f ( X 1 ) f ( X 2 ) f ( X 3 ) f w ( X i ) = P θ ( w | X i ) : score for whether (English) speech X i contains translation of given (German) keyword w Evaluation: Does predicted keyword occur in reference translation? 6 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): Play • 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): Play • 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street 7 / 12

Example predictions (top retrievals) Task: Given written German keyword, find utterances in an unseen English speech collection containing that keyword Input: Fahrrad Output (in top 10): • man riding a bicycle on a foggy day • a biker does a trick on a ramp • a person is doing tricks on a bicycle in a city Input: Straße (street) Output (in top 10): • a woman in black and red listens to an ipod walks down the street • people on the city street walk past a puppet theater • an asian woman rides a bicycle in front of two cars 7 / 12

Cross-lingual keyword spotting performance DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 8 / 12

Example predictions marked as errors Input: Feld (field) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) 9 / 12

Example predictions marked as errors Input: Feld (field) Output: • a brown and black dog running through a grassy field ∗ Input: gr¨ un(en) (green) Output: • a brown dog is chasing a red frisbee across a grassy field ∗ Input: groß(en) (big) Output: • a small group of people sitting together outside ∗ 9 / 12

Error analysis by annotator DETextPrior DEVisionCNN XVisionSpeechCNN XBoWCNN 0 20 40 60 80 100 P @10 (%) 10 / 12

Cross-lingual keyword spotting kuwaka Written query: moto burning (English) kuwaka 11 / 12

Conclusions and future work • Visual grounding makes it possible to perform cross-lingual keyword spotting without any parallel speech and text or translations 12 / 12

Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Birdwatching Spotting Scopes April, 2020 GENERAL FEATURES OF BIRDWATCHING SPOTTING SCOPES

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

Automatic speech recognition and keyword spotting in under-resourced languages Digital Signal

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Angled Spotting Scopes March, 2020 ANGLED SPOTTING SCOPES FOR HUNTING Appropriate for hunting

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Deep Learning Feature for Handwritten Keyword Spotting Baptiste Wicht Andreas Fischer Jean

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

What this session Is and Isnt civicon Denver 2015 IS ISNT Forena Reports: Eliot Mason

Dual Numbers Gino van den Bergen gino@dtecta.com Introduction Dual numbers extend real

Machine Learning avec Spark : La voie de la production Andr Bois-Crettez Tech W ech Week 2019

Multi mode rental survey in Norway Jan Haslund Statistics Norway 15th International Blaise

Models for Geometric Composability of Engineered Physical Systems Vijay Srinivasan MBSE

Theory of Computer Games Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Xamarin.Forms Introduction to Xamarin Who is this guy? Cross platform developer RedBull Event

Northeasterns Evolution Catalyzed by Data Analytics Kathy Spiegelman Vice President &

Sambuz

Useful Links

Newsletter

Mail Us

Visually grounded cross-lingual keyword spotting in speech SLTU, - PowerPoint PPT Presentation

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and Michael Roth 2 1 E&E Engineering, Stellenbosch University, South Africa 2 Saarland University, Germany http://www.kamperh.com/ Advances in

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August

Discriminative Keyword Spotting Joseph Keshet, The Hebrew University David Grangier, IDIAP

Birdwatching Spotting Scopes April, 2020 GENERAL FEATURES OF BIRDWATCHING SPOTTING SCOPES

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

Automatic speech recognition and keyword spotting in under-resourced languages Digital Signal

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Angled Spotting Scopes March, 2020 ANGLED SPOTTING SCOPES FOR HUNTING Appropriate for hunting

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Deep Learning Feature for Handwritten Keyword Spotting Baptiste Wicht Andreas Fischer Jean

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

What this session Is and Isnt civicon Denver 2015 IS ISNT Forena Reports: Eliot Mason

Dual Numbers Gino van den Bergen gino@dtecta.com Introduction Dual numbers extend real

Machine Learning avec Spark : La voie de la production Andr Bois-Crettez Tech W ech Week 2019

Multi mode rental survey in Norway Jan Haslund Statistics Norway 15th International Blaise

Models for Geometric Composability of Engineered Physical Systems Vijay Srinivasan MBSE

Theory of Computer Games Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Xamarin.Forms Introduction to Xamarin Who is this guy? Cross platform developer RedBull Event

Northeasterns Evolution Catalyzed by Data Analytics Kathy Spiegelman Vice President &amp;

Sambuz

Useful Links

Newsletter

Mail Us

Northeasterns Evolution Catalyzed by Data Analytics Kathy Spiegelman Vice President &