Visually grounded learning of keyword prediction from untranscribed - PowerPoint PPT Presentation

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August 2017 Herman Kamper 1 , Shane Settle 2 , Gregory Shakhnarovich 2 , Karen Livescu 2 1 Stellenbosch University, South Africa 2 Toyota Technological Institute at Chicago, USA http://www.kamperh.com/

Success in speech recognition 1 / 13

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] 1 / 13

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 13

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 13

Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 13

Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours transcribed speech audio; ∼ 350M/560M words text • Can we do this for all 7000 languages spoken in the world? 1 / 13

What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) 2 / 13

What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data 2 / 13

What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data, but. . . • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] 2 / 13

What can we learn from weak labels? • Weak labels: Speech paired with other signal (e.g. images) • Criticism: You always have some labelled data, but. . . • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] • New insights and models for speech processing [Jansen et al., ’13] 2 / 13

Using images to ground language 3 / 13

Using images to ground language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] 3 / 13

Using images to ground language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] • We consider images paired with unlabelled spoken captions: Play 3 / 13

Word prediction from images and speech 4 / 13

Word prediction from images and speech t n r t a y vis a i m h h s VGG 4 / 13

Word prediction from images and speech t n r t a y vis a i m h h s 0 . 85 0 . 8 0 . 9 VGG 4 / 13

Word prediction from images and speech t y vis n r f ( X ) t a a i m h h s max feedfwd conv VGG max X 4 / 13

Word prediction from images and speech t y vis n r f ( X ) t a a i m h h s Loss max feedfwd L conv VGG max X 4 / 13

Word prediction from images and speech f ( X ) max feedfwd conv max X 4 / 13

Word prediction from images and speech n f ( X ) t a a m h max feedfwd conv max X 4 / 13

Word prediction from images and speech n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max X 4 / 13

Word prediction from images and speech n f ( X ) t a a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a spoken bag-of-words conv (BoW) classifier max X 4 / 13

Images paired with untranscribed speech We are still in this setting: • We do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions 5 / 13

Images paired with untranscribed speech We are still in this setting: • We do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions • Note: Vision system could be seen as language independent (future) 5 / 13

Experimental details • Data: 8000 images with 5 spoken captions, divided into train, development and test sets [Harwath and Glass, ASRU’15] • Prediction: Output words w where f w ( X ) > α • Tasks: Spoken bag-of-words prediction; keyword spotting • Evaluation: Compare to words in transcriptions of test data 6 / 13

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels Play 7 / 13

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels bicycle , bike, man , riding, Play wearing 7 / 13

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing 7 / 13

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 7 / 13

Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 8 / 13

Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances 9 / 13

Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances: running playing ocean white dogs three lake two boy two dog water mouth small biker white dogs rides two ball dirt red brown riding snowboarder jumping wearing person snowy men man hill air air man snow standing women child three little girls girl boy two two young woman running bicycle grassy ramp white small biker dogs blue two bike grass 9 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type Play (one of top 10) beach behind bike boys large play sitting yellow young 10 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young 10 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young 10 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young 10 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young 10 / 13

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young 10 / 13

Visually grounded learning of keyword prediction from untranscribed - PowerPoint PPT Presentation

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August 2017 Herman Kamper 1 , Shane Settle 2 , Gregory Shakhnarovich 2 , Karen Livescu 2 1 Stellenbosch University, South Africa 2 Toyota Technological

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Visually Grounded, Task-oriented Dialogue Elia Bruni Outline Language grounding Visual dialogue

Blind/Visually Impaired Silvia Ludena Veronica Sarabia Katie Stoddard Huong Vo Blind/Visually

Bayes-Nash Price of Anarchy for GSP Renato Paes Leme va Tardos Cornell University Keyword

A glimpse to sponsored search auctions Maria Serna Fall 2016 AGT-MIRI Sponsored search Keyword

Grounded Action Transformation for Robot Learning in Simulation Josiah Hanna and Peter Stone

Guiding Interaction Behaviors for Multi-modal Grounded Language Learning Jesse Thomason, Jivko

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Vir irtual Co Communit ity Engagement 7 7 Ju July ly 2020 Agenda Introduction

An Impact Perspective Labels, Standards & Certifications Lynne Olson, Ph.D. The Problem

Implementing Fundamental Pharmaceutical Science and Materials/Engineer Expertise in Scale-up 2 nd

Club Assembly 08/27/2020 Lewisville Morning Rotary 2 Welcome (5 minutes) Pledge and

12/10/2014 Special Operations Combat Medic PFN: SOMTCL02 Hours: 2.0 Instructor: JSOMTC, SWMG(A)

Evaluation and Treatment of Common Musculoskeletal Complaints Katherine Julian, MD July 2014

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS DATENBANKSYSTEME 1 (INF 3131) Torsten Grust

Veggie Washer Community Servings asked: How do we clean and prepare the produce in an

Visually grounded learning of keyword prediction from untranscribed - PowerPoint PPT Presentation

Visually grounded learning of keyword prediction from untranscribed speech Interspeech, August 2017 Herman Kamper 1 , Shane Settle 2 , Gregory Shakhnarovich 2 , Karen Livescu 2 1 Stellenbosch University, South Africa 2 Toyota Technological

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Visually grounded cross-lingual keyword spotting in speech SLTU, August 2018 Herman Kamper 1 and

Response-based Learning for Grounded Grounded SMT Riezler, Machine Translation Simianer, Haas

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic

Outline Introduction Definition History Features When should Grounded Theory be used? Types

TAKE TAKE GROUNDED GROUNDED DECISIONS DECISIONS Farm Modelling Statistic based, gamification

Visually Grounded Neural Syntax Acquisition * * Haoyue Shi Jiayuan Mao Kevin Gimpel Karen

Representations of language in a model of visually grounded speech signal Grzegorz Chrupaa

Visually Grounded, Task-oriented Dialogue Elia Bruni Outline Language grounding Visual dialogue

Blind/Visually Impaired Silvia Ludena Veronica Sarabia Katie Stoddard Huong Vo Blind/Visually

Bayes-Nash Price of Anarchy for GSP Renato Paes Leme va Tardos Cornell University Keyword

A glimpse to sponsored search auctions Maria Serna Fall 2016 AGT-MIRI Sponsored search Keyword

Grounded Action Transformation for Robot Learning in Simulation Josiah Hanna and Peter Stone

Guiding Interaction Behaviors for Multi-modal Grounded Language Learning Jesse Thomason, Jivko

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Vir irtual Co Communit ity Engagement 7 7 Ju July ly 2020 Agenda Introduction

An Impact Perspective Labels, Standards &amp; Certifications Lynne Olson, Ph.D. The Problem

Implementing Fundamental Pharmaceutical Science and Materials/Engineer Expertise in Scale-up 2 nd

Club Assembly 08/27/2020 Lewisville Morning Rotary 2 Welcome (5 minutes) Pledge and

12/10/2014 Special Operations Combat Medic PFN: SOMTCL02 Hours: 2.0 Instructor: JSOMTC, SWMG(A)

Evaluation and Treatment of Common Musculoskeletal Complaints Katherine Julian, MD July 2014

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS DATENBANKSYSTEME 1 (INF 3131) Torsten Grust

Veggie Washer Community Servings asked: How do we clean and prepare the produce in an

An Impact Perspective Labels, Standards & Certifications Lynne Olson, Ph.D. The Problem