Learning from unlabelled speech, with and without visual cues Ohio - PowerPoint PPT Presentation

Large-vocabulary: Xitsonga 70 ZRSBaselineUTD (SI) 60 UTDGraphCC (SI) SyllableSegOsc + (SD) 50 BayesSegMinDur-MFCC (SD) BayesSegMinDur-cAE (SI) F -score (%) 40 30 20 10 0 n e y e p r a k y d o T n T u o B ZRSBaselineUTD: [Versteegh et al., IS’15]. UTDGraphCC: [Lyzinski et al., IS’15]. SyllableSegOsc + : [R¨ as¨ anen et al., IS’15]. BayesSeg: [Kamper et al., arXiv’16]. 16 / 38

Listen to discovered clusters • Data for small-vocabulary experiments: Play • Small-vocabulary cluster 45: Play • Large-vocabulary English cluster 1214: Play • Large-vocabulary Xitsonga cluster 629: Play 17 / 38

The true (less rosy) picture Word embedding from cluster 33 ( → one) Embeddings close to the above (non-word segments) Embedding dimensions [Levin et al., ASRU’13]; [Kamper et al., ICASSP’16]; [Settle and Livescu, SLT’16] 18 / 38

Using visual cues to learn from untranscribed speech: Visually Grounded Keyword Prediction

Using visual cues to learn from untranscribed speech: Visually Grounded Keyword Prediction Shane Settle Greg Shakhnarovich Karen Livescu

Arrival

Using images for grounding language 21 / 38

Using images for grounding language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] 21 / 38

Using images for grounding language • Image captioning: Generate written natural language description of a given image [Vinyals et al., CVPR’15] • Grounding written language using images [Bernardi et al., JAIR’16] • We consider images paired with unlabellel spoken captions: Play 21 / 38

Map images and speech into common space 22 / 38

Map images and speech into common space d ( y vis , y spch ) distance y vis y spch max feedfwd conv VGG max X [Harwath et al., NIPS’16] 22 / 38

Retrieval in common (semantic) space y ∈ R D in D -dimensional space y vis y spch [Harwath et al., NIPS’16] 23 / 38

Can we use (supervised) vision model to get labels? d ( y vis , y spch ) distance y vis y spch max feedfwd conv VGG max X Cannot obtain textual labels for the speech using this model 24 / 38

Word prediction from images and speech 25 / 38

Word prediction from images and speech n t t r a i y vis a h m h s VGG [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t t r a i y vis a h m h s 0 . 85 0 . 8 0 . 9 VGG [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t y vis t a r f ( X ) i a h m h s max feedfwd conv VGG max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t y vis t a r f ( X ) i a h m h s Loss max feedfwd L conv VGG max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech f ( X ) max feedfwd conv max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t a f ( X ) a m h max feedfwd conv max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t a f ( X ) a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities conv max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech n t a f ( X ) a m h max feedfwd f ( X ) ∈ R W is vector of word probabilities I.e., a spoken bag-of-words conv (BoW) classifier max X [Kamper et al., arXiv’17] 25 / 38

Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) 26 / 38

Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) 26 / 38

Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) Train using cross-entropy loss (i.e. soft targets): W � L ( f ( X ) , y vis ) = − { y vis ,w log f w ( X ) + (1 − y vis ,w ) log [1 − f w ( X )] } w =1 26 / 38

Word prediction from images and speech Vision system outputs y vis , giving probability of word w for image I : y vis ,w = P ( w | I, γ ) Interpret dimension w of the speech network output f ( X ) as: f w ( X ) = P ( w | X, θ ) Train using cross-entropy loss (i.e. soft targets): W � L ( f ( X ) , y vis ) = − { y vis ,w log f w ( X ) + (1 − y vis ,w ) log [1 − f w ( X )] } w =1 If y vis ,w ∈ { 0 , 1 } , this is summed log loss of W binary classifiers. [Kamper et al., arXiv’17] 26 / 38

Images paired with untranscribed speech We are still in this setting: • I.e., we do not use any of the speech transcriptions during model training (only for evaluation) • But our resulting model can make bag-of-words (BoW) predictions 27 / 38

The vision system W t n t r a i a h m h s y vis • VGG-16 input layers (1.3M images) [Simonyan and Zisserman, arXiv’14] • Train on Flickr30k (caption BoW labels) • Targets: W = 1000 most common word VGG types after removing stop words • Note: Vision system could be seen as language independent (future work) 28 / 38

Experimental details • Data: 8000 images with 5 spoken captions, divided into train, development and test sets [Harwath and Glass, ASRU’15] • Prediction: Output words w where f w ( X ) > α • Tasks: Spoken bag-of-words prediction; keyword spotting • Evaluation: Compare to words in transcriptions of test data 29 / 38

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels Play 30 / 38

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels bicycle , bike, man , riding, Play wearing 30 / 38

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing 30 / 38

Task 1: Spoken bag-of-words prediction Input utterance Predicted BoW labels man on bicycle is doing tricks in an old bicycle , bike, man , riding, building wearing a little girl is climbing a ladder child, girl , little , young a rock climber standing in a crevasse climbing, man, rock a dog running in the grass around sheep dog , field, grass , running a man in a miami basketball uniform ball, basketball , man , looking to the right player, uniform , wearing 30 / 38

Task 1: Spoken bag-of-words prediction Unigram baseline VisionSpeechCNN 80 OracleSpeechCNN Precision (%) 60 40 20 0 α = 0 . 4 α = 0 . 7 31 / 38

Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances 32 / 38

Task 1: Spoken bag-of-words prediction False alarm keywords and words in corresponding utterances: running playing ocean white dogs three lake two boy two dog water mouth small biker white dogs rides two ball dirt red brown riding snowboarder jumping wearing person snowy men man hill air air man snow standing women child three little girls girl boy two two young woman running bicycle grassy ramp white small biker dogs blue two bike grass 32 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type Play (one of top 10) beach behind bike boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . behind bike boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind bike boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave bike boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys Play large play sitting yellow young 33 / 38

Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park large play sitting yellow young 33 / 38

Learning from unlabelled speech, with and without visual cues Ohio - PowerPoint PPT Presentation

Learning from unlabelled speech, with and without visual cues Ohio State University, May 2017 Herman Kamper Toyota Technological Institute at Chicago http://www.kamperh.com/ Success in speech recognition 1 / 38 Success in speech recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech(less) Presentation Basics: A Visual Guide Speech(less) Presentation Basics: A Visual Guide

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by:

Understanding the Enablers and Barriers to Local Environmental Stewardship Organized by Nathan

Drug Policies Beyond the War on Drugs Dr Joanne Csete Dr John Collins Professor Lawrence

Measuring the Impact of One Assignment on Reported Sustainability-Related Behaviors Bob

MPI and MapReduce CCGSC 2010 Flat Rock NC September 8

Policy Showcase 2: Policy Innovations to Maximise Impact 2.30pm 4.30pm Institutionalising

ECE 457/557 Go over syllabus Engineering Data Analysis & Modeling Class overview &

Logic for Computer Science 03 Set theory Wouter Swierstra University of Utrecht 1 Last

Learning from unlabelled speech, with and without visual cues Ohio - PowerPoint PPT Presentation

Learning from unlabelled speech, with and without visual cues Ohio State University, May 2017 Herman Kamper Toyota Technological Institute at Chicago http://www.kamperh.com/ Success in speech recognition 1 / 38 Success in speech recognition

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech(less) Presentation Basics: A Visual Guide Speech(less) Presentation Basics: A Visual Guide

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

DeViSE: A Deep Visual-Semantic Embedding Model Frome et al., Google Research Presented by:

Understanding the Enablers and Barriers to Local Environmental Stewardship Organized by Nathan

Drug Policies Beyond the War on Drugs Dr Joanne Csete Dr John Collins Professor Lawrence

Measuring the Impact of One Assignment on Reported Sustainability-Related Behaviors Bob

MPI and MapReduce CCGSC 2010 Flat Rock NC September 8

Policy Showcase 2: Policy Innovations to Maximise Impact 2.30pm 4.30pm Institutionalising

ECE 457/557 Go over syllabus Engineering Data Analysis &amp; Modeling Class overview &amp;

Logic for Computer Science 03 Set theory Wouter Swierstra University of Utrecht 1 Last

ECE 457/557 Go over syllabus Engineering Data Analysis & Modeling Class overview &