Deep convolutional acoustic word embeddings using word-pair side - PowerPoint PPT Presentation

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 , Weiran Wang 2 , Karen Livescu 2 1 CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2 Toyota Technological Institute at Chicago, USA ICASSP 2016

Introduction ◮ Most speech processing systems rely on deep architectures to classify speech frames into subword units (HMM triphone states). ◮ Requires pronunciation dictionary for breaking words into subwords; in many cases still makes frame-level independence assumptions. ◮ Some studies have started to reconsider whole words as basic modelling unit [Heigold et al. , 2012; Chen et al. , 2015]. 2 / 17

Segmental automatic speech recognition Segmental conditional random field Whole-word lattice rescoring [Bengio ASR [Maas et al. , 2012]: and Heigold, 2014]: ran, f 1 =1 Andrew, f 1 =0 3 / 17

Segmental query-by-example search From [Levin et al. , 2015]: LapEig Search audio segments Segment embeddings LapEig NN Index Query audio Query embedding Query result(s) Fig. 1 . Diagram of the S-RAILS audio search system. [Chen et al. , 2015]: Similar scheme for “Okay Google” using LSTMs. 4 / 17

Segmental query-by-example search From [Levin et al. , 2015]: LapEig Search audio segments Segment embeddings LapEig NN Index Query audio Query embedding Query result(s) Fig. 1 . Diagram of the S-RAILS audio search system. [Chen et al. , 2015]: Similar scheme for “Okay Google” using LSTMs. In this work, we also use a query-related task for evaluation. 4 / 17

Acoustic word embedding problem x i ∈ R d in d -dimensional space f ( Y 1 ) Y 1 Y 2 f ( Y 2 ) 5 / 17

Reference vector method [Levin et al. , 2013] 6 / 17

Reference vector method [Levin et al. , 2013] Segment we want to embed: y t 1 : t 2 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Segment we want to embed: y t 1 : t 2 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we want to embed: y t 1 : t 2 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: y t 1 : t 2 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: Dist 3 y t 1 : t 2 Dist 4 Dist m 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: Dist 3 y t 1 : t 2 Dist 4 Dist m ∈ R m 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want Dimensionality to embed: reduction: Dist 3 × P P ∈ R m × d y t 1 : t 2 Dist 4 Dist m ∈ R m 6 / 17

Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Embedding: Segment we Dist 2 x i = f ( y t 1 : t 2 ) want Dimensionality to embed: reduction: Dist 3 × P P ∈ R m × d y t 1 : t 2 Dist 4 ∈ R d in fixed Dist m d -dimensional space ∈ R m 6 / 17

Word classification CNN [Bengio and Heigold, 2014] 7 / 17

Word classification CNN [Bengio and Heigold, 2014] w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max convolution Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max × n conv convolution Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected fully max × n conv convolution Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected × n full fully max × n conv convolution Y i 7 / 17

Word classification CNN [Bengio and Heigold, 2014] softmax w i x i = f ( Y i ) 0 0 0 · · · 1 · · · 0 0 connected × n full fully max × n conv convolution Y i 7 / 17

Supervision and side information ◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli` ere et al. , 2015]) are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } ◮ Also aligns with query / word discrimination task: does two speech segments contain instances of the same word? (Don’t care about word identity.) 8 / 17

Supervision and side information ◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli` ere et al. , 2015]) are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } ◮ Also aligns with query / word discrimination task: does two speech segments contain instances of the same word? (Don’t care about word identity.) Can we use this weak supervision (sometimes called side information) to train an acoustic word embedding function f ? 8 / 17

Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. 9 / 17

Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. x 1 = f ( Y 1 ) x 2 = f ( Y 2 ) Y 1 Y 2 9 / 17

Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. distance l ( x 1 , x 2 ) x 1 = f ( Y 1 ) x 2 = f ( Y 2 ) Y 1 Y 2 9 / 17

Loss functions 10 / 17

Loss functions The coscos 2 loss [Synnaeve et al. , 2014]: � 1 − cos( x 1 , x 2 ) if same 2 l cos cos 2 ( x 1 , x 2 ) = cos 2 ( x 1 , x 2 ) if different same different 10 / 17

Loss functions The coscos 2 loss [Synnaeve et al. , 2014]: � 1 − cos( x 1 , x 2 ) if same 2 l cos cos 2 ( x 1 , x 2 ) = cos 2 ( x 1 , x 2 ) if different same different Margin-based hinge loss [Mikolov, 2013]: l cos hinge = max { 0 , m + d cos ( x 1 , x 2 ) − d cos ( x 1 , x 3 ) } where d cos ( x 1 , x 2 ) = 1 − cos( x 1 , x 2 ) is the cosine distance between x 1 and x 2 , and 2 m is a margin parameter. Pair ( x 1 , x 2 ) are same, ( x 1 , x 3 ) are different. 10 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple” 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like” 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? Cosine distance: predict: “pie” d 1 “grape” “apple” “apple” “apple” “like” 11 / 17

Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? Cosine distance: predict: “pie” d 1 different “grape” “apple” “apple” “apple” “like” 11 / 17

Deep convolutional acoustic word embeddings using word-pair side - PowerPoint PPT Presentation

Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 , Weiran Wang 2 , Karen Livescu 2 1 CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2 Toyota Technological Institute at Chicago, USA

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Acoustic Acoustic Control Systems BV Acoustic Acoustic Control Systems BV Control Systems BV

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

Image Space Embeddings and Generalized Convolutional Neural Networks Nate Strawn September

Deep Image-Text Embeddings Learning Deep Structure-Preserving Image-Text Embeddings (CVPR 2016)

Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction Roy Schwartz + ,

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

Tasty Malware Analysis with T.A.C.O. Bringing Cuckoo Reports into IDA Pro Ruxcon 2015 Jason

On the Stackelberg strategies in control theory Enrique FERNNDEZ-CARA Dpto. E.D.A.N. - Univ. of

Composite Event Recognition for Maritime Monitoring Manolis Pitsikalis 1 , Alexander Artikis 2 , 1

Mode Callbacks Tarjei Mandt Black Hat USA 2011 Who am I 11. august 2011 Security Researcher

Efficient Voice Activity Detection via Binarized Neural Networks Jong Hwan Ko Josh Fromm

Lecture III: Majorana neutrinos Petr Vogel, Caltech NLDBD school, October 31, 2017 Whatever

Costcompetitive Reduction of Carbon Emissions of up to 80% from the US Electric Sector by 2030

Clustering Ciira Maina Dedan Kimathi University of Technology 17th June 2015 Introduction