Deep convolutional acoustic word embeddings using word-pair side information
Herman Kamper1, Weiran Wang2, Karen Livescu2
1CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2Toyota Technological Institute at Chicago, USA
Deep convolutional acoustic word embeddings using word-pair side - - PowerPoint PPT Presentation
Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 , Weiran Wang 2 , Karen Livescu 2 1 CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2 Toyota Technological Institute at Chicago, USA
1CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2Toyota Technological Institute at Chicago, USA
◮ Most speech processing systems rely on deep architectures to classify speech
◮ Requires pronunciation dictionary for breaking words into subwords; in many
◮ Some studies have started to reconsider whole words as basic modelling unit
2 / 17
3 / 17
LapEig NN Index Search audio segments Segment embeddings Query result(s) Query embedding Query audio LapEig
4 / 17
LapEig NN Index Search audio segments Segment embeddings Query result(s) Query embedding Query audio LapEig
4 / 17
f(Y1) f(Y2) Y2 Y1
5 / 17
6 / 17
6 / 17
6 / 17
6 / 17
6 / 17
6 / 17
6 / 17
6 / 17
6 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
7 / 17
◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli`
◮ Also aligns with query / word discrimination task: does two speech segments
8 / 17
◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli`
◮ Also aligns with query / word discrimination task: does two speech segments
8 / 17
9 / 17
9 / 17
9 / 17
10 / 17
2
10 / 17
2
2
10 / 17
11 / 17
“apple” “pie” “grape” “apple” “apple” “like”
11 / 17
“apple” “pie” “grape” “apple” “apple” “like”
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” Treat as query Treat as terms to search
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: Cosine distance: d1
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different Cosine distance: d1 d2
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same Cosine distance: d1 d2 d3
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same same Cosine distance: d1 d2 d3
11 / 17
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same same Cosine distance: d1 d2 d3
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different Cosine distance: d1 d2 d3 d4 dN
◮ Speech from Switchboard is used for evaluation. ◮ Training set: 10k word tokens; sampled 100k training word pairs. ◮ Test set for same-different evaluation: 11k word tokens, 60.7M pairs, 3%
◮ Used a comparable development set.
12 / 17
13 / 17
13 / 17
13 / 17
14 / 17
14 / 17
15 / 17
15 / 17
15 / 17
15 / 17
15 / 17
15 / 17
15 / 17
15 / 17
15 / 17
16 / 17
16 / 17
◮ Introduced the Siamese CNN for obtaining acoustic word embeddings, and
◮ Evaluated using word discrimination task, and showed similar performance to
◮ For smaller dimensionalities: Siamese CNN outperformed classifier CNN. ◮ Self-criticism: evaluated on a small dataset (low-resource setting). ◮ Future work: sequence models, using embeddings for search and ASR.
17 / 17