Unsupervised neural feature learning for speech using weak top-down - - PowerPoint PPT Presentation
Unsupervised neural feature learning for speech using weak top-down - - PowerPoint PPT Presentation
Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/ Success in speech recognition 1 / 18 Success in speech
Success in speech recognition
1 / 18
Success in speech recognition
1 / 18
Success in speech recognition
1 / 18
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 18
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
1 / 18
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
- Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
1 / 18
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
- Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
- An addiction to labels: 2000 hours transcribed speech audio;
∼350M/560M words text
1 / 18
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
- Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
- An addiction to labels: 2000 hours transcribed speech audio;
∼350M/560M words text
1 / 18
i had to think
- f
some example speech since speech recognition is really cool
Success in speech recognition
[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]
- Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)
- An addiction to labels: 2000 hours transcribed speech audio;
∼350M/560M words text
- But, there are around 7000 languages spoken in the world today
1 / 18
i had to think
- f
some example speech since speech recognition is really cool
Why learn without labels?
3 / 18
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
3 / 18
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
3 / 18
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
3 / 18
Why learn without labels?
- Get insight into human language acquisition [R¨
as¨ anen and Rasilo, ’15]
- Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15]
- Analysis of audio for unwritten languages [Besacier et al., ’14]
- New insights and models for speech processing
[Jansen et al., ’13]
3 / 18
Unsupervised term discovery (UTD)
[Park and Glass, TASLP’08] 4 / 18
Unsupervised term discovery (UTD)
[Park and Glass, TASLP’08] 4 / 18
Unsupervised term discovery (UTD)
[Park and Glass, TASLP’08] 4 / 18
Unsupervised term discovery (UTD)
[Park and Glass, TASLP’08] 4 / 18
Example: Query-by-example search
[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
5 / 18
Example: Query-by-example search
Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
5 / 18
Example: Query-by-example search
Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
5 / 18
Example: Query-by-example search
Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
5 / 18
Example: Query-by-example search
Spoken query:
Useful speech system, not requiring any transcribed speech
[Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
5 / 18
Unsupervised speech processing: Two problems
- 1. Unsupervised frame-level representation learning:
6 / 18
Unsupervised speech processing: Two problems
- 1. Unsupervised frame-level representation learning:
6 / 18
Unsupervised speech processing: Two problems
- 1. Unsupervised frame-level representation learning:
fa(·)
6 / 18
Unsupervised speech processing: Two problems
- 1. Unsupervised frame-level representation learning:
fa(·) fa(·) Cool model
6 / 18
Unsupervised speech processing: Two problems
- 1. Unsupervised frame-level representation learning:
fa(·) fa(·) Cool model
- 2. Unsupervised segmentation and clustering:
How do we discover meaningful units in unlabelled speech?
6 / 18
Unsupervised frame-level representation learning:
The Correspondence Autoencoder
Unsupervised frame-level representation learning:
The Correspondence Autoencoder
Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater
Supervised representation learning using DNNs
ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states
8 / 18
Supervised representation learning using DNNs
ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly
8 / 18
Supervised representation learning using DNNs
ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks Output: predict phone states Feature extractor fa(·) learned from data Phone classifier learned jointly
Unsupervised modelling: No phone class targets to train network on
8 / 18
Autoencoder (AE) neural network
Input speech frame Reconstruct input
[Badino et al., ICASSP’14] 9 / 18
Autoencoder (AE) neural network
Input speech frame Reconstruct input
[Badino et al., ICASSP’14]
- Completely unsupervised
- But purely bottom-up
- Can we use top-down information?
9 / 18
Autoencoder (AE) neural network
Input speech frame Reconstruct input
[Badino et al., ICASSP’14]
- Completely unsupervised
- But purely bottom-up
- Can we use top-down information?
- Idea: Unsupervised term discovery
9 / 18
Unsupervised term discovery (UTD)
10 / 18
Unsupervised term discovery (UTD)
Can we use these discovered word pairs to give weak top-down supervision?
10 / 18
Weak top-down supervision: Align frames
[Jansen et al., ICASSP’13] 11 / 18
Weak top-down supervision: Align frames
[Jansen et al., ICASSP’13] 11 / 18
Weak top-down supervision: Align frames
[Jansen et al., ICASSP’13] 11 / 18
Autoencoder (AE)
Input speech frame Reconstruct input
12 / 18
Correspondence autoencoder (cAE)
Frame from one word Frame from other word in pair
13 / 18
Correspondence autoencoder (cAE)
Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair
13 / 18
Correspondence autoencoder (cAE)
Frame from one word Unsupervised feature extractor fa(·) Frame from other word in pair
Combine top-down and bottom-up information
13 / 18
Correspondence autoencoder (cAE)
Frame from one word Unsupervised feature extractor Frame from other word in pair
Play Play
14 / 18
Correspondence autoencoder (cAE)
Speech corpus Initialize weights Train stacked autoencoder (pretraining) Align word pair frames Train correspondence autoencoder (1) (2) (3) (4) Unsupervised term discovery Unsupervised feature extractor
[Kamper et al., ICASSP’15] 15 / 18
Evaluation: Query-by-example search
Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17]
16 / 18
Evaluation: Isolated word query-by-example
Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision
17 / 18
Evaluation: Isolated word query-by-example
Autoencoder UBM-GMM TopUBM cAE 0.0 0.1 0.2 0.3 0.4 0.5 Average precision Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16]
17 / 18
Summary and conclusion
- Introduced correspondence autoencoder (cAE) for unsupervised
frame-level representation learning
- Uses top-down information from unsupervised term discovery system
- Uses bottom-up initialization on large speech corpus
- Unsupervised neural network model that combines top-down and
bottom-up information results in large intrinsic improvements
- Links with language acquisition research
- Future: More analysis; different domains; practical search systems
18 / 18
http://www.kamperh.com/ https://github.com/kamperh
Evaluation of features: same-different task
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “apple” Treat as query “pie” “grape” “apple” “apple” “like” Treat as terms to search
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” DTW distance: d1
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different DTW distance: d1 d2
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same DTW distance: d1 d2
- ×
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2
- ×
“apple” “like”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same DTW distance: d1 d2 d3
- ×
“apple” “like”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3
- ×
“apple” “like”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” di < threshold? predict: different same same DTW distance: d1 d2 d3
- ×
- “apple”
“like”
Evaluation of features: same-different task
“apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “like” “apple” di < threshold? predict: different same different same different DTW distance: d1 d2 d3 d4 dN
- ×
- ×
Maties Machine Learning (MML)
- Send “subscribe mml” in subject line to sympa@sympa.sun.ac.za
- Mailing list: mml@sympa.sun.ac.za
- Bring together machine learning researchers from across Stellenbosch
University
- Format: Short (in)formal talks every second Friday over lunch
- Focus on machine learning research
- If you want to give a talk, or have any ideas, please let us know!
- kamperh@sun.ac.za, wbrink@sun.ac.za