SLIDE 1 Acoustic modelling of English-accented and Afrikaans-accented South African English
- H. Kamper, F.J. Muamba Mukanya and T.R. Niesler
Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University
UNIVERSITEIT •STELLENBOSCH •UNIVERSITY
jou kennisvennoot
SLIDE 2
Introduction
South African accents of English
English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA
SLIDE 3
Introduction
South African accents of English
English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA
Aim of research
Determine whether data from different South African English accents can be combined to improve speech recognition performance in any one accent Afrikaans-accented English (AE) South African English (EE)
SLIDE 4 Speech Databases
African Speech Technology (AST) databases
◮ Afrikaans English (AE) database ◮ South African English (EE) database
Training set: Approximately 6 hours of speech in both accents Test set: Approximately 24 minutes of speech from 20 speakers in each accent Development set: Used to optimise recognition parameters
SLIDE 5 Decision-Tree State Clustering
Acoustic modelling of context-dependent phones
Acoustic modelling of triphones: [j]−[i]+[k] Problems:
◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur
Want to determine clusters of similar triphones which can then be used to obtain individual models
SLIDE 6 Decision-Tree State Clustering
Acoustic modelling of context-dependent phones
Acoustic modelling of triphones: [j]−[i]+[k] Problems:
◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur
Want to determine clusters of similar triphones which can then be used to obtain individual models
Solution
Use decision-tree state clustering
SLIDE 7
Decision-Tree State Clustering
∗−i+∗
Begin by pooling all triphones with the same basephone
SLIDE 8
Decision-Tree State Clustering
∗−i+∗ Left voiced?
Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters
SLIDE 9
Decision-Tree State Clustering
yes no
∗−i+∗ Left voiced?
Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split
SLIDE 10
Decision-Tree State Clustering
yes no yes no
∗−i+∗ Left voiced? Right plosive?
Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small
SLIDE 11
Decision-Tree State Clustering
yes no yes no
∗−i+∗ Left voiced? Right plosive?
Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small Each tree leaf corresponds to a cluster of HMM states
SLIDE 12
Multi-Accent Decision-Tree State Clustering
yes no yes no
∗−i+∗ Left voiced? Afrikaans accent?
Tag phones with accent before pooling at root nodes Allow decision-tree questions regarding accent as well as phonetic context Automatically determine if triphone states from different accents are similar
SLIDE 13 Multi-Accent Acoustic Models
s1
a12
s2
a23
s3
a11 a22 a33
s1
a12
s2
a23
s3
a11 a22 a33
EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]
Allow sharing between accents Single set of decision-trees is grown for both accents Clustering process employs questions relating to both accent and phonetic context States corresponding to the same basephone but different accents may be shared or kept separate
SLIDE 14 Accent-Specific Acoustic Models
s1
ae
12
s2
ae
23
s3
ae
11
ae
22
ae
33
s1
aa
12
s2
aa
23
s3
aa
11
aa
22
aa
33
EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]
Allows no sharing between accents Separate decision-trees are grown for each accent Clustering process employs only questions relating to phonetic context Completely separate set of acoustic models for each accent
SLIDE 15 Accent-Independent Acoustic Models
s1
a12
s2
a23
s3
a11 a22 a33
s1
a12
s2
a23
s3
a11 a22 a33
EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]
Data are pooled across both accents Single set of decision-trees is grown for both accents Clustering process employs only questions relating to phonetic context Single set of acoustic models for both accents
SLIDE 16
Experiments
Common setup of systems
Decision-tree likelihood threshold varied to produce models with different numbers of clustered states Used 8-mixture cross-word triphone HMMs Speech parameterisation: MFCCs, 1st and 2nd order derivatives, per-utterance CMN
SLIDE 17
Results: Phone Recognition Performance
2000 4000 6000 8000 10000 Number of physical states 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs
SLIDE 18
Analysis of Decision Trees
2 4 6 8 10 Depth within decision tree (root = 0) 0.0 0.5 1.0 1.5 2.0 Absolute increase in overall log likelihood (x10^6) Phonetically-based questions Accent-based questions
SLIDE 19
Analysis of Decision Trees
2000 4000 6000 8000 10000 Number of clustered states in multi-accent HMM set 20 30 40 50 60 70 Clustered states combining data (%)
SLIDE 20
Results: Word Recognition Performance
2000 4000 6000 8000 10000 Number of physical states 74 75 76 77 78 79 80 Word recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs
SLIDE 21
Conclusions and Future Work
Conclusions
Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents
SLIDE 22
Conclusions and Future Work
Conclusions
Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents
Future work
Less similar accents: Black English and South African English Multi-accent acoustic modelling of all five SA English accents
SLIDE 23
Phone Recognition Performance: BE & EE
2000 4000 6000 8000 10000 Number of physical states 60 61 62 63 64 65 66 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs
SLIDE 24
Language Modelling: Phone Recognition of AE
1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of physical states 61 62 63 64 65 66 Phone recognition accuracy (%) Normal Pool Cross