 
              Acoustic modelling of English-accented and Afrikaans-accented South African English H. Kamper, F.J. Muamba Mukanya and T.R. Niesler Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University UNIVERSITEIT • STELLENBOSCH • UNIVERSITY • your knowledge partner jou kennisvennoot
Introduction South African accents of English English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA
Introduction South African accents of English English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA Aim of research Determine whether data from different South African English accents can be combined to improve speech recognition performance in any one accent Afrikaans-accented English (AE) South African English (EE)
Speech Databases African Speech Technology (AST) databases ◮ Afrikaans English (AE) database ◮ South African English (EE) database Training set: Approximately 6 hours of speech in both accents Test set: Approximately 24 minutes of speech from 20 speakers in each accent Development set: Used to optimise recognition parameters
Decision-Tree State Clustering Acoustic modelling of context-dependent phones Acoustic modelling of triphones: [j] − [i] + [k] Problems: ◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur Want to determine clusters of similar triphones which can then be used to obtain individual models
Decision-Tree State Clustering Acoustic modelling of context-dependent phones Acoustic modelling of triphones: [j] − [i] + [k] Problems: ◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur Want to determine clusters of similar triphones which can then be used to obtain individual models Solution Use decision-tree state clustering
Decision-Tree State Clustering Begin by pooling all triphones ∗− i + ∗ with the same basephone
Decision-Tree State Clustering Begin by pooling all triphones ∗− i + ∗ with the same basephone Left voiced? Use linguistically-motivated questions to split clusters
Decision-Tree State Clustering Begin by pooling all triphones ∗− i + ∗ with the same basephone Left voiced? yes Use linguistically-motivated no questions to split clusters Choose question yielding greatest likelihood improvement and split
Decision-Tree State Clustering Begin by pooling all triphones ∗− i + ∗ with the same basephone Left voiced? yes Use linguistically-motivated no questions to split clusters Right plosive? Choose question yielding greatest yes no likelihood improvement and split Repeat until likelihood improvement too small
Decision-Tree State Clustering Begin by pooling all triphones ∗− i + ∗ with the same basephone Left voiced? yes Use linguistically-motivated no questions to split clusters Right plosive? Choose question yielding greatest yes no likelihood improvement and split Repeat until likelihood improvement too small Each tree leaf corresponds to a cluster of HMM states
Multi-Accent Decision-Tree State Clustering ∗− i + ∗ Left voiced? Tag phones with accent before yes pooling at root nodes no Afrikaans Allow decision-tree questions accent? regarding accent as well as yes no phonetic context Automatically determine if triphone states from different accents are similar
Multi-Accent Acoustic Models EE HMM for triphone [j] − [i] + [k] Allow sharing between accents a 11 a 22 a 33 s 1 s 2 s 3 Single set of decision-trees is a 12 a 23 grown for both accents Clustering process employs questions relating to both accent and phonetic context States corresponding to the same basephone but different accents a 12 a 23 s 1 s 2 s 3 may be shared or kept separate a 11 a 22 a 33 AE HMM for triphone [j] − [i] + [k]
Accent-Specific Acoustic Models EE HMM for triphone [j] − [i] + [k] a e a e a e 11 22 33 Allows no sharing between accents s 1 s 2 s 3 a e a e 12 23 Separate decision-trees are grown for each accent Clustering process employs only questions relating to phonetic context Completely separate set of a a a a 12 23 s 1 s 2 s 3 acoustic models for each accent a a a a a a 11 22 33 AE HMM for triphone [j] − [i] + [k]
Accent-Independent Acoustic Models EE HMM for triphone [j] − [i] + [k] Data are pooled across both a 11 a 22 a 33 accents s 1 s 2 s 3 a 12 a 23 Single set of decision-trees is grown for both accents Clustering process employs only questions relating to phonetic context a 12 a 23 s 1 s 2 s 3 Single set of acoustic models for a 11 a 22 a 33 both accents AE HMM for triphone [j] − [i] + [k]
Experiments Common setup of systems Decision-tree likelihood threshold varied to produce models with different numbers of clustered states Used 8-mixture cross-word triphone HMMs Speech parameterisation: MFCCs, 1 st and 2 nd order derivatives, per-utterance CMN
Results: Phone Recognition Performance 70.5 70.0 Phone recognition accuracy (%) 69.5 69.0 68.5 68.0 67.5 67.0 66.5 Accent-specific HMMs Accent-independent HMMs 66.0 Multi-accent HMMs 65.5 0 2000 4000 6000 8000 10000 Number of physical states
Analysis of Decision Trees Absolute increase in overall log likelihood (x10^6) Phonetically-based questions 2.0 Accent-based questions 1.5 1.0 0.5 0.0 0 2 4 6 8 10 Depth within decision tree (root = 0)
Analysis of Decision Trees 70 Clustered states combining data (%) 60 50 40 30 20 0 2000 4000 6000 8000 10000 Number of clustered states in multi-accent HMM set
Results: Word Recognition Performance 80 79 Word recognition accuracy (%) 78 77 76 75 Accent-specific HMMs Accent-independent HMMs 74 Multi-accent HMMs 0 2000 4000 6000 8000 10000 Number of physical states
Conclusions and Future Work Conclusions Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents
Conclusions and Future Work Conclusions Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents Future work Less similar accents: Black English and South African English Multi-accent acoustic modelling of all five SA English accents
Phone Recognition Performance: BE & EE 66 65 Phone recognition accuracy (%) 64 63 62 Accent-specific HMMs 61 Accent-independent HMMs Multi-accent HMMs 60 0 2000 4000 6000 8000 10000 Number of physical states
Language Modelling: Phone Recognition of AE 66 Phone recognition accuracy (%) 65 64 63 62 Normal Pool Cross 61 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of physical states
Recommend
More recommend