Acoustic modelling of English-accented and Afrikaans-accented South - - PowerPoint PPT Presentation

▶

Nov 02, 2022 316 likes •567 views

Acoustic modelling of English-accented and Afrikaans-accented South African English H. Kamper, F.J. Muamba Mukanya and T.R. Niesler Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University

SLIDE 1

Acoustic modelling of English-accented and Afrikaans-accented South African English

H. Kamper, F.J. Muamba Mukanya and T.R. Niesler

Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University

UNIVERSITEIT •STELLENBOSCH •UNIVERSITY

jou kennisvennoot

your knowledge partner

SLIDE 2

Introduction

South African accents of English

English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA

SLIDE 3

Introduction

South African accents of English

English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA

Aim of research

Determine whether data from different South African English accents can be combined to improve speech recognition performance in any one accent Afrikaans-accented English (AE) South African English (EE)

SLIDE 4

Speech Databases

African Speech Technology (AST) databases

◮ Afrikaans English (AE) database ◮ South African English (EE) database

Training set: Approximately 6 hours of speech in both accents Test set: Approximately 24 minutes of speech from 20 speakers in each accent Development set: Used to optimise recognition parameters

SLIDE 5

Decision-Tree State Clustering

Acoustic modelling of context-dependent phones

Acoustic modelling of triphones: [j]−[i]+[k] Problems:

◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur

Want to determine clusters of similar triphones which can then be used to obtain individual models

SLIDE 6

Decision-Tree State Clustering

Acoustic modelling of context-dependent phones

Acoustic modelling of triphones: [j]−[i]+[k] Problems:

◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur

Want to determine clusters of similar triphones which can then be used to obtain individual models

Solution

Use decision-tree state clustering

SLIDE 7

Decision-Tree State Clustering

∗−i+∗

Begin by pooling all triphones with the same basephone

SLIDE 8

Decision-Tree State Clustering

∗−i+∗ Left voiced?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters

SLIDE 9

Decision-Tree State Clustering

yes no

∗−i+∗ Left voiced?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split

SLIDE 10

Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Right plosive?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small

SLIDE 11

Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Right plosive?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small Each tree leaf corresponds to a cluster of HMM states

SLIDE 12

Multi-Accent Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Afrikaans accent?

Tag phones with accent before pooling at root nodes Allow decision-tree questions regarding accent as well as phonetic context Automatically determine if triphone states from different accents are similar

SLIDE 13

Multi-Accent Acoustic Models

s1

a12

s2

a23

s3

a11 a22 a33

s1

a12

s2

a23

s3

a11 a22 a33

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Allow sharing between accents Single set of decision-trees is grown for both accents Clustering process employs questions relating to both accent and phonetic context States corresponding to the same basephone but different accents may be shared or kept separate

SLIDE 14

Accent-Specific Acoustic Models

s1

ae

s2

ae

s3

ae

s1

aa

s2

aa

s3

aa

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Allows no sharing between accents Separate decision-trees are grown for each accent Clustering process employs only questions relating to phonetic context Completely separate set of acoustic models for each accent

SLIDE 15

Accent-Independent Acoustic Models

s1

a12

s2

a23

s3

a11 a22 a33

s1

a12

s2

a23

s3

a11 a22 a33

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Data are pooled across both accents Single set of decision-trees is grown for both accents Clustering process employs only questions relating to phonetic context Single set of acoustic models for both accents

SLIDE 16

Experiments

Common setup of systems

Decision-tree likelihood threshold varied to produce models with different numbers of clustered states Used 8-mixture cross-word triphone HMMs Speech parameterisation: MFCCs, 1st and 2nd order derivatives, per-utterance CMN

SLIDE 17

Results: Phone Recognition Performance

2000 4000 6000 8000 10000 Number of physical states 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

SLIDE 18

Analysis of Decision Trees

2 4 6 8 10 Depth within decision tree (root = 0) 0.0 0.5 1.0 1.5 2.0 Absolute increase in overall log likelihood (x10^6) Phonetically-based questions Accent-based questions

SLIDE 19

Analysis of Decision Trees

2000 4000 6000 8000 10000 Number of clustered states in multi-accent HMM set 20 30 40 50 60 70 Clustered states combining data (%)

SLIDE 20

Results: Word Recognition Performance

2000 4000 6000 8000 10000 Number of physical states 74 75 76 77 78 79 80 Word recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

SLIDE 21

Conclusions and Future Work

Conclusions

Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents

SLIDE 22

Conclusions and Future Work

Conclusions

Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents

Future work

Less similar accents: Black English and South African English Multi-accent acoustic modelling of all five SA English accents

SLIDE 23

Phone Recognition Performance: BE & EE

2000 4000 6000 8000 10000 Number of physical states 60 61 62 63 64 65 66 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

SLIDE 24