Acoustic modelling of English-accented and Afrikaans-accented South - - PowerPoint PPT Presentation

acoustic modelling of english accented and afrikaans
SMART_READER_LITE
LIVE PREVIEW

Acoustic modelling of English-accented and Afrikaans-accented South - - PowerPoint PPT Presentation

Acoustic modelling of English-accented and Afrikaans-accented South African English H. Kamper, F.J. Muamba Mukanya and T.R. Niesler Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University


slide-1
SLIDE 1

Acoustic modelling of English-accented and Afrikaans-accented South African English

  • H. Kamper, F.J. Muamba Mukanya and T.R. Niesler

Digital Signal Processing Laboratory Department of Electrical and Electronic Engineering Stellenbosch University

UNIVERSITEIT •STELLENBOSCH •UNIVERSITY

jou kennisvennoot

  • your knowledge partner
slide-2
SLIDE 2

Introduction

South African accents of English

English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA

slide-3
SLIDE 3

Introduction

South African accents of English

English is the language of government, commerce and science Only 8.2% of the population use English as first language Results in various accents (not regionally bound) Multi-accent speech recognition particularly relevant in SA

Aim of research

Determine whether data from different South African English accents can be combined to improve speech recognition performance in any one accent Afrikaans-accented English (AE) South African English (EE)

slide-4
SLIDE 4

Speech Databases

African Speech Technology (AST) databases

◮ Afrikaans English (AE) database ◮ South African English (EE) database

Training set: Approximately 6 hours of speech in both accents Test set: Approximately 24 minutes of speech from 20 speakers in each accent Development set: Used to optimise recognition parameters

slide-5
SLIDE 5

Decision-Tree State Clustering

Acoustic modelling of context-dependent phones

Acoustic modelling of triphones: [j]−[i]+[k] Problems:

◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur

Want to determine clusters of similar triphones which can then be used to obtain individual models

slide-6
SLIDE 6

Decision-Tree State Clustering

Acoustic modelling of context-dependent phones

Acoustic modelling of triphones: [j]−[i]+[k] Problems:

◮ Not all triphones occur in the training data ◮ Not enough data for some triphones which do occur

Want to determine clusters of similar triphones which can then be used to obtain individual models

Solution

Use decision-tree state clustering

slide-7
SLIDE 7

Decision-Tree State Clustering

∗−i+∗

Begin by pooling all triphones with the same basephone

slide-8
SLIDE 8

Decision-Tree State Clustering

∗−i+∗ Left voiced?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters

slide-9
SLIDE 9

Decision-Tree State Clustering

yes no

∗−i+∗ Left voiced?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split

slide-10
SLIDE 10

Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Right plosive?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small

slide-11
SLIDE 11

Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Right plosive?

Begin by pooling all triphones with the same basephone Use linguistically-motivated questions to split clusters Choose question yielding greatest likelihood improvement and split Repeat until likelihood improvement too small Each tree leaf corresponds to a cluster of HMM states

slide-12
SLIDE 12

Multi-Accent Decision-Tree State Clustering

yes no yes no

∗−i+∗ Left voiced? Afrikaans accent?

Tag phones with accent before pooling at root nodes Allow decision-tree questions regarding accent as well as phonetic context Automatically determine if triphone states from different accents are similar

slide-13
SLIDE 13

Multi-Accent Acoustic Models

s1

a12

s2

a23

s3

a11 a22 a33

s1

a12

s2

a23

s3

a11 a22 a33

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Allow sharing between accents Single set of decision-trees is grown for both accents Clustering process employs questions relating to both accent and phonetic context States corresponding to the same basephone but different accents may be shared or kept separate

slide-14
SLIDE 14

Accent-Specific Acoustic Models

s1

ae

12

s2

ae

23

s3

ae

11

ae

22

ae

33

s1

aa

12

s2

aa

23

s3

aa

11

aa

22

aa

33

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Allows no sharing between accents Separate decision-trees are grown for each accent Clustering process employs only questions relating to phonetic context Completely separate set of acoustic models for each accent

slide-15
SLIDE 15

Accent-Independent Acoustic Models

s1

a12

s2

a23

s3

a11 a22 a33

s1

a12

s2

a23

s3

a11 a22 a33

EE HMM for triphone [j]−[i]+[k] AE HMM for triphone [j]−[i]+[k]

Data are pooled across both accents Single set of decision-trees is grown for both accents Clustering process employs only questions relating to phonetic context Single set of acoustic models for both accents

slide-16
SLIDE 16

Experiments

Common setup of systems

Decision-tree likelihood threshold varied to produce models with different numbers of clustered states Used 8-mixture cross-word triphone HMMs Speech parameterisation: MFCCs, 1st and 2nd order derivatives, per-utterance CMN

slide-17
SLIDE 17

Results: Phone Recognition Performance

2000 4000 6000 8000 10000 Number of physical states 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

slide-18
SLIDE 18

Analysis of Decision Trees

2 4 6 8 10 Depth within decision tree (root = 0) 0.0 0.5 1.0 1.5 2.0 Absolute increase in overall log likelihood (x10^6) Phonetically-based questions Accent-based questions

slide-19
SLIDE 19

Analysis of Decision Trees

2000 4000 6000 8000 10000 Number of clustered states in multi-accent HMM set 20 30 40 50 60 70 Clustered states combining data (%)

slide-20
SLIDE 20

Results: Word Recognition Performance

2000 4000 6000 8000 10000 Number of physical states 74 75 76 77 78 79 80 Word recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

slide-21
SLIDE 21

Conclusions and Future Work

Conclusions

Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents

slide-22
SLIDE 22

Conclusions and Future Work

Conclusions

Accent-specific modelling performs worst Accent-independent and multi-accent acoustic modelling yields similar improvements (Afrikaans speaker proficiency) Inclusion of accent-based questions (selective sharing) does not impair recognition performance, but does not yield significant gain either Supports current practice of simply pooling English accents

Future work

Less similar accents: Black English and South African English Multi-accent acoustic modelling of all five SA English accents

slide-23
SLIDE 23

Phone Recognition Performance: BE & EE

2000 4000 6000 8000 10000 Number of physical states 60 61 62 63 64 65 66 Phone recognition accuracy (%) Accent-specific HMMs Accent-independent HMMs Multi-accent HMMs

slide-24
SLIDE 24

Language Modelling: Phone Recognition of AE

1000 2000 3000 4000 5000 6000 7000 8000 9000 Number of physical states 61 62 63 64 65 66 Phone recognition accuracy (%) Normal Pool Cross