Resource development and experiments in automatic South African - - PowerPoint PPT Presentation

resource development and experiments in automatic south
SMART_READER_LITE
LIVE PREVIEW

Resource development and experiments in automatic South African - - PowerPoint PPT Presentation

Resource development and experiments in automatic South African broadcast news transcription SLTU 2012, Cape Town, South Africa Herman Kamper 1 , Febe de Wet 1 , 2 , Thomas Hain 3 , Thomas Niesler 1 1 Department of Electrical and Electronic


slide-1
SLIDE 1

Resource development and experiments in automatic South African broadcast news transcription

SLTU 2012, Cape Town, South Africa

Herman Kamper1, Febe de Wet1,2, Thomas Hain3, Thomas Niesler1

1Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa 2Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa 3Department of Computer Science, University of Sheffield, United Kingdom

UNIVERSITEIT STELLENBOSCH UNIVERSITY

slide-2
SLIDE 2

Introduction

Broadcast news domain: Provides a ready source of speech audio data Variety of speech styles and quality: careful newsreader to noisy spontaneous Useful as components for subsequent speech technologies

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 2 / 14

slide-3
SLIDE 3

Introduction

Broadcast news domain: Provides a ready source of speech audio data Variety of speech styles and quality: careful newsreader to noisy spontaneous Useful as components for subsequent speech technologies South African (English) broadcast news: Several prevalent English accents South African English is under-resourced variety of English

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 2 / 14

slide-4
SLIDE 4

Introduction

Broadcast news domain: Provides a ready source of speech audio data Variety of speech styles and quality: careful newsreader to noisy spontaneous Useful as components for subsequent speech technologies South African (English) broadcast news: Several prevalent English accents South African English is under-resourced variety of English

Motivation

Report on baseline results of a straight-forward system: Use resources collected at Stellenbosch University (2000 – present) Aim is to use baseline for comparative/interesting further studies

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 2 / 14

slide-5
SLIDE 5

Accents of English in South Africa

Five major accents of South African English are identified in the literature: Afrikaans English (AE) Cape Flats English (CE) White South African English (EE) Indian South African English (IE) Other 77.8%

5.7% 8.8% 3.8%

2.3% 1.6%

Black South African English (BE)

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 3 / 14

slide-6
SLIDE 6

South African broadcast news data

20 hours SAFM broadcasts from 1996 to 2006: RD: Newsreader speech, prepared 27 speakers, 12.9 hours (BE, EE, IE) SI: Studio interview speech, fairly spont. 61 speakers, 0.6 hours NST: Non-studio telephone speech, spont. 262 speakers, 2.07 hours NS: Non-studio wideband speech, noisy 208 speakers, 1.54 hours Accent annotated for each sentence-level segment. Test set similar in composition to training set ∼2.7 hours.

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 4 / 14

slide-7
SLIDE 7

System development

Speech recognition problem

ˆ W = arg max

W

P(W |X) = arg max

W

p(X|W ) P(W )

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 5 / 14

slide-8
SLIDE 8

System development

Speech recognition problem

ˆ W = arg max

W

P(W |X) = arg max

W

p(X|W ) P(W )

Models required

1

Language model for P(W ) - 109M word corpus of newspaper text

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 5 / 14

slide-9
SLIDE 9

System development

Speech recognition problem

ˆ W = arg max

W

P(W |X) = arg max

W

p(X|W ) P(W )

Models required

1

Language model for P(W ) - 109M word corpus of newspaper text

2

Pronunciation dictionary for p(X|W ) - 60k word pronunciation dictionary

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 5 / 14

slide-10
SLIDE 10

System development

Speech recognition problem

ˆ W = arg max

W

P(W |X) = arg max

W

p(X|W ) P(W )

Models required

1

Language model for P(W ) - 109M word corpus of newspaper text

2

Pronunciation dictionary for p(X|W ) - 60k word pronunciation dictionary

3

Acoustic model for p(X|W ) - 20h SABN corpus (previous slide)

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 5 / 14

slide-11
SLIDE 11

Language modelling

109M word corpus from South African newspapers, collected 2000 – 2005: The Financial Mail, Business Day, The Sunday Times, The Times, Sunday World, The Sowetan, The Herald, The Algoa Sun and The Daily Dispatch SRILM toolkit used to train trigram language models on above text as well as on the transcriptions of acoustic training set (185k words) Also considered interpolation of the two language models

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 6 / 14

slide-12
SLIDE 12

Language modelling

109M word corpus from South African newspapers, collected 2000 – 2005: The Financial Mail, Business Day, The Sunday Times, The Times, Sunday World, The Sowetan, The Herald, The Algoa Sun and The Daily Dispatch SRILM toolkit used to train trigram language models on above text as well as on the transcriptions of acoustic training set (185k words) Also considered interpolation of the two language models Language model Perplexity Trained on 109M newspaper corpus 162.9 Trained on acoustic training set 328.9 Interpolation of the above two 139.9

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 6 / 14

slide-13
SLIDE 13

Pronunciation dictionary

Pronunciation dictionaries developed by a phonetic expert Reflect typical EE pronunciation Phone set: 45 ARPABET phones Training pronunciation dictionary: 15k words Recognition pronunciation dictionary: 60k words Average number of pronunciations per word: 1.25 Out-of-vocabulary rate on test set: 1.02%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 7 / 14

slide-14
SLIDE 14

Acoustic modelling

Used HTK to train cross-word triphone HMMs

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-15
SLIDE 15

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-16
SLIDE 16

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-17
SLIDE 17

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs 28.9% 27.7%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-18
SLIDE 18

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs 28.9% 27.7% 25.1% 26.9%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-19
SLIDE 19

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs 28.9% 27.7% 25.1% 24.6% 26.9% 26.4%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-20
SLIDE 20

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs 28.9% 27.7% 25.1% 24.6% 26.9% 26.4%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-21
SLIDE 21

Acoustic modelling

Used HTK to train cross-word triphone HMMs

MFCC HMMs MF-PLP HMMs single-pass retraining single-pass retraining Per-segment CMN Per-segment CMN, per- bulletin CVN Per-bulletin CMN Per-bulletin CMN, per- bulletin CVN Initial triphone HMMs 28.9% 27.7% 25.1% 24.6% 26.9% 26.4%

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 8 / 14

slide-22
SLIDE 22

Experimental results

Final system

Acoustic model set: 2624 states Features: mel-frequency perceptual linear prediction (MF-PLP) Normalisation: per-segment CMN, per-bulletin CVN

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 9 / 14

slide-23
SLIDE 23

Experimental results

Final system

Acoustic model set: 2624 states Features: mel-frequency perceptual linear prediction (MF-PLP) Normalisation: per-segment CMN, per-bulletin CVN

Evaluation

Used the first-best output from HTK HDecode decoder Measured WERs separately for each accent and channel condition

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 9 / 14

slide-24
SLIDE 24

System performance

Accent RD SI NST NS Overall AE

  • 60.7

67.0 63.3 BE 13.7 19.6 64.3 56.9 29.4 CE

  • 61.7
  • 61.7

EE 14.1

  • 54.1

41.6 17.2 IE 12.7

  • 59.2
  • 16.6

UKE

  • 17.7

22.7 32.2 23.8 USE

  • 39.3
  • 50.5

48.0 Other

  • 63.0

66.7 65.3 Overall 13.6 19.5 57.3 52.0 24.6

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 10 / 14

slide-25
SLIDE 25

System performance

10 20 30 40 50 60 70 80 Training data (minutes) 6 8 10 12 14 16 18 20 22 Word error rate (%) F017 F008 F006

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 11 / 14

slide-26
SLIDE 26

MP3 audio compression

MP3 bit-rate RD SI NST NS Overall 128 kbps 13.6 18.9 57.0 51.9 24.6 64 kbps 13.4 18.8 57.8 52.3 24.6 32 kbps 14.3 20.8 58.7 50.7 25.3

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 12 / 14

slide-27
SLIDE 27

Summary and conclusions

Summary: Described compilation of resources and subsequent language, pronunciation and acoustic modelling Compared MFCC and MF-PLP parametrisation Normalisation: compared CMN and CVN Considered system performance on MP3 compressed audio

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 13 / 14

slide-28
SLIDE 28

Summary and conclusions

Summary: Described compilation of resources and subsequent language, pronunciation and acoustic modelling Compared MFCC and MF-PLP parametrisation Normalisation: compared CMN and CVN Considered system performance on MP3 compressed audio

Main findings

Final system: MF-PLP, per-segment CMN, per-bulletin CVN WER of 24.6%, poor performance on spontaneous and telephone speech MP3 compression: system maintains performance except at very low bit-rates

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 13 / 14

slide-29
SLIDE 29

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-30
SLIDE 30

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-31
SLIDE 31

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

Comparison and substitution

USBN system SABN system + SA acoustic model + SA language model + SA dictionary + US acoustic model + US language model + US dictionary

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-32
SLIDE 32

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

Comparison and substitution

USBN system SABN system + SA language model + SA dictionary + US acoustic model + US language model + US dictionary

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-33
SLIDE 33

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

Comparison and substitution

USBN system SABN system + SA language model + SA dictionary + US acoustic model + US language model + US dictionary

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-34
SLIDE 34

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

Comparison and substitution

USBN system SABN system + SA language model + SA dictionary + US acoustic model + US dictionary + US language model

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14

slide-35
SLIDE 35

Future work

Improvements to current system: Presence of several accents: pronunciation and acoustic modelling Single pronunciation dictionary is currently employed Comparative study: Contrast performance with similarly trained UK and US systems Identify how resources from well-resourced UK and US English varieties can be used in the poorly-resourced SA environment

Comparison and substitution

USBN system SABN system + SA language model + SA dictionary + US acoustic model + US dictionary Performance penalty?

  • H. Kamper (Stellenbosch University)

South African broadcast news (SABN) SLTU 2012, Cape Town, South Africa 14 / 14