Speech Processing 15-492/18-492 Multilinguality SPICE: making it - - PowerPoint PPT Presentation

speech processing 15 492 18 492
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 15-492/18-492 Multilinguality SPICE: making it - - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Multilinguality SPICE: making it easier Dealing with *all* Languages Over 6000 Languages Over 6000 Languages Maybe not all commercially interesting now Maybe not all commercially interesting


slide-1
SLIDE 1

Speech Processing 15-492/18-492

Multilinguality SPICE: making it easier

slide-2
SLIDE 2

Dealing with *all* Languages

  • Over 6000 Languages

Over 6000 Languages

  • Maybe not all commercially interesting … now

Maybe not all commercially interesting … now

  • Major languages (economic)

Major languages (economic)

  • Cell phone manufacturers list 46 languages

Cell phone manufacturers list 46 languages

  • But even those not all covered

But even those not all covered

slide-3
SLIDE 3
  • Computerization

Computerization: Speech is key technology : Speech is key technology

  • Mobile Devices, Ubiquitous Information Access

Mobile Devices, Ubiquitous Information Access

  • Globalization

Globalization: : Multilinguality Multilinguality

  • More than 6000 Languages in the world

More than 6000 Languages in the world

  • Multiple official languages

Multiple official languages

  • Europe has 20+ official languages

Europe has 20+ official languages

  • South Africa has 11 official languages

South Africa has 11 official languages

⇒ ⇒ Speech Processing in multiple Languages Speech Processing in multiple Languages

  • Cross

Cross-

  • cultural Human

cultural Human-

  • Human Interaction

Human Interaction

  • Human

Human-

  • Machine Interface in mother tongue

Machine Interface in mother tongue

Motivation

slide-4
SLIDE 4

Challenges

  • Algorithms language independent but require data

Algorithms language independent but require data

  Dozens of hours audio recordings and corresponding transcription

Dozens of hours audio recordings and corresponding transcriptions s

  Pronunciation dictionaries for large vocabularies (>100.000 word

Pronunciation dictionaries for large vocabularies (>100.000 words) s)

  Millions of words written text corpora in various domains in que

Millions of words written text corpora in various domains in question stion

  Bilingual aligned text corpora

Bilingual aligned text corpora

  • BUT: Such data only available in very few languages

BUT: Such data only available in very few languages

  Audio data

Audio data ≤ ≤ 40 40 languages, languages, Transcriptions take up to Transcriptions take up to 40x 40x real time real time

  Large vocabulary pronunciation dictionaries

Large vocabulary pronunciation dictionaries ≤ ≤ 20 20 languages languages

  Small text corpora

Small text corpora ≤ ≤ 100 100 languages, languages, large corpora large corpora ≤ ≤ 30 30 languages languages

  Bilingual corpora in very few language pairs, pivot mostly Engli

Bilingual corpora in very few language pairs, pivot mostly English sh

  • Additional complications:

Additional complications:

  Combinatorical explosion

Combinatorical explosion (domain, speaking style, accent, dialect, ...) (domain, speaking style, accent, dialect, ...)

  Few native speakers at hand for minority (endangered) languages

Few native speakers at hand for minority (endangered) languages

  Languages without writing systems

Languages without writing systems

slide-5
SLIDE 5

Solution: Learning Systems

⇒ ⇒ Systems that learn a language from the user Systems that learn a language from the user

  • Efficient

Efficient learning algorithms for speech processing learning algorithms for speech processing

  • Learning:

Learning:

  Interactive learning with user in the loop

Interactive learning with user in the loop

  Statistical modeling approaches

Statistical modeling approaches

  • Efficiency:

Efficiency:

  Reduce amount of data

Reduce amount of data (save time and costs): by a factor of 10 (save time and costs): by a factor of 10

  Speed up development cycles:

Speed up development cycles: days rather than months days rather than months

⇒ ⇒ Rapid Language Rapid Language Adaptation from universal models Adaptation from universal models

  • Bridge the gap: language and technology experts

Bridge the gap: language and technology experts

  Technology experts do not speak all languages in question

Technology experts do not speak all languages in question

  Native users are not in control of the technology

Native users are not in control of the technology

slide-6
SLIDE 6

Sharing data between modules

Lexst LMt

Word s ↔ Word t N-grams

AMt Dictt

Word → phone sequence

LMt

N-grams

AMs Dicts

Word → phone sequence

Lexts

Word s ↔ Word t

LMs

N-grams

AMs Dicts LMs

Word → phone sequence N-grams

AMt Dictt

Word → phone sequence

Input Ls Input Lt Output Ls

Speech-to-Speech Translation Lsource Ltarget Lsource Ltarget

slide-7
SLIDE 7

SPICE

Speech Processing: Interactive Creation and Evaluation toolkit

  • National Science Foundation, Grant 10/2004, 3 years
  • Principle Investigators Tanja Schultz and Alan Black
  • Bridge the gap between technology experts → language experts
  • Automatic Speech Recognition (ASR),
  • Machine Translation (MT),
  • Text-to-Speech (TTS)
  • Develop web-based intelligent systems
  • Interactive Learning with user in the loop
  • Rapid Adaptation of universal models to unseen languages
  • SPICE webpage http://cmuspice.org
slide-8
SLIDE 8

Spice Project Page

slide-9
SLIDE 9

Input: Speech

Speech Processing Systems

Pronunciation rules

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

Hello

NLP / MT TTS Text data Phone set & Speech data

slide-10
SLIDE 10

Input: Speech

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Phone set & Speech data +

Hello

Rapid Portability: Data

slide-11
SLIDE 11

Finding “Nice” Prompts

  • From very large text databases

From very large text databases

  • Find “nice” sentences:

Find “nice” sentences:

  • Containing only high frequency words

Containing only high frequency words

  • 5

5-

  • 15 words

15 words

  • Find grapheme/phoneme balanced set

Find grapheme/phoneme balanced set

  • Select sentences with best

Select sentences with best triphone triphone/graph /graph

  • 500

500-

  • 1000 sentences

1000 sentences

  • Collect for ASR and TTS acoustic modeling

Collect for ASR and TTS acoustic modeling

slide-12
SLIDE 12

Prompt Selection Issues

  • Need good text

Need good text

  • De

De-

  • htmlify

htmlify, well , well-

  • written, no misspelling

written, no misspelling

  • Need word segmentation

Need word segmentation

  • Japanese, Chinese Thai

Japanese, Chinese Thai

  • Natural text is often mixed language

Natural text is often mixed language

  • Hindi Newspaper Text has lots of English words

Hindi Newspaper Text has lots of English words

  • Automatic selection has errors

Automatic selection has errors

  • Need Speaker to do further selection

Need Speaker to do further selection

  • E.g. lots of telephone numbers,

E.g. lots of telephone numbers, formating formating commands commands

  • CMU Arctic used similar methods

CMU Arctic used similar methods

slide-13
SLIDE 13

Recording Prompts

slide-14
SLIDE 14

GlobalPhone

Multilingual Database

Widespread languages Native Speakers Uniform Data Broad Domain Large Text Resources

Internet, Newspaper

Corpus

19 Languages … counting ≥ 1800 native speakers ≥ 400 hrs Audio data Read Speech Filled pauses annotated

Arabic Ch-Mandarin Ch-Shanghai German French Japanese Korean Croatian Portuguese Russian Spanish Swedish Tamil Czech Turkish + Thai + Creole + Polish + Bulgarian + ... ???

Now available from ELRA !!

slide-15
SLIDE 15

Speech Recognition in 17 Languages

10 11.8 14 14 14.514.5 16.9 18 19 20 20 29 33.5 20 21.7 23.4 29

10 20 30 40

Japanese German English Thai Korean Ch-Mandarin Turkish French Portuguese Croatian Spanish Bulgarian Russian Afrikaans Chinese Arabic Iraqi

Word Error Rate [%]

slide-16
SLIDE 16

Input: Speech

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Phone set & Speech data +

Hello

Rapid Portability: Acoustic Models

slide-17
SLIDE 17

Speech Production is independent from Language ⇒ IPA

1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing

Reduction from 485 to 162 sound classes m,n,s,l appear in all 12 languages p,b,t,d,k,g,f and i,u,e,a,o in almost all

Problem: Context of sounds are language specific Context dependent models for new languages? Solution: 1) Multilingual Decision Context Trees 2) Specialize decision tree by Adaptation

Universal Sound Inventory

  • 1=Plosiv?

N J k (0) k lau k ra ut k le

  • t k or

in k ar +2=Vokal? N J k (1) k (2) lau k ra in k ar ut k le

  • t k or

Blaukraut Brautkleid Brotkorb Weinkarte

slide-18
SLIDE 18

Choosing Phonemes

slide-19
SLIDE 19

Rapid Portability: Acoustic Model

69,1 57,1 49,9 40,6 32,8 28,9 19,6 19 20 40 60 80 100 Word Error rate [%]

0:15 0:15 0:25 0:25 0:25 1:30 16:30

Ø Tree

ML-Tree Po-Tree PDTS +

slide-20
SLIDE 20

Input: Speech

Rapid Portability: Pronunciation Dictionary

Pronunciation rules

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Textdaten

„adios“ /a/ /d/ /i/ /o/ /s/ „Hallo“ /h/ /a/ /l/ /o/ „Phydough“ ??? Hello

slide-21
SLIDE 21

11.5 19.2 18.4 24.5 26.8 15.614 12.7 33 36.4 32.8 16 26.4 18.3 0.0 10.0 20.0 30.0 40.0 50.0 Word Error Rate [%]

Phoneme Grapheme (FTT) Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASR

Problem:

  • 1 Grapheme ≠ 1 Phoneme

Flexible Tree Tying (FTT): One decision tree

  • Improved parameter tying
  • Less over specification
  • Fewer inconsistencies

0=vowel? 0=obstruent? 0=begin-state?

  • 1=syllabic?0=mid-state?-1=obstruent?0=end-state?

AX-m IX-m AX-b

slide-22
SLIDE 22

Dictionary: Interactive Learning

* Follow the work of Davel & Barnard * Word list: extract from text

User Word list W

i:= best select

Word wi Generate pronunciation P(wi)

TTS

P(wi) okay? Yes

Delete wi

No

Update G-2-P Improve P(wi)

G-2-P

Delete wi * Update after each wi → more effective training * Kominek & Black * G-2-P

  • explicit mapping rules
  • neural networks
  • decision trees
  • instance learning

(grapheme context)

Lex

Skip

slide-23
SLIDE 23

Spice: Lex Learner

slide-24
SLIDE 24

Spice: Lex Learner

slide-25
SLIDE 25

Issues and Challenges

  • How to make best use of the human?

How to make best use of the human?

  • Definition of successful completion

Definition of successful completion

  • Which words to present in what order

Which words to present in what order

  • How to be robust against mistakes

How to be robust against mistakes

  • Feedback that keeps users motivated to continue

Feedback that keeps users motivated to continue

  • How many words?

How many words?

  • G2P complexity language dependent

G2P complexity language dependent

  • 80% coverage

80% coverage hundred (SP) to thousands (EN) hundred (SP) to thousands (EN)

  • G2P rule system perplexity

G2P rule system perplexity 16.80 16.80 Dutch Dutch 16.70 16.70 German German 11.48 11.48 Afrikaans Afrikaans 1.21 1.21 Spanish Spanish 3.52 3.52 Italian Italian 50.11 50.11 English English Perplexity Perplexity Language Language

slide-26
SLIDE 26

Input: Speech

Rapid Portability: LM

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Text data

Internet / TV Hello

Inquiry

Automatic Extraction LM Bridge Languages

+ Resource rich languages ↔ Resource low languages:

slide-27
SLIDE 27

Parametric TTS

  • Text

Text-

  • to

to-

  • speech for G2P Learning:

speech for G2P Learning:

  • Technique: phoneme

Technique: phoneme-

  • by

by-

  • phoneme concatenation,

phoneme concatenation, speech not natural but understandable ( speech not natural but understandable (Marelie Marelie Davel Davel) )

  • Units are based on IPA phoneme examples

Units are based on IPA phoneme examples

  PRO: covers languages through simple adaptation

PRO: covers languages through simple adaptation

  CONS: not good enough for speech applications

CONS: not good enough for speech applications

  • Text

Text-

  • to

to-

  • speech for Applications:

speech for Applications:

  • Statistical Parametric Systems:

Statistical Parametric Systems: clustergen clustergen

  • Clusters representing context

Clusters representing context-

  • dependent allophones

dependent allophones

  PRO: can work with little speech (10 minutes)

PRO: can work with little speech (10 minutes)

  PRO: robust to erroneous data.

PRO: robust to erroneous data.

  CONS: speech sounds

CONS: speech sounds buzzy buzzy, lacks natural prosody , lacks natural prosody

slide-28
SLIDE 28

SPICE: Afrikaans - English

  • Goal: Build Afrikaans

Goal: Build Afrikaans – – English S2S using SPICE English S2S using SPICE

  Cooperation with Universit

Cooperation with University y Stellenbosch Stellenbosch and ARMSCOR and ARMSCOR

  Bilingual PhD visited CMU f

Bilingual PhD visited CMU for 3 month (Herman

  • r 3 month (Herman Engelbrecht

Engelbrecht) )

  Afrikaans: Related to Dutch and English

Afrikaans: Related to Dutch and English, , g g-

  • 2

2-

  • p very close, regular grammar, simple morphology

p very close, regular grammar, simple morphology

  • SPICE, all components apply statistical modeling paradigm

SPICE, all components apply statistical modeling paradigm

  ASR:

ASR: HMMs HMMs, N , N-

  • gram LM (JRTk

gram LM (JRTk-

  • ISL)

ISL)

  MT: Statistical MT

MT: Statistical MT (SMT (SMT-

  • ISL)

ISL)

  TTS: Unit

TTS: Unit-

  • Selection (Festival)

Selection (Festival)

  Dictionary

Dictionary: : G G-

  • 2

2-

  • P rules using CART decision trees

P rules using CART decision trees

  • Text: 39

Text: 39 hansard hansard; 680k words; ; 680k words;

  • 43k bilingual aligned sentence pairs;

43k bilingual aligned sentence pairs;

  • Audio: 6 hours read speech; 10k utterances,

Audio: 6 hours read speech; 10k utterances,

slide-29
SLIDE 29

SPICE: Time effort

  • Results: ASR 20% WER; MT A

Results: ASR 20% WER; MT A-

  • E (E

E (E-

  • A) Bleu 34.1 (34.7),

A) Bleu 34.1 (34.7), Nist Nist 7.6 (7.9) 7.6 (7.9)

  • Shared pronunciation dictionaries (f

Shared pronunciation dictionaries (for ASR+TTS) and LM

  • r ASR+TTS) and LM (f

(for ASR+MT)

  • r ASR+MT)
  • Most time consuming process: data preparation

Most time consuming process: data preparation → → reduce amount of data! reduce amount of data!

  • Still too much expert knowledge required (e.g. ASR parameter tun

Still too much expert knowledge required (e.g. ASR parameter tuning!) ing!)

5 8 7 3 11 5 5 5 10 15 20 25

Data Training Tuning Evaluation Prototype

days

AM (ASR) Lex LM (ASR, MT) TM (MT) TTS S-2-S

slide-30
SLIDE 30

Current Tests

  • 11 students is CMU class

11 students is CMU class

  • Hindi (2), Vietnamese (2), French, German (2),

Hindi (2), Vietnamese (2), French, German (2), Bulgarian, Telugu, Cantonese, Mandarin. Bulgarian, Telugu, Cantonese, Mandarin.

  • Build complete S2S system

Build complete S2S system

  • Teams of 2 for translation on small domain

Teams of 2 for translation on small domain

  • Translation is simple phrase

Translation is simple phrase-

  • based

based

  • Purpose:

Purpose:

  • Have students get full experience

Have students get full experience

  • Find bugs/limitation in the system

Find bugs/limitation in the system

  • Evaluation resulting systems for development time and

Evaluation resulting systems for development time and accuracy accuracy

slide-31
SLIDE 31
slide-32
SLIDE 32

HW2: TTS

  • Due 3:30pm Monday October 20

Due 3:30pm Monday October 20th

th

  • Install Festival and

Install Festival and Festvox Festvox

  • Find 10 errors in each of two different

Find 10 errors in each of two different synthesizers synthesizers

  • Build a voice

Build a voice

  • A Talking Clock

A Talking Clock

  • A general voice

A general voice

  • (or both)

(or both)

slide-33
SLIDE 33