Speech Processing Speech Processing Using Speech with Computers

Overview Overview  Speech vs Text Speech vs Text  Same but different Same but different  Core Speech Technologies Core Speech Technologies  Speech Recognition Speech Recognition  Speech Synthesis Speech Synthesis  Dialog Systems Dialog Systems

Pronunciation Lexicon Pronunciation Lexicon  List of words and their pronunciation List of words and their pronunciation  (“pencil” n (p eh1 n s ih l)) (“pencil” n (p eh1 n s ih l))  (“table” n (t ey1 b ax l)) (“table” n (t ey1 b ax l))  Need the right phoneme set Need the right phoneme set  Need other information Need other information  Part of speech Part of speech  Lexical stress Lexical stress  Other information (Tone, Lexical accent …) Other information (Tone, Lexical accent …)  Syllable boundaries Syllable boundaries

Homograph Representation Homograph Representation  Must distinguish different pronunciations Must distinguish different pronunciations  (“project” n (p r aa1 jh eh k t)) (“project” n (p r aa1 jh eh k t))  (“project” v (p r ax jh eh1 k t)) (“project” v (p r ax jh eh1 k t))  (“bass” n_music (b ey1 s)) (“bass” n_music (b ey1 s))  (“bass” n_fish (b ae1 s)) (“bass” n_fish (b ae1 s))  ASR multiple pronunciations ASR multiple pronunciations  (“route” n (r uw t)) (“route” n (r uw t))  (“route(2)” n (r aw t)) (“route(2)” n (r aw t))

Pronunciation of Unknown Words Pronunciation of Unknown Words  How do you pronounce new words How do you pronounce new words  4% of tokens (in news) are new 4% of tokens (in news) are new  You can’t synthesis them without You can’t synthesis them without pronunciations pronunciations  You can’t recognize them without You can’t recognize them without pronunciations pronunciations  Letter-to-Sounds rules Letter-to-Sounds rules  Grapheme-to-Phoneme rules Grapheme-to-Phoneme rules

LTS: Hand written LTS: Hand written  Hand written rules Hand written rules  [LeftContext] X [RightContext] -> Y [LeftContext] X [RightContext] -> Y  e.g. Pronunciation of letter “c” e.g. Pronunciation of letter “c”  c [h r] -> k c [h r] -> k  c [h] -> ch c [h] -> ch  c [i] -> s c [i] -> s  c -> k c -> k

LTS: Machine Learning Techniques LTS: Machine Learning Techniques  Need an existing lexicon Need an existing lexicon  Pronunciations: words and phones Pronunciations: words and phones  But different number of letters and phones But different number of letters and phones  Need an alignment Need an alignment  Between letters and phones Between letters and phones  checked -> ch eh k t checked -> ch eh k t

LTS: alignment LTS: alignment  checked -> ch eh k t checked -> ch eh k t c h e c k e d c h e c k e d ch _ _ eh k k _ _ t ch eh _ _ t  Some letters go to nothing Some letters go to nothing  Some letters go to two phones Some letters go to two phones  box -> b aa k-s box -> b aa k-s  table -> t ey b ax-l - table -> t ey b ax-l -

Find alignment automatically Find alignment automatically  Epsilon scattering Epsilon scattering  Find all possible alignments Find all possible alignments  Estimate p(L,P) on each alignment Estimate p(L,P) on each alignment  Find most probable alignment Find most probable alignment  Hand seed Hand seed  Hand specify allowable pairs Hand specify allowable pairs  Estimate p(L,P) on each possible alignment Estimate p(L,P) on each possible alignment  Find most probable alignment Find most probable alignment  Statistical Machine Translation (IBM model 1) Statistical Machine Translation (IBM model 1)  Estimate p(L,P) on each possible alignment Estimate p(L,P) on each possible alignment  Find most probable alignment Find most probable alignment

Not everything aligns Not everything aligns  0, 1, and 2 letter cases 0, 1, and 2 letter cases  e -> epsilon “moved” e -> epsilon “moved”  x -> k-s, g-z “box” “example” x -> k-s, g-z “box” “example”  e -> y-uw “askew” e -> y-uw “askew”  Some alignments aren’t sensible Some alignments aren’t sensible  dept -> d ih p aa r t m ax n t dept -> d ih p aa r t m ax n t  cmu -> s iy eh m y uw cmu -> s iy eh m y uw

Training LTS models Training LTS models  Use CART trees Use CART trees  One model for each letter One model for each letter  Predict phone (epsilon, phone, dual phone) Predict phone (epsilon, phone, dual phone)  From letter 3-context (and POS) From letter 3-context (and POS)  # # # c h e c -> ch # # # c h e c -> ch  # # c h e c k -> _ # # c h e c k -> _  # c h e c k e -> eh # c h e c k e -> eh  c h e c k e d -> k c h e c k e d -> k

LTS results LTS results  Split lexicon into train/test 90%/10% Split lexicon into train/test 90%/10%  i.e. every tenth entry is extracted for testing i.e. every tenth entry is extracted for testing Lexicon Letter Acc Word Acc Lexicon Letter Acc Word Acc OALD 95.80% 75.56% OALD 95.80% 75.56% CMUDICT 91.99% 57.80% CMUDICT 91.99% 57.80% BRULEX 99.00% 93.03% BRULEX 99.00% 93.03% DE-CELEX 98.79% 89.38% DE-CELEX 98.79% 89.38% Thai 95.60% 68.76% Thai 95.60% 68.76%

Example Tree Example Tree

But we need more than phones But we need more than phones  What about lexical stress What about lexical stress  p r aa1 j eh k t -> p r aa j eh1 k t p r aa1 j eh k t -> p r aa j eh1 k t  Two possibilities Two possibilities  A separate prediction model A separate prediction model  Join model – introduce eh/eh1 (BETTER) Join model – introduce eh/eh1 (BETTER) LTP+S LTPS LTP+S LTPS L no S 96.36% 96.27% L no S 96.36% 96.27% Letter --- 95.80% Letter --- 95.80% W no S 76.92% 74.69% W no S 76.92% 74.69% Word 63.68% 74.56% Word 63.68% 74.56%

Does it really work Does it really work  40K words from Time Magazine 40K words from Time Magazine  1775 (4.6%) not in OALD 1775 (4.6%) not in OALD  LTS gets 70% correct (test set was 74%) LTS gets 70% correct (test set was 74%) Occurs % Occurs % Names 1360 76.6 Names 1360 76.6 Unknown 351 19.8 Unknown 351 19.8 US Spelling 57 3.2 US Spelling 57 3.2 Typos 7 0.4 Typos 7 0.4

Spoken Dialog Systems Spoken Dialog Systems  Information giving Information giving  Flights, buses, stocks weather Flights, buses, stocks weather  Driving directions Driving directions  News News  Information navigators Information navigators  Read your mail Read your mail  Search the web Search the web  Answer questions Answer questions  Provide personalities Provide personalities  Game characters (NPC), toys, robots, chatbots Game characters (NPC), toys, robots, chatbots  Speech-to-speech translation Speech-to-speech translation  Cross-lingual interaction Cross-lingual interaction

Dialog Types Dialog Types  System initiative System initiative  Form-filling paradigm Form-filling paradigm  Can switch language models at each turn Can switch language models at each turn  Can “know” which is likely to be said Can “know” which is likely to be said  Mixed initiative Mixed initiative  Users can go where they like Users can go where they like  System or user can lead the discussion System or user can lead the discussion  Classifying: Classifying:  Users can say what they like Users can say what they like  But really only “N” operations possible But really only “N” operations possible  E.g. AT&T? “How may I help you?” E.g. AT&T? “How may I help you?”  Non-task oriented Non-task oriented

System Initiative System Initiative  Let’s Go Bus Information Let’s Go Bus Information  412 268 3526 412 268 3526  Provides bus information for Pittsburgh Provides bus information for Pittsburgh  Tell Me Tell Me  Company getting others to build systems Company getting others to build systems  Stocks, weather, entertainment Stocks, weather, entertainment  1 800 555 8355 1 800 555 8355

SDS Architecture SDS Architecture Recognition Interpretation Dialog Manager Synthesis Generation

SDS Components SDS Components  Interpretation Interpretation  Parsing and Information Extraction Parsing and Information Extraction  (Ignore politeness and find the departure stop) (Ignore politeness and find the departure stop)  Generation Generation  From SQL table output from DB From SQL table output from DB  Generate “nice” text to say Generate “nice” text to say

Siri-like Assistants Siri-like Assistants  Advantages Advantages  Hard to type/select things on phone Hard to type/select things on phone  Can use context (location, contacts, calendar) Can use context (location, contacts, calendar)  Target common tasks Target common tasks  Calling, sending messages, calendar Calling, sending messages, calendar  Fall back on google lookup Fall back on google lookup

Speech Processing Speech Processing Using Speech with Computers - PowerPoint PPT Presentation

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs Text Speech vs Text Same but different Same but different Core Speech Technologies Core Speech Technologies Speech Recognition Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Numeral classifiers in areal perspective: Khmer and Thai 'syntactic borrowing' revisited RIKKER

Computational Approaches to Creative Language: Summary Caroline Sporleder Computational

Introduction A survivor of suicide loss is a person who lost someone close because of

Congruences connecting modular forms and truncated hypergeometric series Minisymposium on

? ? Chair Chair CPSC 449 Principles of Programming Languages Jrg Denzinger CPSC 449

Language Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing,

Statistical Language Modeling for Speech Recognition Berlin Chen 2003 References: 1. X. Huang

2 MPRI 4 Syntactic Formalisms 3