Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d

Introduction VARIATION The central issue in Automatic Speech Recognition EECS 225d - March 16, 2005 2

Many Types of Variation channel/microphone type environmental noise speaking style vocal anatomy gender accent health etc. EECS 225d - March 16, 2005 3

Focus Today “You say pot[ey]to, I say pot[a]to... ” How can we model variation in pronunciation? EECS 225d - March 16, 2005 4

Pronunciation Variation A careful transcription of conversational speech by trained linguists has revealed... EECS 225d - March 16, 2005 5

80 Ways To Say “and” From “SPEAKING IN SHORTHAND - A SYLLABLE-CENTRIC PERSPECTIVE FOR UNDERSTANDING PRONUNCIATION VARIATION” by Steve Greenberg EECS 225d - March 16, 2005 6

Outline Phonetic Modeling Sub-Word models Phones (mono-, bi-, di- and triphones) Syllables Data-driven units Cross-word modeling Whole-word models Lexicons (Dictionaries) for ASR EECS 225d - March 16, 2005 7

Phonetic Modeling EECS 225d - March 16, 2005 8

Phonetic Modeling How do we select the basic units for recognition? Units should be accurate Units should be trainable Units should be generalizable We often have to balance these against each other. EECS 225d - March 16, 2005 9

Sub-Word Models EECS 225d - March 16, 2005 10

Sub-Word Models Phones Context Independent Context Dependent Syllables Data-driven units Cross-word modeling EECS 225d - March 16, 2005 11

Phones EECS 225d - March 16, 2005 12

Phones Note: “phones” != “phonemes” (see G&M pg. 310) E.g.: Phoneme Phone A A A A A Ascii-65 EECS 225d - March 16, 2005 13

“Flavors” of Phones Context Independent: Monophones Context Dependent: Biphones Diphones Triphones EECS 225d - March 16, 2005 14

Context Independent Phones EECS 225d - March 16, 2005 15

Context Independent “Monophones” “cat” = [k ae t] Easy to train: only about 40 monophones for English The basis of other sub-word units Easy to add new pronunciations to lexicon EECS 225d - March 16, 2005 16

Typical English Phone Set Phone Example Phone Example Phone Example iy ih ae feel fill gas aa ah ao father bud caught ay ax ey bite comply day eh er ow ten turn tone aw oy uh how coin book uw b p tool big pig d t g dig sat gut k f v cut fork vat s z th sit zap thin dh sh zh then she genre l r y lid red yacht w hh m with help mat n ng ch no sing chin jh edge Adapted from “Spoken Language Processing” by Xuedong Huang, et. al. EECS 225d - March 16, 2005 17

Monophones Major Drawback Not very powerful for modeling variation: Example: “key” vs “coo” EECS 225d - March 16, 2005 18

Context Dependent Phones EECS 225d - March 16, 2005 19

Biphones Taking into account the context (what sounds are to the right or left) in which the phone occurs. Left biphone of [ae] in “cat”: k_ae Right biphone of [ae] in “cat”: ae_t “key” = k_iy iy_# “coo” = k_uw uw_# EECS 225d - March 16, 2005 20

Biphones More difficult to train than monophones: Roughly (40^2 + 40^2) biphones for English If not enough training for a biphone model, can “backoff” to monophone EECS 225d - March 16, 2005 21

Triphones Consider the sounds to the left AND right Good modeling of variation Most widely used in ASR systems “key” = #_k_iy k_iy_# “coo” = #_k_uw k_uw_# EECS 225d - March 16, 2005 22

Triphones Can be difficult to train: there are LOTS of possible triphones (roughly 40^3) Not all occur If not enough data to train a triphone, typically back-off to left or right biphone EECS 225d - March 16, 2005 23

Triphones Don’t always capture variation: “that rock” vs. “theatrical” ae_t_r ae_t_r Sometimes helps to cluster similar triphones EECS 225d - March 16, 2005 24

Diphones Modeling the transitions between phones Extend from middle of one phone to the middle of the next “key” = #_k k_iy iy_# “coo” = #_k k_uw uw_# EECS 225d - March 16, 2005 25

Syllables EECS 225d - March 16, 2005 26

Syllables Syllable Rime [Onset] Nucleus [Coda] str eh ng th s “Strengths” EECS 225d - March 16, 2005 27

Syllables Good modeling of variation Somewhere between triphones and whole- word models Can be difficult to train (like triphones) Practical experiments have not shown improvements over triphone-based systems. EECS 225d - March 16, 2005 28

Data-driven Sub-Word Units EECS 225d - March 16, 2005 29

Data-driven Sub-Word Units Basic Idea: More accurate modeling of acoustic variation Cluster data into homogeneous “groups” sounds with similar acoustics should group together Use these automatically-derived units instead of linguistically-based sub-word units EECS 225d - March 16, 2005 30

Data-driven Sub-Word Units Difficulties: Can have problems with training, depending on number of units Real problem: generalizability How do we add words to the system when we don’t know what the units “mean” Create a mapping from phones? EECS 225d - March 16, 2005 31

Cross-word Modeling EECS 225d - March 16, 2005 32

Cross-word Modeling Co-articulation spans word boundaries: “Did you eat yet?” -> jeatyet “could you” -> couldja “I don’t know” -> idunno We can achieve better modeling by looking across word boundaries More difficult to implement- what would dictionary look like? Usually use lattices when doing cross-word modeling EECS 225d - March 16, 2005 33

Whole-word Models EECS 225d - March 16, 2005 34

Whole-word Models In some sense, the most “natural” unit Good modeling of coarticulation within the word If context dependent, good modeling across words Good when vocabulary is small e.g. digits: 10 words Context dependent: 10x10x10 = 1000 models Not a huge problem for training EECS 225d - March 16, 2005 35

Whole-word Models Problems: difficult to train: needs lots of examples of *every* word not generalizable: adding new words requires more data collection EECS 225d - March 16, 2005 36

Lexicons EECS 225d - March 16, 2005 37

Lexicons for ASR cat: k ae t Contains: key: k ey words coo: k uw pronunciations the: 0.6 dh iy optionally: 0.4 dh ax alternate pronunciations pronunciation probabilities No definitions EECS 225d - March 16, 2005 38

Lexicon Generation Where do lexical entries come from? Hand labeling Rule generated Not too bad for English, but can be a big expense when building a recognizer for a new language For a small task, may want to consider whole-word models to bypass lexicon gen EECS 225d - March 16, 2005 39

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d - PowerPoint PPT Presentation

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d Introduction VARIATION The central issue in Automatic Speech Recognition EECS 225d - March 16, 2005 2 Many Types of Variation channel/microphone type environmental noise

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

Why phonetic transcription? Global phonetic diversity Inconsistent orthography within

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Long-Term Formant Long-Term Formant Distribution as a forensic- phonetic feature phonetic

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Joint Learning of Phonetic Units and Word Pronunciations for ASR Chia-ying (Jackie) Lee, Yu

1 In this presentation the two types of alkali-aggregate reaction ASR and ACR will de

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

Water Authoritys ASR Policy Perspective RICK SHEAN, WATER QUALITY HYDROLOGIST AUG. 16, 2017

A Phonetic Analysis of Igbo Tone Linda Chinelo Nkamigbo Department of Linguistics Nnamdi Azikiwe

1 I nternational Congress on Phonetic Sciences I CPhS 2019 Melbourne Convention Exhibition Centre

a Visualization of Phonetic i Markers for Early ESL u Learners in

Articulatory Phonetics The Articulatory System and the International Phonetic Alphabet The IPA:

LING 205 Practical Phonetics* Instructor: Geoff Morrison [df mrsn] *

Predicting Thread Discourse Structure over Technical Web Forums Li Wang, Marco Lui,

BNCWeb Martin Wynne Oxford e-Research Centre, Oxford University Computing Services & Faculty

Treebank Translation for Cross-Lingual Parser Induction Jrg Tiedemann 1 eljko Agi 2 Joakim

Aggregate Analysis of Vowel Pronunciation Introduction The Goal in Swedish Dialects Language

Inferring phonemic classes from CNN activation maps using clustering techniques Thomas

Phonology of Pitch Change Elizabeth Selkirk (1995) Sentence Prosody: Intonation, Stress, and

building a concrete alternative to ida 1 were sorry raxcity.com Shellphish

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d - PowerPoint PPT Presentation

Phonetic Modeling in ASR Chuck Wooters 3/16/05 EECS 225d Introduction VARIATION The central issue in Automatic Speech Recognition EECS 225d - March 16, 2005 2 Many Types of Variation channel/microphone type environmental noise

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

Why phonetic transcription? Global phonetic diversity Inconsistent orthography within

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Long-Term Formant Long-Term Formant Distribution as a forensic- phonetic feature phonetic

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Joint Learning of Phonetic Units and Word Pronunciations for ASR Chia-ying (Jackie) Lee, Yu

1 In this presentation the two types of alkali-aggregate reaction ASR and ACR will de

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

Water Authoritys ASR Policy Perspective RICK SHEAN, WATER QUALITY HYDROLOGIST AUG. 16, 2017

A Phonetic Analysis of Igbo Tone Linda Chinelo Nkamigbo Department of Linguistics Nnamdi Azikiwe

1 I nternational Congress on Phonetic Sciences I CPhS 2019 Melbourne Convention Exhibition Centre

a Visualization of Phonetic i Markers for Early ESL u Learners in

Articulatory Phonetics The Articulatory System and the International Phonetic Alphabet The IPA:

LING 205 Practical Phonetics* Instructor: Geoff Morrison [df mrsn] *

Predicting Thread Discourse Structure over Technical Web Forums Li Wang, Marco Lui,

BNCWeb Martin Wynne Oxford e-Research Centre, Oxford University Computing Services &amp; Faculty

Treebank Translation for Cross-Lingual Parser Induction Jrg Tiedemann 1 eljko Agi 2 Joakim

Aggregate Analysis of Vowel Pronunciation Introduction The Goal in Swedish Dialects Language

Inferring phonemic classes from CNN activation maps using clustering techniques Thomas

Phonology of Pitch Change Elizabeth Selkirk (1995) Sentence Prosody: Intonation, Stress, and

building a concrete alternative to ida 1 were sorry raxcity.com Shellphish

BNCWeb Martin Wynne Oxford e-Research Centre, Oxford University Computing Services & Faculty