Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, - PowerPoint PPT Presentation

Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, Section 6 1

Outline ♦ Speech as probabilistic inference ♦ Speech sounds ♦ Word pronunciation ♦ Word sequences Chapter 15, Section 6 2

Speech as probabilistic inference It’s not easy to wreck a nice beach Speech signals are noisy, variable, ambiguous What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P ( Words | signal ) Words are the hidden state sequence, signal is the observation sequence We need to define an acoustic model (sensor model) + language model (transition model). Chapter 15, Section 6 3

Phones All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal ⇒ acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] b ea t [b] b et [p] p et [ih] b i t [ch] Ch et [r] r at [ey] b e t [d] d ebt [s] s et [ao] b ough t [hh] h at [th] th ick [ow] b oa t [hv] h igh [dh] th at [er] B er t [l] l et [w] w et [ix] ros e s [ng] si ng [en] butt on . . . . . . . . . . . . . . . . . . E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en] Chapter 15, Section 6 4

Speech sounds Raw signal is the microphone displacement as a function of time; processed into overlapping 30ms frames, each described by features Analog acoustic signal: Sampled, quantized digital signal: 10 15 38 22 63 24 10 12 73 Frames with features: 52 47 82 89 94 11 Frame features are typically formants—peaks in the power spectrum Chapter 15, Section 6 5

Phone models Frame features in P ( features | phone ) summarized by – an integer in [0 . . . 255] (using vector quantization); or – the parameters of a mixture of Gaussians Three-state phones: each phone has three phases (Onset, Mid, End) E.g., [t] has silent Onset, explosive Mid, hissing End ⇒ P ( features | phone, phase ) Triphone context: each phone becomes n 2 distinct phones, depending on the phones to its left and right E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!) Triphones useful for handling coarticulation effects: the articulators have inertia and cannot switch instantaneously between positions E.g., [t] in “eighth” has tongue against front teeth Chapter 15, Section 6 6

Phone model example Phone HMM for [m]: 0.9 0.4 0.3 0.7 0.1 0.6 Onset Mid End FINAL Output probabilities for the phone HMM: Onset: Mid: End: C1: 0.5 C3: 0.2 C4: 0.1 C2: 0.2 C4: 0.7 C6: 0.5 C3: 0.3 C5: 0.1 C7: 0.4 Chapter 15, Section 6 7

Word pronunciation models Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model [ey] [ow] 0.2 1.0 0.5 1.0 1.0 [ow] [t] [m] [t] 0.8 1.0 1.0 0.5 [ah] [aa] P ([ towmeytow ] | “tomato” ) = P ([ towmaatow ] | “tomato” ) = 0 . 1 P ([ tahmeytow ] | “tomato” ) = P ([ tahmaatow ] | “tomato” ) = 0 . 4 Structure is created manually, transition probabilities learned from data Chapter 15, Section 6 8

Isolated words Phone models + word models fix likelihood P ( e 1: t | word ) for isolated word P ( word | e 1: t ) = αP ( e 1: t | word ) P ( word ) Prior probability P ( word ) obtained simply by counting word frequencies P ( e 1: t | word ) is computed using the HMM for each word. Isolated-word dictation systems with training reach 95–99% accuracy Chapter 15, Section 6 9

Continuous speech Not just a sequence of isolated-word recognition problems! – Adjacent words highly correlated – Sequence of most likely words � = most likely sequence of words – Segmentation: there are few gaps in speech – Cross-word coarticulation—e.g., “next thing” Continuous speech systems manage 60–80% accuracy on a good day Chapter 15, Section 6 10

Language model Prior probability of a word sequence is given by chain rule: n P ( w 1 · · · w n ) = i =1 P ( w i | w 1 · · · w i − 1 ) � Bigram model: P ( w i | w 1 · · · w i − 1 ) ≈ P ( w i | w i − 1 ) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit Chapter 15, Section 6 11

Trigram model Mark V. Shaney, a program created by Bruce Ellis at Bell Labs – Used alt.singles as training corpus – Generated text using tri-gram model (second order model) Examples: “As I’ve commented before, really relating to someone involves standing next to impossible.” “I spent an interesting evening recently with a grain of salt” Chapter 15, Section 6 12

Combined HMM States of the combined language+word+phone model are labelled by the word we’re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn’t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented a way to find most likely word sequence using A ∗ where “step cost” is − log P ( w i | w i − 1 ) Chapter 15, Section 6 13

DBNs for speech recognition end-of-word observation P(OBS | 2) = 1 P(OBS | not 2) = 0 phoneme 1 1 2 2 deterministic, fixed 1 index 0 0 stochastic, learned 0 1 transition n o n n o phoneme deterministic, fixed articulators stochastic, learned a u u r r a a b b u tongue, lips observation stochastic, learned Also easy to add variables for, e.g., gender, accent, speed. Chapter 15, Section 6 14

Summary Since the mid-1970s, speech recognition has been formulated as probabilistic inference Evidence = speech signal, hidden variables = word and phone sequences “Context” effects (coarticulation etc.) are handled by augmenting state Variability in human speech (speed, timbre, etc., etc.) and background noise make continuous speech recognition in real settings an open problem Chapter 15, Section 6 15

Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, - PowerPoint PPT Presentation

Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, Section 6 1 Outline Speech as probabilistic inference Speech sounds Word pronunciation Word sequences Chapter 15, Section 6 2 Speech as probabilistic inference

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition (briefly) Chapter 15, Section 6 Chapter 15, Section 6 1 Outline Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

LCS 11: Cognitive Science Results of evaluations Perception in language acquisition Language

The ABCs of MLT Aptitude vs. Achievement Who has music aptitude? An Introduction to Gordons

Pitch (in speech) MATLAB tutorial series (Part 2.2) Pouyan Ebrahimbabaie Laboratory for Signal

CS 683 - Security and Privacy Fall 2019 Instructor: Karim Eldefrawy University of San Francisco

Annotation Pro Software Speech signal visualisation, part 1 klessa@amu.edu.pl

Dynamic Bayesian Networks and Hidden Markov Models Decision Trees Marco Chiarandini Deptartment

1 / 29 Outline Introduction Overview Hypothesis and objectives Databases Methodology Data

Assistive Technology Making good out of UbiComp Todays Class 1. Technology in Assistance 2.