An Overview of Speech Technologies Aren Jansen Thanks to - PowerPoint PPT Presentation

An ¡Overview ¡of ¡Speech ¡Technologies ¡ Aren ¡Jansen ¡ Thanks ¡to ¡Brian ¡Kingsbury ¡(IBM) ¡and ¡Hynek ¡Hermansky ¡(JHU) ¡ ¡ for ¡some ¡of ¡the ¡materials ¡contained ¡in ¡this ¡lecture. ¡

Core ¡AutomaBc ¡Speech ¡Technologies ¡ Speech ¡Processing ¡ Synthesis ¡ Coding ¡ RecogniBon ¡ Speaker ¡ ¡ Language ¡ Speech ¡ RecogniBon ¡ RecogniBon ¡ ¡RecogniBon ¡ AcousBc ¡ ¡ Keyword ¡ ¡ Language ¡Modeling ¡ Modeling ¡ Search ¡

From ¡i.i.d. ¡Samples ¡to ¡i.i.d. ¡Time ¡Series ¡ • Most ¡of ¡what ¡you ¡covered ¡so ¡far: ¡ – Given: ¡Z ¡= ¡{(x i ,y i )} ¡pairs ¡(for ¡supervised ¡case) ¡ – Learn: ¡f(x) ¡→ ¡y ¡ • Speech ¡recogniBon: ¡vector ¡Bme ¡series, ¡categorical ¡ labels ¡not ¡at ¡the ¡sample ¡level: ¡ – Given: ¡ X i = x 1 ¡x 2 ¡… ¡x T ¡where ¡ x t ¡ ¡ ¡ ¡R d ¡and ¡ Y i ¡= ¡y 1 y 2 … y n ¡ ¡ ! – No9ce: ¡ n ¡!= ¡T ¡ – Learn: ¡f(X) ¡→ ¡Y ¡

Speech ¡is ¡Rich ¡with ¡Structure ¡ Observed: ¡ Acous9c: ¡ Acous9c-‑ voiced ¡ ¡unvoiced ¡ ¡voiced ¡ ¡unvoiced ¡ ¡voiced ¡ ¡unvoiced ¡ ¡voiced ¡ ¡unvoiced ¡ Phone9c: ¡ ¡ ¡ ¡en ¡ ¡ ¡ ¡ ¡ ¡s ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ai ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡k ¡ ¡l ¡ ¡ow ¡ ¡ ¡ ¡ ¡ ¡ ¡p ¡ ¡ ¡ ¡ ¡ ¡iy ¡ ¡ ¡ ¡ ¡ ¡ ¡d ¡ ¡ ¡iy ¡ ¡ ¡ ¡ ¡aa ¡ ¡ ¡ ¡ ¡ ¡ ¡s ¡ ¡ Phone9c: ¡ encyclopedias ¡ Lexical: ¡ Gramma9cal: ¡ (he ¡sold, ¡NP, ¡to ¡her) ¡ Seman9c: ¡ {book, ¡reference, ¡knowledge, ¡wikipedia} ¡ Speech ¡recogni9on ¡requires ¡modeling ¡of ¡all ¡levels ¡of ¡hierarchy ¡

AutomaBc ¡Speech ¡RecogniBon ¡Pipeline ¡ x 1 ¡x 2 ¡… ¡x T ¡ ¡ AcousBc ¡ ¡ where ¡x t ¡ ¡ ¡R d ¡ ! w 1 ¡w 2 ¡w 3 ¡… ¡ Front-‑ Decoder ¡ End ¡ AcousBc ¡ Pron. ¡ Language ¡ Model ¡ Lexicon ¡ Model ¡

All ¡Speech ¡Processing ¡Begins ¡Here ¡ (in ¡one ¡form ¡or ¡another) ¡ T O ¡ • Fourier ¡Analysis ¡ ¡ decompose ¡the ¡signal ¡into ¡ a ¡sum ¡of ¡sinusoids ¡across ¡ the ¡whole ¡range ¡of ¡ ¡ frequency ¡

Beads ¡on ¡a ¡String ¡ Speech ¡is ¡a ¡quasi-‑staBonary ¡signal ¡ Time ¡→ ¡

Short-‑Bme ¡Analysis ¡for ¡Quasi-‑staBonary ¡Signals ¡ Bme ¡ t 0 ¡ Δ T ¡ s(f,t 0 ) ¡ spectrum ¡ fourier ¡ of ¡the ¡ ¡short ¡ transform ¡ segment ¡ frequency time

4 ¡ frequency ¡[kHz] ¡ 0 ¡ 0 ¡ Bme ¡[s] ¡ 6 ¡ /a;/ ¡ / ε :/ ¡ /i:/ ¡ /o:/ ¡ /u:/ ¡

Removing ¡Speaker ¡CharacterisBcs ¡ • All ¡speech ¡recogniBon ¡front-‑ends ¡ajempt ¡to ¡remove ¡ speaker ¡dependent ¡factors ¡(so ¡do ¡speaker ¡ recognizers!) ¡ • Typically ¡accomplished ¡using ¡spectral ¡smoothing ¡of ¡ various ¡types ¡ 4 ¡ frequency ¡[kHz] ¡ 0 ¡ 0 ¡ Bme ¡[s] ¡ 6 ¡

AcousBc ¡Front-‑ends ¡ • The ¡standards: ¡ – Mel ¡Frequency ¡Cepstral ¡Coefficients ¡(MFCCs) ¡ • Mel ¡Scale ¡and ¡discrete ¡cosine ¡transform ¡ – Perceptual ¡Linear ¡PredicBon ¡(PLPs) ¡ • Bark ¡Scale ¡and ¡linear ¡predicBon ¡(and ¡typically ¡DCT) ¡ • Data ¡driven: ¡ – Neural ¡Network ¡Posteriorgrams ¡ • Use ¡phoneBcally ¡transcribed ¡training ¡data ¡to ¡train ¡ANNs ¡ • Recent ¡trends: ¡ ¡ – Deep ¡belief ¡network ¡pre-‑training ¡ – Spectro-‑temporal ¡recepBve ¡fields ¡(2D ¡Gabors) ¡

Standard ¡Front-‑end ¡Tricks ¡ • Velocity ¡and ¡AcceleraBon ¡features ¡ – Interested ¡in ¡changes ¡(edges) ¡ – Instantaneous ¡is ¡noisy, ¡so ¡we ¡average ¡(slope ¡of ¡line ¡fit ¡to ¡several ¡ points ¡in ¡trajectory) ¡ • Temporal ¡Context ¡(+ ¡LDA ¡or ¡PCA) ¡ – Form ¡supervectors ¡from ¡several ¡neighboring ¡observaBon ¡ – Learn ¡to ¡reduce ¡dimension ¡with ¡or ¡without ¡labeled ¡data ¡ • Cepstral ¡Mean ¡SubtracBon ¡ – CompensaBon ¡for ¡convoluBonal ¡noise ¡(e.g. ¡channel/ microphone ¡variaBon) ¡ – s’ ¡= ¡s ¡* ¡n ¡ è ¡S’(f) ¡= ¡S(f) ¡N(f) ¡ è ¡<log ¡S’(f)> ¡= ¡<log(S(f))> ¡+ ¡log(N(f)) ¡

A ¡Biologically ¡Inspired ¡AlternaBve ¡

Filters ¡Inspired ¡by ¡STRFs ¡

Spectro-‑Temporal ¡ModulaBon ¡Features ¡

Decoder ¡ X ¡= ¡x 1 ¡x 2 ¡… ¡x T ¡ ¡ AcousBc ¡ ¡ where ¡x t ¡ ¡ ¡R d ¡ ! W ¡= ¡w 1 ¡w 2 ¡w 3 ¡… ¡ Front-‑ Decoder ¡ End ¡ AcousBc ¡ Pron. ¡ Language ¡ Model ¡ Lexicon ¡ Model ¡

AcousBc ¡Model ¡ • Most ¡acousBc ¡models ¡(AMs) ¡are ¡characterized ¡in ¡terms ¡of ¡ phonemes ¡ – Phonemes ¡are ¡the ¡atomic ¡sounds ¡of ¡a ¡given ¡language ¡ – E.g. ¡Cat ¡= ¡/ ¡k ¡ae ¡t ¡/, ¡Robot ¡= ¡/r ¡ow ¡b ¡aa ¡t/, ¡The ¡= ¡/dh ¡ah/ ¡OR ¡/th ¡iy/ ¡ – Natural ¡classes ¡exist ¡in ¡terms ¡of ¡confusions ¡and ¡producBon ¡mechanisms ¡ – About ¡45 ¡phones ¡in ¡English ¡(depends ¡on ¡how ¡you ¡count) ¡ • PhoneBc ¡AMs ¡allow ¡sharing ¡of ¡observaBons ¡across ¡context, ¡ reducing ¡training ¡data ¡dependence ¡ • PhoneBc ¡AMs ¡allow ¡generalizaBon ¡to ¡new ¡words ¡(given ¡ pronunciaBon ¡lexicon) ¡

Modeling ¡Individual ¡ObservaBons ¡ • Each ¡phoneBc ¡class ¡is ¡modeled ¡with ¡Gaussian ¡ Mixture ¡Model ¡(GMM) ¡

Context ¡Dependent ¡Phonemes ¡ • Increase ¡model ¡complexity ¡with ¡context ¡dependent ¡ phones: ¡ – One ¡class ¡for ¡each ¡phone ¡in ¡a ¡parBcular ¡phoneBc ¡context ¡ – E.g. ¡triphones: ¡(aa: ¡k, ¡t) ¡ ¡OR ¡(t: ¡s, ¡iy) ¡ – Not ¡all ¡45 3 ¡possibiliBes ¡occur, ¡so ¡a ¡fair ¡amount ¡of ¡pruning ¡is ¡ done ¡ • Typically: ¡pool ¡of ¡Gaussians ¡shared ¡by ¡GMMs ¡for ¡all ¡ context ¡dependent ¡phoneBc ¡units ¡(simple ¡means ¡of ¡ parameter ¡sharing) ¡ • Decision ¡trees ¡typically ¡used ¡to ¡prune ¡and ¡determine ¡ how ¡best ¡to ¡share ¡parameters ¡

GMM ¡Training ¡w/ ¡ExpectaBon-‑MaximizaBon ¡ • E-‑step: ¡ Given ¡current ¡GMM ¡parameters ¡θ, ¡ compute ¡the ¡posterior ¡probability ¡of ¡each ¡ GMM ¡component ¡given ¡the ¡observaBon: ¡

GMM ¡Training ¡w/ ¡ExpectaBon-‑MaximizaBon ¡ • M-‑step: ¡Compute ¡the ¡new ¡expected ¡maximum ¡likelihood ¡ esBmates ¡θ’ ¡of ¡the ¡GMM ¡means ¡and ¡covariances: ¡ • Iterate ¡E ¡and ¡M ¡step ¡unBl ¡the ¡total ¡data ¡ likelihood ¡converges ¡

But ¡We ¡Don’t ¡Have ¡Frame ¡Labels! ¡ • We ¡will ¡also ¡need ¡to ¡use ¡E-‑M ¡algorithm ¡to ¡decide ¡ which ¡frames ¡in ¡training ¡data ¡belong ¡to ¡which ¡ phoneBc ¡class ¡ • But: ¡We ¡first ¡have ¡some ¡temporal ¡constraints ¡to ¡ exploit ¡

Trajectories ¡are ¡should ¡be ¡smooth ¡

Modeling ¡Temporal ¡Dynamics ¡ • Beads ¡on ¡a ¡string ¡model: ¡ Time ¡→ ¡ • Enter ¡the ¡Hidden ¡Markov ¡Model ¡(HMM): ¡

Typical ¡Phone ¡HMM ¡Topology ¡ • Three ¡states ¡per ¡context ¡dependent ¡phone ¡unit ¡(states ¡model ¡ entry, ¡stable ¡part, ¡and ¡exit ¡of ¡the ¡phoneme) ¡ • In ¡total, ¡three ¡states ¡per ¡context ¡dependent ¡phone, ¡O(48 3 ) ¡ context ¡dependent ¡units ¡per ¡phoneme ¡(a ¡very ¡large ¡number) ¡

An Overview of Speech Technologies Aren Jansen Thanks to - PowerPoint PPT Presentation

An Overview of Speech Technologies Aren Jansen Thanks to Brian Kingsbury (IBM) and Hynek Hermansky (JHU) for some of the materials contained in

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Project Overview Speech Speech Generation Generation Common Semantic Frame Speech Speech

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech sound disorder by Sajjal (2018) Definition A speech sound disorder (SSD) is a speech

Speech of Greta Thunberg at the UN Climate Change COP24 Conference in Katowice Content -Greta

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Machine Learning for Signal Processing Representing Signals: Images and Sounds Class 4. 10 Sep

TKT TKT-24 2431 31 So SoC C de design sign Introduction to exercises SoC design / Fall

BBM 413 Fundamentals of Image Processing Erkut Erdem Dept. of Computer Engineering Hacettepe

ZinziPEG: a Low-complexity and Error Resilient JPEG compressor for Smart Camera Network Daniele

Secret-Key Generation from Physics Onur G unl u onur.gunlu@tum.de Supervisor: Gerhard

Sparse Decompositions in Dictionaries for Interferometric Image Reconstruction AIP 2009, Vienna

Fast orthogonal transforms and generation of Brownian paths G. Leobacher partially joint work

Time Series Representations for Better Data Mining What can we do with time series data?