speech and language cs 188 artificial intelligence
play

Speech and Language CS 188: Artificial Intelligence Speech - PDF document

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic speech recognition (ASR) Text-to-speech synthesis (TTS) Dialog systems Language processing technologies Lecture 18: Speech


  1. Speech and Language CS 188: Artificial Intelligence § Speech technologies § Automatic speech recognition (ASR) § Text-to-speech synthesis (TTS) § Dialog systems § Language processing technologies Lecture 18: Speech § Machine translation Pieter Abbeel --- UC Berkeley § Information extraction Many slides over this course adapted from Dan Klein, Stuart Russell, § Web search, question answering Andrew Moore § Text classification, spam filtering, etc … Digitizing Speech Speech in an Hour § Speech input is an acoustic wave form s p ee ch l a b “ l ” to “ a ” transition: Graphs from Simon Arnfield ’ s web tutorial on speech, 3 4 Sheffield: http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/ Spectral Analysis Part of [ae] from “ lab ” § Frequency gives pitch; amplitude gives volume § sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) s p ee ch l a b amplitude § Complex wave repeating nine times § Plus smaller wave that repeats 4x for every large cycle § Large wave: freq of 250 Hz (9 times in .036 seconds) § Fourier transform of wave displayed as a spectrogram § Small wave roughly 4 times this, or roughly 1000 Hz § darkness indicates energy at each frequency frequency [ demo ] 5 6 1

  2. Resonances of the vocal tract [ demo ] § The human vocal tract as an open tube Closed end Open end Length 17.5 cm. § Air in a tube of a given length will tend to vibrate at resonance frequency of tube. § Constraint: Pressure differential should be maximal at (closed) glottal From end and minimal at (open) lip end. Mark Liberman ’ s 7 8 website Figure from W. Barry Speech Science slides Vowel [i] sung at successively higher pitches Acoustic Feature Sequence F#2 A2 C3 § Time slices are translated into acoustic feature vectors (~39 real numbers per slice) frequency F#3 A3 C4 (middle C) …………………………………………… .. e 12 e 13 e 14 e 15 e 16 ……… .. A4 § These are the observations, now we need the hidden states X 10 Figures from Ratree Wayland State Space HMMs for Speech § P(E|X) encodes which acoustic vectors are appropriate for each phoneme (each kind of sound) § P(X|X ’ ) encodes how sounds can be strung together § We will have one state for each sound in each word § From some state x, can only: § Stay in the same state (e.g. speaking slowly) § Move to the next position in the word § At the end of the word, move to the start of the next word § We build a little state graph for each word and chain them together to form our state space X 11 12 2

  3. Transitions with Bigrams Decoding 198015222 the first § While there are some practical issues, finding the words Training Counts 194623024 the same given the acoustics is an HMM inference problem 168504105 the following 158562063 the world … § We want to know which state sequence x 1:T is most likely 14112454 the door ----------------- given the evidence e 1:T : 23135851162 the * § From the sequence x, we can simply read off the words 14 Figure from Huang et al page 618 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend