speech recognition
play

Speech recognition Brief history Technology Computer Literacy 1 - PDF document

Topics Definition of speech recognition Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does speech recognition work 10/11/2008 Speaker recognition Problems of speech and speaker recognition


  1. Topics  Definition of speech recognition Speech recognition  Brief history  Technology Computer Literacy 1 Lecture 22  How does speech recognition work 10/11/2008  Speaker recognition  Problems of speech and speaker recognition Definition History - Homer Dudley  In the 1930s Homer Dudley created the first human  Can also be called automatic speech voice synthesizer at the Bell Labs recognition or computer speech recognition  He started experimenting with electromechanical  Definition: devices to produce analogues of human speech in the 20s Speech recognition converts the spoken  His findings let to the patent for “Vocoder” (voice + words into machine readable into encoder) machine readable input by using binary  a method of reproducing speech through electronic code! means and allowing it to be transmitted over distances (e.g. telephone lines) 1

  2. Speech recognition - Voice The Vocoder recognition  Originally developed as a speech decoder for  What you can already see is that speech and telecommunication voice recognition can refer to the same  Primary use for secure radio communication, where voice technology has to be encrypted before transmitted  So you can treat these terms as synonyms  Was used in SIGSALY system for high-level communications during WW-II  BUT there is also speaker recognition (which  Additionally Vocoder’s hardware and software has falls into the area of speech/voice been used as an electronic music instrument recognition) (Robert Moog, Kraftwek, Pink Floyd) Technology More Technology  A speech signal is recoded by a microphone and captured with a sound card  The speech signal has now to pass through various stages  Here various mathematical and statistical methods are applied 2

  3. Inside the computer Fast Fourier Transforms (FFT)  After the voice input is captured on your  The Fourier Transform is, in mathematics, an sound card operation that transforms one function of a real variable into another  The digital audio output of your card is processed using FFT (Fast Fourier  It works similar to the way that a chord of music we can hear can be transcribed by notes that are Transform) being played  This now already fine-tuned signal is further  The FFT is an algorithm to compute the processed by a HMM (Hidden Markov Model) Discrete Fourier Transform (DFT), which is one form of Fourier analysis Hidden Markov Model (HMM) HMM  Simply said: An HMM figures out when speech starts and stops  It is a statistical model  An HMM can be considered as the simplest dynamic Bayesian network  x = states; y = possible variations; a = state transition probabilities; b = output probabilities 3

  4. Sound How does this work?  The speech recognition software has a database  Sound itself is analogue that’s why we need containing thousands of frequencies  Phonemes to translate the signal into a digital signal  A phoneme is the smallest unit of speech in a language or which is readable by a speech recognising dialect software  The sound of one phoneme is usually different from another, this can change the meaning of a word  That’s what the FFT does, it transforms the  E.g. sound ‘b’ in bat, ‘r’ in rat incoming signal in a band of frequencies  The phoneme data base is matching the audio frequency bands that were sampled  When this is done the next step is  Each phoneme is tagged with a feature number recognising these bands How does it figure out the right sound? Pruning  The software has to use complex technique  When pruning the software generates several to approximate the sound and figure out what hypothesis on what could have been spoken phonemes are used  It then generates scores for these hypothesis  One way of identifying relevant phonemes is and decides to go for the one with the highest to train your speech recognition software score  Or you could prune your software for a  The ones with the lower scores get pruned particular speech out 4

  5. Train your Speech Recogniser More training  So your software applied feature numbers to  When you train your software frequency bands  You feed it with many variations of the same  Now it uses statistics to figure out the phoneme and your software analyses all of probability of a particular feature number these through a statistical methods (e.g. appearing in a phoneme using HMM)  The feature number with the highest  With the help of this great amount of training probability would correspond with the phonemes your software gives again feature phoneme you’ve spoken numbers to specific frequency bands The 2 phases of speaker Speaker recognition recognition  Speaker’s voice is recorded and a number of  Speaker recognition = WHO is speaking individual features (characteristics) of voice  Speech recognition = WHAT is said are used to make a voice print  Identifying characteristics of one voice  In speaker verification this print will be compared  Characteristics of voice are e.g. pitch, to a previous recorded template to verify your voice melody, hoarse vs soft, frequency  In speaker identification your voice print is compared to multiple voice prints in order to determine the best match 5

  6. Possible Problems of Speech and Speaker Recognition Key points  Speech recognition can’t work perfect since  The Vocoder, first speech synthesizer people speak in different dialects, use all kind  Speech recognition and it’s technology of different pronunciation, HMMs can’t always  Fast Fourier Transformation distinguish when speech starts and ends  The Hidden Markov Model since background noise can be confused with  Train and prune your recogniser speech, etc…  Voice recognition involves verification and  Speaker recognition fails as soon as your identification voice quality is different to your sample, e.g.  We all speak so differently and our voices are changing through life which makes it very hard to be when you have a cold, aging can have an a good speech recogniser effect on your voice, etc… 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend