Speech recognition in systems for human- computer interaction - - PowerPoint PPT Presentation

speech recognition in systems for human computer
SMART_READER_LITE
LIVE PREVIEW

Speech recognition in systems for human- computer interaction - - PowerPoint PPT Presentation

Speech recognition in systems for human- computer interaction Ubiquitous Computing Seminar FS2014 Niklas Hofmann | | Niklas Hofmann 13.5.2014 1 Why speech recognition? | | Niklas Hofmann 13.5.2014 2 Source:


slide-1
SLIDE 1

| |

Ubiquitous Computing Seminar FS2014

Niklas Hofmann

Speech recognition in systems for human- computer interaction

13.5.2014 Niklas Hofmann 1

slide-2
SLIDE 2

| | 13.5.2014 Niklas Hofmann 2

Why speech recognition?

Source: http://www.freepixels.com/index.php?action=showpic&cat=20&pic | Google Voice Search Android

slide-3
SLIDE 3

| | 13.5.2014 Niklas Hofmann 3

Speech processing

Speech processing Speaker recognition Speaker identification Speaker verification Speech recognition

slide-4
SLIDE 4

| |

§ User claims identity § Binary decision

§ Either identity claim is correct § or «access» denied

§ Enrollment § Text dependent vs. independent

13.5.2014 Niklas Hofmann 4

Speaker verification

slide-5
SLIDE 5

| |

§ No apriori identity claim § Enrollment § Open vs. closed group § Text dependent vs. independent

13.5.2014 Niklas Hofmann 5

Speaker identification

slide-6
SLIDE 6

| |

§ Recognize spoken language § Speaker independent vs. dependent § Restricted input vs. «speech-to-text» § No predefined usage

§ Commands § Data input § Transcription

13.5.2014 Niklas Hofmann 6

Speech recognition

slide-7
SLIDE 7

| |

Signal generation Signal capturing Preconditioning Feature extraction «Pattern matching» System output

13.5.2014 Niklas Hofmann 7

Speech processing stages

slide-8
SLIDE 8

| | 13.5.2014 Niklas Hofmann 8

Signal generation

Source: Discrete-time speech signal processing | T. Quatieri | 2002

slide-9
SLIDE 9

| |

§ Simplified vocal tract § Time invariant for a short time § Source modeled as

§ Periodic signal § Noise

§ Speech as overlay of source and resonance

13.5.2014 Niklas Hofmann 9

Signal generation

Source: Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU 2011

slide-10
SLIDE 10

| |

§ Microphone

§ Bandwidth § Quality (better quality à easier to detect features)

§ Ambience

§ Noise § Echo

§ Start / Endpoint detection § Normalization § Emphasize relevant frequencies

§ Similar to human hearing

13.5.2014 Niklas Hofmann 10

Signal capturing / preconditioning

slide-11
SLIDE 11

| |

§ Signal framing

§ Vocal tract static for small frame (20-40ms)

§ Performed on either

§ Waveform § Spectrum § Ceptstrum § Mix of all

§ Techniques used

§ Linear Prediction § Cepstral Coefficients

13.5.2014 Niklas Hofmann 11

Feature extraction

slide-12
SLIDE 12

| | 13.5.2014 Niklas Hofmann 12

Framing

slide-13
SLIDE 13

| | 13.5.2014 Niklas Hofmann 13

Framing

slide-14
SLIDE 14

| | 13.5.2014 Niklas Hofmann 14

Waveform

slide-15
SLIDE 15

| |

§ Transform sample from time domain to frequency domain § Invention of FFT very helpfull (1965) § Gives insight in periodicity of a signal § Sensitive to framing (à window functions)

13.5.2014 Niklas Hofmann 15

Spectrum

slide-16
SLIDE 16

| | 13.5.2014 Niklas Hofmann 16

Spectrum

slide-17
SLIDE 17

| | 13.5.2014 Niklas Hofmann 17

Linear prediction

Source: Linear Prediction | Alan O Cinnéide | Dublin Institute of Technology | [2008]

slide-18
SLIDE 18

| | 13.5.2014 Niklas Hofmann 18

Cepstral coefficients

slide-19
SLIDE 19

| |

§ «Detect» speech units (phonemes / words) out of series

  • f feature vectors

§ Two main ideas

§ Template matching

§ «Simple» matching § Dynamic time warping

§ Statistical

§ Hidden markov model

13.5.2014 Niklas Hofmann 19

«Pattern matching»

slide-20
SLIDE 20

| |

§ Calculates distance from sample to template § Simple to implement § Assumes sample and template of same length / speed

§ Very sensitive to different speech patterns (length, pronounciation)

§ No widespread use anymore

13.5.2014 Niklas Hofmann 20

«Simple» matching

slide-21
SLIDE 21

| |

§ Tries to «correct» slower/faster sample with respect to template § Uses metrics to disallow too much «warping» § Still calculates «distance» between sample and template

13.5.2014 Niklas Hofmann 21

Dynamic time warping (DTW)

slide-22
SLIDE 22

| | 13.5.2014 Niklas Hofmann 22

Dynamic time warping (DTW)

Source: Speech Synthesis and Recognition | John Holmes and Wendy Holmes | [2nd Edition]

slide-23
SLIDE 23

| |

§ Models speech as process with hidden states and

  • bservable features

§ Each unit (e.g. word) matched to own process § Gives probability that sample generated from a certain process § Described by:

§ Set of 𝑜 States ​𝑇↓𝑜 § State transition matrix 𝐵 § (probability density function for the observations for each state, ​ 𝑐↓𝑗 )

13.5.2014 Niklas Hofmann 23

Hidden markov model (HMM)

slide-24
SLIDE 24

| |

§ Example: Weather

§ State 1: rain / snow § State 2: cloudy § State 3: sunny

13.5.2014 Niklas Hofmann 24

Hidden markov model (HMM)

Rain Cloudy Sunny

slide-25
SLIDE 25

| |

§ State not necessarily mapped to observation

§ Multiple observations possible in one state § Each observation has different probability to be seen § E.g. Series of «head» and «tails» can be generated by single coin

  • r by two or more different coins (we do not know which coin is

thrown when)

13.5.2014 Niklas Hofmann 25

Hidden markov model (HMM)

Source: Tutorial on Hidden Markov Models | L. R. Rabiner | 1989

slide-26
SLIDE 26

| |

§ Idea: generate one HMM per word

§ Very complex for longer words § Recognition of words not in training set impossible/improbable

§ Divide word into subunits (phonemes)

§ E.g. Cat à /k/ + /a/ + /t/

§ Train one HMM per phoneme (~45 for english) § Chain HMM together to recognize words / sentences

13.5.2014 Niklas Hofmann 26

Applying HMM to speech recognition

slide-27
SLIDE 27

| |

§ One possible model:

§ 1 State for transition in: /sil/ à /a/ § 1 State for the middle: /a/ § 1 State for transition out: /a/ à /sil/

§ Phoneme level HMM still not accurate enough § Context can alter sound of phoneme § Use context dependent models

13.5.2014 Niklas Hofmann 27

Applying HMM to speech recognition

slide-28
SLIDE 28

| |

§ Triphone: e.g. Cat

§ First triphone: /sil/ à /k/ à /a/ § Second triphone: /k/ à /a/ à /t/ § Third triphone: /a/ à /t/ à /sil/

§ Solves context sensitivity but high computation cost:

§ 45 phoneme à​45↑3 =91125 different models (not all needed)

13.5.2014 Niklas Hofmann 28

Applying HMM to speech recognition

slide-29
SLIDE 29

| |

§ Performed with 16 speakers (8:8 male:female) § Utterance of digits 0 – 9 § Also compared linear prediction to cepstral coefficients

13.5.2014 Niklas Hofmann 29

DTW vs HMM

Source: Comparison of DTW and HMM | S. C. Sajjan | 2012

slide-30
SLIDE 30

| |

Signal generation Signal capturing Preconditioning Feature extraction «Pattern matching» System output

13.5.2014 Niklas Hofmann 30

Speech processing stages

slide-31
SLIDE 31

| |

§ Limited power supply

§ Prevent frequent unneeded activation of system

§ Limited storage

§ Tradeoff between size and performance of speech and language models

§ Limited computing power

§ Tradoff between accuracy and speed

§ Long training undesirable

13.5.2014 Niklas Hofmann 31

Speech recognition on mobile devices

slide-32
SLIDE 32

| |

§ Comparison of DTW to HMM on mobile device (2009)

§ 500 MHz CPU

§ Detection of keywords of specific user § Data set of 30 people

§ 7 females and 23 males § Speaking 6 words (4-11 phonemes) § Each word repeated 10 times

13.5.2014 Niklas Hofmann 32

Performance on mobile device

slide-33
SLIDE 33

| | 13.5.2014 Niklas Hofmann 33

Real time factor

Source: Voice Trigger System | H. Lee, S. Chang, D. Yook, Y. Kim | 2009

slide-34
SLIDE 34

| |

§ Meassured «equal error rate»

§ Acceptance threshold set to get equal

§ False posivite rate § False negative rate

§ Dynamic Time warping: ~14% error rate § Hidden Markov model: down to ~9% error rate

§ Heavily dependent on ammount of training data

13.5.2014 Niklas Hofmann 34

Error rate

slide-35
SLIDE 35

| | 13.5.2014 Niklas Hofmann 35

Hidden markov model

Source: Voice Trigger System | H. Lee, S. Chang, D. Yook, Y. Kim | 2009

slide-36
SLIDE 36

| |

§ Multiple «consumer grade» systems deployed

§ 2008 Google Voice Search for Mobile App on iPhone § 2011 Apple launches Siri on iOS § 2011 Google adds Voice Search to Google.com

13.5.2014 Niklas Hofmann 36

What about modern cloud based systems?

slide-37
SLIDE 37

| |

§ Experiments done with 39-dimensional LP-cepstral coefficients § Uses triphone system § Relies heavily on a language model to decrease computation and increase accuracy

13.5.2014 Niklas Hofmann 37

A closer look on Google Voice Search

slide-38
SLIDE 38

| |

§ Learned from typed search queries on google.com

§ Trained on over 230 billion words

§ Also accounts for different locales

(Out-Of-Vocabulary rate : percentage of words unknown to the language model)

13.5.2014 Niklas Hofmann 38

Language model

Training Locale Test Locale USA GBR AUS USA 0.7 1.3 1.6 GBR 1.3 0.7 1.3 AUS 1.3 1.1 0.7

Source: Google Search by Voice: A case study | Google Inc.

slide-39
SLIDE 39

| |

§ Modern capabilities of computers enable more complex systems than ever § Rediscovery of artificial neural networks § But problem still not solved:

§ No automatic transcription of dialog

13.5.2014 Niklas Hofmann 39

A look into the future

slide-40
SLIDE 40

Thank you