EECS E6870 Speech Recognition
Michael Picheny, Stanley F. Chen, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,stanchen,bhuvana}@us.ibm.com 8 September 2009
■❇▼
EECS E6870: Speech Recognition
EECS E6870 Speech Recognition Michael Picheny, Stanley F. Chen, - - PowerPoint PPT Presentation
EECS E6870 Speech Recognition Michael Picheny, Stanley F. Chen, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,stanchen,bhuvana } @us.ibm.com 8 September 2009 EECS E6870: Speech Recognition
EECS E6870: Speech Recognition
■ converting speech to text
■ what it’s not
EECS E6870: Speech Recognition 1
EECS E6870: Speech Recognition 2
■ speech is potentially the fastest way people can communicate with machines
■ remote speech access is ubiquitous
■ archiving/indexing/compressing/understanding human speech
EECS E6870: Speech Recognition 3
■ cover fundamentals of ASR in depth (weeks 1–9) ■ survey state-of-the-art techniques (weeks 10–13) ■ force you, the student, to implement key algorithms in C++
EECS E6870: Speech Recognition 4
■ too much knowledge to fit in one brain
■ three lecturers (no TA?)
■ from IBM T.J. Watson Research Center, Yorktown Heights, NY
EECS E6870: Speech Recognition 5
■ 1300 Mudd; 4:10-6:40pm Tuesday
■ hardcopy of slides distributed at each lecture
EECS E6870: Speech Recognition 6
■ four programming assignments (80% of grade)
■ final reading project (undecided; 20% of grade)
■ weekly readings
EECS E6870: Speech Recognition 7
EECS E6870: Speech Recognition 8
■ C++ (g++ compiler) on x86 PC’s running Linux
■ extensive code infrastructure in C++ with SWIG to make it accessible from
■ get account on ILAB computer cluster
■ labs due Wednesday at 6pm
EECS E6870: Speech Recognition 9
■ PDF versions of readings will be available on the web site ■ recommended text (bookstore):
■ reference texts (library, online, bookstore, EE?):
EECS E6870: Speech Recognition 10
■ in E-mail, prefix subject line with “EECS E6870:” !!! ■ Michael Picheny — picheny@us.ibm.com ■ Stanley F
■ Bhuvana Ramabhadran — bhuvana@us.ibm.com
■ office hours: right after class; or before class by appointment ■ Courseworks
EECS E6870: Speech Recognition 11
■ syllabus ■ slides from lectures (PDF)
■ lab assignments (PDF) ■ reading assignments (PDF)
EECS E6870: Speech Recognition 12
■ feedback questionnaire after each lecture (2 questions)
■ EE’s may find CS parts challenging, and vice versa ■ you, the student, are partially responsible for quality of course ■ together, we can get through this ■ let’s go!
EECS E6870: Speech Recognition 13
■ why is speech recognition hard?
EECS E6870: Speech Recognition 14
■ ad hoc methods
■ maturation of statistical methods; basic HMM/GMM framework developed
■ more processing power, data ■ variations on a theme; tuning; ■ demand from downstream technologies (search, translation)
EECS E6870: Speech Recognition 15
■ speaker-independent single-word recognizer (“Rex”)
EECS E6870: Speech Recognition 16
■ simple signal processing/feature extraction
■ many ideas central to modern ASR introduced, but not used all together
■ small vocabulary
■ not tested with many speakers (usually <10) ■ error rates < 10%
EECS E6870: Speech Recognition 17
EECS E6870: Speech Recognition 18
■ killed ASR research at Bell Labs for many years ■ partially served as impetus for first (D)ARPA program (1971–1976) funding
EECS E6870: Speech Recognition 19
■ four competitors
■ HARPY won hands down
EECS E6870: Speech Recognition 20
■ view speech recognition as . . .
■ downfall of trying to manually encode intensive amounts of linguistic,
EECS E6870: Speech Recognition 21
■ basic paradigm/algorithms developed during this time still used today
■ then, computer power still catching up to algorithms
EECS E6870: Speech Recognition 22
■ dramatic growth in available computing power
■ dramatic growth in transcribed data sets available
■ basic algorithmic framework remains the same as in the 1980’s
EECS E6870: Speech Recognition 23
■ speaker dependent vs. speaker independent
■ small vs. large vocabulary
■ constrained vs. unconstrained domain
■ isolated vs. continuous
■ read vs. spontaneous
EECS E6870: Speech Recognition 24
■ 1995 — Dragon, IBM release speaker-dependent isolated word large-
■ 1997 — Dragon, IBM release speaker-dependent continuous word large-
■ late 1990’s — speaker-independent continuous small-vocab ASR available
■ late 1990’s — limited-domain speaker-independent continuous large-
■ to get reasonable performance, must constrain something
EECS E6870: Speech Recognition 25
■ different sites compete on a common test set ■ harder and harder problems over time
EECS E6870: Speech Recognition 26
EECS E6870: Speech Recognition 27
■ each system has been extensively tuned to that domain! ■ still a ways to go until unconstrained large-vocabulary speaker-independent
EECS E6870: Speech Recognition 28
■ for humans, one system fits all
1string error rates, 3phone error rates 2isolated letters presented to humans, continuous for machine ■❇▼
EECS E6870: Speech Recognition 29
■ speech recognition as pattern classification ■ why is speech recognition so difficult? ■ key problems in speech recognition
EECS E6870: Speech Recognition 30
■ consider isolated digit recognition
■ classification
EECS E6870: Speech Recognition 31
0.5 1 1.5 2 2.5 x 10
4
−1 −0.5 0.5
■ What does an audio signal look like?
EECS E6870: Speech Recognition 32
■ discretizing in time
■ discretizing in magnitude (A/D conversion)
■ one second audio signal A ∈ R16000
EECS E6870: Speech Recognition 33
■ speech recognition ⇔ building a classifier
■ speech recognition ⇔ design discriminant function SCOREc(A) ■ can use concepts, tools from pattern classification
EECS E6870: Speech Recognition 34
■ a simple classifier
■ discriminant function SCOREc(A) = DISTANCE(A, Ac)
i=1 (ai − ai,c)2)
■ pick class whose example is closest to A ■ e.g., scenario for cell phone name recognition
EECS E6870: Speech Recognition 35
100 200 300 400 500 600 700 800 900 1000 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 100 200 300 400 500 600 700 800 900 1000 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 100 200 300 400 500 600 700 800 900 1000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15
EECS E6870: Speech Recognition 36
■ wait, taking Euclidean distance in the time domain is dumb! ■ what about the frequency domain?
EECS E6870: Speech Recognition 37
time in seconds frequency in Hz 0.5 1 1.5 2 500 1000 1500 2000 2500 3000 3500 time in seconds frequency in Hz 0.5 1 1.5 2 2.5 500 1000 1500 2000 2500 3000 3500 time in seconds frequency in Hz 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500
EECS E6870: Speech Recognition 38
■ taking Euclidean distance in the frequency domain doesn’t work well either ■ can we extract cogent features A ⇒ (f1, . . . , fk)
■ this turns out to be remarkably difficult!
EECS E6870: Speech Recognition 39
■ there is a enormous range of ways a particular word can be realized ■ sources of variability
■ screwing with any one of these factors can make ASR accuracy go to hell
EECS E6870: Speech Recognition 40
■ for each word w, collect many examples; summarize with set of canonical
■ to recognize audio signal A, find word w that minimizes DISTANCE(A, Aw,i)
■ converting audio signals A into a set of cogent features values (f1, . . . , fk)
■ coming up with good distance measures DISTANCE(·, ·)
EECS E6870: Speech Recognition 41
■ coming up with good canonical representatives Aw,i for each class
■ what if don’t have examples for each word? (sparse data)
■ efficiently finding the closest word
■ using knowledge that not all words or word sequences are equally probable
EECS E6870: Speech Recognition 42
■ find features of speech such that . . .
■ discard stuff that doesn’t matter
■ look at human production and perception for insight
EECS E6870: Speech Recognition 43
■ air comes out of lungs ■ vocal cords tensed (vibrate ⇒ voicing) or relaxed (unvoiced) ■ modulated by vocal tract (glottis → lips); resonates
EECS E6870: Speech Recognition 44
■ phonemes
■ may be realized differently based on context
EECS E6870: Speech Recognition 45
■ voicing
■ stops/plosives
EECS E6870: Speech Recognition 46
■ spectogram shows energy at each frequency over time ■ voiced sounds have pitch (F0); formants (F1, F2, F3) ■ trained humans can do recognition on spectrograms with high accuracy
EECS E6870: Speech Recognition 47
■ What can the machine do? Here is a sample on TIMIT:
EECS E6870: Speech Recognition 48
■ vowels — EE, AH, etc.
■ consonants
EECS E6870: Speech Recognition 49
■ realization of a phoneme can differ very much depending on context
■ where articulators were for last phone affect how they transition to next
EECS E6870: Speech Recognition 50
■ insight into what features to use?
■ influences how signal processing is done
EECS E6870: Speech Recognition 51
■ as it turns out, the features that work well . . .
■ e.g., Mel Frequency Cepstral Coefficients (MFCC)
EECS E6870: Speech Recognition 52
■ sound comes in ear, converted into vibrations in fluid in cochlea ■ in fluid is basilar membrane, with ∼30,000 little hairs
EECS E6870: Speech Recognition 53
■ human physiology used as justification for frequency analysis ubiquitous in
■ limited knowledge of higher-level processing
EECS E6870: Speech Recognition 54
■ 0 dB sound pressure level (SPL) ⇔ threshold of hearing
■ tells us what range of frequencies people can detect
EECS E6870: Speech Recognition 55
■ equal loudness contours
■ tells us what range of frequencies might be good to focus on
EECS E6870: Speech Recognition 56
■ adjust pitch of one tone until twice/half pitch of other tone ■ Mel scale — frequencies equally spaced in Mel scale are equally spaced
EECS E6870: Speech Recognition 57
■ use controlled stimuli to see what features humans use to distinguish sounds ■ Haskins Laboratories (1940–1950’s), Pattern Playback machine
■ demonstrated importance of formants, formant transitions, trajectories in
EECS E6870: Speech Recognition 58
■ just as human physiology has its quirks, so does machine “physiology” ■ sources of distortion
EECS E6870: Speech Recognition 59
■ input distortion can still be a significant problem
■ enough said
EECS E6870: Speech Recognition 60
■ now that we see what humans do ■ let’s discuss what signal processing has been found to work well empirically
■ goal: ignoring time alignment issues . . .
■ start with some mathematical background
EECS E6870: Speech Recognition 61
EECS E6870: Speech Recognition 62
EECS E6870: Speech Recognition 63
EECS E6870: Speech Recognition 64
EECS E6870: Speech Recognition 65
∞
∞
EECS E6870: Speech Recognition 66
∞
∞
EECS E6870: Speech Recognition 67
EECS E6870: Speech Recognition 68
∞
−π
−∞ |h[n]| < ∞ ■❇▼
EECS E6870: Speech Recognition 69
∞
∞
∞
∞
EECS E6870: Speech Recognition 70
∞
∞
∞
∞
∞
∞
∞
−π
EECS E6870: Speech Recognition 71
EECS E6870: Speech Recognition 72
N−1
N−1
N
N
N−1
N
N−1
N−1
N ]ej 2πkn N
N−1
N−1
N
EECS E6870: Speech Recognition 73
EECS E6870: Speech Recognition 74
N−1
N
N−1
N
N = e−j 2πkn
N
N/2−1
N/2 + W k N N/2−1
N/2
NG[k]
EECS E6870: Speech Recognition 75
EECS E6870: Speech Recognition 76
N−1
2NC[k], 0 ≤ k < N
2NC[k], 0 ≤ k < N
EECS E6870: Speech Recognition 77
EECS E6870: Speech Recognition 78
EECS E6870: Speech Recognition 79
EECS E6870: Speech Recognition 80
EECS E6870: Speech Recognition 81
Li−1
EECS E6870: Speech Recognition 82
EECS E6870: Speech Recognition 83
EECS E6870: Speech Recognition 84