ELEN E6884/COMS 86884 Speech Recognition
Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA {picheny,eeide,stanchen}@us.ibm.com 8 September 2005
■❇▼
ELEN E6884: Speech Recognition
ELEN E6884/COMS 86884 Speech Recognition Michael Picheny, Ellen - - PowerPoint PPT Presentation
ELEN E6884/COMS 86884 Speech Recognition Michael Picheny, Ellen Eide, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, NY, USA { picheny,eeide,stanchen } @us.ibm.com 8 September 2005 ELEN E6884: Speech Recognition
■❇▼
ELEN E6884: Speech Recognition
■ converting speech to text
■ what it’s not
■❇▼
ELEN E6884: Speech Recognition 1
■❇▼
ELEN E6884: Speech Recognition 2
■ speech is potentially the fastest way people can communicate
■ remote speech access is ubiquitous
■ archiving/indexing/compressing human speech
■❇▼
ELEN E6884: Speech Recognition 3
■ cover fundamentals of ASR in depth (weeks 1–9) ■ survey state-of-the-art techniques (weeks 10–13) ■ force you, the student, to implement key algorithms in C++
■❇▼
ELEN E6884: Speech Recognition 4
■ too much knowledge to fit in one brain
■ three lecturers (no TA?)
■ from IBM T.J. Watson Research Center, Yorktown Heights, NY
■❇▼
ELEN E6884: Speech Recognition 5
■ 1306 Mudd; 4:10-6:40pm Thursday
■ hardcopy of slides distributed at each lecture
■❇▼
ELEN E6884: Speech Recognition 6
■ four programming assignments (80% of grade)
■ final reading project (20% of grade)
■ weekly readings
■❇▼
ELEN E6884: Speech Recognition 7
■❇▼
ELEN E6884: Speech Recognition 8
■ C++ (g++ compiler) on x86 PC’s running Linux
■ extensive code infrastructure (provided by IBM)
■ get account on ILAB computer cluster
■ labs due at Friday 6pm
■❇▼
ELEN E6884: Speech Recognition 9
■ will be mailed out when ILAB accounts are ready ■ due next Friday (9/16) 6pm ■ getting acquainted
■❇▼
ELEN E6884: Speech Recognition 10
■ PDF versions of readings will be available on the web site ■ recommended text (bookstore):
■ reference texts (library, EE?):
■❇▼
ELEN E6884: Speech Recognition 11
■ in E-mail, prefix subject line with “ELEN E6884:” !!! ■ Michael Picheny — picheny@us.ibm.com ■ Ellen Eide — eeide@us.ibm.com ■ Stanley F. Chen — stanchen@watson.ibm.com
■ office hours: right after class; or before class by appointment ■ Courseworks
■❇▼
ELEN E6884: Speech Recognition 12
■ syllabus ■ slides from lectures (PDF)
■ lab assignments (PDF) ■ reading assignments (PDF)
■❇▼
ELEN E6884: Speech Recognition 13
■ feedback questionnaire after each lecture (2 questions)
■ EE’s may find CS parts challenging, and vice versa ■ you, the student, are partially responsible for quality of course ■ together, we can get through this ■ let’s go!
■❇▼
ELEN E6884: Speech Recognition 14
■ why is speech recognition hard?
■❇▼
ELEN E6884: Speech Recognition 15
■ ad hoc methods
■ maturation of statistical methods; basic HMM/GMM framework
■ more processing power, data ■ variations on a theme; tuning
■❇▼
ELEN E6884: Speech Recognition 16
■ speaker-independent single-word recognizer (“Rex”)
■❇▼
ELEN E6884: Speech Recognition 17
■ simple signal processing/feature extraction
■ many ideas central to modern ASR introduced, but not used all
■ small vocabulary
■ not tested with many speakers (usually <10) ■ error rates < 10%
■❇▼
ELEN E6884: Speech Recognition 18
■❇▼
ELEN E6884: Speech Recognition 19
■ killed ASR research at Bell Labs for many years ■ partially served as impetus for first (D)ARPA program (1971–
■❇▼
ELEN E6884: Speech Recognition 20
■ four competitors
■ HARPY won hands down
■❇▼
ELEN E6884: Speech Recognition 21
■ view speech recognition as . . .
■ downfall of trying to manually encode intensive amounts of
■❇▼
ELEN E6884: Speech Recognition 22
■ basic paradigm/algorithms developed during this time still used
■ then, computer power still catching up to algorithms
■❇▼
ELEN E6884: Speech Recognition 23
■ dramatic growth in available computing power
■ dramatic growth in transcribed data sets available
■ basic algorithmic framework remains the same as in the 1980’s
■❇▼
ELEN E6884: Speech Recognition 24
■ speaker dependent vs. speaker independent
■ small vs. large vocabulary
■ constrained vs. unconstrained domain
■ isolated vs. continuous
■ read vs. spontaneous
■❇▼
ELEN E6884: Speech Recognition 25
■ 1995 — Dragon, IBM release speaker-dependent isolated word
■ 1997 — Dragon, IBM release speaker-dependent continuous
■ late 1990’s — speaker-independent continuous small-vocab
■ late 1990’s — limited-domain speaker-independent continuous
■ to get reasonable performance, must constrain something
■❇▼
ELEN E6884: Speech Recognition 26
■ different sites compete on a common test set ■ harder and harder problems over time
■❇▼
ELEN E6884: Speech Recognition 27
■❇▼
ELEN E6884: Speech Recognition 28
■ each system has been extensively tuned to that domain! ■ still a ways to go until unconstrained large-vocabulary speaker-
■❇▼
ELEN E6884: Speech Recognition 29
■ for humans, one system fits all
1string error rates 2isolated letters presented to humans, continuous for machine
■❇▼
ELEN E6884: Speech Recognition 30
■ speech recognition as pattern classification ■ why is speech recognition so difficult? ■ key problems in speech recognition
■❇▼
ELEN E6884: Speech Recognition 31
■ consider isolated digit recognition
■ classification
■❇▼
ELEN E6884: Speech Recognition 32
■ e.g., turn on microphone for exactly one second ■ microphone converts instantaneous air pressure into real value
0.5 1 1.5 2 2.5 x 10
4
−1 −0.5 0.5
■❇▼
ELEN E6884: Speech Recognition 33
■ discretizing in time
■ discretizing in magnitude (A/D conversion)
■ one second audio signal A ∈ R16000
■❇▼
ELEN E6884: Speech Recognition 34
■ speech recognition ⇔ building a classifier
■ speech recognition ⇔ design discriminant function SCOREc(A) ■ can use concepts, tools from pattern classification
■❇▼
ELEN E6884: Speech Recognition 35
■ a simple classifier
■ discriminant function SCOREc(A) = DISTANCE(A, Ac)
i=1 (ai − ai,c)2)
■ pick class whose example is closest to A ■ e.g., scenario for cell phone name recognition
■❇▼
ELEN E6884: Speech Recognition 36
100 200 300 400 500 600 700 800 900 1000 −0.5 −0.4 −0.3 −0.2 −0.1 0.1 0.2 100 200 300 400 500 600 700 800 900 1000 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 100 200 300 400 500 600 700 800 900 1000 −0.2 −0.15 −0.1 −0.05 0.05 0.1 0.15
■❇▼
ELEN E6884: Speech Recognition 37
■ wait, taking Euclidean distance in the time domain is dumb! ■ what about the frequency domain?
■❇▼
ELEN E6884: Speech Recognition 38
time in seconds frequency in Hz 0.5 1 1.5 2 500 1000 1500 2000 2500 3000 3500 time in seconds frequency in Hz 0.5 1 1.5 2 2.5 500 1000 1500 2000 2500 3000 3500 time in seconds frequency in Hz 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 500 1000 1500 2000 2500 3000 3500
■❇▼
ELEN E6884: Speech Recognition 39
■ taking Euclidean distance in the frequency domain doesn’t work
■ can we extract cogent features A ⇒ (f1, . . . , fk)
■ this turns out to be remarkably difficult!
■❇▼
ELEN E6884: Speech Recognition 40
■ there is a enormous range of ways a particular word can be
■ sources of variability
■ screwing with any one of these factors can make ASR accuracy
■❇▼
ELEN E6884: Speech Recognition 41
■ for each word w, collect many examples; summarize with set of
■ to recognize audio signal A, find word w that minimizes
■ converting audio signals A into a set of cogent features values
■ coming up with good distance measures DISTANCE(·, ·)
■❇▼
ELEN E6884: Speech Recognition 42
■ coming up with good canonical representatives Aw,i for each
■ what if don’t have examples for each word? (sparse data)
■ efficiently finding the closest word
■ using knowledge that not all words or word sequences are
■❇▼
ELEN E6884: Speech Recognition 43
■ find features of speech such that . . .
■ discard stuff that doesn’t matter
■ look at human production and perception for insight
■❇▼
ELEN E6884: Speech Recognition 44
■ air comes out of lungs ■ vocal cords tensed (vibrate ⇒ voicing) or relaxed (unvoiced) ■ modulated by vocal tract (glottis → lips); resonates
■❇▼
ELEN E6884: Speech Recognition 45
■ phonemes
■ may be realized differently based on context
■❇▼
ELEN E6884: Speech Recognition 46
■ voicing
■ stops/plosives
■❇▼
ELEN E6884: Speech Recognition 47
■ spectogram shows energy at each frequency over time ■ voiced sounds have pitch (F0); formants (F1, F2, F3) ■ trained humans can do recognition on spectrograms with high
■❇▼
ELEN E6884: Speech Recognition 48
■ vowels — EE, AH, etc.
■ consonants
■❇▼
ELEN E6884: Speech Recognition 49
■ realization of a phoneme can differ very much depending on
■ where articulators were for last phone affect how they transition
■❇▼
ELEN E6884: Speech Recognition 50
■ insight into what features to use?
■ influences how signal processing is done
■❇▼
ELEN E6884: Speech Recognition 51
■ as it turns out, the features that work well . . .
■ e.g., Mel Frequency Cepstral Coefficients (MFCC)
■❇▼
ELEN E6884: Speech Recognition 52
■ sound comes in ear, converted into vibrations in fluid in cochlea ■ in fluid is basilar membrane, with ∼30,000 little hairs
■❇▼
ELEN E6884: Speech Recognition 53
■ human physiology used as justification for frequency analysis
■ limited knowledge of higher-level processing
■❇▼
ELEN E6884: Speech Recognition 54
■ 0 dB sound pressure level (SPL) ⇔ threshold of hearing
■ tells us what range of frequencies people can detect
■❇▼
ELEN E6884: Speech Recognition 55
■ equal loudness contours
■ tells us what range of frequencies might be good to focus on
■❇▼
ELEN E6884: Speech Recognition 56
■ adjust pitch of one tone until twice/half pitch of other tone ■ Mel scale — frequencies equally spaced in Mel scale are
■❇▼
ELEN E6884: Speech Recognition 57
■ use controlled stimuli to see what features humans use to
■ Haskins Laboratories (1940–1950’s), Pattern Playback machine
■ demonstrated importance of formants, formant transitions,
■❇▼
ELEN E6884: Speech Recognition 58
■ just as human physiology has its quirks, so does machine
■ sources of distortion
■❇▼
ELEN E6884: Speech Recognition 59
■ input distortion can still be a significant problem
■ enough said
■❇▼
ELEN E6884: Speech Recognition 60
■ now that we see what humans do ■ let’s discuss what signal processing has been found to work
■ goal: ignoring time alignment issues . . .
■ start with some mathematical background
■❇▼
ELEN E6884: Speech Recognition 61
■❇▼
ELEN E6884: Speech Recognition 62
■❇▼
ELEN E6884: Speech Recognition 63
■❇▼
ELEN E6884: Speech Recognition 64
■❇▼
ELEN E6884: Speech Recognition 65
∞
∞
■❇▼
ELEN E6884: Speech Recognition 66
∞
∞
■❇▼
ELEN E6884: Speech Recognition 67
∞
−π
−∞ |h[n]| < ∞
∞
■❇▼
ELEN E6884: Speech Recognition 68
∞
∞
∞
∞
∞
∞
∞
∞
■❇▼
ELEN E6884: Speech Recognition 69
∞
∞
−π
■❇▼
ELEN E6884: Speech Recognition 70
N−1
N−1
N
N
N−1
N
N−1
N−1
N ]ej 2πkn N
N−1
N−1
N
■❇▼
ELEN E6884: Speech Recognition 71
N−1
N
N−1
N
N = e−j 2πkn
N
N/2−1
N/2 + W k N N/2−1
N/2
NG[k]
■❇▼
ELEN E6884: Speech Recognition 72
N−1
2NC[k], 0 ≤ k < N
2NC[k], 0 ≤ k < N
■❇▼
ELEN E6884: Speech Recognition 73
■❇▼
ELEN E6884: Speech Recognition 74
■❇▼
ELEN E6884: Speech Recognition 75
■❇▼
ELEN E6884: Speech Recognition 76
■❇▼
ELEN E6884: Speech Recognition 77
Li−1
■❇▼
ELEN E6884: Speech Recognition 78
■❇▼
ELEN E6884: Speech Recognition 79