Algorithms for NLP Automatic Speech Recognition Yulia Tsvetkov CMU - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP Automatic Speech Recognition Yulia Tsvetkov CMU - - PowerPoint PPT Presentation

Algorithms for NLP Automatic Speech Recognition Yulia Tsvetkov CMU Slides: Preethi Jyothi IIT Bombay, Dan Klein UC Berkeley Skip-gram Prediction Skip-gram Prediction Training data w t , w t-2 w t , w t-1 w t , w t+1 w t , w


slide-1
SLIDE 1

Automatic Speech Recognition

Yulia Tsvetkov – CMU Slides: Preethi Jyothi – IIT Bombay, Dan Klein – UC Berkeley

Algorithms for NLP

slide-2
SLIDE 2

Skip-gram Prediction

slide-3
SLIDE 3

Skip-gram Prediction

▪ Training data

wt , wt-2 wt , wt-1 wt , wt+1 wt , wt+2 ...

slide-4
SLIDE 4

Skip-gram Prediction

slide-5
SLIDE 5

How to compute p(+|t,c)?

slide-6
SLIDE 6

FastText: Motivation

slide-7
SLIDE 7

Subword Representation

skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

slide-8
SLIDE 8

FastText

slide-9
SLIDE 9

ELMO

The Broadway play premiered yesterday .

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

λ1

ELMo

λ2 λ0

=

+ + ( ( ( ) ) )

slide-10
SLIDE 10

Announcements

▪ HW1 due Sept 24 ▪ HW2 out Oct 2

slide-11
SLIDE 11

Automatic Speech Recognition (ASR)

▪ Automatic speech recognition (or speech-to-text) systems transform speech utterances into their corresponding text form, typically in the form of a word sequence ▪ Downstream applications of ASR

▪ Speech understanding ▪ Audio information retrieval ▪ Speech translation ▪ Keyword search

Speech signal She sells sea shells Speech transcript

slide-12
SLIDE 12

What ASR is Not

Slide credit: Preethi Jyothi

slide-13
SLIDE 13

ASR is the Front Engine

Slide credit: Preethi Jyothi

slide-14
SLIDE 14

Why is ASR a Challenging Problem?

▪ Style:

▪ Read speech vs spontaneous (conversational) speech ▪ Command & control vs continuous natural speech

▪ Speaker characteristics:

▪ Rate of speech, accent, prosody (stress, intonation), speaker age, pronunciation variability even when the same speaker speaks the same word

▪ Channel characteristics:

▪ Background noise, room acoustics, microphone properties, interfering speakers

▪ Task specifics:

▪ Vocabulary size (very large number of words to be recognized), language-specific complexity, resource limitations

Slide credit: Preethi Jyothi

slide-15
SLIDE 15

History of ASR

The very first ASR

Slide credit: Preethi Jyothi

slide-16
SLIDE 16

History of ASR

Slide credit: Preethi Jyothi

slide-17
SLIDE 17

History of ASR

Slide credit: Preethi Jyothi

slide-18
SLIDE 18

History of ASR

Slide credit: Preethi Jyothi

slide-19
SLIDE 19

Statistical ASR : The Noisy Channel Model

Acoustic model Language model: Distributions over sequences

  • f words (sentences)

~80s

slide-20
SLIDE 20

History of ASR

Slide credit: Preethi Jyothi

slide-21
SLIDE 21

History of ASR

Slide credit: Preethi Jyothi

slide-22
SLIDE 22

Evaluating an ASR system

▪ Word/Phone error rate (ER)

▪ uses the Levenshtein distance measure: What are the minimum number of edits (insertions/deletions/substitutions) required to convert W* to Wref ?

From J&M

slide-23
SLIDE 23

NIST ASR Benchmark Test History

slide-24
SLIDE 24

What’s Next?

Slide credit: Preethi Jyothi

slide-25
SLIDE 25

What’s Next?

https://www.youtube.com/watch?v=gNx0huL9qsQ Link credit: Preethi Jyothi

▪ accented speech ▪ low-resource ▪ speaker separation ▪ short queries ▪ etc.

slide-26
SLIDE 26

In our course

slide-27
SLIDE 27

Statistical ASR

Slide by Preethi Jyothi

slide-28
SLIDE 28

ASR Topics

Slide by Preethi Jyothi

slide-29
SLIDE 29

In our course

Slide by Preethi Jyothi

slide-30
SLIDE 30

Acoustic Analysis

Slide by Preethi Jyothi

slide-31
SLIDE 31

What is speech - physical realisation

▪ Waves of changing air pressure ▪ Realised through excitation from the vocal cords ▪ Modulated by the vocal tract, the articulators (tongue, teeth, lips) ▪ Vowels: open vocal tract ▪ Consonants are constrictions

  • f vocal tract

▪ Representation: ▪ acoustics ▪ linguistics

slide-32
SLIDE 32

Acoustics

slide-33
SLIDE 33

Simple Periodic Waves of Sound

▪ Y axis: Amplitude = amount of air pressure at that point in time ▪ X axis: Time ▪ Frequency = number of cycles per second.

▪ 20 cycles in .02 seconds = 1000 cycles/second = 1000 Hz

slide-34
SLIDE 34

Complex Waves: 100Hz+1000Hz

amplitude

slide-35
SLIDE 35

Spectrum

100 1000 Frequency in Hz Amplitude Frequency components (100 and 1000 Hz) on x-axis

slide-36
SLIDE 36

“She just had a baby”

▪ What can we learn from a wavefile?

▪ No gaps between words (!) ▪ Vowels are voiced, long, loud ▪ Voicing: regular peaks in amplitude ▪ When stops closed: no peaks, silence ▪ Peaks = voicing: .46 to .58 (vowel [iy], from second .65 to .74 (vowel [ax]) and so on ▪ Silence of stop closure (1.06 to 1.08 for first [b], or 1.26 to 1.28 for second [b]) ▪ Fricatives like [sh]: intense irregular pattern; see .33 to .46

slide-37
SLIDE 37

Part of [ae] waveform from “had”

▪ Note complex wave repeating nine times in figure ▪ Plus smaller waves which repeats 4 times for every large pattern ▪ Large wave has frequency of 250 Hz (9 times in .036 seconds) ▪ Small wave roughly 4 times this, or roughly 1000 Hz ▪ Two little tiny waves on top of peak of 1000 Hz waves

Amplitude Time

slide-38
SLIDE 38

Spectrum of an Actual Speech

Coefficient

slide-39
SLIDE 39

Spectrograms

ampl time

slice

freq coeff

FFT

time ampl

slide-40
SLIDE 40

Spectrograms

time ampl

slide-41
SLIDE 41

Spectrograms

fr eq time time ampl

slide-42
SLIDE 42

Types of Graphs

fr eq time time ampl ampl time freq coeff

slide-43
SLIDE 43

Frequency gives pitch; amplitude gives volume

Frequencies at each time slice processed into observation vectors

s p ee ch l a b

amplitude

Speech in a Slide

f r e q u e n c y

……………………………………………..x12x13x12x14x14………..

slide-44
SLIDE 44

Articulation

slide-45
SLIDE 45

Text from Ohala, Sept 2001, from Sharon Rose slide Sagittal section of the vocal tract (Techmer 1880)

Nasal cavity Pharynx Vocal folds (in the larynx) Trache a Lungs

Articulatory System

Oral cavity

slide-46
SLIDE 46

Space of Phonemes

▪ Standard international phonetic alphabet (IPA) chart of consonants

slide-47
SLIDE 47

Place

slide-48
SLIDE 48

Places of Articulation

labial dental alveolar post-alveolar/palatal velar uvular pharyngeal laryngeal/glottal

Figure thanks to Jennifer Venditti

slide-49
SLIDE 49

Labial place

bilabial labiodental

Figure thanks to Jennifer Venditti

Bilabial: p, b, m Labiodental: f, v

slide-50
SLIDE 50

Coronal place

dental alveolar post-alveolar/palatal

Figure thanks to Jennifer Venditti

Dental: th/dh Alveolar: t/d/s/z/l/n Post: sh/zh/y

slide-51
SLIDE 51

Dorsal Place

velar uvular pharyngeal

Figure thanks to Jennifer Venditti

Velar: k/g/ng

slide-52
SLIDE 52

Space of Phonemes

▪ Standard international phonetic alphabet (IPA) chart of consonants

slide-53
SLIDE 53

Manner

slide-54
SLIDE 54

Manner of Articulation

▪ In addition to varying by place, sounds vary by manner ▪ Stop: complete closure of articulators, no air escapes via mouth

▪ Oral stop: palate is raised (p, t, k, b, d, g) ▪ Nasal stop: oral closure, but palate is lowered (m, n, ng)

▪ Fricatives: substantial closure, turbulent: (f, v, s, z) ▪ Approximants: slight closure, sonorant: (l, r, w) ▪ Vowels: no closure, sonorant: (i, e, a)

slide-55
SLIDE 55

Space of Phonemes

▪ Standard international phonetic alphabet (IPA) chart of consonants

slide-56
SLIDE 56

Vowels

slide-57
SLIDE 57

Vowel Space

slide-58
SLIDE 58

Seeing Formants: the Spectrogram

slide-59
SLIDE 59

Vowel Space

slide-60
SLIDE 60

Spectrograms

slide-61
SLIDE 61

Pronunciation is Context Dependent

▪ [bab]: closure of lips lowers all formants: so rapid increase in all formants at beginning of "bab” ▪ [dad]: first formant increases, but F2 and F3 slight fall ▪ [gag]: F2 and F3 come together: this is a characteristic of velars. Formant transitions take longer in velars than in alveolars or labials

From Ladefoged “A Course in Phonetics”

slide-62
SLIDE 62

Dialect Issues

▪ Speech varies from dialect to dialect (examples are American vs. British English) ▪ Syntactic (“I could” vs. “I could do”) ▪ Lexical (“elevator” vs. “lift”) ▪ Phonological ▪ Phonetic ▪ Mismatch between training and testing dialects can cause a large increase in error rate American British

all

  • ld
slide-63
SLIDE 63

Acoustic Analysis

Slide by Preethi Jyothi

slide-64
SLIDE 64

Frame Extraction

▪ A frame (25 ms wide) extracted every 10 ms

25 ms 10ms

a1 a2 a3

Figure: Simon Arnfield

Preview of feature extraction for each frame: 1) DFT (Spectrum) 2) Log (Calibrate) 3) another DFT (!!??)

slide-65
SLIDE 65

Why these Peaks?

▪ Articulation process:

▪ The vocal cord vibrations create harmonics ▪ The mouth is an amplifier ▪ Depending on shape of mouth, some harmonics are amplified more than others

slide-66
SLIDE 66

Figures from Ratree Wayland

A3 A4 A2 C4 C3 F#3 F#2

Vowel [i] at increasing pitches

slide-67
SLIDE 67

Deconvolution / The Cepstrum

slide-68
SLIDE 68

Deconvolution / The Cepstrum

Graphs from Dan Ellis

slide-69
SLIDE 69

Final Feature Vector

▪ 39 (real) features per 25 ms frame:

▪ 12 MFCC features ▪ 12 delta MFCC features ▪ 12 delta-delta MFCC features ▪ 1 (log) frame energy ▪ 1 delta (log) frame energy ▪ 1 delta-delta (log frame energy)

▪ So each frame is represented by a 39D vector

slide-70
SLIDE 70

Acoustic Analysis

Slide by Preethi Jyothi

slide-71
SLIDE 71

Phonetic Analysis

Slide by Preethi Jyothi

slide-72
SLIDE 72

CMU Pronunciation Dict

slide-73
SLIDE 73

Speech Model

w1 w2

Words

s1 s2 s3 s4 s5 s6 s7

Sound types

a1 a2 a3 a4 a5 a6 a7

Acoustic

  • bservations

Language model Acoustic model

slide-74
SLIDE 74

Acoustic Modeling

Slide by Preethi Jyothi

slide-75
SLIDE 75

Vector Quantization

▪ Idea: discretization

▪ Map MFCC vectors onto discrete symbols ▪ Compute probabilities just by counting

▪ This is called vector quantization or VQ ▪ Not used for ASR any more ▪ But: useful to consider as a starting point

slide-76
SLIDE 76

Next class: HMMs for Continuous Observations

▪ Feature vectors are real-valued ▪ Solution 1: discretization ▪ Solution 2: continuous emissions

▪ Gaussians ▪ Multivariate Gaussians ▪ Mixtures of multivariate Gaussians