Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 - - PowerPoint PPT Presentation
Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars Other ASR techniques But not just acoustics But not just acoustics But not all phones are equi-probable Find word sequences that maximizes
But not just acoustics But not just acoustics
- But not all phones are equi-probable
- Find word sequences that maximizes
- Using Bayes’ Law
- Combine models
– Us HMMs to provide – Use language model to provide
Beyond n-grams Beyond n-grams
Tri-gram languages models
Tri-gram languages models
Good for general ASR
Good for general ASR
More targeted models for dialog systems
More targeted models for dialog systems
Look for more structure
Look for more structure
Formal Language Theory Formal Language Theory
Chomsky Hierarchy
Chomsky Hierarchy
Finite State Machines
Finite State Machines
Context Free Grammars
Context Free Grammars
Context Sensitive Grammars
Context Sensitive Grammars
Generalized Rewrite Rules/Turing machines
Generalized Rewrite Rules/Turing machines
As LM or as Understanding mechanism
As LM or as Understanding mechanism
Folded into the ASR or only ran on output
Folded into the ASR or only ran on output
Finite State Machines Finite State Machines
Trigram is a word^2 FSM
Trigram is a word^2 FSM
FSM for greeting
FSM for greeting
Hello Good Morning Afternoon
Finite State Grammar Finite State Grammar
Sentences -> Start Greeting End
Sentences -> Start Greeting End
Greeting -> “Hello”
Greeting -> “Hello”
Greeting -> “Good” TOD
Greeting -> “Good” TOD
TOD -> Morning
TOD -> Morning
TOD -> Afternoon
TOD -> Afternoon
Context Free Grammar Context Free Grammar
X -> Y Z
X -> Y Z
Y -> “Terminal”
Y -> “Terminal”
Y -> NonTerminal NonTerminal
Y -> NonTerminal NonTerminal
JSGF JSGF
Simple grammar formalism for ASR
Simple grammar formalism for ASR
Standard for writing ASR grammars
Standard for writing ASR grammars
Actually finite state
Actually finite state
http://www.w3.org/TR/jsgf
http://www.w3.org/TR/jsgf
Finite State Machines Finite State Machines
Finite State Machines:
Finite State Machines:
Deterministic
Deterministic
Each arc leaving a state has unique label
Each arc leaving a state has unique label
There always exists a Deterministic machine
There always exists a Deterministic machine representing a non-Deterministic one representing a non-Deterministic one
Minimal
Minimal
There exists an FSM with less (or equal) states that
There exists an FSM with less (or equal) states that accepts the same language accepts the same language
Probabilistic FSMs Probabilistic FSMs
Each arc has a label and a probability
Each arc has a label and a probability
Collect probabilities from data
Collect probabilities from data
Can do smoothing like ngrams
Can do smoothing like ngrams
Natural Language Processing Natural Language Processing
Probably mildly context sensitive
Probably mildly context sensitive
i.e. you need context sensitive rules
i.e. you need context sensitive rules
But if we only accept context free
But if we only accept context free
Probably OK
Probably OK
If we only accept finite state
If we only accept finite state
Probably OK too
Probably OK too
Writing Grammars for Speech Writing Grammars for Speech
What do people say?
What do people say?
No what do people *really* say!
No what do people *really* say!
Write examples
Write examples
Please, I’d like a flight to Boston
Please, I’d like a flight to Boston
I want to fly to Boston
I want to fly to Boston
What do you have going to Boston
What do you have going to Boston
What about Boston
What about Boston
Boston
Boston
Write rules grouping things together
Write rules grouping things together
Ignore the unimportant things Ignore the unimportant things
I’m terribly sorry but I would greatly
I’m terribly sorry but I would greatly appreciate if you might be able to help me appreciate if you might be able to help me find an acceptable find an acceptable flight to Boston flight to Boston. .
I, I wanna want to go to ehm Boston.
I, I wanna want to go to ehm Boston.
What do people really say What do people really say
A: see who else will somebody else important all the
A: see who else will somebody else important all the {mumble} the whole school are out for a week {mumble} the whole school are out for a week
B: oh really
B: oh really
A: {lipsmack} {breath} yeah
A: {lipsmack} {breath} yeah
B: okay {breath} well when are you going to come up then
B: okay {breath} well when are you going to come up then
A: um let’s see well I guess I I could come up actually
A: um let’s see well I guess I I could come up actually anytime anytime
B: okay well how about now
B: okay well how about now
A: now
A: now
B: yeah
B: yeah
A: have to work tonight –laugh-
A: have to work tonight –laugh-
Class based language models Class based language models
Conflate all words in same class
Conflate all words in same class
Cities, Names, numbers etc
Cities, Names, numbers etc
Can be automatic or designed
Can be automatic or designed
Adaptive Language Models Adaptive Language Models
Update with new News stories
Update with new News stories
Update your language model every day
Update your language model every day
Update your language model with daily use
Update your language model with daily use
Using user generated data (if ASR is good)
Using user generated data (if ASR is good)
Combining models Combining models
Use “background” model
Use “background” model
General tri-gram/neural model
General tri-gram/neural model
Use specific model
Use specific model
Grammar based
Grammar based
Very localized
Very localized
Combine
Combine
Interpolated (just a weight factor)
Interpolated (just a weight factor)
More elaborate combinations
More elaborate combinations
Maximum entropy models
Maximum entropy models
Vocabulary size Vocabulary size
Command and control
Command and control
< 100 words, grammar based
< 100 words, grammar based
Simple dialog
Simple dialog
< 1000 words, grammar/tri-gram
< 1000 words, grammar/tri-gram
Complex dialog
Complex dialog
< 10K words, tri-gram (some grammar for control)
< 10K words, tri-gram (some grammar for control)
Dictation
Dictation
< 64K words, tri-gram
< 64K words, tri-gram
Broadcast News
Broadcast News
256K plus, tri-gram/neural (and lots of other possibilities)
256K plus, tri-gram/neural (and lots of other possibilities)
Homework 1 Homework 1
Build a speech recognition system
Build a speech recognition system
An acoustic model
An acoustic model
A pronunciation lexicon
A pronunciation lexicon
A language model
A language model
Note it takes time to build
Note it takes time to build
What is your initial WER
What is your initial WER
How did you improve it
How did you improve it
Two stages:
Two stages:
Fri 25
Fri 25th
th Sep 3:30pm install and run all software
Sep 3:30pm install and run all software
Fri 2
Fri 2nd
nd Oct 3:30pm final submission
Oct 3:30pm final submission