Speech Processing 11-492/18-495 Speech Processing Current Topics - - PowerPoint PPT Presentation
Speech Processing 11-492/18-495 Speech Processing Current Topics - - PowerPoint PPT Presentation
Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges Commercial and Research Current and Future Current and Future What are the hot topics in Speech What are the hot topics in Speech What currently
Current and Future Current and Future
What are the hot topics in Speech
What are the hot topics in Speech
What currently works
What currently works
What could work soon (5-10years)
What could work soon (5-10years)
What are the industry hot topics
What are the industry hot topics
What are the research challenges
What are the research challenges
Spoken Dialog: Now Spoken Dialog: Now
Industry:
Industry:
Location based querying
Location based querying
On phone: Apple (Siri)
On phone: Apple (Siri)
In home: Amazon (Echo)
In home: Amazon (Echo)
Smartphones, Tablets:
Smartphones, Tablets:
(Owners have money)
(Owners have money) – IoT deployment IoT deployment
How do you make money out of this …
How do you make money out of this …
Spoken Dialog: Now Spoken Dialog: Now
Research
Research
Error recovery
Error recovery
Adaptive systems
Adaptive systems
Rapid deployment
Rapid deployment
Learning dialog structure from data
Learning dialog structure from data
Non-task oriented dialog
Non-task oriented dialog
ASR: Now ASR: Now
Industry
Industry
Adapting cloud ASR per app.
Adapting cloud ASR per app.
Broadcast news transcription
Broadcast news transcription
Robust speech recognition:
Robust speech recognition:
In car, outside, in noisy office, far field
In car, outside, in noisy office, far field
LM adaptation from other sources
LM adaptation from other sources
Using click through and search queries
Using click through and search queries
Pronunciation variants (“wrong” ones too)
Pronunciation variants (“wrong” ones too)
Medical transcription
Medical transcription
ASR: Now ASR: Now
Research:
Research:
Discriminative training
Discriminative training
Acoustic parameter projections to discriminate
Acoustic parameter projections to discriminate between the correct answers and competitors between the correct answers and competitors
Robust recognition
Robust recognition
Far field microphones
Far field microphones
Blind source separation
Blind source separation
Out of vocabulary words
Out of vocabulary words
Unsupervised training
Unsupervised training
Deep Learning (Neural Nets)
Deep Learning (Neural Nets)
Zero-resource ASR
Zero-resource ASR
TTS: Now TTS: Now
Industry
Industry
Building custom voices (and your voice)
Building custom voices (and your voice)
Multilingual on small devices
Multilingual on small devices
E.g. for GPS Navigation over Europe
E.g. for GPS Navigation over Europe
Easy methods to build new languages
Easy methods to build new languages
Conversational Speech
Conversational Speech
TTS: Now TTS: Now
Research
Research
Improving neural synthesis
Improving neural synthesis
- Quality/Resources/Runtime computation
Quality/Resources/Runtime computation
Rapid support in new languages
Rapid support in new languages
Emotional speech synthesis
Emotional speech synthesis
Automatic building of voices from data
Automatic building of voices from data
Without any human intervention
Without any human intervention
Languages without Orthography
Languages without Orthography
Synthesis beyond the sentence
Synthesis beyond the sentence
Synthesis with more text analysis
Synthesis with more text analysis
Speech to Speech Translation Speech to Speech Translation
Industry
Industry
One way systems, domain limited systems
One way systems, domain limited systems
Simple targeted cell phone systems
Simple targeted cell phone systems
Youtube/Broadcast translation
Youtube/Broadcast translation
Skype translation
Skype translation
Research
Research
Two way systems, large domains
Two way systems, large domains
One way lecture/broadcast news
One way lecture/broadcast news
VC and SID: Now VC and SID: Now
Voice conversion
Voice conversion
Cross Lingual Voice Conversion
Cross Lingual Voice Conversion
Emotion/style conversion
Emotion/style conversion
Conversion without training data
Conversion without training data
Speaker ID
Speaker ID
Accuracy on large data sets (> 1000 speakers)
Accuracy on large data sets (> 1000 speakers)
Cross channel/language ID
Cross channel/language ID
More information in ID (prosody, vocab)
More information in ID (prosody, vocab)
CALL: Now CALL: Now
Industry
Industry
Pronunciation training
Pronunciation training
Scenario practicing
Scenario practicing
Research
Research
Game based tools
Game based tools
Measuring educational contribution
Measuring educational contribution
Speech Processing Future Speech Processing Future
Hard challenges (PhD topics and beyond)
Hard challenges (PhD topics and beyond)
All on the research side
All on the research side
But maybe in Research Labs
But maybe in Research Labs
Speech Reco without Speech Speech Reco without Speech
Using other modalities
Using other modalities
Lip movement, muscle movement
Lip movement, muscle movement
Silent speech
Silent speech
No generated audio
No generated audio
Just think about the words
Just think about the words
Gesture recognition
Gesture recognition
Brain Computer Interfaces
Brain Computer Interfaces
ASR without text
ASR without text
Find “….” in all this audio
Find “….” in all this audio
Beyond the Words Beyond the Words
Recognition of more than words
Recognition of more than words
Intent, style, emotion
Intent, style, emotion
Human-Machine
Human-Machine
Frustration, confidence, agreement
Frustration, confidence, agreement
Human-Human
Human-Human
Rapport, relationships, persuasion
Rapport, relationships, persuasion
Truth and lies
Truth and lies
Sentiment
Sentiment
Conversational Systems Conversational Systems
Participant in a meeting
Participant in a meeting
True conversational speech
True conversational speech
Appropriate non-word speech generation
Appropriate non-word speech generation
Know when to speak, when to laugh, when to listen
Know when to speak, when to laugh, when to listen
Appropriate timing conversation
Appropriate timing conversation
Able to interrupt when having something to say
Able to interrupt when having something to say
Have something to say
Have something to say
Summaries and Discussions Summaries and Discussions
Describe a paper/movie/event
Describe a paper/movie/event
Appropriate summary
Appropriate summary
Allow questions
Allow questions
Know when to use style/emotion
Know when to use style/emotion
Not just speech<->text
Not just speech<->text
Understand more of the text content
Understand more of the text content Answer complex questions
Answer complex questions
Engage user and discuss topic