Speech Processing 11-492/18-495 Speech Processing Current Topics - - PowerPoint PPT Presentation

speech processing 11 492 18 495
SMART_READER_LITE
LIVE PREVIEW

Speech Processing 11-492/18-495 Speech Processing Current Topics - - PowerPoint PPT Presentation

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges Commercial and Research Current and Future Current and Future What are the hot topics in Speech What are the hot topics in Speech What currently


slide-1
SLIDE 1

Speech Processing 11-492/18-495

Speech Processing Current Topics and Future challenges Commercial and Research

slide-2
SLIDE 2

Current and Future Current and Future

 What are the hot topics in Speech

What are the hot topics in Speech

 What currently works

What currently works

 What could work soon (5-10years)

What could work soon (5-10years)

 What are the industry hot topics

What are the industry hot topics

 What are the research challenges

What are the research challenges

slide-3
SLIDE 3

Spoken Dialog: Now Spoken Dialog: Now

 Industry:

Industry:

 Location based querying

Location based querying

 On phone: Apple (Siri)

On phone: Apple (Siri)

 In home: Amazon (Echo)

In home: Amazon (Echo)

 Smartphones, Tablets:

Smartphones, Tablets:

 (Owners have money)

(Owners have money) – IoT deployment IoT deployment

 How do you make money out of this …

How do you make money out of this …

slide-4
SLIDE 4

Spoken Dialog: Now Spoken Dialog: Now

 Research

Research

 Error recovery

Error recovery

 Adaptive systems

Adaptive systems

 Rapid deployment

Rapid deployment

 Learning dialog structure from data

Learning dialog structure from data

 Non-task oriented dialog

Non-task oriented dialog

slide-5
SLIDE 5

ASR: Now ASR: Now

 Industry

Industry

 Adapting cloud ASR per app.

Adapting cloud ASR per app.

 Broadcast news transcription

Broadcast news transcription

 Robust speech recognition:

Robust speech recognition:

 In car, outside, in noisy office, far field

In car, outside, in noisy office, far field

 LM adaptation from other sources

LM adaptation from other sources

 Using click through and search queries

Using click through and search queries

 Pronunciation variants (“wrong” ones too)

Pronunciation variants (“wrong” ones too)

 Medical transcription

Medical transcription

slide-6
SLIDE 6

ASR: Now ASR: Now

 Research:

Research:

 Discriminative training

Discriminative training

 Acoustic parameter projections to discriminate

Acoustic parameter projections to discriminate between the correct answers and competitors between the correct answers and competitors

 Robust recognition

Robust recognition

 Far field microphones

Far field microphones

 Blind source separation

Blind source separation

 Out of vocabulary words

Out of vocabulary words

 Unsupervised training

Unsupervised training

 Deep Learning (Neural Nets)

Deep Learning (Neural Nets)

 Zero-resource ASR

Zero-resource ASR

slide-7
SLIDE 7

TTS: Now TTS: Now

 Industry

Industry

 Building custom voices (and your voice)

Building custom voices (and your voice)

 Multilingual on small devices

Multilingual on small devices

 E.g. for GPS Navigation over Europe

E.g. for GPS Navigation over Europe

 Easy methods to build new languages

Easy methods to build new languages

 Conversational Speech

Conversational Speech

slide-8
SLIDE 8

TTS: Now TTS: Now

 Research

Research

 Improving neural synthesis

Improving neural synthesis

  • Quality/Resources/Runtime computation

Quality/Resources/Runtime computation

 Rapid support in new languages

Rapid support in new languages

 Emotional speech synthesis

Emotional speech synthesis

 Automatic building of voices from data

Automatic building of voices from data

 Without any human intervention

Without any human intervention

 Languages without Orthography

Languages without Orthography

 Synthesis beyond the sentence

Synthesis beyond the sentence

 Synthesis with more text analysis

Synthesis with more text analysis

slide-9
SLIDE 9

Speech to Speech Translation Speech to Speech Translation

 Industry

Industry

 One way systems, domain limited systems

One way systems, domain limited systems

 Simple targeted cell phone systems

Simple targeted cell phone systems

 Youtube/Broadcast translation

Youtube/Broadcast translation

 Skype translation

Skype translation

 Research

Research

 Two way systems, large domains

Two way systems, large domains

 One way lecture/broadcast news

One way lecture/broadcast news

slide-10
SLIDE 10

VC and SID: Now VC and SID: Now

 Voice conversion

Voice conversion

 Cross Lingual Voice Conversion

Cross Lingual Voice Conversion

 Emotion/style conversion

Emotion/style conversion

 Conversion without training data

Conversion without training data

 Speaker ID

Speaker ID

 Accuracy on large data sets (> 1000 speakers)

Accuracy on large data sets (> 1000 speakers)

 Cross channel/language ID

Cross channel/language ID

 More information in ID (prosody, vocab)

More information in ID (prosody, vocab)

slide-11
SLIDE 11

CALL: Now CALL: Now

 Industry

Industry

 Pronunciation training

Pronunciation training

 Scenario practicing

Scenario practicing

 Research

Research

 Game based tools

Game based tools

 Measuring educational contribution

Measuring educational contribution

slide-12
SLIDE 12

Speech Processing Future Speech Processing Future

 Hard challenges (PhD topics and beyond)

Hard challenges (PhD topics and beyond)

 All on the research side

All on the research side

 But maybe in Research Labs

But maybe in Research Labs

slide-13
SLIDE 13

Speech Reco without Speech Speech Reco without Speech

 Using other modalities

Using other modalities

 Lip movement, muscle movement

Lip movement, muscle movement

 Silent speech

Silent speech

 No generated audio

No generated audio

 Just think about the words

Just think about the words

 Gesture recognition

Gesture recognition

 Brain Computer Interfaces

Brain Computer Interfaces

 ASR without text

ASR without text

 Find “….” in all this audio

Find “….” in all this audio

slide-14
SLIDE 14

Beyond the Words Beyond the Words

 Recognition of more than words

Recognition of more than words

 Intent, style, emotion

Intent, style, emotion

 Human-Machine

Human-Machine

 Frustration, confidence, agreement

Frustration, confidence, agreement

 Human-Human

Human-Human

 Rapport, relationships, persuasion

Rapport, relationships, persuasion

 Truth and lies

Truth and lies

 Sentiment

Sentiment

slide-15
SLIDE 15

Conversational Systems Conversational Systems

 Participant in a meeting

Participant in a meeting

 True conversational speech

True conversational speech

 Appropriate non-word speech generation

Appropriate non-word speech generation

 Know when to speak, when to laugh, when to listen

Know when to speak, when to laugh, when to listen

 Appropriate timing conversation

Appropriate timing conversation

 Able to interrupt when having something to say

Able to interrupt when having something to say

 Have something to say

Have something to say

slide-16
SLIDE 16

Summaries and Discussions Summaries and Discussions

 Describe a paper/movie/event

Describe a paper/movie/event

 Appropriate summary

Appropriate summary

 Allow questions

Allow questions

 Know when to use style/emotion

Know when to use style/emotion

 Not just speech<->text

Not just speech<->text

 Understand more of the text content

Understand more of the text content  Answer complex questions

Answer complex questions

 Engage user and discuss topic

Engage user and discuss topic

slide-17
SLIDE 17