Speech Processing 15-492/18-492 Speech Translation Case study: - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speech Translation Case study: - - PowerPoint PPT Presentation
Speech Processing 15-492/18-492 Speech Translation Case study: Transtac Details Transtac: Two S2S System DARPA developed for DARPA developed for Check points, medical and civil defense Check points, medical and civil defense
Transtac: Two S2S System
- DARPA developed for
DARPA developed for
- Check points, medical and civil defense
Check points, medical and civil defense
- Requirements
Requirements
- Two way
Two way
- Eyes
Eyes-
- free (no screen)
free (no screen)
- Portable
Portable
- Usable by real
Usable by real usersS usersS
Transtac System
Laptop secured in Backpack Optional speech control Push-to-Talk Buttons Close-talking Microphone Small powerful Speakers
Transtac System Details
- Two way system
Two way system
- 2 ASR systems: English and Iraqi
2 ASR systems: English and Iraqi
- 2 way statistical translation
2 way statistical translation
- 2 synthesizers
2 synthesizers
- Push
Push-
- to
to-
- talk system
talk system
- (Users don’t like “translate everything mode”)
(Users don’t like “translate everything mode”)
- Echo back ASR result
Echo back ASR result
- And then translation
And then translation
Iraqi Language
- Iraqi Arabic is a dialect
Iraqi Arabic is a dialect
- Most Iraqi’s write Modern Standard Arabic
Most Iraqi’s write Modern Standard Arabic
- Most Iraqi’s do not write their own dialect
Most Iraqi’s do not write their own dialect
- No standardized spelling
No standardized spelling
- Transtac
Transtac project invented one project invented one
- But Iraqi’s may not be used to it
But Iraqi’s may not be used to it
- Arabic (MSA and dialects)
Arabic (MSA and dialects)
- Do not write short vowels in words
Do not write short vowels in words
Data for Training
- Collected human mediated dialogs
Collected human mediated dialogs
- Human acts as a machine
Human acts as a machine
- Passed a microphone back an forward
Passed a microphone back an forward
- Try to get people not to talk at same time
Try to get people not to talk at same time
- Large number of collections (over 4 years)
Large number of collections (over 4 years)
- 650 thousand sentences pairs
650 thousand sentences pairs
- Many different speakers
Many different speakers
- Hand transcribed by experts (in Iraqi spelling)
Hand transcribed by experts (in Iraqi spelling)
- Hand translate (Source sentences and Interpreter’s)
Hand translate (Source sentences and Interpreter’s)
Iraqi ASR
- Acoustic model from Iraqi data
Acoustic model from Iraqi data
- Based on MSA
Based on MSA phoneset phoneset
- Needs to be small fast models
Needs to be small fast models
- Discriminative Training
Discriminative Training
- Speaker specific adaptation
Speaker specific adaptation
- Lexicon
Lexicon
- Based on LDC provided lexicon
Based on LDC provided lexicon
- Multiple pronunciations/typos still a problem
Multiple pronunciations/typos still a problem
- Statistically trained LTS rules
Statistically trained LTS rules
- Language Model
Language Model
- Trained on Iraqi input (and translated output)
Trained on Iraqi input (and translated output)
English ASR
- Acoustic model
Acoustic model
- Originally using other models
Originally using other models
- Then trained from collected data
Then trained from collected data
- (Mostly military personnel)
(Mostly military personnel)
- Lexicon
Lexicon
- Existing lexicon but needed to add Military speak:
Existing lexicon but needed to add Military speak: MRAP, IED MRAP, IED
- Language model
Language model
- Trained from data provided
Trained from data provided
- Trained from “similar” data found on the web
Trained from “similar” data found on the web
- Training from hand created “typical” examples
Training from hand created “typical” examples
TTS
- Standard English TTS
Standard English TTS
- Appropriate “command” voice
Appropriate “command” voice
- Unit selection
Unit selection
- Added lots of military vocabulary
Added lots of military vocabulary
- Iraqi TTS
Iraqi TTS
- Recorded from Iraqi radio announcer
Recorded from Iraqi radio announcer
- Based on example sentences in the domain
Based on example sentences in the domain
- LDC lexicon and LTS rules (same as ASR)
LDC lexicon and LTS rules (same as ASR)
- Hand tuned
Hand tuned
S2S Interface Issues
- How do you teach people to use the system
How do you teach people to use the system
- “
“Transtac Transtac say instructions” say instructions”
- Not really sufficient
Not really sufficient
- How can you tell it translated correctly
How can you tell it translated correctly
- Give (speech) feedback.
Give (speech) feedback.
Backtranslation
Backtranslation
ASR echo back
ASR echo back
S2S Interface Issues
- How do you translate names
How do you translate names
- A correct translation/transliteration is hard to
A correct translation/transliteration is hard to understand understand
- Mark names in translations
Mark names in translations
- “My name is … Abdullah”
“My name is … Abdullah”
- “He lives on … al
“He lives on … al-
- Aqar
Aqar … street” … street”
S2S Evaluation (Transtac)
- Offline tests
Offline tests
- ASR
ASR-
- >Text and Text
>Text and Text-
- >Text
>Text
- Compare to translation references
Compare to translation references
- WER and “BLEU” score
WER and “BLEU” score
- Online tests
Online tests
- Concept transfer (through defined scenarios)
Concept transfer (through defined scenarios)
- Speed (number of concepts per minute)
Speed (number of concepts per minute)
- (English speech masking)
(English speech masking)
- Utility tests
Utility tests
- Does it really work
Does it really work
Transtac Participants
- Developer groups
Developer groups
- IBM
IBM
- SRI
SRI
- BBN
BBN
- CMU
CMU
- USC
USC
- Evaluations
Evaluations
- Twice a year in Iraqi (somewhere in DC)
Twice a year in Iraqi (somewhere in DC)
- One surprise language (Farsi,
One surprise language (Farsi, Bahasa Bahasa Malay) Malay)
- Other evaluations with military groups
Other evaluations with military groups
Does it work??
- Yes, mostly
Yes, mostly
- 27 concepts out of 30
27 concepts out of 30-
- ish turns
ish turns
- Systems are mostly similar
Systems are mostly similar
- But some better than others
But some better than others
- Other techniques
Other techniques
- Belt/holster based PC with handheld speaker
Belt/holster based PC with handheld speaker
- Small PC in pouch
Small PC in pouch
- Chest mounted array microphone
Chest mounted array microphone
S2S ASR Advanced issues
- Tight coupling
Tight coupling
- ASR should output N
ASR should output N-
- best
best
- Translated all (lattice)
Translated all (lattice)
- Choose best translation
Choose best translation
- (MT as a LM for ASR)
(MT as a LM for ASR)
- Remove
Remove disfluencies/hestitations disfluencies/hestitations
- Add more relevant data
Add more relevant data
- Automatically convert past tense/third person data to
Automatically convert past tense/third person data to present tense/ present tense/first+second first+second person … person …
S2S TTS Advance Issues
- MT output isn’t
MT output isn’t gramtical gramtical
- TTS doesn’t care and just says it
TTS doesn’t care and just says it
- TTS should try to say MT output with more
TTS should try to say MT output with more breaks. breaks.
- TTS (unit selection)
TTS (unit selection)
- As a LM on MT output
As a LM on MT output
- Choose the best translation on what is said best