Simultaneous Speech Translation Graham Neubig Nara Institute of - - PowerPoint PPT Presentation

simultaneous speech translation
SMART_READER_LITE
LIVE PREVIEW

Simultaneous Speech Translation Graham Neubig Nara Institute of - - PowerPoint PPT Presentation

Simultaneous Speech Translation Simultaneous Speech Translation Graham Neubig Nara Institute of Science and Technology (NAIST) 10/16/2015 Joint Work With: Satoshi Nakamura, Tomoki Toda, Sakriani Sakti, Tomoki Fujita, Hiroaki Shimizu, Yusuke


slide-1
SLIDE 1

1

Simultaneous Speech Translation

Simultaneous Speech Translation

Graham Neubig Nara Institute of Science and Technology (NAIST) 10/16/2015

Joint Work With: Satoshi Nakamura, Tomoki Toda, Sakriani Sakti, Tomoki Fujita, Hiroaki Shimizu, Yusuke Oda, Takashi Mieno, Quoc Truong Do

slide-2
SLIDE 2

2

Simultaneous Speech Translation

Background

slide-3
SLIDE 3

3

Simultaneous Speech Translation

Speech Translation

Source: Microsoft Research http://research.microsoft.com/en-us/news/features/translator-052714.aspx Source: NICT http://www.nict.go.jp/press/2010/06/29-1.html Source: Karlsruhe Institute of Technology http://isl.anthropomatik.kit.edu/english/1520.php

slide-4
SLIDE 4

4

Simultaneous Speech Translation

Traditional Speech Translation

ASR こんにちは、駅はどこですか? MT Hello, where is the station? TTS

Divide at sentence boundaries

slide-5
SLIDE 5

5

Simultaneous Speech Translation

Problem: Delay (Ear-Voice Span)

ASR こんにちは、駅はどこですか? MT Hello, where is the station? TTS

Delay

slide-6
SLIDE 6

6

Simultaneous Speech Translation

Speech Translation Example

slide-7
SLIDE 7

7

Simultaneous Speech Translation

Simultaneous Speech Translation

ASR

こんにちは、

MT

駅は

MT

どこですか?

MT

Hello,

the station where is it? TTS TTS TTS

Delay: Reduced

But, this is not easy!

slide-8
SLIDE 8

8

Simultaneous Speech Translation

Professional Simultaneous Interpretation

Photo Credit: https://www.flickr.com/photos/joi/2027679714 https://www.flickr.com/photos/european_parliament/4268490015

slide-9
SLIDE 9

9

Simultaneous Speech Translation

Simultaneous Interpretation Data [Shimizu+ LREC14]

 Recorded data - About 10 Hours of TED Talks (English-Japanese, Japanese-English) Experience Rank 15 years S rank 4 years A rank 1 year B rank Freely available for research purposes: http://ahclab.naist.jp/resource/stc/  Simultaneous interpreters - 3 pros with varying years of experience - Ranked S, A, and B

slide-10
SLIDE 10

10

Simultaneous Speech Translation

Simultaneous Interpreter Example

slide-11
SLIDE 11

11

Simultaneous Speech Translation

So How do Simultaneous Interpreters Do It?

今ご覧いただいたこの映像は今から五年前、日本で世間を 賑わせていた裁判員制度が始まる一年前、大学四年生だった 私が模擬裁判用の資料として作った物です

Source: Translation:

You just saw this video clip. Five years ago, at that time in Japan, the ordinary people's justice system, jury system, was very much talked about in Japan, and I created this video as a reference material for that.

Interpretation:

Five years ago, as a college senior, I created the video that you just saw as a reference material for a mock trial, one year before the much-talked-about jury system commenced in Japan.

Segmentation Prediction Rewording Summarization

Predict NP

slide-12
SLIDE 12

12

Simultaneous Speech Translation

Can We Do the Same in Speech Translation Systems?

  • Segmentation: When do we start translating?
  • Prediction: Can we predict things that haven't been

said?

  • Rewording: Can we reword sentences to be conducive

to simultaneous translation?

  • Evaluation: How do we decide which results are

better?

Four problems in this talk:

slide-13
SLIDE 13

13

Simultaneous Speech Translation

Segmentation

slide-14
SLIDE 14

14

Simultaneous Speech Translation

Heuristic Segmentation Strategies

hello where is the station

Division on pauses [Fugen+ 07, Bangalore+ 12] Division on predicted commas [Sridhar+ 13] comma no comma Division based on reordering probabilities [Fujita+ 13]

hello → probability of reordering 0.1 where → probability of reordering 0.8

slide-15
SLIDE 15

15

Simultaneous Speech Translation

Optimizing Segmentation Strategies for Simultaneous Speech Translation [Oda+ ACL14]

  • All previous segmentation strategies were based on

heuristics

  • Don't directly take into account effect on translation

accuracy

What if we could directly optimize sentence segmentation for translation accuracy?

slide-16
SLIDE 16

16

Simultaneous Speech Translation

Training/Testing Framework

src src src src src src src src src trg trg trg trg trg trg trg trg trg

Training Corpus

src src src src src src src src src

Segmentation S*

Find segmentation S* that maximizes MT accuracy Train segmentation model

Model

src src src src src src src src src

Testing Corpus

src src src src src src src src src

Segmented Test

trg trg trg trg trg trg trg trg trg

Translated Test

Segment Translate

slide-17
SLIDE 17

17

Simultaneous Speech Translation

S* Search Method 1: Greedy Search

I ate lunch but she left 私は昼食を食べたが彼女は帰った I ate lunch but she left I ate lunch but she left I ate lunch but she left I ate lunch but she left I ate lunch but she left 私 昼食を食べたが彼女は帰った 私は食べた ランチ彼女は帰った

私は昼食を食べた しかし彼女は帰った

私は昼食を食べたが 彼女は帰った 私は食べたが彼女 左 I ate lunch but she left I ate lunch but she left 私 昼食を食べたが 彼女は帰った I ate lunch but she left 私は食べた 昼食だが 彼女は帰った I ate lunch but she left

私は昼食を食べたしかし 彼女は帰った

I ate lunch but she left 私は昼食を食べたが 彼女 左 I ate lunch but she left 0.7 0.4 0.6 1.0 0.2 0.9 0.3 0.6 0.2

Train SVM classifier to recover / at test time

slide-18
SLIDE 18

18

Simultaneous Speech Translation

S* Search Method 2: Grouping by Features

I ate lunch but she left

PRN VBD NN CC PRN VBD

I ate an apple and an orange

PRN VBD DET NN CC DET NN

Pronoun + Verb Noun + Conjunction Determiner + Noun

  • Because MT/Evaluation is complicated, there is the

potential to overfit

  • Solution: group boundaries by features

Search can be performed using dynamic programming Features for the model trivial, no learning is needed

slide-19
SLIDE 19

19

Simultaneous Speech Translation

Results on TED Talks

→ 2-3 times faster with no loss in BLEU

slide-20
SLIDE 20

20

Simultaneous Speech Translation

Simultaneous Translation Demo

  • Greedy+Grouping at 10 words
slide-21
SLIDE 21

21

Simultaneous Speech Translation

Future Contributions to Segmentation?

  • Speech:

Optimized models using acoustic features?

  • Parsing:

Incorporation with incremental parsing? e.g. [Ryu+ 06]

  • Machine Learning:

Smarter models: neural networks?

  • Algorithms:

Integration with incremental decoding? e.g. [Sankaran+ 10]

slide-22
SLIDE 22

22

Simultaneous Speech Translation

Prediction

slide-23
SLIDE 23

23

Simultaneous Speech Translation

What Kind of Prediction do Simultaneous Interpreters Do? [Wilss 78, Chernov+ 04]

  • Structural prediction

サイエンスを正しく楽しく、これを合い言葉にサイエンス CG science factual fun this keyword as science CG then what I wanted to do is to クリエーターとして活動しています。 creator as working promote fun and factual science, that's my keyword. I'm a … 今 ご覧頂いた 映像 now you saw video you just saw a video clip

  • Lexical prediction
slide-24
SLIDE 24

24

Simultaneous Speech Translation

Predicting Sentence-final Verbs [Grissom et al., EMNLP14]

  • Method for translating from verb-final languages (e.g. German)
  • Train a classifier to predict the sentence-final verb
  • Use reinforcement learning to decide to “wait” “predict” or “commit”
slide-25
SLIDE 25

25

Simultaneous Speech Translation

Syntax-based Simultaneous Translation through Prediction

  • f Unseen Syntactic Constituents [Oda+ ACL15]
  • Predict unseen syntax constituents

In the next 18 minutes I

PP NP IN NP NN NP NNS CD JJ DT I minutes 18 next the in PP S IN NP PRP NP NNS CD JJ DT I minutes 18 next the in (VP) VP

Predict

VP

  • Translate from correct tree

今 から 18 分 私 今 から 18 分 で 私 は (VP)

slide-26
SLIDE 26

26

Simultaneous Speech Translation

Why is Syntax Necessary?

  • Tree-to-string (T2S) MT framework

This is NP

This is DT VBZ

NP

VP NP S Parse これ は NP で す MT

  • Obtains state-of-the-art results on syntactically distant language

pairs (c.f. phrase-based translation; PBMT)

  • Possible to use additional syntactic constituents explicitly
  • Additional heuristic to wait for more input based on when

translation requires reordering

slide-27
SLIDE 27

27

Simultaneous Speech Translation

Leaf span

Making Training Data for Syntax Prediction

  • Decompose gold trees in the treebank

S VP NP NN NP DT VBZ pen a is This DT

  • 1. Select any leaf span in the tree
  • 2. Find the path between

leftmost/rightmost leaves

  • 3. Delete the outside subtree

NN

  • 4. Replace inside subtrees

with topmost phrase label

  • 5. Finally we obtain:

nil is a NN nil

Leaf span Left syntax Right syntax

slide-28
SLIDE 28

28

Simultaneous Speech Translation

Syntax Prediction Process

I minutes 18 next the in

Input translation unit

PP NP IN NP NN NP NNS CD JJ DT

  • 1. Parse the input as-is

Word:R1=I POS:R1=NN Word:R1-2=I,minutes POS:R1-2=NN,NNS ... ROOT=PP ROOT-L=IN ROOT-R=NP ...

  • 2. Extract features

VP ... 0.65 NP ... 0.28 nil ... 0.04 ...

  • 3. Predict the next tag

(linear SVM)

VP

  • 4. Append to

sequence

nil

  • 5. Repeat until nil
slide-29
SLIDE 29

29

Simultaneous Speech Translation

Results: Translation Trade-off (1)

2 4 6 8 10 12 14 16 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15

Translation Accuracy

BLEU RIBES

Mean #words in inputs ∝ Delay

Short Long Short Long

2 4 6 8 10 12 14 16 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6

PBMT

  • Short inputs reduce translation accuracies

Using N-words segmentation (not-optimized)

slide-30
SLIDE 30

30

Simultaneous Speech Translation

Results: Translation Trade-off (2)

2 4 6 8 10 12 14 16 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15

Translation Accuracy

BLEU RIBES

Mean #words in inputs ∝ Delay

T2S PBMT

2 4 6 8 10 12 14 16 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6

Short Long Short Long

  • Long phrase

... T2S > PBMT

  • Short phrase

... T2S < PBMT

slide-31
SLIDE 31

31

Simultaneous Speech Translation

Results: Translation Trade-off (3)

2 4 6 8 10 12 14 16 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15

Translation Accuracy

BLEU RIBES

Mean #words in inputs ∝ Delay

T2S PBMT Proposed

2 4 6 8 10 12 14 16 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6

Short Long Short Long

  • Prevent accuracy decreasing in short phrases
  • More robustness for reordering
slide-32
SLIDE 32

32

Simultaneous Speech Translation

Future Contributions to Prediction?

  • Language Modeling:

More sophisticated models for lexical prediction.

  • Lexical Simplification:

Predict a more general word, then replace it later?

  • Machine Learning:

End-to-end reinforcement learning of the whole system? Application of neural MT models?

slide-33
SLIDE 33

33

Simultaneous Speech Translation

Rewording

slide-34
SLIDE 34

34

Simultaneous Speech Translation

What Kinds of Rewording May Be Helpful?

  • Passivization [He+ 15]

私は I 昨日 yesterday 本を book 安い a cheap 買った bought I bought a cheap book yesterday yesterday a cheap book was bought by me

  • Conjunction Clauses [Shimizu+ 13]

Y dakara X X nazenaraba Y X because Y

  • etc.
slide-35
SLIDE 35

35

Simultaneous Speech Translation

Constructing a Speech Translation System using Simultaneous Interpretation Data [Shimizu+ IWSLT13]

Input

Translation System

Interpreted

Training

Translated Interpretation- like results

Traditional Proposed  Approach: - Incorporate simultaneous interpretation data in training the MT system

 [Paulik+ 08] use interpretation data, but to improve accuracy

slide-36
SLIDE 36

36

Simultaneous Speech Translation

Incorporating Interpretation Data

 Tuning (Tu) - Tune the parameters of the translation systems to match the interpretation data  Language Model (LM): Linear Interpolation - Match the style of simultaneous interpreters  Translation Model (TM): fill-up [Bisazza+ 11] - Like the LM, adapt the TM to match interpretation data

Interpretation data is small, so use adaptation techniques

slide-37
SLIDE 37

37

Simultaneous Speech Translation

Experimental Evaluation

Accuracy measured against simultaneous interpretation reference

P h r a s e S e n t e n c e

slide-38
SLIDE 38

38

Simultaneous Speech Translation

Examples of Learned Traits

Shortening Starting sentences with “OK” or “And” (Also done by interpreter in 25% of sentences)

slide-39
SLIDE 39

39

Simultaneous Speech Translation

Syntax-based Rewriting for Simultaneous Machine Translation [He+ EMNLP15]

  • Reword the target language to be closer to source
  • Passivizing, changing order of clauses when beneficial
slide-40
SLIDE 40

40

Simultaneous Speech Translation

Future Contributions in Rewording?

  • Paraphrasing:

More generalized models of structural paraphrasing?

  • Semantic Similarity:

How can we evaluate semantic similarity between sentences structurally different from the reference?

slide-41
SLIDE 41

41

Simultaneous Speech Translation

Evaluation

slide-42
SLIDE 42

42

Simultaneous Speech Translation

Speed vs. Accuracy

  • Tradeoff between speed and accuracy.

Delay Accuracy

Long Short High Low もっと 手頃な ホテルは ありませんか more cheap hotel is there もっと 手頃な ホテルは ありませんか more cheap hotel is there Don’t split the sentence Split the sentence

do you have a more reasonable hotel ? / more / reasonable / is there a hotel ? /

  • Given two systems of different speed and accuracy,

which is better?

slide-43
SLIDE 43

43

Simultaneous Speech Translation

Speed or Accuracy? A Study in Evaluation of Simultaneous Speech Translation Systems [Mieno+ InterSpeech15]

  • Based on speed and accuracy, determine which system is betuer

High Low

Accuracy Accuracy Accuracy Delay Delay Delay

slide-44
SLIDE 44

44

Simultaneous Speech Translation

How to Create an Evaluation Function? (Based on Data)

Accuracy Accuracy Delay Delay

Training Data Features

Translations with various delays and accuracies

Movie data

Machine Learning

Evaluation Function

Manual Evaluation Results Manual Evaluation Results

slide-45
SLIDE 45

45

Simultaneous Speech Translation

Manual Evaluation Format

  • Rank-based evaluatjon

– Perform comparatjve evaluatjon of which output is “betuer” – Allows for consideratjon of both speed and accuracy

System A System B System C Output A Output B Output C 2 1 3

Input video

Ranking by evaluators

slide-46
SLIDE 46

46

Simultaneous Speech Translation

Evaluation Sheet Example

slide-47
SLIDE 47

47

Simultaneous Speech Translation

Learning an Evaluation Function

Weight vector Features useful in evaluation (i.e., delay and accuracy)

Displayed video

Define a linear function that takes a video as input and returns a score

This function can be learned from ranked data using “learning to rank”

slide-48
SLIDE 48

48

Simultaneous Speech Translation

Experimental Setup

  • Target video

TED Talks TED Talks

  • Gathered data

Video 20 Types 20-30 Seconds Delay 7 Types 0,1,2,3,5,7,10 Seconds Accuracy 3 Types Auto: BLEU/RIBES Man: Adequacy Subjects 15 Japanese speakers Modalitjes Subtjtled Dubbed

  • Translation data

(5 varieties)

English → Japanese ① Realtime trans. is important ② Often used in MT evaluation

Translator Translator

Interpreter 1 (S Rank) Interpreter 1 (S Rank) Interpreter 2 (A Rank) Interpreter 2 (A Rank) Syntax-based MT Syntax-based MT Phrase-based MT Phrase-based MT

slide-49
SLIDE 49

49

Speed or Accuracy? A Study in Evaluation of Simultaneous Speech Translation

Evaluation of Evaluation

Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Text Subtitles

Accuracy Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Speech

None BLEU+1 RIBES Adeq.

slide-50
SLIDE 50

50

Speed or Accuracy? A Study in Evaluation of Simultaneous Speech Translation

Q1: Is Delay Important in S2S Translation?

Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Text Subtitles

Accuracy Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Speech

None BLEU+1 RIBES Adeq.

A: Yes! In all cases, the scoring function considering delay did as good or better than just considering accuracy.

slide-51
SLIDE 51

51

Speed or Accuracy? A Study in Evaluation of Simultaneous Speech Translation

Q2: Does Importance Depend on Modality of Presentation?

Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Text Subtitles

Accuracy Acc. Delay+Acc. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Speech

None BLEU+1 RIBES Adeq.

A: Yes! Considering delay was more useful when presenting results through subtitles. Why?: Probably because when watching subtitles, it is possible to hear the original speech.

  • Avg. +7%
  • Avg. +3%
slide-52
SLIDE 52

52

Speed or Accuracy? A Study in Evaluation of Simultaneous Speech Translation

Learned Evaluation Functions (for Adequacy)

Speech Output Subtitle Output

5 4 3 2 1 2 4 6 8 10

Delay (s)

5 4 3 2 1 2 4 6 8 10

Delay (s) 5 Level Adequacy 5 Level Adequacy

Accuracy Delay Subtitle Output 1.40

  • 0.059

Speech Output 1.99

  • 0.018

1 point of adequacy = 8.0 sec. of delay 28.5 sec. of delay

slide-53
SLIDE 53

53

Simultaneous Speech Translation

Future Contributions in Evaluation?

  • Adaptation:

A more flexible evaluation measure that generalizes to many modalities, genres, tasks.

  • Machine Learning:

Non-linear regression functions?

  • Speech/UI:

Other factors including presentation modality (avatars?), synthesis quality play a large role.

slide-54
SLIDE 54

54

Simultaneous Speech Translation

Conclusion

slide-55
SLIDE 55

55

Simultaneous Speech Translation

Conclusion

  • The problem of high-accuracy simultaneous translation

covers many fields of NLP/Speech: parsing, machine learning, language modeling, prosody, paraphrasing.

  • Still a new field, lots of opportunities for interesting

applications of NLP tech!