Microblogs as Parallel Corpora Wang Ling, Guang Xiang, Chris Dyer, - - PowerPoint PPT Presentation

microblogs as parallel corpora
SMART_READER_LITE
LIVE PREVIEW

Microblogs as Parallel Corpora Wang Ling, Guang Xiang, Chris Dyer, - - PowerPoint PPT Presentation

Microblogs as Parallel Corpora Wang Ling, Guang Xiang, Chris Dyer, Isabel Trancoso, Alan W Black Carnegie Mellon University Instituto Superior Tecnico In this talk we will... In this talk we will... Crawl large amounts of microblog


slide-1
SLIDE 1

Microblogs as Parallel Corpora

Wang Ling, Guang Xiang, Chris Dyer, Isabel Trancoso, Alan W Black Carnegie Mellon University Instituto Superior Tecnico

slide-2
SLIDE 2

In this talk we will...

slide-3
SLIDE 3

In this talk we will...

  • Crawl large amounts of microblog parallel

data for free

slide-4
SLIDE 4

In this talk we will...

  • Crawl large amounts of microblog parallel

data for free

slide-5
SLIDE 5

In this talk we will...

  • Crawl large amounts of microblog parallel

data for free

○ Crawl Sina Weibo (Chinese Twitter) ○ English-Mandarin Pair

slide-6
SLIDE 6

Background

slide-7
SLIDE 7

Parallel Data in MT

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) Translation Model Tuning Decoding Evaluation

slide-8
SLIDE 8

Parallel Data in MT

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) Translation Model Tuning Decoding Evaluation

slide-9
SLIDE 9

Parallel Data in MT

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) Translation Model Tuning Decoding Evaluation

slide-10
SLIDE 10

Parallel Data in MT

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) Translation Model Tuning Decoding Evaluation

slide-11
SLIDE 11

Why do we need Parallel Data from Microblogs?

MT Model

  • Problem: Current parallel corpora are

generally clean and formal.

In 2011, Quebec fell victim to half of the closures and reductions in hours.

slide-12
SLIDE 12

Why do we need Parallel Data from Microblogs?

MT Model Input

shoutotut to the fans i met today. love u

  • Problem: Current parallel corpora are

generally clean and formal. But Microblogs are noisy and informal.

slide-13
SLIDE 13

Why do we need Parallel Data from Microblogs?

msg 4 Warren G his cday is today 1 yr older. Google Translate

slide-14
SLIDE 14

Why do we need Parallel Data from Microblogs?

msg 4 Warren G his cday is today 1 yr older. Google Translate

味精4沃伦G他的cday是今日1年岁。

slide-15
SLIDE 15

Why do we need Parallel Data from Microblogs?

msg 4 Warren G his cday is today 1 yr older. Google Translate

味精4沃伦G他的cday是今日1年岁。

slide-16
SLIDE 16

Why do we need Parallel Data from Microblogs?

msg 4 Warren G his cday is today 1 yr older. Google Translate

味精4沃伦G他的cday是今日1年岁。

slide-17
SLIDE 17

Why do we need Parallel Data from Microblogs?

msg 4 Warren G his cday is today 1 yr older. Google Translate

味精4沃伦G他的cday是今日1年岁。

slide-18
SLIDE 18

Problem with Parallel Data

  • Parallel data is a scarce resource
slide-19
SLIDE 19

Problem with Parallel Data

  • Parallel data is a scarce resource
  • Most of the parallel data are crawled from

○ Parallel Websites (Resnik 1999)(Fukushima 2006) ○ Patents (Macken 2007) ○ Parliament data (Koehn 2005) ○ ...

slide-20
SLIDE 20

Problem with Parallel Data

  • Parallel data is a scarce resource
  • Most of the parallel data are crawled from

○ Parallel Websites (Resnik 1999)(Fukushima 2006) ○ Patents (Macken 2007) ○ Parliament data (Koehn 2005) ○ ...

  • Crowdsourcing Translation(Zaiden 2011) is

an alternative but budget required

slide-21
SLIDE 21

Microblog Parallel Data Extraction

slide-22
SLIDE 22

How can we get Parallel Data in this domain for free?

slide-23
SLIDE 23

How can we get Parallel Data in this domain for free?

  • ...and we found this
slide-24
SLIDE 24

Is there Parallel Data in Sina Weibo?

  • Does this also happen in Sina Weibo?
slide-25
SLIDE 25

Is there Parallel Data in Sina Weibo?

  • Does this also happen in Sina Weibo?

Skydiving was incredible! Such an amazing feeling! I loving being adventurous! ;D - 高空 跳伞太不可思议 了!真是一种奇妙的感觉 !我喜欢冒险! ;D Meeting Yao Ming for the first time! So great to be back in China for the Mission Hills World Celebrity Pro-Am. Will post pictures soon! 第一次和姚明见面!又回 到中国 的感觉太棒了!这次是为观澜湖 世界名人赛。照片稍等片 后! Thanks.

slide-26
SLIDE 26

Is there Parallel Data in Sina Weibo?

  • Formal and Informal

"I am the light and I am the dark. And beyond the light and the dark, I am and God is." 我是 光明,我也 是黑暗。超越光明和黑暗,我 是,神是。 msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... - 发信息给 Warren G , 今天是他的生日,又 老了一岁了。生日快乐,愿上帝保佑 你和 ...

slide-27
SLIDE 27

Is there Parallel Data in Sina Weibo?

  • Formal and Informal

"I am the light and I am the dark. And beyond the light and the dark, I am and God is." 我是 光明,我也 是黑暗。超越光明和黑暗,我 是,神是。 msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... - 发信息给 Warren G , 今天是他的生日,又 老了一岁了。生日快乐,愿上帝保佑 你和 ...

slide-28
SLIDE 28

Is there Parallel Data in Sina Weibo?

  • Multiple Language Pairs

Summer Stand, The Drenched Show 2012 2012 싸이 훨씬 더 흠뻑 쇼 进进进进进进球了!克罗斯为拜仁破门 ! Toooooooooooooooooooooor 6:1! Kroos trifft fu ̈ r den FC Bayern!

slide-29
SLIDE 29

Is there Parallel Data in Sina Weibo?

  • Multiple Language Pairs

Summer Stand, The Drenched Show 2012 2012 싸이 훨씬 더 흠뻑 쇼 进进进进进进球了!克罗斯为拜仁破门 ! Toooooooooooooooooooooor 6:1! Kroos trifft fu ̈ r den FC Bayern!

slide-30
SLIDE 30

But there is a catch...

slide-31
SLIDE 31

But there is a catch...

  • Not all multilingual tweets are parallel
slide-32
SLIDE 32

But there is a catch...

  • Not all multilingual tweets are parallel

[GD's Twitter] ONE OF A KIND 的 M/V 马上 就要公 开了 !! Y’all Ready for this ?呃啊 啊啊,好紧张啊~还 请大家多多支持 ! 转发微博《南方小羊牧场》 11月9号北 美上映。 Showtime is coming up soon...

slide-33
SLIDE 33

But there is a catch...

  • Not all multilingual tweets are parallel
  • Finding the parallel segments in the

message is not trivial

slide-34
SLIDE 34

But there is a catch...

  • Not all multilingual tweets are parallel
  • Finding the parallel segments in the

message is not trivial

I wanna be here every year if possible~! 많은 분들의 걱정처 럼 ' 순간반짝 ' 일지라도 열심히 해보겠습 니다 ... 지나고보면 다 순간이니까요 ...^^ 可能的话, 我想每年来这里 ~ ! 就算像有 的人担心的那样我只是“昙 花 一现 " ,我还是会非常努力 的 ... 因为回头看的话,一切 都只是一 瞬的 ...^^

slide-35
SLIDE 35

But there is a catch...

  • Not all multilingual tweets are parallel
  • Finding the parallel segments in the

message is not trivial

I wanna be here every year if possible~! 많은 분들의 걱정처 럼 ' 순간반짝 ' 일지라도 열심히 해보겠습 니다 ... 지나고보면 다 순간이니까요 ...^^ 可能的话, 我想每年来这里 ~ ! 就算像有 的人担心的那样我只是“昙 花 一现 " ,我还是会非常努力 的 ... 因为回头看的话,一切 都只是一 瞬的 ...^^

slide-36
SLIDE 36

Content-based Matching

  • Given two sentences, calculate their

similarity:

je vais manger I am going to eat

slide-37
SLIDE 37

Content-based Matching

  • Given two sentences, calculate their

similarity:

○ Compute Viterbi Alignments je vais manger I am going to eat

slide-38
SLIDE 38

Content-based Matching

  • Given two sentences, calculate their

similarity:

○ Compute Viterbi Alignments ○ Compute Similarity Score je vais manger I am going to eat

slide-39
SLIDE 39

Content-based Matching

  • But, previous work assumes that a pair of

documents will be given

je vais manger I am going to eat

slide-40
SLIDE 40

Content-based Matching

  • But, previous work assumes that a pair of

documents will be given

  • In our case, only one document is provided

je vais manger I am going to eat

slide-41
SLIDE 41

Microblog Alignment Model

  • Solution: Consider all spans for matching
slide-42
SLIDE 42

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat

slide-43
SLIDE 43

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat

slide-44
SLIDE 44

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat je vais going to Score=0.2

slide-45
SLIDE 45

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat

slide-46
SLIDE 46

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat je vais manger I am going to Score=0.3

slide-47
SLIDE 47

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat

slide-48
SLIDE 48

Microblog Alignment Model

  • Solution: Consider all spans for matching

je vais manger I am going to eat je vais manger I am going to eat Score=0.6

slide-49
SLIDE 49

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

slide-50
SLIDE 50

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

○ Number of spans = N^4 ○ Viterbi alignments = N^2

slide-51
SLIDE 51

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

  • Answer: Dynamic Programming

○ Reuse Viterbi Alignments for previously processed spans je vais going to

slide-52
SLIDE 52

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

  • Answer: Dynamic Programming

○ Reuse Viterbi Alignments for previously processed spans je vais going to je vais manger I am going to

slide-53
SLIDE 53

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

  • Answer: Dynamic Programming

○ Reuse Viterbi Alignments for previously processed spans je vais going to je vais manger I am going to

slide-54
SLIDE 54

Microblog Alignment Model

  • Solution: Consider all spans for matching
  • Problem: Running the Viterbi Alignments for

all possible spans is intractable O(N^6):

  • Answer: Dynamic Programming

○ Reuse Viterbi Alignments for previously processed spans ○ Reduces Complexity from O(N^6) to O(N^4)

slide-55
SLIDE 55

Microblog Alignment Model

  • Final score computed by various models

English Mandarin Score You know what? 知道吗? 0.6 You have to remember where you come from b4 u know where u going... 你在知道要去哪里之前先要记得 自己从哪里来... 0.5 To DanielVeuleman yea iknw imma work on that 对DanielVeuleman说,是的,我知 道,我正在向那方面努力 0.3 just eat it, delicious noodles... 不管多晚,饿了不吃,就是睡不 着... 0.2

slide-56
SLIDE 56

Microblog Alignment Model

  • Final score computed by various models
  • Extract pairs by thresholding the score

English Mandarin Score You know what? 知道吗? 0.6 You have to remember where you come from b4 u know where u going... 你在知道要去哪里之前先要记得 自己从哪里来... 0.5 To DanielVeuleman yea iknw imma work on that 对DanielVeuleman说,是的,我知 道,我正在向那方面努力 0.3 just eat it, delicious noodles... 不管多晚,饿了不吃,就是睡不 着... 0.2

slide-57
SLIDE 57

Microblog Alignment Model

  • Final score computed by various models
  • Extract pairs by thresholding the score

English Mandarin Score You know what? 知道吗? 0.6 You have to remember where you come from b4 u know where u going... 你在知道要去哪里之前先要记得 自己从哪里来... 0.5 To DanielVeuleman yea iknw imma work on that 对DanielVeuleman说,是的,我知 道,我正在向那方面努力 0.3

slide-58
SLIDE 58

Experimental Results

slide-59
SLIDE 59

Results

  • Dataset

○ Crawled 65 million targeted tweets from Sina Weibo

slide-60
SLIDE 60

Results

  • Dataset

○ Crawled 65 million targeted tweets from Sina Weibo

slide-61
SLIDE 61

Results

  • Dataset

○ Crawled 65 million targeted tweets from Sina Weibo ○ Filtered all tweets with without a Mandarin Trigram and an English Trigram

slide-62
SLIDE 62

Parallel Sentence Extraction Results

  • Dataset

○ Crawled 65 million targeted tweets from Sina Weibo ○ Filtered all tweets with without a Mandarin Trigram and an English Trigram

  • Annotated 2000 tweets sampled uniformly

○ Is the tweet parallel? ○ Where are the parallel spans?

slide-63
SLIDE 63

Parallel Sentence Extraction Results

  • Parallel Tweet Detection
slide-64
SLIDE 64

Parallel Sentence Extraction Results

  • Keeping 30% of the data is a good trade-off
slide-65
SLIDE 65

Parallel Sentence Extraction Results

  • 30% of the tweets are parallel
slide-66
SLIDE 66

Parallel Sentence Extraction Results

  • Span detection:

○ Metric: Average Word Error Rate (no substitutions) je vais manger I am going to eat je vais manger :D I am going to eat je vais I am going to eat

Insertion Error Deletion Error Reference

slide-67
SLIDE 67

Parallel Sentence Extraction Results

  • Span detection:

○ Metric: Average Word Error Rate (no substitutions) je vais manger I am going to eat je vais manger :D I am going to eat je vais I am going to eat

Insertion Error Deletion Error Reference

slide-68
SLIDE 68

Parallel Sentence Extraction Results

  • Span detection:

○ Metric: Average Word Error Rate (no substitutions) je vais manger I am going to eat je vais manger :D I am going to eat je vais I am going to eat

Insertion Error Deletion Error Reference

slide-69
SLIDE 69

Parallel Sentence Extraction Results

  • Span detection:

○ Metric: Average Word Error Rate (no substitutions) ○ WER = 11.4% je vais manger I am going to eat je vais manger :D I am going to eat je vais I am going to eat

Insertion Error Deletion Error Reference

slide-70
SLIDE 70

MT Results

slide-71
SLIDE 71

MT Results

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) Tuning Decoding Evaluation

  • Baseline

Translation Model

slide-72
SLIDE 72

MT Results

Parallel Corpora (Training) Parallel Corpora (Devel) Parallel Corpora (Test) PSMT (Moses) MERT MOSES BLEU

  • Baseline
slide-73
SLIDE 73
  • Training Parallel Data

○ From Sina Weibo ■ Approximately 1M multilingual tweets ■ Expect 337K parallel sentences ■ Microblog Domain

Results (Extrinsic)

slide-74
SLIDE 74
  • Training Parallel Data

○ From Sina Weibo ■ Approximately 1M multilingual tweets ■ Expect 337K parallel sentences ■ Microblog Domain ○ FBIS dataset ■ 300K parallel sentences ■ News Domain ○ NIST dataset ■ 8M parallel sentences (including FBIS) ■ News Domain

Results (Extrinsic)

slide-75
SLIDE 75

Results (Extrinsic)

  • Development and Test sets

○ Weibo ■ Built by annotating weibo tweets manually ■ 1000 dev ■ 1000 test ■ Microblog domain

slide-76
SLIDE 76

Results (Extrinsic)

  • Development and Test sets

○ Weibo ■ Built by annotating weibo tweets manually ■ 1000 dev ■ 1000 test ■ Microblog domain ○ Syndicate ■ Extracted from project syndicate (Parallel website) ■ 1000 dev ■ 1000 test ■ News and political domain

slide-77
SLIDE 77

Results (Extrinsic)

  • MT experiments

○ Significant improvements (30-40%) on microblogs (in-domain)

Syndicate Weibo ZH-EN EN-ZH ZH-EN EN-ZH FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.8 15.9 15.7 17.2

slide-78
SLIDE 78

Results (Extrinsic)

  • MT experiments

○ Worse results on the Syndicate data(out-of-domain)

Syndicate Weibo ZH-EN EN-ZH ZH-EN EN-ZH FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.8 15.9 15.7 17.2

slide-79
SLIDE 79

Results (Extrinsic)

  • MT experiments

○ Better results in both datasets by combining parallel data

Syndicate Weibo ZH-EN EN-ZH ZH-EN EN-ZH FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.8 15.9 15.7 17.2 FBIS+Weibo 11.7 19.2 16.5 17.8 NIST+Weibo 13.3 21.5 16.9 17.9

slide-80
SLIDE 80

New Translations?

slide-81
SLIDE 81

New Translations?

  • Abbreviations

谢=thx,你=u

have u ever really lived in beijing ? 你是否真的住过北京 To Colton Lopez, thx for the love! 对 Colton Lopez说,谢 谢你的爱

slide-82
SLIDE 82

New Translations?

  • Abbreviations

TMD=damn,TM=damn

slide-83
SLIDE 83

New Translations?

  • Abbreviations

TMD=damn,TM=damn 他妈的-Ta Ma De

slide-84
SLIDE 84

New Translations?

  • Abbreviations

TMD=damn,TM=damn

Life is like the game ''Angry Birds''. When you fail, there are always some damn stupid pigs laughing at you. 人生就像 '' 愤怒的小鸟 '' ,当你失败时,总有 TMD 几 只笨猪在笑

slide-85
SLIDE 85

New Translations?

  • Abbreviations
  • Jargon

囧=embarrassed

slide-86
SLIDE 86

New Translations?

  • Abbreviations
  • Jargon

囧=embarrassed

I'm so embarrassed. 我囧死了。

slide-87
SLIDE 87

New Translations?

  • Abbreviations
  • Jargon

囧=embarrassed, 屌丝=loser

slide-88
SLIDE 88

New Translations?

  • Abbreviations
  • Jargon

囧=embarrassed, 屌丝=loser

slide-89
SLIDE 89

New Translations?

  • Abbreviations
  • Jargon

囧=embarrassed, 屌丝=loser

Today I heard a male foreign loser roaring in anger on the phone, "You are a liar! You don't love me at all! All you want to do is practise oral English!!! 今天在地铁站,看到一个外国 男屌丝在电话咆哮:你是 个骗子!你一点都不爱我!你只是想 和我练口语!

slide-90
SLIDE 90

Related Work

  • Jehl et al, 2012, describe a CLIR method to

find tweets that are parallel

○ Dataset not available (Tweets cannot be made public) ○ Poster in this ACL (make sure to check it out!)

slide-91
SLIDE 91

Conclusion

  • Presented an automatic method to extract

parallel sentences from microblogs

○ Large amounts of parallel data for free ○ Improvements for the ZH-EN pair

slide-92
SLIDE 92

Conclusion

  • Presented an automatic method to extract

parallel sentences from microblogs

○ Large amounts of parallel data for free ○ Improvements for the ZH-EN pair

  • μtopia - Microblog Translated Posts Corpora

○ @ http://www.cs.cmu.edu/~lingwang/microtopia/ ○ 1.5 Million Parallel Sentences from Twitter + Weibo ■ English ■ Mandarin ■ Arabic ■ 7 other languages

slide-93
SLIDE 93

Future Work

  • Online Microblog Translation System will be

available ○ @ http://www.microblogtranslation.org

slide-94
SLIDE 94

Thx y’all 4 ur attention ;)

slide-95
SLIDE 95

Thx y’all 4 ur attention ;)

Corpora - http://www.cs.cmu.edu/~lingwang/microtopia/ MT system - http://www.microblogtranslation.org