Paraphrasing 4 Microblog Normalization Wang Ling Carnegie Mellon - - PowerPoint PPT Presentation

paraphrasing 4 microblog normalization
SMART_READER_LITE
LIVE PREVIEW

Paraphrasing 4 Microblog Normalization Wang Ling Carnegie Mellon - - PowerPoint PPT Presentation

Paraphrasing 4 Microblog Normalization Wang Ling Carnegie Mellon University Chris Dyer Instituto Superior Tcnico Alan Black INESC-ID L2F Isabel Trancoso In a nutshell... In a nutshell... Automatically create a Normalization corpora


slide-1
SLIDE 1

Paraphrasing 4 Microblog Normalization

Wang Ling Chris Dyer Alan Black Isabel Trancoso Carnegie Mellon University Instituto Superior Técnico INESC-ID L2F

slide-2
SLIDE 2

In a nutshell...

slide-3
SLIDE 3

In a nutshell...

Automatically create a Normalization corpora

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

slide-4
SLIDE 4

In a nutshell...

Automatically create a Normalization corpora Build a Paraphrase model for Normalization

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

Normalization Model

slide-5
SLIDE 5

In a nutshell...

Normalization Model Leavin canada goin bac to cali

slide-6
SLIDE 6

In a nutshell...

Normalization Model Leavin canada goin bac to cali Leaving Canada, going back to California

slide-7
SLIDE 7

Why Normalize?

msg 4 Warren G his cday is today 1 yr older.

slide-8
SLIDE 8

Why Normalize?

msg 4 Warren G his cday is today 1 yr older. Bing Translator

slide-9
SLIDE 9

Why Normalize?

msg 4 Warren G his cday is today 1 yr older. 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。 Bing Translator

slide-10
SLIDE 10

Why Normalize?

msg 4 Warren G his cday is today 1 yr older. 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。 Bing Translator monosodium glutamate 4 warren G he cday today is

  • lder 1 year.
slide-11
SLIDE 11

Related Work: Lexical Normalization

  • msg → message
  • 4 → for
  • yr → year
slide-12
SLIDE 12

Related Work: Lexical Normalization

  • Yang and Eisenstein, 2013; Hassan and

Menezes, 2013; Liu et al., 2012; Han et al., 2013; Han et al., 2012; Han and Baldwin, 2011; Gouws et al., 2011; Beaufort et al., 2010; Liu et al., 2010; Contractor et al, 2010; Kaufmann, 2010; Cook and Stevenson, 2009; Kobuset al., 2008; Choudhury et al., 2007; Aw et al., 2006... and many many many more

slide-13
SLIDE 13

Lexical Normalization

  • Finding lexical normalizations
  • Is lexical normalization enough???
slide-14
SLIDE 14

Is lexical normalization enough???

msg 4 Warren G his cday is today 1 yr older. 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。 Bing Translator

slide-15
SLIDE 15

Is lexical normalization enough???

message for Warren G his birthday is today 1 year older. Bing Translator

  • Lexical normalization
slide-16
SLIDE 16

Is lexical normalization enough???

message for Warren G his birthday is today 1 year older. 他的生日今天是大 1 岁的沃伦 G 消息。 Bing Translator His birthday today is 1 year

  • lder Warren G message.
  • Lexical normalization
slide-17
SLIDE 17

How to make it better???

message for Warren G his birthday is today 1 year older.

slide-18
SLIDE 18

How to make it better???

A message for Warren G, his birthday is today, he is 1 year

  • lder.

沃伦 G 的消息,他的生日是今天, 他是大 1 岁。 Bing Translator A message to/from Warren G, because his birthday is today, he is one year older.

  • Punctuation
  • Pro-drop
  • Missing articles
slide-19
SLIDE 19

Previous work

  • Beam Search Decoder combining different

models (Wang and Ng, 2013)

Model Example Pronunciation 2day → today Missing word “Be” I late → I am late Retokenization u.where → u . where Apostrophe im → I’m Abbreviation lol → laughing out loud Time 1130am → 11:30 am

slide-20
SLIDE 20

Previous work

  • Other possible models/features:

Model Example Swapping letters concsience →conscience Letter Repetition gooooood → good Context rly → really Visual g00d → good Measure Units 100k → 100000

slide-21
SLIDE 21

Problem

slide-22
SLIDE 22

Problem

  • I am too lazy
slide-23
SLIDE 23

Problem

  • I am too lazy

○ Need to look at the data and look for stuff to normalize

slide-24
SLIDE 24

Problem

  • I am too lazy

○ Need to look at the data and look for stuff to normalize ○ Implement models/features to address each case

slide-25
SLIDE 25

Problem

  • I am too lazy

○ Need to look at the data and look for stuff to normalize ○ Implement models/features to address each case ○ Repeat again for every new language to normalize

slide-26
SLIDE 26

Problem

  • I am too lazy

○ Need to look at the data and look for stuff to normalize ○ Implement models/features to address each case ○ Repeat again for every new language to normalize

  • How to build a good normalizer without

working that hard???

slide-27
SLIDE 27

Dream

slide-28
SLIDE 28

Dream

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

  • Obtain data with of tweets paired with

normalizations

slide-29
SLIDE 29

Dream

  • Obtain data with of tweets paired with

normalizations

  • Build a model that learns to normalize

based on these examples

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

Normalization Model Data

slide-30
SLIDE 30

But how can we get the data?

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

slide-31
SLIDE 31

But how can we get the data?

  • Annotate (Wang and Ng, 2013)

○ Not scalable

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

slide-32
SLIDE 32

But how can we get the data?

  • Annotate (Wang and Ng, 2013)

○ Not scalable

  • Extract Paraphrases in Twitter (Xu et al,

2013)

○ No distinction between original tweet and normalization ○ Works in practice by using a formal language model

Original Normalization gonna b nutz ! ! going to be nuts!! sunday morn . wrking Sunday morning. Working I went 2 da national army I went to the national army

slide-33
SLIDE 33

Do people normalize text online?

slide-34
SLIDE 34

Do people normalize text online?

  • Claim: Yes, but they do it in another

language.

slide-35
SLIDE 35

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... - 发信息给 Warren G , 今天是他的生日,又 老了一岁了。生日快乐,愿上帝保佑 你和 ...

  • Standardization during translations (Laviosa,

1998;Volansky et al., 2013)

  • No equivalent idiosyncrasies
slide-36
SLIDE 36

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。快乐 cday 5 月上帝保 佑你和... Bing Translator

slide-37
SLIDE 37

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... 发信息给 Warren G ,今天是他的 生日,又 老了一岁了。生日快乐, 愿上帝保佑你和 ... 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。快乐 cday 5 月上帝保 佑你和... Bing Translator

slide-38
SLIDE 38

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... 发信息给 Warren G ,今天是他的 生日,又 老了一岁了。生日快乐, 愿上帝保佑你和 ... 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。快乐 cday 5 月上帝保 佑你和... Bing Translator Bing Translator

slide-39
SLIDE 39

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... 发信息给 Warren G ,今天是他的 生日,又 老了一岁了。生日快乐, 愿上帝保佑你和 ... 味精 4 沃伦 G 他 cday 今天是较 旧的 1 年。快乐 cday 5 月上帝保 佑你和... Bing Translator Bing Translator Send a message to Warren g, today is his birthday, old and a

  • ne year old. Happy birthday and

may God bless you and ...

slide-40
SLIDE 40

Translationese

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... Send a message to Warren g, today is his birthday, old and a

  • ne year old. Happy birthday and

may God bless you and ...

slide-41
SLIDE 41

Translationese

msg 4 Warren G Send a message to Warren g, happy cday may god bless u and the... Happy birthday and may God bless you and ... his cday is today today is his birthday, 1 yr older.

  • ld and a one year old.
slide-42
SLIDE 42

Translationese

msg 4 Warren G Send a message to Warren g, happy cday may god bless u and the... Happy birthday and may God bless you and ... his cday is today today is his birthday, 1 yr older.

  • ld and a one year old.

abbreviation, punctuation, completion

slide-43
SLIDE 43

Translationese

msg 4 Warren G Send a message to Warren g, happy cday may god bless u and the... Happy birthday and may God bless you and ... his cday is today today is his birthday, 1 yr older.

  • ld and a one year old.

jargon, punctuation, phrasing abbreviation, punctuation, completion

slide-44
SLIDE 44

Translationese

msg 4 Warren G Send a message to Warren g, happy cday may god bless u and the... Happy birthday and may God bless you and ... his cday is today today is his birthday, 1 yr older.

  • ld and a one year old.

abbreviation, machine translation sucks jargon, punctuation, phrasing abbreviation, punctuation, completion

slide-45
SLIDE 45

Translationese

msg 4 Warren G Send a message to Warren g, happy cday may god bless u and the... Happy birthday and may God bless you and ... his cday is today today is his birthday, 1 yr older.

  • ld and a one year old.

abbreviation, machine translation sucks jargon, punctuation, phrasing abbreviation, punctuation, completion jargon, abbreviation, conjunction

slide-46
SLIDE 46

Where do these come from?

slide-47
SLIDE 47

Where do these come from?

msg 4 Warren G his cday is today 1 yr older. happy cday may god bless u and the... - 发信息给 Warren G , 今天是他的生日,又 老了一岁了。生日快乐,愿上帝保佑 你和 ... 【2013.9.7】It was an important result for #Portugal in a very special date for me. I would like to dedicate this victory to my father that has left us 8 years ago. 这是一场重要的胜利,在今天这个对我来说很特别的日 子.我要把这场胜利献给我的父亲,他八年前离开了我 们.

  • Microblogs as Parallel Corpora (Ling et al, 2013)
slide-48
SLIDE 48

Building a Normalizer

slide-49
SLIDE 49

Building a Normalizer

English Other Language To DanielVeuleman yea iknw imma work on that 对DanielVeuleman说,是的我知道,我 正在向那方面努力 Warren G his cday is today 1 yr older. 发信息给Warren G,今天是他的生日, 又老了一岁了。 Where the hell have you been all these years? 这些年你TMD到哪去了

  • nni this gift only 4 uكﻟ طﻘﻓ ﺔﯾدﮭﻟا ةذھ ﻲﻧوأ

Next Monday I am gonna see a movie in German language at the cinema. В понедельник я буду смотреть фильм на немецком в кинотеатре.

  • ppa!! I love u (*≧∀≦*)

사랑해요( `・ω・´ ) huv a gd time 素敵♡貴重な時間やね

  • Get parallel corpora
slide-50
SLIDE 50

Building a Normalizer

English Normalized English To DanielVeuleman yea iknw imma work on that Right DanielVeuleman, yes I know, I'm Xiangna efforts Warren G his cday is today 1 yr older. Send a message to Warren G, today is his birthday, old and a one year old. Where the hell have you been all these years? TMD these years you where to go

  • nni this gift only 4 u

Unni this gift just for you Next Monday I am gonna see a movie in German language at the cinema. On Monday, I'll be watching a movie in a movie theater in German.

  • ppa!! I love u (*≧∀≦*)

I love (`· ω · ') huv a gd time It or a nice ♡ valuable time

  • Translate into English (Google, Bing, Youdao)
slide-51
SLIDE 51

Building a Normalizer

English Normalized English To DanielVeuleman yea iknw imma work on that Right DanielVeuleman, yes I know, I'm Xiangna efforts Warren G his cday is today 1 yr older. Send a message to Warren G, today is his birthday, old and a one year old. Where the hell have you been all these years? TMD these years you where to go

  • nni this gift only 4 u

Unni this gift just for you Next Monday I am gonna see a movie in German language at the cinema. On Monday, I'll be watching a movie in a movie theater in German.

  • ppa!! I love u (*≧∀≦*)

I love (`· ω · ') huv a gd time It or a nice ♡ valuable time

  • Filter bad examples (Phrasal Alignment)
slide-52
SLIDE 52

Building a Normalizer

English Normalized English To DanielVeuleman yea iknw imma work on that Right DanielVeuleman, yes I know, I'm Xiangna efforts Warren G his cday is today 1 yr older. Send a message to Warren G, today is his birthday, old and a one year old. Where the hell have you been all these years? TMD these years you where to go

  • nni this gift only 4 u

Unni this gift just for you Next Monday I am gonna see a movie in German language at the cinema. On Monday, I'll be watching a movie in a movie theater in German.

  • Train a Phrase-based SMT model
slide-53
SLIDE 53

Normalization Model (Phrase)

  • Phrase-based SMT model (Koehn, 2003)
slide-54
SLIDE 54

Normalization Model (Phrase)

  • Phrase-based SMT model (Koehn, 2003)

We r 2 young to talk about 4ever We 're too young , can not talk about forever

slide-55
SLIDE 55

Normalization Model (Phrase)

  • What we can achieve with this model

Normalization Model

slide-56
SLIDE 56

Normalization Model (Phrase)

  • What we can achieve with this model

○ Abbreviations

Normalization Model

I wanna go 4 Pizza 2day I want to go for Pizza today

slide-57
SLIDE 57

Normalization Model (Phrase)

  • What we can achieve with this model

○ Abbreviations ○ Punctuation

Normalization Model

I’ll cook it brotha! I will cook it, brother!

slide-58
SLIDE 58

Normalization Model (Phrase)

  • What we can achieve with this model

○ Abbreviations ○ Punctuation ○ Apostrophe

Normalization Model

hes not very well He is not very well

slide-59
SLIDE 59

Normalization Model (Phrase)

  • What we can achieve with this model

○ Abbreviations ○ Punctuation ○ Apostrophe ○ Context-based

Normalization Model

Adam n me n Vegas Adam and me in vegas

slide-60
SLIDE 60

Normalization Model (Phrase)

  • What we can achieve with this model

○ Abbreviations ○ Punctuation ○ Apostrophe ○ Context-based ○ Orthographic errors ○ …….

Normalization Model

Adidas goin hard this year !!! Adidas going hard this year!!!

slide-61
SLIDE 61

Limitation of this Model

slide-62
SLIDE 62

Limitation of this Model

  • Can only normalize seen words

○ Even though we know that:

Original Normalization 4ever forever 4get forget

slide-63
SLIDE 63

Limitation of this Model

  • Can only normalize seen words

○ Even though we know that: ○ We cannot infer that:

Original Normalization 4ever forever 4get forget Original Normalization 4got ??????

slide-64
SLIDE 64

Normalization Model (Character)

  • A character-based normalization model:

○ Use phrase pairs as parallel segments

Original Normalization 4ever forever 4get forget goingfor going for goooood good

slide-65
SLIDE 65

Normalization Model (Character)

  • A character-based normalization model:

○ Use phrase pairs as parallel segments ○ Build a phrase-based SMT model on characters

4 e v e r f o r e v e r

g o i n g f o r

g o i n g <s> f o r

slide-66
SLIDE 66

Normalization Model (Character)

  • A character-based normalization model:

○ Use phrase pairs as parallel segments ○ Build a phrase-based SMT model on characters ○ Lattice generation

I wanna to meeeeet DanielVeuleman

slide-67
SLIDE 67

Normalization Model (Character)

  • A character-based normalization model:

○ Use phrase pairs as parallel segments ○ Build a phrase-based SMT model on characters ○ Lattice generation ○ Phrase-based Normalization

Normalization Model

I want to meet Daniel Veuleman

I am a lattice!

slide-68
SLIDE 68

Normalization Model (Character)

  • What we can learn from this model now?
slide-69
SLIDE 69

Normalization Model (Character)

  • What we can learn from this model now?

○ Phonetically Similar Substitutions

Original Normalization Probability s s 0.87 s c 0.04 s z 0.02

slide-70
SLIDE 70

Normalization Model (Character)

  • What we can learn from this model now?

○ Phonetically Similar Substitutions ○ Logographic Substitutions

Original Normalization Probability 4 4 0.71 4 f o r 0.06 4 e f o r e 0.89

slide-71
SLIDE 71

Normalization Model (Character)

  • What we can learn from this model now?

○ Phonetically Similar Substitutions ○ Logographic Substitutions ○ Visual Substitutions

Original Normalization Probability

  • 0.86
  • 0.01
slide-72
SLIDE 72

Normalization Model (Character)

  • What we can learn from this model now?

○ Phonetically Similar Substitutions ○ Logographic Substitutions ○ Visual Substitutions ○ Segmentation ○ ...

Original Normalization Probability i n g f o r i n g f o r 0.45 g f g f 0.01

slide-73
SLIDE 73

Results (Normalization)

  • Normalization Dataset

○ Training set: 1.3M normalization pairs ○ Test set: 1290 normalization pairs/translation pairs ■ annotation from previous work + normalization

slide-74
SLIDE 74

Results (Normalization)

  • Evaluation with BLEU

Orig: I wanna meeeeet DanielVeuleman Norm: I want to met Daniel Veuleman Ref: I want to meet Daniel Veuleman BLEU Normalizer Score: X.XX

slide-75
SLIDE 75

Results (Normalization)

  • Setups

○ Phrase-based Model (P) Phrase-based Model I wanna meeeet Daniel Veuleman I want to met Daniel Veuleman

slide-76
SLIDE 76

Results (Normalization)

  • Setups

○ Phrase-based Model (P) ○ Phrase-based Model + Character-based Model (PC)

Character-based Model Phrase-based Model I wanna meeeeet DanielVeuleman I wanna meet Daniel Veuleman I want to meet Daniel Veuleman

slide-77
SLIDE 77

Results (Normalization)

  • Normalization Results

Setup BLEU No normalization 19.90 Normalization(P) 21.96 Normalization(PC) 22.39

slide-78
SLIDE 78

Results (Translation)

  • Translation Systems

○ Moses (NIST) ○ Moses (NIST+Weibo) ○ Online Systems (Google, Bing, Youdao) ■ Online A, Online B and Online C

slide-79
SLIDE 79

Results (Translation)

  • Translation Results

Setup Ours (NIST) Ours (NIST+Weibo) Online A Online B Online C No normalization 15.1 24.4 20.1 17.9 18.8

slide-80
SLIDE 80

Results (Translation)

  • Translation Results

○ wanna=want to ○ gotta=going to ○ u=you ○ 4=for

Setup Ours (NIST) Ours (NIST+Weibo) Online A Online B Online C No normalization 15.1 24.4 20.1 17.9 18.8 Normalization (P) 15.7 24.3 20.5 18.1 18.9

slide-81
SLIDE 81

Results (Translation)

  • Translation Results

○ representin=representing ○ peaceof=peace of

Setup Ours (NIST) Ours (NIST+Weibo) Online A Online B Online C No normalization 15.1 24.4 20.1 17.9 18.8 Normalization (P) 15.7 24.3 20.5 18.1 18.9 Normalization (PC) 15.9 24.4 20.6 18.2 19.1

slide-82
SLIDE 82

Conclusion

  • Presented a method to build a normalization

system using translation ○ Corpora built using translation ○ Model built using Machine Translation on the phrase level and character level

slide-83
SLIDE 83

Conclusion

  • Presented a method to build a normalization

system using translation ○ Corpora built using translation ○ Model built using Machine Translation on the phrase level and character level ○ Do not address each problem specifically, but learns from examples.

slide-84
SLIDE 84

Conclusion

  • Presented a method to build a normalization

system using translation ○ Corpora built using translation ○ Model built using Machine Translation on the phrase level and character level ○ Do not address each problem specifically, but learns from examples.

  • More Data = better

○ … and we are crawling everyday

slide-85
SLIDE 85

Future Work

  • Some problems in the model

○ No null translations (missing words or punctuation)

slide-86
SLIDE 86

Thx very much 4 ur attention :)

System and API available at: www.microblogtranslation.org

Thank you very much for your attention :)