Learning and Generating Paraphrases From Twitter and Beyond
Wei Xu
Guest Lecture @ Penn MT class April-2-2015
Computer)and)Informa/on)Science) University)of)Pennsylvania
Learning and Generating Paraphrases From Twitter and Beyond Wei Xu - - PowerPoint PPT Presentation
Learning and Generating Paraphrases From Twitter and Beyond Wei Xu Computer)and)Informa/on)Science) University)of)Pennsylvania Guest Lecture @ Penn MT class April-2-2015 Research Overview TACL 15 ! Paraphrase ! NAACL 15 ! ! TACL 14 ! !
Guest Lecture @ Penn MT class April-2-2015
Computer)and)Informa/on)Science) University)of)Pennsylvania
TACL 15!
!
NAACL 15!
!
TACL 14!
!
ACL 14!
!
ACL 13!
!
BUCC 13!
!
LSAM 13!
!
COLING 12!
!
IJCNLP 11!
!
EMNLP 11!
!
ACL 06
Social Media Paraphrase Information Extraction
… the forced resignation
Harry Stonecipher, for … the king’s speech His Majesty’s address wealthy rich
word phrase sentence
… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …
Information Extraction end_job (Harry Stonecipher, Boeing)
Wei)Xu,)Raphael)Hoffmann,)Le)Zhao,)Ralph)Grishman.)“Filling)Knowledge)Base)Gaps)for)Distant)Supervision)of)Rela/on)Extrac/on”)) In)ACL)(2013)))
extract
… the forced resignation
Harry Stonecipher, for … … after Boeing Co. Chief Executive Harry Stonecipher was ousted from …
Question Answering Who is the CEO stepping down from Boeing?
match
… the forced resignation
Harry Stonecipher, for … … after Boeing Co. Chief Executive Harry Stonecipher was ousted from …
Text Simplification
They are culturally akin to the coastal peoples of Papua New Guinea. Their culture is like that of the coastal peoples of Papua New Guinea.
Wei)Xu,)Chris)CallisonUBurch.)“Problems)in)Current)Text)Simplifica/on)Research:)New)Data)Can)Help”))to)appear)in)TACL)(2015))) NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))
Stylistic Rewriting
Palpatine: If you will not be turned, you will be destroyed! If you will not be turn’d, you will be undone!
Wei)Xu,)Alan)Ri_er,)Bill)Dolan,)Ralph)Grishman,)Colin)Cherry.)“Paraphrasing)for)Style”)In)COLING)(2012)))
Luke: Father, please! Help me! Father, I pray you! Help me!
But, primarily for formal language usage and well-edited text
Numerous publications on paraphrase identification, extraction, generation and various applications
report big events using formal language
(Dolan,)Quirk)and)Brocke_,)2004;)Dolan)and)Brocke_,)2005;)Brocke_)and)Dolan,)2005))
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“A)Preliminary)Study)of)Tweet)Summariza/on)using)Informa/on)Extrac/on”)in)LASM)(2014)))
thousands of users talk about both big and micro events using formal, informal, erroneous language
Very%diverse!%
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
pittsburgh pgh pittsburg pixburgh pit steelers against the steelers against pittsburgh
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”)) In)BUCC)(2013)))
Information Retrieval
Noisy Text Normalization
Oscar-nominated documentary don’t want for don’t wait for
Wei)Xu,)Joel)Tetreault,)Mar/n)Chodorow,)Ralph)Grishman,)Le)Zhao.)“Exploi/ng)Syntac/c)and)Distribu/onal)Informa/on)for)Spelling) Correc/on)with)WebUScale)NUgram)Models”)In)EMNLP)(2011)))
who wants to get a beer?
Human-computer Interaction
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”)) In)BUCC)(2013)))
want to get a beer? who else wants to get a beer? who wants to go get a beer? trying to get a beer? who wants to buy a beer? who else wants to get a beer? … (21 different ways)
Language Education
Aaaaaaaaand stephen curry is on fire What a incredible performance from Stephen Curry
Wowsers to this nets bulls game This nets vs bulls game is great
Sentiment Analysis
This Nets vs Bulls game is nuts This Nets and Bulls game is a good game this Nets vs Bulls game is too live This NetsBulls series is intense This netsbulls game is too good
Mancini has been sacked by Manchester City Mancini gets the boot from Man City
identify parallel sentences automatically ! from Twitter’s big data stream
WORLD OF JENKS IS ON AT 11 World of Jenks is my favorite show on tv
(Zanzotto, Pennacchiotti and Tsioutsiouliklis, 2011)
(Xu, Ritter and Grishman, 2013)
(Ling, Dyer, Black and Trancoso, 2013)
Mancini has been sacked by Manchester City Mancini gets the boot from Man City
very short lexically divergent
!
(less word overlap, even in high-dimensional space)
two sentences about the same topic are paraphrases if and only if they contain at least one word pair that is a paraphrase anchor
That boy Brook Lopez with a deep 3 brook lopez hit a 3
At-least-one-anchor Assumption
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
not every word pair of similar meaning indicates sentence-level paraphrase Solution: a discriminative model using features at word-level
Iron Man 3 was brilliant fun Iron Man 3 tonight see what this is like
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
Z2"
0" 0" 0"
1" 0"
..." man$"|"teo" be"|"is" ..." Z1" Z4" Y#paraphrase" Y#non2paraphrase" Z3"
1"
next"|"new" diff_word" same_pos_nn" both_sig" …" same_stem" same_pos_be" not_both_sig" …" diff_word" same_pos_jj" both_sig" …" diff_word" diff_pos_nn" diff_pos_jj" not_both_sig" …" man$"|"li>le"
sentence"pair" word"pair" features(
Manti bout to be the next Junior Seau Teo is the little new Junior Seau
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
Nega%ve'Bags' Posi%ve'Bags''
A'bag'is'labeled'posi%ve,'if'' there'is'at#least#one'posi%ve'example' A'bag'is'labeled'nega%ve,'if'' all'the'examples'in'it'are'nega%ve'
Instead of labels on each individual instance, the learner only observes labels on bags of instances.
(Die_erich)et)al.,)1997))
Z2"
?" ?" 1"
Z1" Y" Z3"
1"
bag"label" (observed)" instance"label" (latent)"
Posi7ve"Bag""
A"bag"is"labeled"posi7ve,"if"" there"is"at#least#one"posi7ve"example"
features" constraints"
Latent Variable Model
Latent Variable Model
Z2"
0" 0" 0"
Z1" Y" Z3"
0"
instance"label" (latent)"
Nega3ve"Bag""
A"bag"is"labeled"nega3ve,"if"" all"the"examples"in"it"are"nega3ve"
features" constraints" bag"label" (observed)"
Maria)Pershina,)Bonan)Min,)Wei)Xu,)Ralph)Grishman.)“Infusion)of)Labeled)Data)into)Distant)Supervision)for)Rela/on)Extrac/on”))In)ACL)(2014)))) Wei)Xu,)Raphael)Hoffmann,)Le)Zhao,)Ralph)Grishman.)“Filling)Knowledge)Base)Gaps)for)Distant)Supervision)of)Rela/on)Extrac/on”))In)ACL)(2013)))) Wei)Xu,)Ralph)Grishman,)Le)Zhao.)“Passage)Retrieval)for)Informa/on)Extrac/on)using)Distant)Supervision”))In)IJCNLP)(2011))))
Distantly Supervised Information Extraction
G
|R| |xi|
n
zi
hi
yi
xi
relation level mention level
Z2"
0" 0" 0"
1" 0"
..." man$"|"teo" be"|"is" ..." Z1" Z4" Y#paraphrase" Y#non2paraphrase" Z3"
1"
next"|"new" diff_word" same_pos_nn" both_sig" …" same_stem" same_pos_be" not_both_sig" …" diff_word" same_pos_jj" both_sig" …" diff_word" diff_pos_nn" diff_pos_jj" not_both_sig" …" man$"|"li>le"
sentence"pair" word"pair" features(
Manti bout to be the next Junior Seau Teo is the little new Junior Seau
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
Zi" Y#" W×W"" S×S""
sentence"pair" word"pair" determinis2c"OR" bag"label" (observed)" instance"label" (latent)"
Model the assumption:! sentence-level paraphrase is anchored by at-least-one word pair
Zj"
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
features parameters determinis/c)OR jth)word)pair ith)sentence)pair’s)label)) (observed)or)to)be)predicated)) latent)labels)for)all)word)pairs)) in)the)ith)sentence)pair)
Objective:! learn the parameters that maximize likelihood over the training corpus
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
ith#training#sentence#pair all#possible#values#
reward#correct# (condi6oned#on#labels)
Perceptron-style Update:! Viterbi approximation + online learning O(# word pairs)
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
penalize#wrong# (ignoring#labels)
Crowdsourcing
(Courtesy:)The)Sheep)Market)by)Aaron)Koblin)
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
Crowdsourcing
have similar meaning hurts both quantity and quality
non#experts*lower*their*bars*
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
Netflix Jeff Green Ryu The Clippers Reggie Miller 0.2 0.4 0.6 0.8
Random w/ Selection
SumBasic Algorithm 8% 16%
percentages of paraphrases
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
Multi-Armed Bandits 16% 34%
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
18,762 sentence pairs labeled cost only $200
! ! !
1/3 paraphrase, 2/3 non-paraphrase (very balanced) including a very broad range of paraphrases: synonyms, misspellings, slang, acronyms and colloquialisms
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
important)but)difficult)to)obtain
40 55 70 85 100
(Das&Smith,2009) (Guo&Diab,2012) (Ji&Eisenstein,2013) Our Model Human Upperbound
90.8 72.6 62.8 65.5 63.2 75.2 72.2 66.4 52.5 62.9
Precision Recall
state-of-the-art of paraphrase identification
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
40 55 70 85 100
(Das&Smith,2009) (Guo&Diab,2012) (Ji&Eisenstein,2013) Our Model Human Upperbound
90.8 72.6 62.8 65.5 63.2 75.2 72.2 66.4 52.5 62.9
Precision Recall
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014)) Our Model (Ji&Eisenstein,2013)
SemEval 2015 shared task on “Paraphrase in Twitter” 19 + 1 teams participated
!
100+ research groups have requested the data since Nov 2014 paraphrase identification (0 or 1) rank 1 semantic similarity (0 ~ 1) rank 4
Wei)Xu,)Chris)CallisonUBurch,)Bill)Dolan.)“SemEvalU2015)Task)1:)Paraphrase)and)Seman/c)Similarity)in)Twi_er)(PIT)”)In)SemEval)(2015))
That boy Brook Lopez with a deep 3 brook lopez hit a 3
Multi-instance Learning Paraphrase Model (MultiP)
Wei)Xu,)Alan)Ri_er,)Chris)CallisonUBurch,)Bill)Dolan,)Yangfeng)Ji.)“Extrac/ng)Lexically)Divergent)Paraphrases)from)Twi_er”)In)TACL)(2014))
(a lot of space for future work)
Mancini has been sacked by Manchester City Mancini gets the boot from Man City
align
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”)) In)BUCC)(2013)))
has been sacked by gets the boot from manchester city man city 4 for 4 four
hostes hostess
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”)) In)BUCC)(2013)))
business Hostes
is going biz . .
is going Hostess
translate
Wei)Xu,)Alan)Ri_er,)Ralph)Grishman.)“Gathering)and)Genera/ng)Paraphrases)from)Twi_er)with)Applica/on)to)Normaliza/on”)) In)BUCC)(2013)))
Bilingual Monolingual studied sensitive to error
straightforward sophisticated less more less more a lot more recently has standard evaluation yes not quite yet naturally available parallel text
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
(Paraphrase =)
complex simple stylistic plain noisy standard erroneous correct
and more (future work) …
(Xu et al. 2013) (Xu et al. 2012) (Xu et al. 2015) (Xu et al. 2011)
Wei)Xu.)“DataUdriven)Approaches)for)Paraphrasing)Across)Language)Varia/ons”)PhD)Thesis,)New)York)University.)(2014)))
Wandering through rows of stalls examining workhorses and prize hogs may seem to … have been a strange way for a scientist to spend an afternoon, but there was a certain logic to it.
hogs may seem a bit strange through rows of stalls
Quanze)Chen,)Chenyang)Lei,)Wei)Xu,)Ellie)Pavlick)and)Chris)CallisonUBurch.)“Poetry)of)the)Crowd:)A)Human)Computa/on)Algorithm)to) Convert)Prose)into)Rhyming)Verse”)In)AAAI's)HCOMP)(2012)
[Rhyme]! balls falls installs walls …
state-of-the-art (since 2010)
NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))
state-of-the-art (since 2010) is suboptimal ! is not all that simple
Wei)Xu,)Chris)CallisonUBurch.)“Problems)in)Current)Text)Simplifica/on)Research:)New)Data)Can)Help”))to)appear)in)TACL)(2015))) NSF)EAGER:)“Simplifica/on)as)Machine)Transla/on”)(2014)~)2015))
!
!
!
all#code#and#data#are#available#on#my#homepage:##h<p://www.cis.upenn.edu/~xwe/
Chris Callison-Burch Ralph Grishman Bill Dolan Alan Ritter Raphael Hoffmann Joel Tetreault Le Zhao Maria Pershina Martin Chodorow Colin Cherry Yangfeng Ji Ellie Pavlick Mingkun Gao Quanze Chen UPenn NYU MSR UW / OSU UW / AI2 Incubator ETS / Yahoo! CMU / Google NYU CUNY NRC GaTech UPenn UPenn UPenn
thanks thanking you appreciate it thnx thx tyvm thank you very much thanks a lot 3x say thanks am grateful wawwww thankkkkkkkkkkk you alotttttttttttt! thank u 4 ur time gratitude