Alan Ritter ◦ socialmedia-class.org
Social Media & Text Analysis
lecture 7 - Paraphrase Identification and Linear Regression
CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org
Social Media & Text Analysis lecture 7 - Paraphrase - - PowerPoint PPT Presentation
Social Media & Text Analysis lecture 7 - Paraphrase Identification and Linear Regression CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Alan Ritter socialmedia-class.org (Recap) what is
Alan Ritter ◦ socialmedia-class.org
lecture 7 - Paraphrase Identification and Linear Regression
CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org
(Recap)
“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)
(Recap)
wealthy rich
word
“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)
(Recap)
wealthy rich
word
the king’s speech His Majesty’s address
phrase
“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)
(Recap)
wealthy rich
word
the king’s speech His Majesty’s address
phrase
… the forced resignation
Harry Stonecipher, for …
sentence
… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …
“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)
(Recap)
'01 Novels '05 '13 Bi-Text '04 News '13* '14* Twitter '11 Video
Xu Ritter Callison-Burch Dolan Ji Xu Ritter Dolan Grishman Cherry
'12* Style ‘01 Web
Xu Callison-Burch Napoles
'15*'16* Simple 80s WordNet
modified by adjectives
additional, administrative, assigned, assumed, collective, congressional, constitutional ... assert, assign, assume, attend to, avoid, become, breach ...
Decking Lin and Patrick Pantel. “DIRT - Discovery of Inference Rules from Text” In KDD (2001)
... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...
, gefoltert
, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...
Source: Chris Callison-Burch
... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...
, gefoltert
, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...
Source: Chris Callison-Burch
... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...
, gefoltert
, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...
Source: Chris Callison-Burch
... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...
, gefoltert
, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...
Source: Chris Callison-Burch
... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...
, gefoltert
, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...
Source: Chris Callison-Burch
insect, beetle, pest, mosquito, fly squealer, snitch, rat, mole microphone, tracker, mic, wire, earpiece, cookie glitch, error, malfunction, fault, failure bother, annoy, pester microbe, virus, bacterium, germ, parasite
Source: Chris Callison-Burch
'01 Novels '05 '13 Bi-Text '04 News '13* '14* Twitter '11 Video '12* Style ‘01 Web '15*'16* Simple 80s WordNet
Mancini has been sacked by Manchester City Mancini gets the boot from Man City
WORLD OF JENKS IS ON AT 11 World of Jenks is my favorite show on tv
Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)
(meaningful) non-paraphrases are needed to train classifiers!
'01 Novels '05 '13 Bi-Text '04 News '13* ’14* 17* Twitter '11 Video '12* Style ‘01 Web '15*'16* Simple 80s WordNet
(meaningful) non-paraphrases are needed to train classifiers!
Microsoft Research Paraphrase Corpus
(Dolan, Quirk and Brockett, 2004; Dolan and Brockett, 2005; Brockett and Dolan, 2005)
also contains some non-paraphrases
Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)
also contains a lot of non-paraphrases
Alan Ritter ◦ socialmedia-class.org
Paraphrase Identification:
Alan Ritter ◦ socialmedia-class.org
Paraphrase Identification:
negative (non-paraphrases)
Alan Ritter ◦ socialmedia-class.org
Paraphrase Identification:
negative (non-paraphrases) positive (paraphrases)
Alan Ritter ◦ socialmedia-class.org
Paraphrase Identification:
Alan Ritter ◦ socialmedia-class.org
Classification Method:
(x(1), y(1)), … , (x(m), y(m))
Alan Ritter ◦ socialmedia-class.org
Classification Method:
(x(1), y(1)), … , (x(m), y(m))
Alan Ritter ◦ socialmedia-class.org
(Recap) Classification Method:
Alan Ritter ◦ socialmedia-class.org
(Recap)
features ti are assumed independent given the class y
Alan Ritter ◦ socialmedia-class.org
Classification Method:
learning algorithm for classification!
(better performance on various datasets).
Alan Ritter ◦ socialmedia-class.org
Paraphrase Identification:
common
Alan Ritter ◦ socialmedia-class.org
A very related problem of Paraphrase Identification:
5: completely equivalent in meaning 4: mostly equivalent, but some unimportant details differ 3: roughly equivalent, some important information differs/missing 2: not equivalent, but share some details 1: not equivalent, but are on the same topic 0: completely dissimilar
Alan Ritter ◦ socialmedia-class.org
A Simpler Model:
Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20
Alan Ritter ◦ socialmedia-class.org
A Simpler Model:
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
Alan Ritter ◦ socialmedia-class.org
A Simpler Model:
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
Alan Ritter ◦ socialmedia-class.org
A Simpler Model:
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
paraphrase{
Alan Ritter ◦ socialmedia-class.org
A Simpler Model:
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
paraphrase non-paraphrase
Alan Ritter ◦ socialmedia-class.org
#words in common (x) Sentence Similarity (y)
1 4 1 13 4 18 5 … …
Alan Ritter ◦ socialmedia-class.org
Source: NLTK Book
(Recap)
training set
Alan Ritter ◦ socialmedia-class.org
training set (also called) hypothesis
Alan Ritter ◦ socialmedia-class.org
training set (also called) hypothesis
#words in common Sentence Similarity
x
(estimated)
y
Alan Ritter ◦ socialmedia-class.org
training set (also called) hypothesis x
(estimated)
y
Alan Ritter ◦ socialmedia-class.org
hypothesis (h)
x
(estimated)
y
training set
Linear Regression w/ one variable
Linear Regression:
Alan Ritter ◦ socialmedia-class.org
𝞲’s: parameters
#words in common (x) Sentence Similarity (y)
1 4 1 13 4 18 5 … …
Linear Regression w/ one variable:
Source: many following slides are adapted from Andrew Ng
Alan Ritter ◦ socialmedia-class.org
#words in common (x) Sentence Similarity (y)
1 4 1 13 4 18 5 … …
Linear Regression w/ one variable:
Alan Ritter ◦ socialmedia-class.org
Andrew'Ng'
0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3'
Linear Regression w/ one variable::
Alan Ritter ◦ socialmedia-class.org
Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20
close to y for training examples (x, y)
Linear Regression w/ one variable:
Alan Ritter ◦ socialmedia-class.org
close to y for training examples (x, y)
Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20
Linear Regression w/ one variable:
Alan Ritter ◦ socialmedia-class.org
close to y for training examples (x, y)
Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20
J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
θ0, θ1
squared error function:
Linear Regression w/ one variable:
Alan Ritter ◦ socialmedia-class.org
minimize
θ0, θ1
J(θ0,θ1)
𝞲0, 𝞲1
J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
Alan Ritter ◦ socialmedia-class.org
minimize
θ0, θ1
J(θ0,θ1)
𝞲0, 𝞲1
J(θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
minimize
θ1
J(θ1) 𝞲1 Simplified
J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
(for fixed 𝞲1, this is a function of x )
x
(function of the parameter 𝞲1 )
𝞲1=1 Hypothesis Cost Function Q:
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
(for fixed 𝞲1, this is a function of x )
x
(function of the parameter 𝞲1 )
𝞲1=1 Hypothesis Cost Function
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
1 2 3 0.5 1 1.5 2 2.5
(for fixed 𝞲1, this is a function of x )
𝞲1 J(𝞲1) x
(function of the parameter 𝞲1 )
𝞲1=1 Hypothesis Cost Function
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
(for fixed 𝞲1, this is a function of x )
x
(function of the parameter 𝞲1 )
𝞲1=0.5 Q:
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
1 2 3 0.5 1 1.5 2 2.5
(for fixed 𝞲1, this is a function of x )
𝞲1 J(𝞲1) x
(function of the parameter 𝞲1 )
𝞲1=0.5
Alan Ritter ◦ socialmedia-class.org
1 2 3 1 2 3
y
1 2 3 0.5 1 1.5 2 2.5
(for fixed 𝞲1, this is a function of x )
𝞲1 J(𝞲1) x
(function of the parameter 𝞲1 )
minimize
θ1
J(θ1)
Alan Ritter ◦ socialmedia-class.org
minimize
θ0, θ1
J(θ0,θ1)
𝞲0, 𝞲1
J(θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
minimize
θ1
J(θ1) 𝞲1 Simplified
J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x ) (function of the parameter 𝞲0, 𝞲1 )
𝞲1 𝞲0 J(𝞲1, 𝞲2)
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
contour plot
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
until we hopefully end up at a minimum
θ0, θ1 J(θ0,θ1)
Alan Ritter ◦ socialmedia-class.org
1 2 3 0.5 1 1.5 2 2.5
𝞲1 J(𝞲1)
minimize
θ1
J(θ1)
Simplified
Alan Ritter ◦ socialmedia-class.org
1 2 3 0.5 1 1.5 2 2.5
𝞲1 J(𝞲1)
minimize
θ1
J(θ1)
learning rate
Simplified
Alan Ritter ◦ socialmedia-class.org
Andrew'Ng'
θ1" θ0" J(θ0,θ1)"
minimize
θ0,θ1
J(θ0,θ1)
Alan Ritter ◦ socialmedia-class.org
Andrew'Ng'
θ0" θ1" J(θ0,θ1)"
minimize
θ0,θ1
J(θ0,θ1)
Alan Ritter ◦ socialmedia-class.org
repeat until convergence { }
simultaneous update for j=0 and j=1
learning rate
Alan Ritter ◦ socialmedia-class.org
Linear Regression w/ one variable:
∂ ∂θ0 J(θ0,θ1) = ? ∂ ∂θ1 J(θ0,θ1) = ?
i=1 m
Cost Function
Alan Ritter ◦ socialmedia-class.org
repeat until convergence { }
i=1 m
simultaneous update 𝞲0, 𝞲1
Linear Regression w/ one variable:
i=1 m
Alan Ritter ◦ socialmedia-class.org
𝞲1 𝞲0 J(𝞲1, 𝞲2)
cost function is convex
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
Andrew'Ng'
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Alan Ritter ◦ socialmedia-class.org
Andrew'Ng'
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
(for fixed 𝞲0, 𝞲1, this is a function of x )
(function of the parameter 𝞲0, 𝞲1 )
Andrew'Ng'
Alan Ritter ◦ socialmedia-class.org
examples
i=1 m
Cost Function
Alan Ritter ◦ socialmedia-class.org
(Recap)
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
Alan Ritter ◦ socialmedia-class.org
(Recap)
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
paraphrase{
Alan Ritter ◦ socialmedia-class.org
(Recap)
(Classification: predict discrete-valued output)
Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20
threshold ➞ Classification
paraphrase non-paraphrase
Alan Ritter ◦ socialmedia-class.org
𝞲1 𝞲0 J(𝞲1, 𝞲2)
(Recap)
minimize
θ0, θ1
J(θ0,θ1) 𝞲0, 𝞲1
J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2
i=1 m
Alan Ritter ◦ socialmedia-class.org
repeat until convergence { }
(simultaneous update for j=0 and j=1)
learning rate (Recap)
Alan Ritter ◦ socialmedia-class.org
socialmedia-class.org