Social Media & Text Analysis lecture 7 - Paraphrase - - PowerPoint PPT Presentation

social media text analysis
SMART_READER_LITE
LIVE PREVIEW

Social Media & Text Analysis lecture 7 - Paraphrase - - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 7 - Paraphrase Identification and Linear Regression CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org Alan Ritter socialmedia-class.org (Recap) what is


slide-1
SLIDE 1

Alan Ritter ◦ socialmedia-class.org

Social Media & Text Analysis

lecture 7 - Paraphrase Identification and Linear Regression

CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

slide-2
SLIDE 2

(Recap)

what is Paraphrase?

“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

slide-3
SLIDE 3

(Recap)

what is Paraphrase?

wealthy rich

word

“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

slide-4
SLIDE 4

(Recap)

what is Paraphrase?

wealthy rich

word

the king’s speech His Majesty’s address

phrase

“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

slide-5
SLIDE 5

(Recap)

what is Paraphrase?

wealthy rich

word

the king’s speech His Majesty’s address

phrase

… the forced resignation

  • f the CEO of Boeing,

Harry Stonecipher, for …

sentence

… after Boeing Co. Chief Executive Harry Stonecipher was ousted from …

“sentences or phrases that convey approximately the same meaning using different words” — (Bhagat & Hovy, 2012)

slide-6
SLIDE 6

The Ideal

slide-7
SLIDE 7

(Recap)

Paraphrase Research

'01 Novels '05 '13 Bi-Text '04 News '13* '14* Twitter '11 Video

Xu Ritter Callison-Burch Dolan Ji Xu Ritter Dolan Grishman Cherry

'12* Style ‘01 Web

Xu Callison-Burch Napoles

'15*'16* Simple 80s WordNet

slide-8
SLIDE 8

Distributional Similarity

modified by adjectives

  • bjects of verbs

additional, administrative, assigned, assumed, collective, congressional, constitutional ... assert, assign, assume, attend to, avoid, become, breach ...

Decking Lin and Patrick Pantel. “DIRT - Discovery of Inference Rules from Text” In KDD (2001)

Lin and Panel (2001) operationalize the Distributional Hypothesis using dependency relationships to define similar environments. Duty and responsibility share a similar set of dependency contexts in large volumes of text:

slide-9
SLIDE 9

Bilingual Pivoting

... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...

  • der wurden

, gefoltert

  • r have been

, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...

Source: Chris Callison-Burch

word alignment

slide-10
SLIDE 10

Bilingual Pivoting

... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...

  • der wurden

, gefoltert

  • r have been

, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...

Source: Chris Callison-Burch

word alignment

slide-11
SLIDE 11

Bilingual Pivoting

... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...

  • der wurden

, gefoltert

  • r have been

, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...

Source: Chris Callison-Burch

word alignment

slide-12
SLIDE 12

Bilingual Pivoting

... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...

  • der wurden

, gefoltert

  • r have been

, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...

Source: Chris Callison-Burch

word alignment

slide-13
SLIDE 13

Bilingual Pivoting

... fünf Landwirte , weil ... 5 farmers were in Ireland ... ...

  • der wurden

, gefoltert

  • r have been

, tortured festgenommen thrown into jail festgenommen imprisoned ... ... ... ...

Source: Chris Callison-Burch

word alignment

slide-14
SLIDE 14

Key Limitations of PPDB?

slide-15
SLIDE 15

Key Limitations of PPDB?

bug

insect, beetle, pest, mosquito, fly squealer, snitch, rat, mole microphone, tracker, mic, wire, earpiece, cookie glitch, error, malfunction, fault, failure bother, annoy, pester microbe, virus, bacterium, germ, parasite

Source: Chris Callison-Burch

word sense

slide-16
SLIDE 16

Another Key Limitation

'01 Novels '05 '13 Bi-Text '04 News '13* '14* Twitter '11 Video '12* Style ‘01 Web '15*'16* Simple 80s WordNet

  • nly paraphrases, no non-paraphrases
slide-17
SLIDE 17

Paraphrase Identification

  • btain sentential paraphrases automatically

Mancini has been sacked by Manchester City Mancini gets the boot from Man City

Yes!%

WORLD OF JENKS IS ON AT 11 World of Jenks is my favorite show on tv

No!$

Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)

(meaningful) non-paraphrases are needed to train classifiers!

slide-18
SLIDE 18

Also Non-Paraphrases

'01 Novels '05 '13 Bi-Text '04 News '13* ’14* 17* Twitter '11 Video '12* Style ‘01 Web '15*'16* Simple 80s WordNet

(meaningful) non-paraphrases are needed to train classifiers!

slide-19
SLIDE 19

News Paraphrase Corpus

Microsoft Research Paraphrase Corpus

(Dolan, Quirk and Brockett, 2004; Dolan and Brockett, 2005; Brockett and Dolan, 2005)

also contains some non-paraphrases

slide-20
SLIDE 20

Twitter Paraphrase Corpus

Wei Xu, Alan Ritter, Chris Callison-Burch, Bill Dolan, Yangfeng Ji. “Extracting Lexically Divergent Paraphrases from Twitter” In TACL (2014)

also contains a lot of non-paraphrases

slide-21
SLIDE 21

Alan Ritter ◦ socialmedia-class.org

Paraphrase Identification:

A Binary Classification Problem

  • Input:
  • a sentence pair x
  • a fixed set of binary classes Y = {0, 1}
  • Output:
  • a predicted class y ∈ Y (y = 0 or y = 1)
slide-22
SLIDE 22

Alan Ritter ◦ socialmedia-class.org

Paraphrase Identification:

A Binary Classification Problem

  • Input:
  • a sentence pair x
  • a fixed set of binary classes Y = {0, 1}
  • Output:
  • a predicted class y ∈ Y (y = 0 or y = 1)

negative (non-paraphrases)

slide-23
SLIDE 23

Alan Ritter ◦ socialmedia-class.org

Paraphrase Identification:

A Binary Classification Problem

  • Input:
  • a sentence pair x
  • a fixed set of binary classes Y = {0, 1}
  • Output:
  • a predicted class y ∈ Y (y = 0 or y = 1)

negative (non-paraphrases) positive (paraphrases)

slide-24
SLIDE 24

Alan Ritter ◦ socialmedia-class.org

Paraphrase Identification:

A Binary Classification Problem

  • Input:
  • a sentence pair x
  • a fixed set of binary classes Y = {0, 1}
  • Output:
  • a predicted class y ∈ Y (y = 0 or y = 1)
slide-25
SLIDE 25

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Input:
  • a sentence pair x
  • a fixed set of binary classes Y = {0, 1}
  • a training set of m hand-labeled sentence pairs 


(x(1), y(1)), … , (x(m), y(m))

  • Output:
  • a learned classifier 𝜹: x → y ∈ Y (y = 0 or y = 1)
slide-26
SLIDE 26

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Input:
  • a sentence pair x (represented by features)
  • a fixed set of binary classes Y = {0, 1}
  • a training set of m hand-labeled sentence pairs 


(x(1), y(1)), … , (x(m), y(m))

  • Output:
  • a learned classifier 𝜹: x → y ∈ Y (y = 0 or y = 1)
slide-27
SLIDE 27

Alan Ritter ◦ socialmedia-class.org

(Recap) Classification Method:

Supervised Machine Learning

  • Naïve Bayes
  • Logistic Regression
  • Support Vector Machines (SVM)
slide-28
SLIDE 28

Alan Ritter ◦ socialmedia-class.org

(Recap)

Naïve Bayes

  • Cons:

features ti are assumed independent given the class y

  • This will cause problems:
  • correlated features ➞ double-counted evidence
  • while parameters are estimated independently
  • hurt classier’s accuracy

P(t1,t2,...,tn | y) = P(t1 | y)⋅P(t2 | y)⋅...⋅P(tn | y)

slide-29
SLIDE 29

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Naïve Bayes
  • Logistic Regression
  • Support Vector Machines (SVM)
slide-30
SLIDE 30

Logistic Regression

  • One of the most useful supervised machine

learning algorithm for classification!

  • Generally high performance for a lot of problems.
  • Much more robust than Naïve Bayes 


(better performance on various datasets).

slide-31
SLIDE 31

Let’s start with something simpler!

Before Logistic Regression

slide-32
SLIDE 32

Alan Ritter ◦ socialmedia-class.org

Paraphrase Identification:

Simplified Features

  • We use only one feature:
  • number of words that two sentences share in

common

slide-33
SLIDE 33

Alan Ritter ◦ socialmedia-class.org

A very related problem of Paraphrase Identification:

Semantic Textual Similarity

  • How similar (close in meaning) two sentences are?

5: completely equivalent in meaning 4: mostly equivalent, but some unimportant details differ 3: roughly equivalent, some important information differs/missing 2: not equivalent, but share some details 1: not equivalent, but are on the same topic 0: completely dissimilar

slide-34
SLIDE 34

Alan Ritter ◦ socialmedia-class.org

A Simpler Model:

Linear Regression

Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20

slide-35
SLIDE 35

Alan Ritter ◦ socialmedia-class.org

A Simpler Model:

Linear Regression

  • also supervised learning (learn from annotated data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

slide-36
SLIDE 36

Alan Ritter ◦ socialmedia-class.org

A Simpler Model:

Linear Regression

  • also supervised learning (learn from labeled data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

slide-37
SLIDE 37

Alan Ritter ◦ socialmedia-class.org

A Simpler Model:

Linear Regression

  • also supervised learning (learn from labeled data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

paraphrase{

slide-38
SLIDE 38

Alan Ritter ◦ socialmedia-class.org

A Simpler Model:

Linear Regression

  • also supervised learning (learn from labeled data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

paraphrase non-paraphrase

{ {

slide-39
SLIDE 39

Alan Ritter ◦ socialmedia-class.org

Training Set

  • m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m))
  • x’s: “input” variable / features
  • y’s: “output”/“target” variable

#words in common (x) Sentence Similarity (y)

1 4 1 13 4 18 5 … …

slide-40
SLIDE 40

Alan Ritter ◦ socialmedia-class.org

Source: NLTK Book

(Recap)

Supervised Machine Learning

training set

slide-41
SLIDE 41

Alan Ritter ◦ socialmedia-class.org

Supervised Machine Learning

training set (also called) hypothesis

slide-42
SLIDE 42

Alan Ritter ◦ socialmedia-class.org

Supervised Machine Learning

training set (also called) hypothesis

#words in common Sentence Similarity

x

(estimated) 


y 


slide-43
SLIDE 43

Alan Ritter ◦ socialmedia-class.org

Supervised Machine Learning

training set (also called) hypothesis x

(estimated) 


y 


slide-44
SLIDE 44

Alan Ritter ◦ socialmedia-class.org

hypothesis (h)

x

(estimated) 


y 


training set

  • How to represent h ?

hθ(x) = θ0 +θ1x

Linear Regression 
 w/ one variable

Linear Regression:

Model Representation

slide-45
SLIDE 45

Alan Ritter ◦ socialmedia-class.org

  • m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m))

𝞲’s: parameters

#words in common (x) Sentence Similarity (y)

1 4 1 13 4 18 5 … …

hθ(x) = θ0 +θ1x

Linear Regression w/ one variable:

Model Representation

Source: many following slides are adapted from Andrew Ng

slide-46
SLIDE 46

Alan Ritter ◦ socialmedia-class.org

  • m hand-labeled sentence pairs (x(1), y(1)), … , (x(m), y(m))
  • 𝞲’s: parameters

#words in common (x) Sentence Similarity (y)

1 4 1 13 4 18 5 … …

hθ(x) = θ0 +θ1x

Linear Regression w/ one variable:

Model Representation

slide-47
SLIDE 47

Alan Ritter ◦ socialmedia-class.org

Andrew'Ng'

0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3' 0' 1' 2' 3'

Linear Regression w/ one variable::

Model Representation

slide-48
SLIDE 48

Alan Ritter ◦ socialmedia-class.org

Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20

  • Idea: choose 𝞲0, 𝞲1 so that h𝞲(x) is

close to y for training examples (x, y)

Linear Regression w/ one variable:

Cost Function

slide-49
SLIDE 49

Alan Ritter ◦ socialmedia-class.org

  • Idea: choose 𝞲0, 𝞲1 so that h𝞲(x) is

close to y for training examples (x, y)

Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20

Linear Regression w/ one variable:

Cost Function

slide-50
SLIDE 50

Alan Ritter ◦ socialmedia-class.org

  • Idea: choose 𝞲0, 𝞲1 so that h𝞲(x) is

close to y for training examples (x, y)

Sentence Similarity (rated by Human) 1 2 3 4 5 #words in common (feature) 5 10 15 20

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

minimize

θ0, θ1

J(θ0,θ1)

squared error function:

Linear Regression w/ one variable:

Cost Function

slide-51
SLIDE 51

Alan Ritter ◦ socialmedia-class.org

Linear Regression

  • Hypothesis:
  • Parameters:
  • Cost Function:
  • Goal:

hθ(x) = θ0 +θ1x

minimize

θ0, θ1

J(θ0,θ1)

' n:'

𝞲0, 𝞲1

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

slide-52
SLIDE 52

Alan Ritter ◦ socialmedia-class.org

Linear Regression

  • Hypothesis:
  • Parameters:
  • Cost Function:
  • Goal:

hθ(x) = θ0 +θ1x

minimize

θ0, θ1

J(θ0,θ1)

' n:'

𝞲0, 𝞲1

hθ(x) = θ1x

J(θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

minimize

θ1

J(θ1) 𝞲1 Simplified

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

slide-53
SLIDE 53

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

(for fixed 𝞲1, this is a function of x )

x

J(θ1)

(function of the parameter 𝞲1 )

𝞲1=1 Hypothesis Cost Function Q:

slide-54
SLIDE 54

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

(for fixed 𝞲1, this is a function of x )

x

J(θ1)

(function of the parameter 𝞲1 )

𝞲1=1 Hypothesis Cost Function

slide-55
SLIDE 55

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

1 2 3 0.5 1 1.5 2 2.5

  • 0.5

(for fixed 𝞲1, this is a function of x )

𝞲1 J(𝞲1) x

J(θ1)

(function of the parameter 𝞲1 )

𝞲1=1 Hypothesis Cost Function

slide-56
SLIDE 56

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

(for fixed 𝞲1, this is a function of x )

x

J(θ1)

(function of the parameter 𝞲1 )

𝞲1=0.5 Q:

slide-57
SLIDE 57

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

1 2 3 0.5 1 1.5 2 2.5

  • 0.5

(for fixed 𝞲1, this is a function of x )

𝞲1 J(𝞲1) x

J(θ1)

(function of the parameter 𝞲1 )

𝞲1=0.5

slide-58
SLIDE 58

Alan Ritter ◦ socialmedia-class.org

1 2 3 1 2 3

hθ(x)

y

1 2 3 0.5 1 1.5 2 2.5

  • 0.5

(for fixed 𝞲1, this is a function of x )

𝞲1 J(𝞲1) x

J(θ1)

(function of the parameter 𝞲1 )

minimize

θ1

J(θ1)

slide-59
SLIDE 59

Alan Ritter ◦ socialmedia-class.org

Linear Regression

  • Hypothesis:
  • Parameters:
  • Cost Function:
  • Goal:

hθ(x) = θ0 +θ1x

minimize

θ0, θ1

J(θ0,θ1)

' n:'

𝞲0, 𝞲1

hθ(x) = θ1x

J(θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

minimize

θ1

J(θ1) 𝞲1 Simplified

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

slide-60
SLIDE 60

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x ) (function of the parameter 𝞲0, 𝞲1 )

𝞲1 𝞲0 J(𝞲1, 𝞲2)

J(θ0,θ1)

slide-61
SLIDE 61

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

contour plot

slide-62
SLIDE 62

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-63
SLIDE 63

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-64
SLIDE 64

Alan Ritter ◦ socialmedia-class.org

Parameter Learning

  • Have some function
  • Want
  • Outline:
  • Start with some 𝞲0, 𝞲1
  • Keep changing 𝞲0, 𝞲1 to reduce J(𝞲1, 𝞲2) 


until we hopefully end up at a minimum

J(θ0,θ1)

min

θ0, θ1 J(θ0,θ1)

slide-65
SLIDE 65

Alan Ritter ◦ socialmedia-class.org

1 2 3 0.5 1 1.5 2 2.5

  • 0.5

𝞲1 J(𝞲1)

minimize

θ1

J(θ1)

Gradient Descent

Simplified

slide-66
SLIDE 66

Alan Ritter ◦ socialmedia-class.org

1 2 3 0.5 1 1.5 2 2.5

  • 0.5

𝞲1 J(𝞲1)

minimize

θ1

J(θ1)

θ1 := θ1 −α ∂ ∂θ1 J(θ1)

learning rate

Gradient Descent

Simplified

slide-67
SLIDE 67

Alan Ritter ◦ socialmedia-class.org

Andrew'Ng'

θ1" θ0" J(θ0,θ1)"

Gradient Descent

minimize

θ0,θ1

J(θ0,θ1)

slide-68
SLIDE 68

Alan Ritter ◦ socialmedia-class.org

Andrew'Ng'

θ0" θ1" J(θ0,θ1)"

Gradient Descent

minimize

θ0,θ1

J(θ0,θ1)

slide-69
SLIDE 69

Alan Ritter ◦ socialmedia-class.org

repeat until convergence { }

Gradient Descent

θ j := θ j −α ∂ ∂θ j J(θ0,θ1)

simultaneous update
 for j=0 and j=1

learning rate

slide-70
SLIDE 70

Alan Ritter ◦ socialmedia-class.org

Linear Regression w/ one variable:

Gradient Descent

∂ ∂θ0 J(θ0,θ1) = ? ∂ ∂θ1 J(θ0,θ1) = ?

hθ(x) = θ0 +θ1x

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

Cost Function

slide-71
SLIDE 71

Alan Ritter ◦ socialmedia-class.org

repeat until convergence { }

θ0 := θ0 −α 1 m (hθ(x(i))− y(i))

i=1 m

simultaneous 
 update 𝞲0, 𝞲1

Linear Regression w/ one variable:

Gradient Descent

θ1 := θ1 −α 1 m (hθ(x(i))− y(i))

i=1 m

⋅ x(i)

slide-72
SLIDE 72

Alan Ritter ◦ socialmedia-class.org

𝞲1 𝞲0 J(𝞲1, 𝞲2)

Linear Regression

cost function is convex

slide-73
SLIDE 73

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-74
SLIDE 74

Alan Ritter ◦ socialmedia-class.org

Andrew'Ng'

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

slide-75
SLIDE 75

Alan Ritter ◦ socialmedia-class.org

Andrew'Ng'

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

slide-76
SLIDE 76

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-77
SLIDE 77

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-78
SLIDE 78

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-79
SLIDE 79

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-80
SLIDE 80

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-81
SLIDE 81

Alan Ritter ◦ socialmedia-class.org

hθ(x)

(for fixed 𝞲0, 𝞲1, this is a function of x )

J(θ0,θ1)

(function of the parameter 𝞲0, 𝞲1 )

Andrew'Ng'

slide-82
SLIDE 82

Alan Ritter ◦ socialmedia-class.org

Batch Gradient Descent

  • Each step of gradient descent uses all the training

examples

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

Cost Function

slide-83
SLIDE 83

Alan Ritter ◦ socialmedia-class.org

(Recap)

Linear Regression

  • also supervised learning (learn from annotated data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

slide-84
SLIDE 84

Alan Ritter ◦ socialmedia-class.org

(Recap)

Linear Regression

  • also supervised learning (learn from annotated data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

paraphrase{

slide-85
SLIDE 85

Alan Ritter ◦ socialmedia-class.org

(Recap)

Linear Regression

  • also supervised learning (learn from annotated data)
  • but for Regression: predict real-valued output


(Classification: predict discrete-valued output)

Sentence Similarity 1 2 3 4 5 #words in common (feature) 5 10 15 20

threshold ➞ Classification

paraphrase non-paraphrase

{ {

slide-86
SLIDE 86

Alan Ritter ◦ socialmedia-class.org

𝞲1 𝞲0 J(𝞲1, 𝞲2)

(Recap)

Linear Regression

  • Hypothesis:
  • Parameters:
  • Cost Function:
  • Goal:

hθ(x) = θ0 +θ1x

minimize

θ0, θ1

J(θ0,θ1) 𝞲0, 𝞲1

J(θ0,θ1) = 1 2m (hθ(x(i))− y(i))2

i=1 m

slide-87
SLIDE 87

Alan Ritter ◦ socialmedia-class.org

repeat until convergence { }

θ j := θ j −α ∂ ∂θ j J(θ0,θ1)

(simultaneous update
 for j=0 and j=1)

learning rate (Recap)

Gradient Descent

slide-88
SLIDE 88

Alan Ritter ◦ socialmedia-class.org

socialmedia-class.org

Next Class:

  • Logistic Regression