Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - - PDF document

distrib ute d re pre se nta tio ns o f se nte nc e s a nd
SMART_READER_LITE
LIVE PREVIEW

Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - - PDF document

2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline Introduction Algorithm Learning Vector Representation of Words


slide-1
SLIDE 1

2017-11-27 1

Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts

QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I

Outline

▶ Introduction ▶ Algorithm

Learning Vector Representation of Words Paragraph Vector: A distributed memory model Paragraph Vector without word ordering: Distributed bag of words

▶ Experiments ▶ Conclusion ▶ Demo

2

slide-2
SLIDE 2

2017-11-27 2

I ntro duc tio n

▶ Many machine learning algorithms require the input to be

represented as a fixed-length feature vector.

▶ When it comes to texts, one of the most common fixed-length

features is bag-of-words.

3

Ba g o f Wo rds

4

slide-3
SLIDE 3

2017-11-27 3

Ba g o f Wo rds Disa dva nta g e s

▶ The word order is lost, and thus different sentences can have

exactly the same representation, as long as the same words are used.

▶ Even though bag-of-n-grams considers the word order in short

context, it suffers from data sparsity and high dimensionality.

▶ Bag-of-words and bag-of-n-grams have very little sense about

the semantics of the words or more formally the distances between the words. (powerful, Paris, strong)

5

Wo rd E mb e dding

… word(i-k) word(i-k+1) word(i+k) word(i) sum … word(i-k) word(i-k+1) word(i+k) word(i) projection CBOW Skipgram

6

slide-4
SLIDE 4

2017-11-27 4

Wo rd E mb e dding

7

Pro po se d Me tho d

▶ Distributed Representations of Sentences and Documents model

was proposed.

▶ Paragraph Vector, an unsupervised algorithm that learns fixed-

length feature representations from variable-length pieces of texts.

▶ Proposed algorithm represents each document by a dense vector

which is trained to predict words in the document.

8

slide-5
SLIDE 5

2017-11-27 5

L e a rning Ve c to r Re pre se nta tio n o f Wo rds

▶ The task is to predict a word given the other words in a

context.

9

Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l (PV-DM)

10

▶ Paragraph vectors are used for

prediction

▶ Every paragraph is mapped to a

unique vector.

▶ Every word is also mapped to a

unique vector

slide-6
SLIDE 6

2017-11-27 6

Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l (PV-DM)

▶ The contexts are sampled from a

sliding window over paragraph

▶ Paragraph vector is shared across

all contexts from the same paragraph.

▶ Word vectors are shared across

paragraphs

11

Adva nta g e s o ve r BOW

  • Semantics of the words. In this space, “powerful” is closer

to “strong” than to “Paris”

  • Take into consideration the word order.

12

slide-7
SLIDE 7

2017-11-27 7

Pa ra g ra ph Ve c to r Distrib ute d Ba g o f Wo rds (PV-DBOW)

▶ In this version, the paragraph vector is trained to predict

the words in a small window.

13

E xpe rime nt

▶ Each paragraph vector is a combination of two vectors: one

learned by PV-DM and one learned by PV-DBOW.

▶ Sentiment Analysis. ▶ Stanford sentiment treebank

11855 sentences

▶ IMDB

100000 movie reviews

▶ Information Retrieval

14

slide-8
SLIDE 8

2017-11-27 8

Sta nfo rd se ntime nt tre e b a nk

▶ Learn the representations for all the sentences ▶ The paragraph vector is the concatenation of two vectors

from PV-DBOW and PV-DM

▶ Logistic Regression was used for prediction ▶ Every sentence has label which goes from 0.0 to 1.0

15

Sta nfo rd se ntime nt tre e b a nk

16

slide-9
SLIDE 9

2017-11-27 9

I MDB

▶ Using Neural Networks and Logistic Regression for

prediction

▶ The paragraph vector is the concatenation of two vectors

from PV-DBOW and PV-DM

17

I MDB

18

slide-10
SLIDE 10

2017-11-27 10

I nfo rma tio n Re trie va l

calls from ( 000 ) 000 - 0000 . 3913 calls reported from this number . according to 4 reports the identity of this caller is american airlines .

do you want to find out who called you from +1 000 - 000 - 0000 , +1 0000000000 or ( 000) 000 - 0000 ? see reports and share information you have about this caller

allina health clinic patients for your convenience , you can pay your allina health clinic bill online . pay your clinic bill now , question and answers...

19

Ob se rva tio ns

▶ PV-DM is consistently better than PV-DBOW ▶ PV-DM alone can achieve good results ▶ The combination of PV-DM and PV-DOW can gain best

results.

▶ A good guess for window size is between 5 and 12. ▶ The proposed method must be run in parallel.

20

slide-11
SLIDE 11

2017-11-27 11

Adva nta g e s a nd Disa dva nta g e s

The proposed method is competitive with state-of-the-art methods.

The good performance demonstrates the merits of Paragraph vector in capturing the semantics of paragraphs.

It is scalable (sentences, paragraphs, and documents).

Paragraph vectors have the potential to overcome many weaknesses

  • f bag-of-words (word orders, word meaning, …)

Paragraph vector can be expensive.

Too many parameters.

If the input corpus is one with lots of misspellings like tweets, this algorithm may not be a good choice

21

De mo

22

slide-12
SLIDE 12

2017-11-27 12

23

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer sat Output layer

  • ne-hot

vector

  • ne-hot

vector Index of cat in vocabulary vh2 24

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer sat Output layer

  • V-dim

V-dim N-dim

V-dim N will be the size of word vector We must learn W and W’

slide-13
SLIDE 13

Slide 23 vh2 One hot encoding technique is used to encode categorical integer features using a one-hot aka

  • ne-of-K scheme.

Suppose you have ‘color’ feature which can take values ‘green’, ‘red’, and ‘blue’. One hot encoding will convert this ‘color’ feature to three features, namely, ‘is_green’, ‘is_red’, and ‘is_blue’ which all are binary.

vagelis hristidis, 2016-11-06

slide-14
SLIDE 14

2017-11-27 13

25

1 … 1 …

xcat xon

1 …

Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim +

  • 2

0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

  • 1

  • 2.4

2.6 … … 1.8

  • 26

1 … 1 …

xcat xon

1 …

Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim +

  • 2

0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

  • 1

  • 1.8

2.9 … … 1.9

slide-15
SLIDE 15

2017-11-27 14

27

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer

  • Output layer
  • V-dim

V-dim N-dim

  • V-dim

N will be the size of word vector

  • 28

1 … 1 …

cat

  • n

1 …

Input layer Hidden layer

  • Output layer
  • V-dim

V-dim N-dim

  • V-dim

N will be the size of word vector

  • 0.01

0.02 0.00 0.02 0.01 0.02 0.01 0.7 … 0.00

  • We would prefer

close to

slide-16
SLIDE 16

2017-11-27 15

29

1 … 1 …

xcat xon

1 …

Input layer Hidden layer sat Output layer V-dim V-dim N-dim V-dim

  • 0.1 2.4 1.6 1.8 0.5 0.9

… … … 3.2 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 … … … … … … … … … … … … … … … … … … … … 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

  • Contain word’s vectors
  • We can consider either W or W’ as the word’s representation. Or even take the

average.