distrib ute d re pre se nta tio ns o f se nte nc e s a nd
play

Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume - PDF document

2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline Introduction Algorithm Learning Vector Representation of Words


  1. 2017-11-27 Distrib ute d Re pre se nta tio ns o f Se nte nc e s a nd Do c ume nts QUOC L E , T OMAS MI K OL OV PRE SE NT E RS: AMI N a nd AL I Outline ▶ Introduction ▶ Algorithm Learning Vector Representation of Words Paragraph Vector: A distributed memory model Paragraph Vector without word ordering: Distributed bag of words ▶ Experiments ▶ Conclusion ▶ Demo 2 1

  2. 2017-11-27 I ntro duc tio n ▶ Many machine learning algorithms require the input to be represented as a fixed-length feature vector. ▶ When it comes to texts, one of the most common fixed-length features is bag-of-words. 3 Ba g o f Wo rds 4 2

  3. 2017-11-27 Ba g o f Wo rds Disa dva nta g e s ▶ The word order is lost, and thus different sentences can have exactly the same representation, as long as the same words are used. ▶ Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and high dimensionality. ▶ Bag-of-words and bag-of-n-grams have very little sense about the semantics of the words or more formally the distances between the words. (powerful, Paris, strong) 5 Wo rd E mb e dding word(i-k) word(i-k) sum word(i) word(i) projection word(i-k+1) word(i-k+1) … … word(i+k) word(i+k) CBOW Skipgram 6 3

  4. 2017-11-27 Wo rd E mb e dding 7 Pro po se d Me tho d ▶ Distributed Representations of Sentences and Documents model was proposed. ▶ Paragraph Vector, an unsupervised algorithm that learns fixed- length feature representations from variable-length pieces of texts. ▶ Proposed algorithm represents each document by a dense vector which is trained to predict words in the document. 8 4

  5. 2017-11-27 L e a rning Ve c to r Re pre se nta tio n o f Wo rds ▶ The task is to predict a word given the other words in a context. 9 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ Paragraph vectors are used for prediction ▶ Every paragraph is mapped to a unique vector. ▶ Every word is also mapped to a unique vector 10 5

  6. 2017-11-27 Pa ra g ra ph Ve c to r: A distrib ute d me mo ry mo de l ( PV-DM ) ▶ The contexts are sampled from a sliding window over paragraph ▶ Paragraph vector is shared across all contexts from the same paragraph. ▶ Word vectors are shared across paragraphs 11 Adva nta g e s o ve r BOW  Semantics of the words. In this space, “powerful” is closer to “strong” than to “Paris”  Take into consideration the word order. 12 6

  7. 2017-11-27 Pa ra g ra ph Ve c to r Distrib ute d Ba g o f Wo rds ( PV-DBOW ) ▶ In this version, the paragraph vector is trained to predict the words in a small window. 13 E xpe rime nt ▶ Each paragraph vector is a combination of two vectors: one learned by PV-DM and one learned by PV-DBOW. ▶ Sentiment Analysis. ▶ Stanford sentiment treebank 11855 sentences ▶ ▶ IMDB 100000 movie reviews ▶ ▶ Information Retrieval 14 7

  8. 2017-11-27 Sta nfo rd se ntime nt tre e b a nk ▶ Learn the representations for all the sentences ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM ▶ Logistic Regression was used for prediction ▶ Every sentence has label which goes from 0.0 to 1.0 15 Sta nfo rd se ntime nt tre e b a nk 16 8

  9. 2017-11-27 I MDB ▶ Using Neural Networks and Logistic Regression for prediction ▶ The paragraph vector is the concatenation of two vectors from PV-DBOW and PV-DM 17 I MDB 18 9

  10. 2017-11-27 I nfo rma tio n Re trie va l calls from ( 000 ) 000 - 0000 . 3913 calls reported from this number . ▶ according to 4 reports the identity of this caller is american airlines . do you want to find out who called you from +1 000 - 000 - 0000 , +1 ▶ 0000000000 or ( 000) 000 - 0000 ? see reports and share information you have about this caller allina health clinic patients for your convenience , you can pay your ▶ allina health clinic bill online . pay your clinic bill now , question and answers... 19 Ob se rva tio ns ▶ PV-DM is consistently better than PV-DBOW ▶ PV-DM alone can achieve good results ▶ The combination of PV-DM and PV-DOW can gain best results. ▶ A good guess for window size is between 5 and 12. ▶ The proposed method must be run in parallel. 20 10

  11. 2017-11-27 Adva nta g e s a nd Disa dva nta g e s The proposed method is competitive with state-of-the-art methods. ▶ The good performance demonstrates the merits of Paragraph vector ▶ in capturing the semantics of paragraphs. It is scalable (sentences, paragraphs, and documents). ▶ Paragraph vectors have the potential to overcome many weaknesses ▶ of bag-of-words (word orders, word meaning, …) Paragraph vector can be expensive. ▶ Too many parameters. ▶ If the input corpus is one with lots of misspellings like tweets, this ▶ algorithm may not be a good choice 21 De mo 22 11

  12. 2017-11-27 Input layer 0 Index of cat in vocabulary 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 … 0 0 0 one-hot 0 one-hot vh2 sat 0 vector vector 0 0 1 0 … 0 1 0 0 on 0 0 0 … 0 23 We must learn W and W ’ Input layer 0 1 0 0 Hidden layer Output layer cat � 0 ��� 0 0 0 0 0 … 0 V-dim 0 0 �′ ��� 0 sat 0 0 0 0 1 0 … N-dim � V-dim 1 0 ��� on 0 0 0 0 … V-dim N will be the size of word vector 0 24 12

  13. Slide 23 vh2 One hot encoding technique is used to encode categorical integer features using a one-hot aka one-of-K scheme. Suppose you have ‘color’ feature which can take values ‘green’, ‘red’, and ‘blue’. One hot encoding will convert this ‘color’ feature to three features, namely, ‘is_green’, ‘is_red’, and ‘is_blue’ which all are binary. vagelis hristidis, 2016-11-06

  14. 2017-11-27 � � � � ��� � � ��� ��� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 2.4 Input layer 1 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.6 0 � 0 … … … … … … … … … … � … 0 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.8 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � ��� � � �� � 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 25 � � � � �� � � �� ��� 0 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 1.8 Input layer 0 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 2.9 0 � 0 … … … … … … … … … … � … 1 1 0 … … … … … … … … … … … 0 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 1.9 0 0 Output layer x cat 0 0 0 … 0 0 0 0 0 … 0 V-dim 0 0 � � � ��� � � �� � 0 + sat 2 0 0 0 0 1 0 … V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 26 13

  15. 2017-11-27 Input layer 0 1 0 0 Hidden layer Output layer � cat 0 ��� 0 0 0 0 0 … 0 V-dim 0 0 � � � ���������� � � � � � � � 0 ��� 0 0 0 0 1 � � 0 … � 1 0 ��� N-dim 0 on � ��� � 0 0 V-dim 0 … V-dim N will be the size of word vector 0 27 Input layer 0 We would prefer � � close to � � ��� 1 0 0 Hidden layer Output layer cat � 0 ��� 0 0 0 0.01 0 0 0.02 … 0 V-dim 0.00 0 0 � � � � � � � 0 ��� 0.02 0 � � � ���������� 0.01 0 0 0 1 0.02 � � 0 … 0.01 � 1 0 ��� N-dim 0.7 on 0 � � ��� 0 … 0 V-dim 0.00 0 … � � V-dim N will be the size of word vector 0 28 14

  16. 2017-11-27 � � ��� 0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 Contain word’s vectors Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1 0 … … … … … … … … … … 1 … … … … … … … … … … 0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 Output layer x cat 0 0 0 � 0 ��� 0 0 … 0 V-dim 0 � 0 � ��� 0 sat 0 0 0 0 1 0 � … ��� V-dim 1 0 Hidden layer 0 x on 0 N-dim 0 0 … V-dim 0 We can consider either W or W’ as the word’s representation. Or even take the average. 29 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend