 
              Vector Space Models Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 January 19, 2017 Based on slides from Jacob Eisenstein, Noah Smith, Mohit Bansal, Richard Socher, and everyone else they copied from.
Outline Latent Semantic Analysis Vector Models for Words Reducing the Dimensions Direct Embeddings CS 295: STATISTICAL NLP (WINTER 2017) 2
Outline Latent Semantic Analysis Vector Models for Words Reducing the Dimensions Direct Embeddings CS 295: STATISTICAL NLP (WINTER 2017) 3
Example: Documents c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors: A survey From http://lsa.colorado.edu/papers/dp1.LSAintro.pdf CS 295: STATISTICAL NLP (WINTER 2017) 4
Example: Term-Doc Matrix c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors CS 295: STATISTICAL NLP (WINTER 2017) 5
Problems with Sparse Vectors c2: A survey of user opinion of computer system response time c1: Human machine interface m4: Graph minors: A survey for ABC computer applications CS 295: STATISTICAL NLP (WINTER 2017) 6
Example: Distance Matrix c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 c2 c3 c4 c5 m1 m2 m3 m4 CS 295: STATISTICAL NLP (WINTER 2017) 7
Option 2: SVD CS 295: STATISTICAL NLP (WINTER 2017) 8
Latent Semantic Analysis (LSA) CS 295: STATISTICAL NLP (WINTER 2017) 9
Example: Decomposition c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors CS 295: STATISTICAL NLP (WINTER 2017) 10
New Document Vectors CS 295: STATISTICAL NLP (WINTER 2017) 11
Example: Reconstruction human interface computer user system response time EPS survey trees graph minors CS 295: STATISTICAL NLP (WINTER 2017) 12
Example: Distance Matrix CS 295: STATISTICAL NLP (WINTER 2017) 13
Outline Latent Semantic Analysis Vector Models for Words Reducing the Dimensions Direct Embeddings CS 295: STATISTICAL NLP (WINTER 2017) 14
Let’s look at words A bottle of tezguino is on the table. Everybody likes tezguino. Tezguino makes you drunk. We make tezguino out of corn. What does tezguino mean? Loud, motor oil, tortillas, choices, wine You shall know a word by the company keeps. (Firth, 1957) CS 295: STATISTICAL NLP (WINTER 2017) 15
Term-Context Matrix C1 C2 C3 C4 tezguino C1: A bottle of ______ is on the table. loud C2: Everybody likes ______. motor oil C3: _____ makes you drunk. tortillas C4: We make _____ out of corn. choices wine CS 295: STATISTICAL NLP (WINTER 2017) 16
What is a “Context”? A bottle of tezguino is on the table. Tezguino makes you drunk. Can be anything you want! … • Entire contents of the sentence I had a fancy bottle of wine and One word before and after • got drunk last night! • Words in the same sentence The terrible wine is on the table. Document it appears in • Many other variations… • CS 295: STATISTICAL NLP (WINTER 2017) 17
What is a “Context”? A bottle of tezguino is on the table. Tezguino makes you drunk. Can be anything you want! … • Entire contents of the sentence I had a fancy bottle of wine and Unlikely to occur again! • got drunk last night! • One word before and after The terrible wine is on the table. Words in the same sentence • Document ID it appears in • • Many other variations… C1 C2 C3 C4 tezguino wine CS 295: STATISTICAL NLP (WINTER 2017) 18
What is a “Context”? A bottle of tezguino is on the table. Tezguino makes you drunk. Can be anything you want! … • Entire contents of the sentence I had a fancy bottle of wine and One word before and after • got drunk last night! • Or n-words The terrible wine is on the table. Words in the same sentence • Document it appears in • • Many other variations… bottle-of is-of makes-you and-got the-terrible is-on tezguino wine CS 295: STATISTICAL NLP (WINTER 2017) 19
What is a “Context”? A bottle of tezguino is on the table. Tezguino makes you drunk. Can be anything you want! … • Entire contents of the sentence I had a fancy bottle of wine and One word before and after • got drunk last night! • Words in the same sentence The terrible wine is on the table. Filter: nouns and verbs? • Bag of words in a window • • Document it appears in Many other variations… • bottle table you drunk fancy night terrible tezguino wine CS 295: STATISTICAL NLP (WINTER 2017) 20
What is a “Context”? A bottle of tezguino is on the table. Tezguino makes you drunk. Can be anything you want! … • Entire contents of the sentence I had a fancy bottle of wine and One word before and after • got drunk last night! • Words in the same sentence The terrible wine is on the table. Document it appears in • Term-document matrix! • D1 D2 D3 D4 • Latent Semantic Analysis tezguino Many other variations… • table bottle drunk wine CS 295: STATISTICAL NLP (WINTER 2017) 21
Pointwise Mutual Information Raw counts are not good Skewed towards common words/contexts • Many of them are not informative • is, the, it, they, … • PMI(w,c) How much more likely is w to occur in c, than just randomly? • CS 295: STATISTICAL NLP (WINTER 2017) 22
Outline Latent Semantic Analysis Vector Models for Words Reducing the Dimensions Direct Embeddings CS 295: STATISTICAL NLP (WINTER 2017) 23
Option 1: Revisiting Clustering CS 295: STATISTICAL NLP (WINTER 2017) 24
Hierarchical Clustering CS 295: STATISTICAL NLP (WINTER 2017) 25
Example CS 295: STATISTICAL NLP (WINTER 2017) 26
Brown Clusters for Twitter http://www.cs.cmu.edu/~ark/TweetNLP/cluster_viewer.html CS 295: STATISTICAL NLP (WINTER 2017) 27
Option 2: SVD CS 295: STATISTICAL NLP (WINTER 2017) 28
Example Word Projection CS 295: STATISTICAL NLP (WINTER 2017) 29
Problem with SVD & Clustering Computational Complexity SVD: O(mn 2 ) • Clustering: O(knm) per iteration, or O(n 3 ) • But, n can be 100,000! • “One shot” Difficult to add new documents or words • Cannot work with streaming data • CS 295: STATISTICAL NLP (WINTER 2017) 30
Outline Latent Semantic Analysis Vector Models for Words Reducing the Dimensions Direct Embeddings CS 295: STATISTICAL NLP (WINTER 2017) 31
Predict surrounding words A bottle of tezguino is on the table. u v CS 295: STATISTICAL NLP (WINTER 2017) 32
Estimating the Word Vectors CS 295: STATISTICAL NLP (WINTER 2017) 33
Similar Meaning = Close CS 295: STATISTICAL NLP (WINTER 2017) 34
Similar Meaning = Close https://siddhant7.github.io/Vector-Representation-of-Words/ CS 295: STATISTICAL NLP (WINTER 2017) 35
Vectors “know” Gender male : female :: King : queen King - male + female queen https://siddhant7.github.io/Vector-Representation-of-Words/ CS 295: STATISTICAL NLP (WINTER 2017) 36
They “know” Tenses! walking : walked :: swimming : swam swimming – walking + walked swam https://siddhant7.github.io/Vector-Representation-of-Words/ CS 295: STATISTICAL NLP (WINTER 2017) 37
They “know” Facts! Madrid Country – Capital + Spain https://siddhant7.github.io/Vector-Representation-of-Words/ CS 295: STATISTICAL NLP (WINTER 2017) 38
Upcoming… • Homework 1 is up! No more material will be covered Homework • • Due: January 26, 2017 Project pitch is due January 23, 2017! • Start assembling teams now Project • Tons of datasets on the “projects” page on website • CS 295: STATISTICAL NLP (WINTER 2017) 39
Recommend
More recommend