Words vs. Terms Words vs. Terms Information Retrieval cares about - PDF document

Words vs. Terms Words vs. Terms � Information Retrieval cares about “terms” � You search for ’em, Google indexes ’em � Query: � What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro to NLP - J. Eisner 2 Words vs. Terms Finding Phrases (“collocations”) � What kind of monkeys live in Costa Rica? � kick the bucket � directed graph � words? � iambic pentameter � Osama bin Laden � content words? � word stems? � United Nations � word clusters? � real estate � multi-word phrases? � quality control � thematic content? (this is a “habitat question”) � international best practice � … have their own meanings, translations, etc. 600.465 - Intro to NLP - J. Eisner 3 600.465 - Intro to NLP - J. Eisner 4 Finding Phrases (“collocations”) Finding Phrases (“collocations”) � Just use common bigrams? � Just use common bigrams? � Doesn’t work: � Better correction - filter by tags: A N, N N, N P N … � 11487 New York � 80871 of the � 7261 United States � 58841 in the � 5412 Los Angeles � 26430 to the � 3301 last year � … � … � 15494 to be � 1074 chief executive � … � 1073 real estate � 12622 from the � … � 11428 New York � 10007 he said � Possible correction – just drop function words! 600.465 - Intro to NLP - J. Eisner 5 600.465 - Intro to NLP - J. Eisner 6 1

data from Manning & Schütze textbook (14 million words of NY Times) (Pointw ise) Mutual Information Finding Phrases (“collocations”) ¬ new ___ new ___ TOTAL � Still want to filter out “new companies” ___ companies 8 4,667 4,675 � These words occur together reasonably often but (“old companies”) only because both are frequent ___ ¬ companies 15,820 14,287,181 14,303,001 (“old machines”) � Do they occur more often [among A N pairs?] than you would expect by chance? TOTAL 15,828 14,291,848 14,307,676 � Expect by chance: p(new) p(companies) � p(new companies) = p(new) p(companies) ? N � Actually observed: p(new companies) � MI = log 2 p(new companies) / p(new)p(companies) = p(new) p(companies | new) � mutual information = log 2 (8/N) /((15828/N)(4675/N)) = log 2 1.55 = 0.63 � binomial significance test MI > 0 if and only if p(co’s | new) > p(co’s) > p(co’s | ¬ new) � Here MI is positive but small. Would be larger for stronger collocations. � 600.465 - Intro to NLP - J. Eisner 7 600.465 - Intro to NLP - J. Eisner 8 data from Manning & Schütze textbook (14 million words of NY Times) data from Manning & Schütze textbook (14 million words of NY Times) Significance Tests Binomial Significance (“Coin Flips”) ¬ new ___ ¬ new ___ new ___ TOTAL new ___ TOTAL ___ companies 1 583 584 ___ companies 8 4,667 4,675 (“old companies”) ___ ¬ companies 15,820 14,287,181 14,303,001 ___ ¬ companies 1978 1,785,898 1,787,876 TOTAL 15,828 14,291,848 14,307,676 (“old machines”) Assume we have 2 coins that were used when generating the text. � Following new, we flip coin A to decide whether companies is next. TOTAL 1979 1,786,481 1,788,460 � Following ¬ new, we flip coin B to decide whether companies is next. � � Sparse data. In fact, suppose we divided all counts by 8: We can see that A was flipped 15828 times and got 8 heads. � � Probability of this: p 8 (1-p) 15820 * 15828! / 8! 15820! � Would MI change? � No, yet we should be less confident it’s a real collocation. We can see that B was flipped 14291848 times and got 4667 heads. � � Extreme case: what happens if 2 novel words next to each other? Our question: Do the two coins have different weights? � � So do a significance test! Takes sample size into account. (equivalently, are there really two separate coins or just one?) 600.465 - Intro to NLP - J. Eisner 9 600.465 - Intro to NLP - J. Eisner 10 data from Manning & Schütze textbook (14 million words of NY Times) data from Manning & Schütze textbook (14 million words of NY Times) Binomial Significance (“Coin Flips”) Binomial Significance (“Coin Flips”) ¬ new ___ ¬ new ___ new ___ TOTAL new ___ TOTAL ___ companies 8 4,667 4,675 ___ companies 1 583 584 ___ ¬ companies ___ ¬ companies 15,820 14,287,181 14,303,001 1978 1,785,898 1,787,876 TOTAL 15,828 14,291,848 14,307,676 TOTAL 1979 1,786,481 1,788,460 Null hypothesis: same coin Null hypothesis: same coin � � assume p null (co’s | new) = p null (co’s | ¬ new) = p null (co’s) = 4675/14307676 assume p null (co’s | new) = p null (co’s | ¬ new) = p null (co’s) = 584/1788460 � � p null (data) = p null (8 out of 15828)* p null (4667 out of 14291848) = .00042 p null (data) = p null (1 out of 1979)* p null (583 out of 1786481) = .0056 � � Collocation hypothesis: different coins Collocation hypothesis: different coins � � assume p coll (co’s | new) = 8/15828, p coll (co’s | ¬ new) = 4667/14291848 assume p coll (co’s | new) = 1/1979, p coll (co’s | ¬ new) = 583/1786481 � � p coll (data) = p coll (8 out of 15828)* p coll (4667 out of 14291848) = .00081 p coll (data) = p coll (1 out of 1979)* p coll (583 out of 1786418) = .0061 � � Collocation hypothesis still increases p(data), but only slightly now. So collocation hypothesis doubles p(data). � � If we don’t have much data, 2-coin model can’t be much better at explaining it. We can sort bigrams by the log-likelihood ratio: log p coll (data)/p null (data) � � Pointwise mutual information as strong as before, but based on much less data. � i.e., how sure are we that “companies” is more likely after “new”? � So it’s now reasonable to believe the null hypothesis that it’s a coincidence. 600.465 - Intro to NLP - J. Eisner 11 600.465 - Intro to NLP - J. Eisner 12 2

data from Manning & Schütze textbook (14 million words of NY Times) Function vs. Content Words Binomial Significance (“Coin Flips”) ¬ new ___ new ___ TOTAL � Might want to eliminate function words, or reduce their influence on a search ___ companies 8 4,667 4,675 ___ ¬ companies 15,820 14,287,181 14,303,001 � Tests for content word: � If it appears rarely? TOTAL 15,828 14,291,848 14,307,676 � no: c(beneath) < c(Kennedy) ≈ c(aside) « c(oil) in WSJ Null hypothesis: same coin � � If it appears in only a few documents? assume p null (co’s | new) = p null (co’s | ¬ new) = p null (co’s) = 4675/14307676 � � better: Kennedy tokens are concentrated in a few docs p null (data) = p null (8 out of 15828)* p null (4667 out of 14291848) = .00042 � Collocation hypothesis: different coins � This is traditional solution in IR � assume p coll (co’s | new) = 8/15828, p coll (co’s | ¬ new) = 4667/14291848 � If its frequency varies a lot among documents? � p coll (data) = p coll (8 out of 15828)* p coll (4667 out of 14291848) = .00081 � � best: content words come in bursts (when it rains, it pours?) Does this mean that collocation hypothesis is twice as likely? � probability of Kennedy is increased if Kennedy appeared in � No, as it’s far less probable a priori ! (most bigrams ain’t collocations) preceding text – it is a “self-trigger” whereas beneath isn’t � Bayes: p(coll | data) = p(coll) * p(data | coll) / p(data) isn’t twice p(null | data) � 600.465 - Intro to NLP - J. Eisner 13 600.465 - Intro to NLP - J. Eisner 14 Latent Semantic Analysis Latent Semantic Analysis � A trick from Information Retrieval � A trick from Information Retrieval � Each document in corpus is a length-k vector � Each document in corpus is a length-k vector � Or each paragraph, or whatever � Plot all documents in corpus Reduced-dimensionality plot True plot in k dimensions d e k n r y s o a g u d t e v t c e c n o t r d u v u a a b o r d o m g a b b b b b y a a a a y a a z z (0, 3, 3, 1, 0, 7, . . . 1, 0) a single document 600.465 - Intro to NLP - J. Eisner 15 600.465 - Intro to NLP - J. Eisner 16 Latent Semantic Analysis Latent Semantic Analysis � Reduced plot is a perspective drawing of true plot � SVD plot allows best possible reconstruction of true plot (i.e., can recover 3-D coordinates with minimal distortion) � It projects true plot onto a few axes � ∃ ∃ a best choice of axes – shows most variation in the data. ∃ ∃ � Ignores variation in the axes that it didn’t pick � Hope that variation’s just noise and we want to ignore it � Found by linear algebra: “Singular Value Decomposition” (SVD) Reduced-dimensionality plot True plot in k dimensions Reduced-dimensionality plot True plot in k dimensions B e m e theme B h A word 2 m e e t t h word 3 theme A w o r d 1 600.465 - Intro to NLP - J. Eisner 17 600.465 - Intro to NLP - J. Eisner 18 3

Words vs. Terms Words vs. Terms Information Retrieval cares about - PDF document

Words vs. Terms Words vs. Terms Information Retrieval cares about terms You search for em, Google indexes em Query: What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro to NLP

III. Staying in a Hotel - Terms of Accommodation Contract Headings: Terms Standard

Statistical properties of lambda terms Maciej Bendkowski Olivier Bodini Sergey Dovgal CLA,

Nameless Representation of Terms CIS500: Software Foundations Nameless Representation of Terms

Enumerating (restricted) -terms Motzkin trees -terms with bounded number of Danile

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

RULES FOR THE PRESENTATION OF ELECTRONIC INVOICES TO PAYERS 1. TERMS AND DEFINITIONS Terms used

Charter Reform Proposals for 4 YEAR TERMS 4 YEAR TERMS by Council Member Laure Quinlivan

Anatomy Terms terms of direction anterior/posterior medial/lateral superior/inferior

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

// THE COVERAGE OF TODAYS TALK Presented by _________ Date: 3-Apr, 2019 // THE COVERAGE OF

Interference and the threat to Democracy. Slides from the NZRise briefing session at Parliament

Procurement Plan Formal Procurement Formal Procurement Methods Invitation for Bid (IFB)

Please wait The training is starting soon PLEASE HAVE A PEN AND PAPER READY TO TAKE NOTES FOR

ARMY AL&T MAGAZINE WRITERS WORKSHOP 27 March 2014 1 Unclassified/For Public Release OUR

Making Data Count: Research and Analytics Applications to the 2020 Census Integrated

State Board Meeting August 26, 2020 9:00 to 12:40 Meeting to be held electronically by ZOOM

Words vs. Terms Words vs. Terms Information Retrieval cares about - PDF document

Words vs. Terms Words vs. Terms Information Retrieval cares about terms You search for em, Google indexes em Query: What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro to NLP

III. Staying in a Hotel - Terms of Accommodation Contract Headings: Terms Standard

Statistical properties of lambda terms Maciej Bendkowski Olivier Bodini Sergey Dovgal CLA,

Nameless Representation of Terms CIS500: Software Foundations Nameless Representation of Terms

Enumerating (restricted) -terms Motzkin trees -terms with bounded number of Danile

Words, Words, Words AND WHY THEY MATTER IN ADVERTISING AND MARKETING Steve Kaplan Becky

Proverbs Words: The Power of Life and Death Words: The Power of 3. Words: They Can Be

The nature and quantity of the unique words of narratives (i.e.., the words beyond the

Question 5-1) Number of words = 256K words = 2 8 *2 10 words Number of bits pre each word = 32 bit

Sturmian words, Lecture 3 Standard words Dominique Perrin 1 er d ecembre 2011 Dominique

Simplicity in Practice https://xkcd.com/1349/ Words, words, words. Hamlet, Act 2 Scene

MORPHOLOGY A Study of the internal structure of words and the relationships among words

Token to Words Expanding identified token to words numbers+type = word list

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

RULES FOR THE PRESENTATION OF ELECTRONIC INVOICES TO PAYERS 1. TERMS AND DEFINITIONS Terms used

Charter Reform Proposals for 4 YEAR TERMS 4 YEAR TERMS by Council Member Laure Quinlivan

Anatomy Terms terms of direction anterior/posterior medial/lateral superior/inferior

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

// THE COVERAGE OF TODAYS TALK Presented by _________ Date: 3-Apr, 2019 // THE COVERAGE OF

Interference and the threat to Democracy. Slides from the NZRise briefing session at Parliament

Procurement Plan Formal Procurement Formal Procurement Methods Invitation for Bid (IFB)

Please wait The training is starting soon PLEASE HAVE A PEN AND PAPER READY TO TAKE NOTES FOR

ARMY AL&amp;T MAGAZINE WRITERS WORKSHOP 27 March 2014 1 Unclassified/For Public Release OUR

Making Data Count: Research and Analytics Applications to the 2020 Census Integrated

State Board Meeting August 26, 2020 9:00 to 12:40 Meeting to be held electronically by ZOOM

ARMY AL&T MAGAZINE WRITERS WORKSHOP 27 March 2014 1 Unclassified/For Public Release OUR