Lecture 38 β tf/idf and information retrieval
Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source
Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson - - PowerPoint PPT Presentation
Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Outline similarity vs. semantic field: word2vec at different scales term frequency (tf): the
Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source
Similarity = words can be used interchangeably in most contexts How do we measure that in practice? Answer: extract examples of word π₯!, +/- k words (2 β€ π β€ 5, for example): β¦hot, although iced coffee is a popularβ¦ β¦indicate that moderate coffee consumption is benignβ¦ β¦and of π₯": β¦consumed as iced tea. Sweet tea isβ¦ β¦national average of tea consumption in Irelandβ¦ The words βicedβ and βconsumptionβ appear in both contexts, so we can conclude that π‘(coffea, tea) > 0. No other words are shared, so we can conclude π‘(coffee, tea) < 1.
Levy & Goldberg (2014) trained word2vec in three different ways:
the sentence to get syntactic dependency structure (Deps) They tested all three method for the similarity vs. relatedness of the nearest-neighbor of each word.
Precision vs. Recall on the WordSim-353 database, in which word pairs may be either related or similar (Fig. 2(a), Levy & Goldberg 2014) Precision vs. Recall on the Chiarello et al. database, in which word pairs are
Levy & Goldberg 2014)
window (k=2) produces vectors whose nearest neighbors are more similar (they can be used identically in a sentence).
vectors whose nearest neighbors are related, not just similar.
pairs are said to inhabit the same semantic field.
words that refers to the same subject.
Precision vs. Recall on the WordSim-353 database, in which word pairs may be either related or similar (Fig. 2(a), Levy & Goldberg 2014) Precision vs. Recall on the Chiarello et al. database, in which word pairs are
Levy & Goldberg 2014)
β¦studied at hogwarts, a castleβ¦
w=hogwarts
β¦ harry potter studied at hogwartsβ¦
vector nearest neighbors, context k=2 vector nearest neighbors, context k=5
β¦studied at evernight, a castleβ¦
evernight dumbledore
β¦harry potter learned from dumbledoreβ¦ β¦studied at sunnydaleβ¦
sunnydale hallows
β¦harry potter and the deathly hallows.. β¦a castle garderobeβ¦
garderobe half-blood
β¦harry potter and the half-bloodβ¦ β¦lives at blandings, a castleβ¦
blandings malfoy
β¦harry potter said to malfoyβ¦ β¦lives at collinwood, a castleβ¦
collinwood snape
β¦harry potter said to snapeβ¦
Examples of k=2 and k=5 nearest-neighbors, from (Levy & Goldberg, 2014)
embedding to capture semantic field, as in the second column (not similar usage, like the first column)?
larger contexts are better.
= the whole document? w=hogwarts
vector nearest neighbors, context k=2 vector nearest neighbors, context k=5
evernight dumbledore sunnydale hallows garderobe half-blood blandings malfoy collinwood snape
Hogwarts School of Witchcraft and Wizardry, commonly shortened to Hogwarts, is a fictional British school of magic for students aged eleven to eighteen, and is the primary setting for the first six books in J. K. Rowling's Harry Potter seriesβ¦ Albus Percival Wulfric Brian Dumbledore is a fictional character in J. K. Rowling's Harry Potter series. For most of the series, he is the headmaster of the wizarding school Hogwarts. As part of his backstory, it is revealed that he is the founder and leader of β¦ Collinwood Mansion is a fictional house featured in the Gothic horror soap opera Dark Shadows (1966β 1971). Built in 1795 by Joshua Collins, Collinwood has been home to the Collins familyβand other sometimes unwelcome supernatural visitorsβ¦ document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
From the term-document matrix, we can define each term vector to be just the vector
β π€(π) = [π’π(π, 1), β¦ , π’π(π, πΈ)] β¦where we now define the term frequency (of term π in document π) to be the number
π’π(π, π) = Count word π in document π For example, β π€ a = 1,1,1 β π€(of) = [1,2,1] β π€(potter) = [1,1,0]
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
The relatedness of two words can now be measured using their cosine similarity. For example, π‘(rowling!s, harry) = cos β‘ rowling!s, harry = β π€(rowling!s) 5 β π€(harry) β π€(rowling!s) β π€(harry) = 1Γ1 + 1Γ1 + 0Γ0 2Γ 2 = 1 π‘(harry, gothic) = cos β‘ harry, gothic = β π€(harry) 5 β π€(gothic) β π€(harry) β π€(gothic) = 1Γ0 + 1Γ0 + 0Γ1 2Γ1 = 0
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
Now letβs try something different. Letβs define a vector for each document, rather than for each term: β π(π) = [π’π(1, π), β¦ , π’π(π, π)] Thus, β π H = 1,1,1,2,1,1,1,1,1,0,0,0 β π(D) = [1,2,1,4,1,0,1,1,1,1,0,0,0] β π(C) = [1,0,2,1,1,0,0,0,0,0,1,1,1]
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
Document vectors are useful because they allow us to retrieve a document, based on the degree to which it matches a query. For example, the query: βWhat school did Harry Potter attend?β β¦can be written as a query vector: β π = [0,0,0,0,0,1,0,1,1,0,0,0,0] We can sometimes find the most relevant document using cosine distance: β π 5 β π H β π β π H = 3 3 13 = 0.48 β π 5 β π D β π β π D = 2 3 27 = 0.22 β π 5 β π C β π β π C = 3 10 = 0.00
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
Suppose that we find a new document
Dark Shadows is an American Gothic soap opera that originally aired weekdays on the ABC television network, from June 27, 1966, to April 2, 1971. The show depicted the lives, loves, trials, and tribulations of β¦ Now we want to determine whether this document is about the Dark Shadows soap opera, or about the Harry Potter series. How?
document class term Harry Potter Dark Shadows a 2 1
3 in 2 2 is 6 1 fictional 2 1 school 1 rowlingβs 2 harry 2 potter 2 series 2 house 1 featured 1 gothic 1
To start with, letβs create a single merged document class vector, for each class, by just adding together all
β π¦ Harry Potter = β π H + β π D β π¦ Dark Shadows = β π C
Now we turn the new document into a vector with the same dimensions: Dark Shadows is an American Gothic soap opera that originally aired weekdays on the ABC television network, from June 27, 1966, to April 2, 1971. The show depicted the lives, loves, trials, and tribulations of β¦ β π = [0,1,0,1,0,0,0,0,0,0,0,0,1]
document class term Harry Potter Dark Shadows a 2 1
3 in 2 2 is 6 1 fictional 2 1 school 1 rowlingβs 2 harry 2 potter 2 series 2 house 1 featured 1 gothic 1
Now letβs just compute the cosine similarity with each document class: Dark Shadows is an American Gothic soap opera that originally aired weekdays on the ABC television network, from June 27, 1966, to April 2, 1971. The show depicted the lives, loves, trials, and tribulations of β¦ β π = [0,1,0,1,0,0,0,0,0,0,0,0,1] β π 9 β π¦ HP β π β π HP = 1Γ3 + 1Γ6 + 1Γ0 3 74 = 0.60 β π 9 β π¦ DS β π β π DS = 1Γ0 + 1Γ1 + 1Γ1 3 10 = 0.37 β¦oopsβ¦
document class term Harry Potter Dark Shadows a 2 1
3 in 2 2 is 6 1 fictional 2 1 school 1 rowlingβs 2 harry 2 potter 2 series 2 house 1 featured 1 gothic 1
the difference between π’π(HP, gothic) = 0 and π’π(DS, gothic) = 1 is much more important than the difference between π’π(HP, is) = 6 and π’π(DS, is) = 1.
the difference between term frequencies that matters, itβs their ratio that matters. 6 β 1 β« 1 β 0 6 1 βͺ 1
document class term Harry Potter Dark Shadows a 2 1
3 in 2 2 is 6 1 fictional 2 1 school 1 rowlingβs 2 harry 2 potter 2 series 2 house 1 featured 1 gothic 1
We can emphasize ratios, rather than differences, by measuring the log of tf, rather than the raw frequencies: log 6 β log 1 βͺ log 1 β log 0 So letβs redefine term frequency to be π’π(π, π) = log!" Count word π in document π The use of a base-10 logarithm is a sort
definition was first published in 1972. Really, though, the base of the logarithm doesnβt matter much.
document class term Harry Potter Dark Shadows a 0.3
0.5 ββ in 0.3 0.3 is 0.8 fictional 0.3 school ββ rowlingβs 0.3 ββ harry 0.3 ββ potter 0.3 ββ series 0.3 ββ house ββ featured ββ gothic ββ
All those ββ terms are annoying and numerically awful. There are two standard ways to deal with them:
difference between 0 and 1 is unimportant, and the difference between 1 and 10 is about the same as the difference between 10 and 100: π’π π, π = 1 + max 0, log!" Count
example), where the difference between 0 and 1 is about as important as the difference between 1 and 3: π’π π, π = log!" 1 + Count
document class term Harry Potter Dark Shadows a 0.5 0.3
0.6 in 0.5 0.5 is 0.8 0.3 fictional 0.5 0.3 school 0.3 rowlingβs 0.5 harry 0.5 potter 0.5 series 0.5 house 0.3 featured 0.3 gothic 0.3
Using this new notation, our query vector is: β π = [0,0.3,0,0.3,0,0,0,0,0,0,0,0,0.3] β π I β π¦ HP β π β π HP = 0.18 + 0.24 + 0 0.27 2.84 = 0.48 β π I β π¦ DS β π β π DS = 0 + 0.09 + 0.09 0.27 0.79 = 0.39 So, now the βDark Shadowsβ class is closer to correctly claiming this query. But weβre not quite there yetβ¦
document class term Harry Potter Dark Shadows a 0.5 0.3
0.6 in 0.5 0.5 is 0.8 0.3 fictional 0.5 0.3 school 0.3 rowlingβs 0.5 harry 0.5 potter 0.5 series 0.5 house 0.3 featured 0.3 gothic 0.3
Did you notice that most words occur in a query either once, or zero times? So every element of the query vector is either log!" 1 + 0 = 0 or log!" 1 + 1 = 0.3. So, for q but not for x, letβs return it to binary, β π = [0,1,0, β¦]. Then: β π 4 β π¦ π = 7
#$! %
Count(π, π)log!" 1 + Count(π, π) = log!" ?
#$! %
1 + Count(π, π) &'()*(#,-) Just for the heck of it, letβs divide by π + N(π) /(-), where π is vocabulary size, N(π) is the number of words in class π, and N(π) is the number of words in the query. That gives us: β π 4 β π¦ π = log!" ?
#$! %
1 + Count π, π π + N π
&'()* #,-
= log!" ?
#:1'23 # 45 4) *67 8(729
π word π class π
We saw that putting tf on a log scale is not quite enough for us to correctly classify the test document as being part of class βDark Shadows,β so letβs look for more problems to fix. Hereβs a problem: why do the words βa,β βof,β βin,β βisβ count more than βpotterβ and βgothicβ? Those function words are used by all classes, so we shouldnβt really pay so much attention to them.
document class term Harry Potter Dark Shadows a 0.5 0.3
0.6 in 0.5 0.5 is 0.8 0.3 fictional 0.5 0.3 school 0.3 rowlingβs 0.5 harry 0.5 potter 0.5 series 0.5 house 0.3 featured 0.3 gothic 0.3
Inverse document frequency (idf) is a discount weight, meant to reduce the importance of any word thatβs used equally across all classes. A typical definition is: πππ π = log!" πΈ ππ(π) ...where πΈ is the number of document classes (2, in our example), and ππ(π) is the number of documents in which the ith word appears.
document class term (idf) Harry Potter Dark Shadows a(0) 0.5 0.3
0.6 in(0) 0.5 0.5 is(0) 0.8 0.3 fictional(0) 0.5 0.3 school(0.3) 0.3 rowlingβs(0.3) 0.5 harry(0.3) 0.5 potter(0.3) 0.5 series(0.3) 0.5 house(0.3) 0.3 featured(0.3) 0.3 gothic(0.3) 0.3
With that definition, we get
π’π(π, π)πππ π = log!" 1 + Count(π, π) log!" πΈ ππ(π)
β¦and the document class vectors are now
β π¦(π) = [π’π 1, π πππ(1), β¦ , π’π π, π πππ(π)]
document class term (idf) Harry Potter Dark Shadows a(0)
0.18 in(0) is(0) fictional(0) school(0.3) 0.09 rowlingβs(0.3) 0.15 harry(0.3) 0.15 potter(0.3) 0.15 series(0.3) 0.15 house(0.3) 0.09 featured(0.3) 0.09 gothic(0.3) 0.09
Remember, the original word counts in
β π = [0,1,0,1,0,0,0,0,0,0,0,0,1] If we convert those into tf-idf, we get β π = [0,0.09,0,0,0,0,0,0,0,0,0,0,0.09] Then β π I β π¦ HP β π β π HP = 0.0162 + 0 + 0 0.0162 0.1305 = 0.35 β π I β π¦ DS β π β π DS = 0 + 0 + 0.0081 0.0162 0.0243 = 0.41 It worked! We got the right answer!
document class term (idf) Harry Potter Dark Shadows a(0)
0.18 in(0) is(0) fictional(0) school(0.3) 0.09 rowlingβs(0.3) 0.15 harry(0.3) 0.15 potter(0.3) 0.15 series(0.3) 0.15 house(0.3) 0.09 featured(0.3) 0.09 gothic(0.3) 0.09
counts that matters, itβs the ratio. So instead of raw counts, use log counts: π’π π, π = log!" 1 + Count
documents are unimportant. Discount them by the factor πππ π = log!" πΈ ππ(π)
document class term (idf) Harry Potter Dark Shadows a(0)
0.18 in(0) is(0) fictional(0) school(0.3) 0.09 rowlingβs(0.3) 0.15 harry(0.3) 0.15 potter(0.3) 0.15 series(0.3) 0.15 house(0.3) 0.09 featured(0.3) 0.09 gothic(0.3) 0.09
Now that we understand information retrieval, letβs go back to our
How can we determine whether or not two words are related?
document term Hogwarts Dumbledore Collinwood a 1 1 1
1 2 in 1 1 2 is 2 4 1 fictional 1 1 1 school 1 rowlingβs 1 1 harry 1 1 potter 1 1 series 1 1 house 1 featured 1 gothic 1
Instead of creating a term-document matrix, letβs create a matrix that shows how often each pair of words
This will be π π, π = O
#$! %
Count π, π Count(π, π) For example, for the words π =a and π =of, π a, of = 1Γ1 + 1Γ2 + 0 = 3
term 2 term 1 a
in school harry potter house gothic a 3 3 4 1 2 2 1 1
3 5 3 1 3 3 in 4 3 6 1 2 2 2 2 school 1 1 1 1 1 1 harry 2 3 2 1 2 2 potter 2 3 2 1 2 2 house 1 2 1 1 gothic 1 2 1 1
Hereβs a subset of the word co-occurrence matrix. Notice that this seems, again, to give too much credit to the function words. Letβs reduce their importance using tf- idf.
term 2 term 1 a
in school harry potter house gothic a
0.032 0.018 0.024 0.024 in school 0.018 0.027 0.018 0.018 harry 0.024 0.018 0.020 0.020 potter 0.024 0.018 0.020 0.020 house 0.027 0.027 gothic 0.027 0.027
π π, π = log!" 1 + T
IJ! K
Count π, π Count(π, π) log!" πΈ ππ(π) log!" πΈ ππ(π)
In this example, we have D=3 documents, so the possible values
log!" 3/3 = 0 log!" 3/2 β 0.2 log!" 3/1 β 0.3
π‘(rowling!s, harry) = cos β‘ rowling!s, harry = β π€(rowling!s) 5 β π€(harry) β π€(rowling!s) β π€(harry)
π’π π, π = log"# 1 + Count
πππ π = log"# πΈ ππ(π)
π π, π = P
$%" &
Count π, π Count(π, π)