Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Lecture 38 – tf/idf and information retrieval Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source

Outline • similarity vs. semantic field: word2vec at different scales • term frequency (tf): the term-document matrix • cosine similarity • document classification: tf on a log scale • document classification: inverse document frequency (idf) • relatedness again: the word co-occurrence matrix

Similarity: The Internet is the database Similarity = words can be used interchangeably in most contexts How do we measure that in practice? Answer: extract examples of word 𝑥 ! , +/- k words ( 2 ≤ 𝑙 ≤ 5 , for example): …hot, although iced coffee is a popular… …indicate that moderate coffee consumption is benign… …and of 𝑥 " : …consumed as iced tea . Sweet tea is… …national average of tea consumption in Ireland… The words “iced” and “consumption” appear in both contexts, so we can conclude that 𝑡(coffea, tea) > 0 . No other words are shared, so we can conclude 𝑡(coffee, tea) < 1 .

Similarity vs. Relatedness Levy & Goldberg (2014) trained word2vec in three different ways: • k=2 • k=5 • Context determined by first parsing the sentence to get syntactic Precision vs. Recall on the Precision vs. Recall on the dependency structure (Deps) WordSim-353 database, Chiarello et al. database, in which word pairs may in which word pairs are They tested all three method for the be either related or only similar (Fig. 2(b), similarity vs. relatedness of the similar (Fig. 2(a), Levy & Levy & Goldberg 2014) Goldberg 2014) nearest-neighbor of each word.

Similarity vs. Relatedness • Apparently, the smaller context window (k=2) produces vectors whose nearest neighbors are more similar (they can be used identically in a sentence). • The larger context (k=5) produces vectors whose nearest neighbors are related , not just similar . • More specifically, the latter words Precision vs. Recall on the Precision vs. Recall on the pairs are said to inhabit the same WordSim-353 database, Chiarello et al. database, semantic field . in which word pairs may in which word pairs are • A semantic field is a group of be either related or only similar (Fig. 2(b), words that refers to the same similar (Fig. 2(a), Levy & Levy & Goldberg 2014) subject. Goldberg 2014)

Similarity vs. Relatedness …studied at hogwarts, a castle… w=hogwarts … harry potter studied at hogwarts… vector nearest vector nearest neighbors, context neighbors, context k=2 k=5 evernight dumbledore …studied at evernight, a castle… …harry potter learned from dumbledore… sunnydale hallows …studied at sunnydale… …harry potter and the deathly hallows.. …a castle garderobe… garderobe half-blood …harry potter and the half-blood… …lives at blandings, a castle… blandings malfoy …harry potter said to malfoy… …lives at collinwood, a castle… collinwood snape …harry potter said to snape… Examples of k=2 and k=5 nearest-neighbors, from (Levy & Goldberg, 2014)

What if you wanted se semanti tic f field , not similarity? • What if you wanted your vector w=hogwarts embedding to capture semantic vector nearest vector nearest field, as in the second column neighbors, context neighbors, context (not similar usage, like the first k=2 k=5 column)? evernight dumbledore • If you want that, it seems that sunnydale hallows larger contexts are better. garderobe half-blood • Why not just set context window blandings malfoy = the whole document? collinwood snape

the term-document matrix document term Hogwarts Dumbledore Collinwood Hogwarts School of Witchcraft and Wizardry, a 1 1 1 commonly shortened to Hogwarts, is a fictional British school of magic for students aged eleven to of 1 2 eighteen, and is the primary setting for the first six in 1 1 2 books in J. K. Rowling's Harry Potter series… is 2 4 1 Albus Percival Wulfric Brian Dumbledore is a fictional fictional 1 1 1 character in J. K. Rowling's Harry Potter series. For most of the series, he is the headmaster of the school 1 wizarding school Hogwarts. As part of his backstory, it rowling’s 1 1 is revealed that he is the founder and leader of … harry 1 1 Collinwood Mansion is a fictional house featured in the Gothic horror soap opera Dark Shadows (1966– potter 1 1 1971). Built in 1795 by Joshua Collins, Collinwood has series 1 1 been home to the Collins family—and other house 1 sometimes unwelcome supernatural visitors… featured 1 gothic 1

the term-document matrix document From the term-document matrix, we can define each term vector to be just the vector term Hogwarts Dumbledore Collinwood of term frequencies: a 1 1 1 𝑤(𝑗) = [𝑢𝑔(𝑗, 1), … , 𝑢𝑔(𝑗, 𝐸)] ⃗ of 1 2 in 1 1 2 …where we now define the term frequency is 2 4 1 (of term 𝑗 in document 𝑘 ) to be the number of times the term occurs in the document: fictional 1 1 1 𝑢𝑔(𝑗, 𝑘) = Count word 𝑗 in document 𝑘 school 1 rowling’s 1 1 For example, harry 1 1 𝑤 a = 1,1,1 ⃗ potter 1 1 𝑤(of) = [1,2,1] ⃗ series 1 1 𝑤(potter) = [1,1,0] ⃗ house 1 featured 1 gothic 1

cosine similarity The relatedness of two words can now be measured document using their cosine similarity. For example, term Hogwarts Dumbledore Collinwood 𝑡(rowling ! s, harry) = cos ∡ rowling ! s, harry a 1 1 1 of 1 2 𝑤(rowling ! s) 5 ⃗ = ⃗ 𝑤(harry) in 1 1 2 𝑤(rowling ! s) ⃗ 𝑤(harry) ⃗ is 2 4 1 = 1×1 + 1×1 + 0×0 fictional 1 1 1 = 1 2× 2 school 1 rowling’s 1 1 𝑡(harry, gothic) = cos ∡ harry, gothic harry 1 1 potter 1 1 = ⃗ 𝑤(harry) 5 ⃗ 𝑤(gothic) series 1 1 𝑤(harry) ⃗ 𝑤(gothic) ⃗ house 1 = 1×0 + 1×0 + 0×1 featured 1 = 0 2×1 gothic 1

document vectors Now let’s try something different. Let’s document define a vector for each document, term Hogwarts Dumbledore Collinwood rather than for each term: a 1 1 1 of 1 2 ⃗ 𝑒(𝑘) = [𝑢𝑔(1, 𝑘), … , 𝑢𝑔(𝑊, 𝑘)] in 1 1 2 is 2 4 1 fictional 1 1 1 Thus, school 1 rowling’s 1 1 ⃗ 𝑒 H = 1,1,1,2,1,1,1,1,1,0,0,0 harry 1 1 potter 1 1 ⃗ 𝑒(D) = [1,2,1,4,1,0,1,1,1,1,0,0,0] series 1 1 house 1 ⃗ 𝑒(C) = [1,0,2,1,1,0,0,0,0,0,1,1,1] featured 1 gothic 1

information retrieval Document vectors are useful because they allow us document to retrieve a document, based on the degree to which it matches a query. For example, the query: term Hogwarts Dumbledore Collinwood “What school did Harry Potter attend?” a 1 1 1 …can be written as a query vector: of 1 2 𝑟 = [0,0,0,0,0,1,0,1,1,0,0,0,0] ⃗ in 1 1 2 is 2 4 1 We can sometimes find the most relevant document using cosine distance: fictional 1 1 1 𝑟 5 ⃗ ⃗ 𝑒 H 3 school 1 = = 0.48 ⃗ 3 13 𝑟 ⃗ 𝑒 H rowling’s 1 1 harry 1 1 𝑟 5 ⃗ ⃗ 𝑒 D 2 = = 0.22 potter 1 1 ⃗ 𝑟 ⃗ 𝑒 D 3 27 series 1 1 house 1 𝑟 5 ⃗ ⃗ 𝑒 C 0 = = 0.00 featured 1 ⃗ 3 10 𝑟 ⃗ 𝑒 C gothic 1

document classification Suppose that we find a new document document on the web: term Hogwarts Dumbledore Collinwood a 1 1 1 Dark Shadows is an American Gothic of 1 2 soap opera that originally aired in 1 1 2 weekdays on the ABC television network, is 2 4 1 from June 27, 1966, to April 2, 1971. The fictional 1 1 1 show depicted the lives, loves, trials, and school 1 tribulations of … rowling’s 1 1 harry 1 1 Now we want to determine whether this potter 1 1 document is about the Dark Shadows series 1 1 soap opera, or about the Harry Potter house 1 series. featured 1 How? gothic 1

document classification To start with, let’s create a single document class merged document class vector, for term Harry Potter Dark Shadows each class, by just adding together all a 2 1 of the document vectors in the class: of 3 in 2 2 is 6 1 𝑦 Harry Potter = ⃗ 𝑒 H + ⃗ ⃗ 𝑒 D fictional 2 1 school 1 𝑦 Dark Shadows = ⃗ rowling’s 2 ⃗ 𝑒 C harry 2 potter 2 series 2 house 1 featured 1 gothic 1

Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Outline similarity vs. semantic field: word2vec at different scales term frequency (tf): the

WEDNESDAY 6 DECEMBER 10:00-11:00 ACTIVITIES IN THE IDF SOUTH EAST ASIA REGION 12:00-12:30 IDF

Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI

Logic in Action Introduction: IDF and Special Operations History of the IDF and special

II are : created ? Gal LEIF ) Lastine Gull Elf ) - f Ift F } Idf ) ={ rebut - iff - idf de re

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

On Partial Sums in Cyclic Groups Douglas R. Stinson David R. Cheriton School of Computer Science

AI Large Practical Alan Smaill School of Informatics Sep 27 2017 Alan Smaill AI Large

Predictor-corrector ensemble filters for high-dimensional nonlinear systems and sparse data and

A Formal Security Analysis of the Signal Messaging Protocol Luke Garratt Computer Science

Information Access to Historical Documents from the Early New High German Period Andreas Hauser,

Assessment of Major Systems Containment S. Michael Modro Joint IAEA-ICTP Essential Knowledge

Introduction to English Linguistics 8: Indo-European and Germanic Cognates Sanskrit Latin

Module 8 Using ABBYY: Practice Uwe Springmann Centrum fr Informations- und Sprachverarbeitung

Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Lecture 38 tf/idf and information retrieval Mark Hasegawa-Johnson 5/1/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Outline similarity vs. semantic field: word2vec at different scales term frequency (tf): the

WEDNESDAY 6 DECEMBER 10:00-11:00 ACTIVITIES IN THE IDF SOUTH EAST ASIA REGION 12:00-12:30 IDF

Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI

Logic in Action Introduction: IDF and Special Operations History of the IDF and special

II are : created ? Gal LEIF ) Lastine Gull Elf ) - f Ift F } Idf ) ={ rebut - iff - idf de re

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

On Partial Sums in Cyclic Groups Douglas R. Stinson David R. Cheriton School of Computer Science

AI Large Practical Alan Smaill School of Informatics Sep 27 2017 Alan Smaill AI Large

Predictor-corrector ensemble filters for high-dimensional nonlinear systems and sparse data and

A Formal Security Analysis of the Signal Messaging Protocol Luke Garratt Computer Science

Information Access to Historical Documents from the Early New High German Period Andreas Hauser,

Assessment of Major Systems Containment S. Michael Modro Joint IAEA-ICTP Essential Knowledge

Introduction to English Linguistics 8: Indo-European and Germanic Cognates Sanskrit Latin

Module 8 Using ABBYY: Practice Uwe Springmann Centrum fr Informations- und Sprachverarbeitung

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models