Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
CSE 6240: Web Search and Text Mining. Spring 2020
- Prof. Srijan Kumar
Introduction to Information Retrieval: IR Basics and Evaluation - - PowerPoint PPT Presentation
CSE 6240: Web Search and Text Mining. Spring 2020 Introduction to Information Retrieval: IR Basics and Evaluation Prof. Srijan Kumar 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Logistics Class size: Due
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
3
Some slides from today’s lecture are inspired from Prof. Hongyuan Zha’s past offerings of this course
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
4
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
5
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
7
Tokenization and linguistic processing determine the terms considered for retrieval
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
8
Tokenization and linguistic processing determine the terms considered for retrieval
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
9
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
10
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
11
Tokenization and linguistic processing determine the terms considered for retrieval
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
12
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
13
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
14
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
15
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
16
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
17
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
18
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
19
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
20
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
21
Tokenization and linguistic processing determine the terms considered for retrieval
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
22
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
23
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
24
110100 110111 NOT 010000 = 101111
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
25
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
26
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
27
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
28
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
29
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
30
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
31
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
32
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
33
Tokenization and linguistic processing determine the terms considered for retrieval
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
34
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
35
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
36
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
37
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
38
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
39
– A measure of informativeness of a term: rarity across the whole corpus – High idf = the term is unique; low idf = common words – Formulation 1: the raw count of number of documents the term occurs in
– Formulation 2: logarithmically scaled inverse fraction
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
40
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
41
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
42
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
43
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
44
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
45
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
46
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
47
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
48
– E.g., information need: I’m looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. – Query: wine red white heart attack effective – You evaluate whether the doc addresses the information need, not whether it has those words
– perfect, excellent, good, fair, bad
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
49
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
50
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
51
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
52
Relevant Not Relevant Retrieved True Positive TP False Positive FP Not Retrieved False Negative FN True Negative TN
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
53
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
54
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
55
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
56
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
57
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
58
matches on the first one or two results pages
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
59
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
60
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
61
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
62
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
63