INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University, Ithaca, NY 3 Dec 2009 1 / 32


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 26/26: Feature Selection and Exam Overview

Paul Ginsparg

Cornell University, Ithaca, NY

3 Dec 2009

1 / 32

slide-2
SLIDE 2

Administrativa

Assignment 4 due Fri 4 Dec (extended to Sun 6 Dec).

2 / 32

slide-3
SLIDE 3

Combiner in Simulator

“Can be added, but makes less sense to have a combiner in a

  • simulator. Combiners help to speed things by providing local

(in-memory) partial reduces. In a simulator we are not really concerned about efficiency.” Hadoop Wiki: “When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is flushed by passing all the values of each key to the combiner’s reduce method and outputting the key-value pairs of the combine operation as if they were created by the original map

  • peration.”

3 / 32

slide-4
SLIDE 4

Assignment 3

The page rank rj of page j is determined self-consistently by the equation rj = α n + (1 − α)

  • i|i→j

ri di , α is a number between 0 and 1 (originally taken to be .15) the sum on i is over pages i pointing to j di is the outgoing degree of page i. Incidence matrix Aij = 1 if i points to j, otherwise Aij = 0. Transition probability from page i to page j Pij = α n Oij + (1 − α) 1 di Aij where n = total # of pages, di is the outdegree of node i, and Oij = 1(∀i, j). The matrix eigenvector relation rP = r or r = PT r is equivalent to the equation above (with r is normalized as a probability, so that

i ri Oij = i ri = 1).

4 / 32

slide-5
SLIDE 5

Overview

1

Recap

2

Feature selection

3

Structured Retrieval

4

Exam Overview

5 / 32

slide-6
SLIDE 6

Outline

1

Recap

2

Feature selection

3

Structured Retrieval

4

Exam Overview

6 / 32

slide-7
SLIDE 7

More Data

Figure 1. Learning Curves for Confusion Set Disambiguation http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf Scaling to Very Very Large Corpora for Natural Language Disambiguation

  • M. Banko and E. Brill (2001)

7 / 32

slide-8
SLIDE 8

Statistical Learning

Spelling with Statistical Learning Google Sets Statistical Machine Translation Canonical image selection from the web Learning people annotation from the web via consistency learning and others . . .

8 / 32

slide-9
SLIDE 9

Outline

1

Recap

2

Feature selection

3

Structured Retrieval

4

Exam Overview

9 / 32

slide-10
SLIDE 10

Feature selection

In text classification, we usually represent documents in a high-dimensional space, with each dimension corresponding to a term. In this lecture: axis = dimension = word = term = feature Many dimensions correspond to rare words. Rare words can mislead the classifier. Rare misleading features are called noise features. Eliminating noise features from the representation increases efficiency and effectiveness of text classification. Eliminating features is called feature selection.

10 / 32

slide-11
SLIDE 11

Different feature selection methods

A feature selection method is mainly defined by the feature utility measures it employs Feature utility measures

Frequency – select the most frequent terms Mutual information – select the terms with the highest mutual information Mutual information is also called information gain in this context. Chi-square

11 / 32

slide-12
SLIDE 12

Information

H[p] =

i=1,n −pi log2 pi measures information uncertainty

(p.91 in book) has maximum H = log2 n for all pi = 1/n Consider two probability distributions: p(x) for x ∈ X and p(y) for y ∈ Y MI: I[X; Y ] = H[p(x)] + H[p(y)] − H[p(x, y)] measures how much information p(x) gives about p(y) (and vice versa) MI is zero iff p(x, y) = p(x)p(y), i.e., x and y are independent for all x ∈ X and y ∈ Y can be as large as H[p(x)] or H[p(y)] I[X; Y ] =

  • x∈X,y∈Y

p(x, y) log2 p(x, y) p(x)p(y)

12 / 32

slide-13
SLIDE 13

Mutual information

Compute the feature utility A(t, c) as the expected mutual information (MI) of term t and class c. MI tells us “how much information” the term contains about the class and vice versa. For example, if a term’s occurrence is independent of the class (same proportion of docs within/without class contain the term), then MI is 0. Definition:

I(U; C)=

  • et∈{1,0}
  • ec∈{1,0}

P(U =et, C =ec) log2 P(U =et, C =ec) P(U =et)P(C =ec)

13 / 32

slide-14
SLIDE 14

How to compute MI values

Based on maximum likelihood estimates, the formula we actually use is: I(U; C) = N11 N log2 NN11 N1.N.1 + N01 N log2 NN01 N0.N.1 +N10 N log2 NN10 N1.N.0 + N00 N log2 NN00 N0.N.0 N10: number of documents that contain t (et = 1) and are not in c (ec = 0); N11: number of documents that contain t (et = 1) and are in c (ec = 1); N01: number of documents that do not contain t (et = 1) and are in c (ec = 1); N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1); N = N00 + N01 + N10 + N11.

14 / 32

slide-15
SLIDE 15

MI example for poultry/export in Reuters

ec = epoultry = 1 ec = epoultry = 0 et = eexport = 1 N11 = 49 N10 = 27,652 et = eexport = 0 N01 = 141 N00 = 774,106 Plug these values into formula: I(U; C) = 49 801,948 log2 801,948 · 49 (49+27,652)(49+141) + 141 801,948 log2 801,948 · 141 (141+774,106)(49+141) + 27,652 801,948 log2 801,948 · 27,652 (49+27,652)(27,652+774,106) +774,106 801,948 log2 801,948 · 774,106 (141+774,106)(27,652+774,106) ≈ 0.000105

15 / 32

slide-16
SLIDE 16

MI feature selection on Reuters

coffee coffee 0.0111 bags 0.0042 growers 0.0025 kg 0.0019 colombia 0.0018 brazil 0.0016 export 0.0014 exporters 0.0013 exports 0.0013 crop 0.0012 sports soccer 0.0681 cup 0.0515 match 0.0441 matches 0.0408 played 0.0388 league 0.0386 beat 0.0301 game 0.0299 games 0.0284 team 0.0264

16 / 32

slide-17
SLIDE 17

χ2 Feature selection

χ2 tests independence of two events, p(A, B) = p(A)p(B) (or p(A|B) = p(A), p(B|A) = p(B)) test occurrence of the term, occurrence of the class, rank w.r.t.: X 2(D, t, c) =

  • et∈{0,1}
  • ec∈{0,1}

(Netec − Eetec)2 Eetec where N = observed frequency in D, E = expected frequency (e.g., E11 is the expected frequency of t and c occurring together in a document, assuming term and class are independent) High value of X 2 indicates independence hypothesis is incorrect, i.e., observed and expected are not similar. Occurrence of term and class dependent events ⇒ occurrence of term makes class more (or less) likely, hence helpful as feature.

17 / 32

slide-18
SLIDE 18

χ2 Feature selection, example

ec = epoultry = 1 ec = epoultry = 0 et = eexport = 1 N11 = 49 N10 = 27,652 et = eexport = 0 N01 = 141 N00 = 774,106 E11 = N · P(t) · P(c) = N · N11 + N10 N · N11 + N01 N = N · 49 + 27652 N · 49 + 141 N ≈ 6.6 ec = epoultry = 1 ec = epoultry = 0 et = eexport = 1 E11 ≈ 6.6 E10 ≈ 183.4 et = eexport = 0 E01 ≈ 27694.4 E00 ≈ 774063.6 X 2(D, t, c) =

  • et∈{0,1}
  • ec∈{0,1}

(Netec − Eetec)2 Eetec ≈ 284

18 / 32

slide-19
SLIDE 19

Naive Bayes: Effect of feature selection

# # # # # # # # # # # # # # #

1 10 100 1000 10000 0.0 0.2 0.4 0.6 0.8 number of features selected F1 measure

  • o oo
  • x

x x x x x x x x x x x x x b b b bb b b b b b b b b b b #

  • x

b multinomial, MI multinomial, chisquare multinomial, frequency binomial, MI

(multinomial = multinomial Naive Bayes)

19 / 32

slide-20
SLIDE 20

Feature selection for Naive Bayes

In general, feature selection is necessary for Naive Bayes to get decent performance. Also true for most other learning methods in text classification: you need feature selection for optimal performance.

20 / 32

slide-21
SLIDE 21

Outline

1

Recap

2

Feature selection

3

Structured Retrieval

4

Exam Overview

21 / 32

slide-22
SLIDE 22

XML markup

play authorShakespeare/author titleMacbeth/title act number=”I” scene number=”vii” titleMacbeths castle/title verseWill I with wine and wassail .../verse /scene /act /play

22 / 32

slide-23
SLIDE 23

XML Doc as DOM object

23 / 32

slide-24
SLIDE 24

Outline

1

Recap

2

Feature selection

3

Structured Retrieval

4

Exam Overview

24 / 32

slide-25
SLIDE 25

Definition of information retrieval (from Lecture 1)

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Three scales (web, enterprise/inst/domain, personal)

25 / 32

slide-26
SLIDE 26

“Plan” (from Lecture 1)

Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR ⇔ NLP ⇔ ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra and in probability theory

26 / 32

slide-27
SLIDE 27

1st Half

Searching full text: dictionaries, inverted files, postings, implementation and algorithms, term weighting, Vector Space Model, similarity, ranking Word Statistics MRS: 1 Boolean retrieval MRS: 2 The term vocabulary and postings lists MRS: 3 Dictionaries and tolerant retrieval MRS: 5 Index compression MRS: 6 Scoring, term weighting, and the vector space model MRS: 7 Computing scores in a complete search system

27 / 32

slide-28
SLIDE 28

1st Half, cont’d

Evaluation of retrieval effectiveness MRS: 8. Evaluation in information retrieval Latent semantic indexing MRS: 18. Matrix decompositions and latent semantic indexing Discussion 2 SMART Discussion 3 IDF Discussion 4 Latent semantic indexing

28 / 32

slide-29
SLIDE 29

2nd Half

MRS: 3. Tolerant retrieval MRS: 9 Relevance feedback and query expansion MRS: 11 Probabilistic information retrieval Web Search: anchor text and links, Citation and Link Analysis, Web crawling MRS: 19 Web search basics MRS: 21 Link analysis

29 / 32

slide-30
SLIDE 30

2nd Half, cont’d

Classification, categorization, clustering MRS: 13 Text classification and Naive Bayes MRS: 14 Vector space classification MRS: 16 Flat clustering MRS: 17 Hierarchical clustering (Structured Retrieval MRS: 10 XML Retrieval) Discussion 5 Google Discussion 6 MapReduce Discussion 7 Statistical Spell Correction

30 / 32

slide-31
SLIDE 31

Midterm

1) term-document matrix, VSM, tf.idf 2) Recall/Precision 3) LSI 4) Word statistics (Heap, Zipf)

31 / 32

slide-32
SLIDE 32

Final Exam, 3 or 4 questions from these topics

CS4300/INFO4300 Tue 15 Dec 7:00-9:30 PM Olin Hall 255 issues in personal/enterprise/webscale searching, recall/precision, and how related to info/nav/trans needs issues for modern search engines... (e.g., w.r.t. web scale, tf.idf? recall/precision?) MapReduce probabilistic reasoning: naive bayes web indexing and retrieval: link analysis, adversarial IR Vector space classification (rocchio, kNN) types of text classification (curated, rule-based, statistical) clustering: flat, hierarchical (k-means, agglomerative): evaluation of clustering, measures of cluster similarity (single link, complete link, average, group average) classification, clustering (make a dendrogram based on similarity) cluster labeling, feature selection

32 / 32