Dimensionality Reduction for Information Retrieval using Vector - - PowerPoint PPT Presentation

dimensionality reduction for information retrieval using
SMART_READER_LITE
LIVE PREVIEW

Dimensionality Reduction for Information Retrieval using Vector - - PowerPoint PPT Presentation

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka, Marian Vajter sic April 30, 2011 Tobias Berka, Marian


slide-1
SLIDE 1

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms

Tobias Berka, Marian Vajterˇ sic April 30, 2011

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-2
SLIDE 2

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Outline

1 Introduction

Dimensionality Reduction

2 Rare Term Vector Replacement

Zipf’s Law Replacement Vectors Rare Term Replacement

3 Evaluation

Retrieval Performance Computational Performance Stability

4 Summary & Conclusions

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-3
SLIDE 3

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

Introduction

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-4
SLIDE 4

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

Goals

Reduce dimensionality, Preserve or improve...

Pair-wise distances, Cross-class scatter, Retrieval / clustering / classification performance.

Detect...

Contributing factors, Individual components, Signals or noise.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-5
SLIDE 5

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

Methods

Great Classics: Linear Methods,

Singular Value Decomposition, Principal Component Analysis (PCA), Non-negative Matrix Factorization(s), Independent Component Analysis.

Canonical Extension:

Kernel Methods,

Maps:

Mesh Fitting, Self-Organization,

Manifold Learning:

Local Linearization, Local Non-linear Reduction.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-6
SLIDE 6

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

My Interest

Better retrieval, More complete retrieval. Dynamic searching, Less reliance on static indices,

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-7
SLIDE 7

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

My Interest

Interactive semi-supervised clustering, Exploratory data analysis, Search.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-8
SLIDE 8

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction

My Interest

Sparse ❀ dense. Good for super-scalar CPUs, More efficient parallelism.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-9
SLIDE 9

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Rare Term Vector Replacement

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-10
SLIDE 10

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Zipf’s Law

“The [document] frequency of a word is reciprocally proportional to its frequency rank.” fi ∝ 1 rank (fi) .

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-11
SLIDE 11

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Zipf’s Law in Practice

“Most words occur only in a very small number of documents.”

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-12
SLIDE 12

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Zipf’s Law in Pictures

1 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • ccurrences

feature (relative)

  • ccurrences

Q1=1 Q2=2 Q3=7 mean=75.93 cut-off=694 Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-13
SLIDE 13

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Zipf’s Law in Pictures

1 10 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • ccurrences

feature (relative)

  • ccurrences

Q1=1 Q2=1 Q3=4 mean=6.52 cut-off=10 Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-14
SLIDE 14

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Zipf’s Law vs. Dimensionality Reduction

Eliminate rare terms? High importance for information retrieval! Can we compress them?

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-15
SLIDE 15

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Replacement Vectors

Let’s compute replacement vectors!

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-16
SLIDE 16

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Centroid Summarization

We operate on a corpus in vector form.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-17
SLIDE 17

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Centroid Summarization

Select the vectors containing a rare term.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-18
SLIDE 18

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Centroid Summarization

Compute the centroid.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-19
SLIDE 19

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Vector Truncation

Discard the rare features.

1 10 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

  • ccurrences

feature (relative)

  • ccurrences

Q1=1 Q2=1 Q3=4 mean=6.52 cut-off=10 Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-20
SLIDE 20

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Computing Replacement Vectors

For all rare terms, we compute the following: Select all documents containing the rare term, Compute the (weighted) average vector, Truncate all rare terms from this vector.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-21
SLIDE 21

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

A More Efficient Algorithm

For all documents,

For all rare terms,

Add the common terms to the average vector.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-22
SLIDE 22

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

New Document Representation

For all documents, we compute the following: Truncate all rare terms from the document vectors (i.e. retain

  • nly common terms),

Add the linear combination of all replacement vectors,

For all rare terms in the document, Scaled by the weighted term frequency,

Normalize the result to unit length.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-23
SLIDE 23

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Zipf’s Law Replacement Vectors Rare Term Replacement

Subsequent Rank Reduction

Once we have computed the replacement vectors, we compute a rank-reduced PCA, Reduces number of features by 50%, improves the retrieval performance, Low number of features, dense data matrix – use a symmetric eigensolver, In LAPACK terms: xSPEVX, xSYEV, etc.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-24
SLIDE 24

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

Evaluation

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-25
SLIDE 25

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

Reuters Corpus

Reuters corpus, training set, all categories.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 10 20 30 40 50 60 70 80 90 100 mean precision hit list rank sparse TD-IDF (47,236) vector replacement (535) rank-reduced vector replacement (392) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-26
SLIDE 26

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

MEDLARS Corpus

MEDLARS corpus, very small!

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 5 10 15 20 25 30 mean precision hit list rank sparse TF (8,742) vector replacement (1,136) rank-reduced vector replacement (500) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-27
SLIDE 27

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

Run-Time Measurements

Single-pass algorithm, extended Reuters corpus.

600 700 800 900 1000 1100 1200 400000 450000 500000 550000 600000 650000 700000 750000 800000 850000 time [s] documents

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-28
SLIDE 28

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

Choice of Parameters

Reuters corpus, various thresholds.

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 10 100 1000 10000 mean average precision features 0.1% (5,859) 0.5% (2,287) 1% (1,430) 3% (535) 5% (299) 7% (185) 9% (149) 10% (109) TF-IDF (47,236) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-29
SLIDE 29

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Retrieval Performance Computational Performance Stability

Singular Values

Singular values on a log-scale.

0.1 1 10 100 100 200 300 400 500 600 singular values

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-30
SLIDE 30

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Summary & Conclusions

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-31
SLIDE 31

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Summary

Background: dimensionality reduction, Motivation: Zipf’s law, centroid summarization, Method: rare term vector replacement (RTVR), Detailed Algorithms: see paper, Evaluation: retrieval performance, run-time and stability w.r.t. choice of parameters.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-32
SLIDE 32

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Results

Significant reduction in the number of features, Clear improvement in retrieval quality, Fairly stable w.r.t. choice of parameters, Efficient and scalable algorithm.

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-33
SLIDE 33

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Outlook

Parallelization is well under way, Term/Document updating/downdating, More experiments – performance shoot-out?

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-34
SLIDE 34

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

My Interest

Is it good enough to replace ranking? Lots of missing pieces...

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

slide-35
SLIDE 35

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions

Thank you!

Questions?

Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto