CS490W: Web Information Systems CS-490W Web Information Systems - - PowerPoint PPT Presentation

cs490w web information systems
SMART_READER_LITE
LIVE PREVIEW

CS490W: Web Information Systems CS-490W Web Information Systems - - PowerPoint PPT Presentation

CS490W: Web Information Systems CS-490W Web Information Systems Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR


slide-1
SLIDE 1

CS490W: Web Information Systems

CS-490W Web Information Systems

Course Review

Luo Si Department of Computer Science Purdue University

slide-2
SLIDE 2

Basic Concepts of IR: Outline

Basic Concepts of Information Retrieval:

 Task definition of Ad-hoc IR

  • Terminologies and concepts
  • Overview of retrieval models

 Text representation

  • Indexing
  • Text preprocessing

 Evaluation

  • Evaluation methodology
  • Evaluation metrics
slide-3
SLIDE 3

Ad-hoc IR: Terminologies

Terminologies:

 Query

  • Representative data of user’s information need: text (default) and
  • ther media

 Document

  • Data candidate to satisfy user’s information need: text (default) and
  • ther media

 Database|Collection|Corpus

  • A set of documents

 Corpora

  • A set of databases
  • Valuable corpora from TREC (Text Retrieval Evaluation

Conference)

slide-4
SLIDE 4

Some core concepts of IR

Information Need Retrieval Model Representation Query Indexed Objects Retrieved Objects Representation Returned Results Evaluation/Feedback

slide-5
SLIDE 5

Text Representation: Indexing

Statistical Properties of Text

/ 0.1

r

p A r A  

Zipf’s law: relate a term’s frequency to its rank

 Rank all terms with their frequencies in descending order, for a

term at a specific rank (e.g., r) collects and calculates

r

f

: term frequency

r r

f p N 

: relative term frequency

Total number of words

 Zipf’s law (by observation):

So

log( ) log( ) log( )

r r r r

f A p rf AN r f AN N r        

So Rank X Frequency = Constant

slide-6
SLIDE 6

Text Representation: Text Preprocessing

Text Preprocessing: extract representative index terms

 Parse query/document for useful structure

  • E.g., title, anchor text, link, tag in xml…..

 Tokenization

  • For most western languages, words separated by spaces; deal with

punctuation, capitalization, hyphenation

  • For Chinese, Japanese: more complex word segmentation…

 Remove stopwords: (remove “the”, “is”,..., existing standard list)  Morphological analysis (e.g., stemming):

  • Stemming: determine stem form of given inflected forms

 Other: extract phrases; decompounding for some European

languages

slide-7
SLIDE 7

Evaluation

Relevant docs retrieved Precision= Retrieved docs

Evaluation criteria

 Effectiveness

  • Favor returned document ranked lists with more relevant documents

at the top

  • Objective measures

Recall and Precision Mean-average precision Rank based precision

For documents in a subset of a ranked lists, if we know the truth

Relevant docs retrieved Recall= Relevant docs

slide-8
SLIDE 8

Evaluation

Pooling Strategy

 Retrieve documents using multiple methods  Judge top n documents from each method  Whole retrieved set is the union of top retrieved documents

from all methods

 Problems: the judged relevant documents may not be

complete

 It is possible to estimate size of true relevant documents by

randomly sampling

slide-9
SLIDE 9

Evaluation

Single value metrics

 Mean average precision

  • Calculate precision at each relevant document; average over all

precision values

 11-point interpolated average precision

  • Calculate precision at standard recall points (e.g., 10%, 20%...);

smooth the values; estimate 0 % by interpolation

  • Average the results

 Rank based precision

  • Calculate precision at top ranked documents (e.g., 5, 10, 15…)
  • Desirable when users care more for top ranked documents
slide-10
SLIDE 10

Evaluation

Sample Results

slide-11
SLIDE 11

Retrieval Models: Outline

Retrieval Models

 Exact-match retrieval method

  • Unranked Boolean retrieval method
  • Ranked Boolean retrieval method

 Best-match retrieval method

  • Vector space retrieval method
  • Latent semantic indexing
slide-12
SLIDE 12

Retrieval Models: Unranked Boolean

Unranked Boolean: Exact match method

 Selection Model

  • Retrieve a document iff it matches the precise query
  • Often return unranked documents (or with chronological order)

 Operators

  • Logical Operators: AND OR, NOT
  • Approximately operators: #1(white house) (i.e., within one word

distance, phrase) #sen(Iraq weapon) (i.e., within a sentence)

  • String matching operators: Wildcard (e.g., ind* for india and

indonesia)

  • Field operators: title(information and retrieval)…
slide-13
SLIDE 13

Retrieval Models: Unranked Boolean

Advantages:

 Work well if user knows exactly what to retrieve  Predicable; easy to explain  Very efficient

Disadvantages:

 It is difficult to design the query; high recall and low precision

for loose query; low recall and high precision for strict query

 Results are unordered; hard to find useful ones  Users may be too optimistic for strict queries. A few very

relevant but a lot more are missing

slide-14
SLIDE 14

Retrieval Models: Ranked Boolean

Ranked Boolean: Exact match

 Similar as unranked Boolean but documents are ordered by

some criterion

Reflect importance of document by its words Query: (Thailand AND stock AND market) Retrieve docs from Wall Street Journal Collection

Which word is more important? Term Frequency (TF): Number of occurrence in query/doc; larger number means more important Inversed Document Frequency (IDF): Larger means more important Total number of docs Number of docs contain a term There are many variants of TF, IDF: e.g., consider document length Many “stock” and “market”, but fewer “Thailand”. Fewer may be more indicative

slide-15
SLIDE 15

Retrieval Models: Ranked Boolean Ranked Boolean: Calculate doc score

 Term evidence: Evidence from term i occurred in doc j: (tfij)

and (tfij*idfi)

 AND weight: minimum of argument weights  OR weight: maximum of argument weights Term evidence

0.2 0.6 0.4

AND Min=0.2

0.2 0.6 0.4

OR Max=0.6

Query: (Thailand AND stock AND market)

slide-16
SLIDE 16

Retrieval Models: Ranked Boolean

Advantages:

 All advantages from unranked Boolean algorithm

  • Works well when query is precise; predictive; efficient

 Results in a ranked list (not a full list); easier to browse and

find the most relevant ones than Boolean

 Rank criterion is flexible: e.g., different variants of term

evidence

Disadvantages:

 Still an exact match (document selection) model: inverse

correlation for recall and precision of strict and loose queries

 Predictability makes user overestimate retrieval quality

slide-17
SLIDE 17

Retrieval Models: Vector Space Model

Vector space model

 Any text object can be represented by a term vector

  • Documents, queries, passages, sentences
  • A query can be seen as a short document

 Similarity is determined by distance in the vector space

  • Example: cosine of the angle between two vectors

 The SMART system

  • Developed at Cornell University: 1960-1999
  • Still quite popular
slide-18
SLIDE 18

Retrieval Models: Vector Space Model

Vector representation

Java Sun Starbucks D2 D3 D1 Query

slide-19
SLIDE 19

Retrieval Models: Vector Space Model

Give two vectors of query and document

query as document as calculate the similarity

1 2

( , ,..., )

n

q q q q 

1 2

( , ,..., )

j j j jn

d d d d 

Cosine similarity: Angle between vectors

1 ,1 2 ,2 , 1 ,1 2 ,2 , 2 2 2 2 1 1

cos( ( , )) ... ... ... ...

j j j j j j n j j j j n n j jn

q d q d q d q d q d q d q d q d q d q d q q d d              

( , )

j

q d 

q

j

d

( , ) cos( ( , ))

j j

sim q d q d  

slide-20
SLIDE 20

Retrieval Models: Vector Space Model

Vector Coefficients

The coefficients (vector elements) represent term evidence/

term importance

It is derived from several elements

  • Document term weight: Evidence of the term in the document/query
  • Collection term weight: Importance of term from observation of collection
  • Length normalization: Reduce document length bias

Naming convention for coefficients: ,

. .

k j k

q d DCL DCL 

First triple represents query term; second for document term

slide-21
SLIDE 21

Retrieval Models: Vector Space Model

Common vector weight components:

lnc.ltc: widely used term weight

  • “l”: log(tf)+1
  • “n”: no weight/normalization
  • “t”: log(N/df)
  • “c”: cosine normalization

    

 

 

2 2 2 2 1 1

) ( log 1 ) ( log( 1 ) ( log( ) ( log 1 ) ( log( 1 ) ( log( ..

  

                  

k j k q k j q j jn n j j

k df N k tf k tf k df N k tf k tf d q d q d q d q

slide-22
SLIDE 22

Retrieval Models: Vector Space Model

Advantages:

 Best match method; it does not need a precise query  Generated ranked lists; easy to explore the results  Simplicity: easy to implement  Effectiveness: often works well  Flexibility: can utilize different types of term weighting

methods

 Used in a wide range of IR tasks: retrieval, classification,

summarization, content-based filtering…

slide-23
SLIDE 23

Retrieval Models: Vector Space Model

Disadvantages:

 Hard to choose the dimension of the vector (“basic concept”);

terms may not be the best choice

 Assume independent relationship among terms  Heuristic for choosing vector operations

  • Choose of term weights
  • Choose of similarity function

 Assume a query and a document can be treated in the same

way

slide-24
SLIDE 24

Retrieval Models: Latent Semantic Indexing

Latent Semantic Indexing (LSI): Explore correlation between terms and documents

 Two terms are correlated (may share similar semantic

concepts) if they often co-occur

 Two documents are correlated (share similar topics) if they

have many common words Latent Semantic Indexing (LSI): Associate each term and document with a small number of semantic concepts/topics

slide-25
SLIDE 25

Query Expansion: Outline

Query Expansion via Relevant Feedback

 Relevance Feedback  Blind/Pseudo Relevance Feedback

Query Expansion via External Resources

 Thesaurus

  • “Industrial Chemical Thesaurus”, “Medical Subject

Headings” (MeSH)

 Semantic network

  • WordNet
slide-26
SLIDE 26

Goal: Move new query close to relevant documents and far away from irrelevant documents Approach: New query is a weighted average of original query, and relevant and non-relevant document vectors

Query Expansion: Relevance Feedback

Vector Space Model

1 1 ' (Rocchio formula) | | | |

i i

i i d R d NR

q q d d R NR  

 

  

 

slide-27
SLIDE 27

Web Search

Spam Detection

  • Content spam; link spam;……

Source size estimation

Capture-Recapture Model

  • What is the assumption?
  • How to calculate?

Duplicate detection

slide-28
SLIDE 28

Text Categorization (I)

Outline

 Introduction to the task of text categorization

  • Manual v.s. automatic text categorization

 Text categorization applications  Evaluation of text categorization  K nearest neighbor text categorization method

slide-29
SLIDE 29

Text Categorization

 Automatic text categorization

  • Learn algorithm to automatically assign predefined

categories to text documents /objects

  • automatic or semi-automatic

 Procedures

  • Training: Given a set of categories and labeled document

examples; learn a method to map a document to correct category (categories)

  • Testing: Predict the category (categories) of a new

document

 Automatic or semi-automatic categorization can significantly

reduce the manual efforts

slide-30
SLIDE 30

K-Nearest Neighbor Classifier

 Idea: find your language by what language your

neighbors speak

(k=1) (k=5)

 Use K nearest neighbors to vote

slide-31
SLIDE 31

K Nearest Neighbor: Technical Elements

 Document representation  Document distance measure: closer documents should have

similar labels; neighbors speak the same language

 Number of nearest neighbors (value of K)  Decision threshold

slide-32
SLIDE 32

K Nearest Neighbor: Framework

{0,1} y docs, , R x )}, y , {(x D

i M i i i

  

M

R x

(x) D x i i

k i

)y x sim(x, k 1 (x) y ˆ

Training data Test data Scoring Function The neighbor hood is

D Dk 

Classification:

ˆ 1 if y(x) t 0 otherwise     

Document Representation: Xi uses tf.idf weighting for each dimension

slide-33
SLIDE 33

Collaborative Filtering

Outline

 Introduction to collaborative filtering  Main framework  Memory-based collaborative filtering approach  Model-based collaborative filtering approach

  • Aspect model & Two-way clustering model
  • Flexible mixture model
  • Decouple model

 Unified filtering by combining content and collaborative

filtering

slide-34
SLIDE 34

Federated Search

Outline

 Introduction to federated search  Main research problems

  • Resource Representation
  • Resource Selection
  • Results Merging

 A unified utility maximization framework for federated search  Modeling search engine effectiveness

slide-35
SLIDE 35

Components of a Federated Search System and Two Important Applications

. . . . . . (1) Resource Representation . . . . Engine 1 Engine 2 Engine 3 Engine 4 Engine N (2) Resource Selection

… … ……

(3) Results Merging

Information source recommendation: Recommend information sources for users’ text queries (e.g., completeplanet.com): Steps 1 and 2 Federated document retrieval: Also search selected sources and merge individual ranked lists into a single list: Steps 1, 2 and 3

Federated Search

slide-36
SLIDE 36

Clustering

Document clustering

– Motivations – Document representations – Success criteria

Clustering algorithms

– K-means – Model-based clustering (EM clustering)

slide-37
SLIDE 37

Link Analysis

Outline

 The characteristics of Web structure (small world)  Hub & Authority Algorithms

  • Authority Value, Hubness Value

 Page Rank Algorithm (Page Rank Value)  Relation with the computation of eigenvector

slide-38
SLIDE 38

Information Extraction

 Concept of Information Extraction

go beyond of retrieval; deep analysis; from unstructured data to semi-structured or structured data

 Named Entity Recognition and Simple Solution

Slide window algorithm; connection with text categorization

 Probabilistic method for information extraction

Finite State Model