Information Retrieval What Is Clustering? Group data into clusters - - PDF document
Information Retrieval What Is Clustering? Group data into clusters - - PDF document
Information Retrieval What Is Clustering? Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Outliers Cluster 1
CMPT 354: Database I -- Information Retrieval 2
What Is Clustering?
- Group data into clusters
– Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers
CMPT 354: Database I -- Information Retrieval 3
Application Examples
- A stand-alone tool: explore data distribution
- A preprocessing step for other algorithms
- Pattern recognition, spatial data analysis,
image processing, market research, WWW, …
– Cluster documents – Cluster web log data to discover groups of similar access patterns
CMPT 354: Database I -- Information Retrieval 4
What Is Good Clustering?
- High intra-class similarity and low inter-class
similarity
– Depending on the similarity measure
- The ability to discover some or all of the
hidden patterns
CMPT 354: Database I -- Information Retrieval 5
Partitioning Algorithms: Basics
- Partition n objects into k clusters
– Optimize the chosen partitioning criterion
- Global optimal: examine all possible partitions
– (kn-(k-1)n-…-1) possible partitions, too expensive!
- Heuristic methods: k-means and k-medoids
– K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster
CMPT 354: Database I -- Information Retrieval 6
K-Means: Example
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Update the cluster means Assign each
- bjects
to most similar center
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
reassign reassign
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
K= 2 Arbitrarily choose K
- bject as initial
cluster center Update the cluster means
CMPT 354: Database I -- Information Retrieval 7
K-means
- Arbitrarily choose k objects as the initial
cluster centers
- Until no change, do
– (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster
CMPT 354: Database I -- Information Retrieval 8
Pros and Cons of K-means
- Relatively efficient: O(tkn)
– n: # objects, k: # clusters, t: # iterations; k, t << n.
- Often terminate at a local optimum
- Applicable only when mean is defined
– What about categorical data?
- Need to specify the number of clusters
- Unable to handle noisy data and outliers
- Unsuitable to discover non-convex clusters
CMPT 354: Database I -- Information Retrieval 9
Information retrieval
- Dealing with the representation, storage,
- rganization of, and access to information
items
– Information instead of just data – “interpret” the contents of the documents – Rank documents according to a degree of relevance to the user query
- The notion of relevance is at the center of
information retrieval
CMPT 354: Database I -- Information Retrieval 10
Information Retrieval History
- Simple information retrieval functions: book
content tables, index cards, and traditional library management systems
– Computer-centered view: building efficient indexes – Human-centered view: understand the behavior
- f the user and his information needs
- The Web and digital libraries
CMPT 354: Database I -- Information Retrieval 11
Information Retrieval Systems
- Information retrieval (IR) systems use a simpler data model
than database systems
– Information organized as a collection of documents – Documents are unstructured, no schema
- Information retrieval locates relevant documents, on the
basis of user input such as keywords or example documents
– e.g., find documents containing the words “database systems”
- Can be used even on textual descriptions provided with
non-textual data such as images
- Web search engines are the most familiar example of IR
systems
CMPT 354: Database I -- Information Retrieval 12
IR versus DB
- IR systems do not deal with transactional updates
(including concurrency control and recovery)
- Database systems deal with structured data, with
schemas that define the data organization
- IR systems deal with some querying issues not
generally addressed by database systems
– Approximate searching by keywords – Ranking of retrieved answers by estimated degree of relevance
CMPT 354: Database I -- Information Retrieval 13
Data and Queries
Structured data (relational data) Unstructured data (e.g., free text, multimedia) Structured queries Relational databases XML for semi- structured data Unstructured queries (keywords
- nly)
IR in DB (new direction in DB research and development) Information retrieval
CMPT 354: Database I -- Information Retrieval 14
Keyword Search
- In full text retrieval, all the words in each document are
considered to be keywords
- Information-retrieval systems typically allow query
expressions formed using keywords and the logical connectives and, or, and not
– Ands are implicit, even if not explicitly specified
- Relevance ranking is based on factors such as
– Term frequency
- Frequency of occurrence of query keyword in document
– Inverse document frequency
- How many documents the query keyword occurs in
– Fewer give more importance to keyword
– Hyperlinks to documents
- More links to a document document is more important
CMPT 354: Database I -- Information Retrieval 15
TF-IDF
- Term frequency/Inverse Document frequency ranking
- Let n(d) = number of terms in the document d
- n(d, t) = number of occurrences of term t in the document d
- Relevance of a document d to a term t
The log factor is to avoid excessive weight to frequent terms
- Relevance of document to query Q
n n( (d d) ) n n( (d d, , t t) ) 1 + 1 + TF TF ( (d d, , t t) = ) = log log r r ( (d d, , Q Q) = ) = ∑ ∑ TF TF ( (d d, , t t) ) n n( (t t) )
t t∈ ∈Q Q
CMPT 354: Database I -- Information Retrieval 16
Relevance Ranking Using Terms
- Most systems also consider
– Words that occur in title, author list, section headings, etc. are given greater importance – Words whose first occurrence is late in the document are given lower importance – Very common words (stop words) such as “a”, “an”, “the”, “it” etc are eliminated – Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart
- Documents are returned in decreasing order of relevance
score
– Usually only top few documents are returned, not all
CMPT 354: Database I -- Information Retrieval 17
Similarity Based Retrieval
- Similarity based retrieval - retrieve
documents similar to a given document
– Similarity may be defined on the basis of common words: e.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.
- Relevance feedback: Similarity can be used
to refine answer set to keyword query
– User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these
CMPT 354: Database I -- Information Retrieval 18
Vector space model
- Define an n-dimensional space, where n is
the number of words in the document set
- Vector for document d goes from origin to a
point whose i th coordinate is TF (d,t ) / n (t )
- The cosine of the angle between the vectors
- f two documents is used as a measure of
their similarity
CMPT 354: Database I -- Information Retrieval 19
Relevance Using Hyperlinks
- The number of documents relevant to a query can
be enormous if only term frequencies are taken into account
- Using term frequencies makes “spamming” easy
– E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high
- People often look for pages from popular sites
- Idea: use popularity of Web site (e.g. how many
people visit it) to rank site pages that match given keywords
– Problem: hard to find actual popularity of site
CMPT 354: Database I -- Information Retrieval 20
Relevance Using Hyperlinks
- Use the number of hyperlinks to a site as a
measure of the popularity or prestige of the site
– Count only one hyperlink from each site (why?) – Popularity measure is for site, not for individual page
- But, most hyperlinks are to root of site
- Also, concept of “site” is difficult to define since a URL prefix like
cs.sfu.ca contains many unrelated pages of varying popularity
- Refinements
– When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige
- Definition is circular
- Set up and solve system of simultaneous linear equations
CMPT 354: Database I -- Information Retrieval 21
PageRank
- Simulate a user navigating randomly in the
web who jumps to a random page with probability q or follows a random hyperlink with probability (1-q)
- C(a) is the number of outgoing links of page
a
- Page a is pointed to by pages p1 to pn
∑
=
− + =
n i i i
p C p PR q q a PR
1
) ( / ) ( ) 1 ( ) (
CMPT 354: Database I -- Information Retrieval 22
Relevance Using Hyperlinks
- Connections to social networking theories
that ranked prestige of people
– E.g. the president of the U.S.A has a high prestige since many people know him
- Someone known by multiple prestigious
people has high prestige
CMPT 354: Database I -- Information Retrieval 23
Rethinking Search Engines
- High recall, low precision
– Many mildly relevant or irrelevant documents may be returned – “Too much can easily become as bad as too little”
- Low or no recall, often when combinations of
keywords are used
- Results are highly sensitive to vocabulary
– A search engine does not know “XML data” is “semi- structured data”
- Results are single web pages
– How to find information spread over various documents, e.g., a survey on the latest XML initiatives
CMPT 354: Database I -- Information Retrieval 24
Web Information Processing
- The amount of web content outpaces
technological progress
- Information retrieval? No, it is location finder
– The Web search results are often not accessible by other software tools, e.g., OLAP tools – Search engines are relatively isolated applications
- Main goal of Web information processing:
make the Web content machine accessible
CMPT 354: Database I -- Information Retrieval 25
Semantic Web Initiative
- To represent Web content in a machine-
processable form and to use intelligent techniques to take advantage of these representations
– “Machine-understandable” versus processable
- A fast track for R & D
– DARPA Agent Markup Language Project (DAML) – European Union’s Sixth Framework Programme
CMPT 354: Database I -- Information Retrieval 26
Retrieval Effectiveness
- Information-retrieval systems save space by using index
structures that support only approximate retrieval
– false negative (false drop) - some relevant documents may not be retrieved – false positive - some irrelevant documents may be retrieved – For many applications a good index should not permit any false drops, but may permit a few false positives
- Relevant performance metrics:
– precision - what percentage of the retrieved documents are relevant to the query – recall - what percentage of the documents relevant to the query were retrieved
CMPT 354: Database I -- Information Retrieval 27
Retrieval Effectiveness
- Recall vs. precision tradeoff: can improve recall by
retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision
- Measures of retrieval effectiveness:
– Recall as a function of number of documents fetched, or precision as a function of recall, e.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%”
- Problem: it is not easy to determine which
documents are actually relevant, and which are not
CMPT 354: Database I -- Information Retrieval 28
Web Search Engines
- Web crawlers are programs that locate and gather
information on the Web
– Recursively follow hyperlinks present in known documents, to find
- ther documents, starting from a seed set of documents
– Fetched documents
- Handed over to an indexing system
- Can be discarded after indexing, or store as a cached copy
- Crawling the entire Web would take a very large amount of
time and very a very huge space
– Search engines typically cover only a part of the Web, not all of it – Take months to perform a single crawl
CMPT 354: Database I -- Information Retrieval 29
Web Crawling
- Crawling is done by multiple processes on multiple
machines, running in parallel
– Set of links to be crawled stored in a database – New links found in crawled pages added to this set, to be crawled later
- Indexing process also runs on multiple machines
– Create a new copy of index instead of modifying old index – Old index is used to answer queries – After a crawl is “completed” new index becomes “old” index
- Multiple machines used to answer queries
– Indices may be kept in memory – Queries may be routed to different machines for load balancing
CMPT 354: Database I -- Information Retrieval 30
IR and Structured Data
- IR systems originally treated documents as a
collection of words
- Information extraction systems infer structure from
documents, e.g., extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement
- Relations or XML structures are used to store
extracted data
– System seeks connections among data to answer queries
CMPT 354: Database I -- Information Retrieval 31
Directories
- Storing related documents together in a
library facilitates browsing
– users can see not only requested document but also related ones.
- Browsing is facilitated by classification
system that organizes logically related documents together.
- Organization is hierarchical: classification
hierarchy
CMPT 354: Database I -- Information Retrieval 32
Classification Hierarchy Example
CMPT 354: Database I -- Information Retrieval 33
Classification DAG
- Documents can reside in multiple places in
a hierarchy in an information retrieval system, since physical location is not important
- Classification hierarchy is thus Directed
Acyclic Graph (DAG)
CMPT 354: Database I -- Information Retrieval 34
A Classification DAG
CMPT 354: Database I -- Information Retrieval 35
Web Directories
- A Web directory is just a classification
directory on Web pages
– E.g. Yahoo! Directory, Open Directory project – Issues:
- What should the directory hierarchy be?
- Given a document, which nodes of the directory are
categories relevant to the document
– Often done manually
- Classification of documents into a hierarchy may be
done based on term similarity
CMPT 354: Database I -- Information Retrieval 36
Summary
- Data analysis is important in many applications – decision
support systems
- OLAP data analysis
- Data warehousing
- Data mining: association rules, classification, clustering
- Information retrieval
– Retrieve relevant documents – Rank relevant documents – Applications: Web, digital libraries
- Techniques
– Relevance measure – Similarity search – Semantic web
CMPT 354: Database I -- Information Retrieval 37
To-Do List
- Use the Mint-TPC data set, for each order,
collect the following information
– Customer, date, total number of items in the
- rder, the list of items in the order, total price
- Build a data cube using customer, date and
total number of items as dimensions, and average price per order as the measure
- Mine frequent itemsets in orders
CMPT 354: Database I -- Information Retrieval 38
To-Do List
- Read Chapter 19 in the textbook
- Using the email archive of this course, can