Information Retrieval What Is Clustering? Group data into clusters - - PDF document

information retrieval what is clustering
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval What Is Clustering? Group data into clusters - - PDF document

Information Retrieval What Is Clustering? Group data into clusters Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: no predefined classes Outliers Cluster 1


slide-1
SLIDE 1

Information Retrieval

slide-2
SLIDE 2

CMPT 354: Database I -- Information Retrieval 2

What Is Clustering?

  • Group data into clusters

– Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers

slide-3
SLIDE 3

CMPT 354: Database I -- Information Retrieval 3

Application Examples

  • A stand-alone tool: explore data distribution
  • A preprocessing step for other algorithms
  • Pattern recognition, spatial data analysis,

image processing, market research, WWW, …

– Cluster documents – Cluster web log data to discover groups of similar access patterns

slide-4
SLIDE 4

CMPT 354: Database I -- Information Retrieval 4

What Is Good Clustering?

  • High intra-class similarity and low inter-class

similarity

– Depending on the similarity measure

  • The ability to discover some or all of the

hidden patterns

slide-5
SLIDE 5

CMPT 354: Database I -- Information Retrieval 5

Partitioning Algorithms: Basics

  • Partition n objects into k clusters

– Optimize the chosen partitioning criterion

  • Global optimal: examine all possible partitions

– (kn-(k-1)n-…-1) possible partitions, too expensive!

  • Heuristic methods: k-means and k-medoids

– K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each cluster is represented by one of the objects in the cluster

slide-6
SLIDE 6

CMPT 354: Database I -- Information Retrieval 6

K-Means: Example

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Update the cluster means Assign each

  • bjects

to most similar center

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

reassign reassign

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

K= 2 Arbitrarily choose K

  • bject as initial

cluster center Update the cluster means

slide-7
SLIDE 7

CMPT 354: Database I -- Information Retrieval 7

K-means

  • Arbitrarily choose k objects as the initial

cluster centers

  • Until no change, do

– (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster – Update the cluster means, i.e., calculate the mean value of the objects for each cluster

slide-8
SLIDE 8

CMPT 354: Database I -- Information Retrieval 8

Pros and Cons of K-means

  • Relatively efficient: O(tkn)

– n: # objects, k: # clusters, t: # iterations; k, t << n.

  • Often terminate at a local optimum
  • Applicable only when mean is defined

– What about categorical data?

  • Need to specify the number of clusters
  • Unable to handle noisy data and outliers
  • Unsuitable to discover non-convex clusters
slide-9
SLIDE 9

CMPT 354: Database I -- Information Retrieval 9

Information retrieval

  • Dealing with the representation, storage,
  • rganization of, and access to information

items

– Information instead of just data – “interpret” the contents of the documents – Rank documents according to a degree of relevance to the user query

  • The notion of relevance is at the center of

information retrieval

slide-10
SLIDE 10

CMPT 354: Database I -- Information Retrieval 10

Information Retrieval History

  • Simple information retrieval functions: book

content tables, index cards, and traditional library management systems

– Computer-centered view: building efficient indexes – Human-centered view: understand the behavior

  • f the user and his information needs
  • The Web and digital libraries
slide-11
SLIDE 11

CMPT 354: Database I -- Information Retrieval 11

Information Retrieval Systems

  • Information retrieval (IR) systems use a simpler data model

than database systems

– Information organized as a collection of documents – Documents are unstructured, no schema

  • Information retrieval locates relevant documents, on the

basis of user input such as keywords or example documents

– e.g., find documents containing the words “database systems”

  • Can be used even on textual descriptions provided with

non-textual data such as images

  • Web search engines are the most familiar example of IR

systems

slide-12
SLIDE 12

CMPT 354: Database I -- Information Retrieval 12

IR versus DB

  • IR systems do not deal with transactional updates

(including concurrency control and recovery)

  • Database systems deal with structured data, with

schemas that define the data organization

  • IR systems deal with some querying issues not

generally addressed by database systems

– Approximate searching by keywords – Ranking of retrieved answers by estimated degree of relevance

slide-13
SLIDE 13

CMPT 354: Database I -- Information Retrieval 13

Data and Queries

Structured data (relational data) Unstructured data (e.g., free text, multimedia) Structured queries Relational databases XML for semi- structured data Unstructured queries (keywords

  • nly)

IR in DB (new direction in DB research and development) Information retrieval

slide-14
SLIDE 14

CMPT 354: Database I -- Information Retrieval 14

Keyword Search

  • In full text retrieval, all the words in each document are

considered to be keywords

  • Information-retrieval systems typically allow query

expressions formed using keywords and the logical connectives and, or, and not

– Ands are implicit, even if not explicitly specified

  • Relevance ranking is based on factors such as

– Term frequency

  • Frequency of occurrence of query keyword in document

– Inverse document frequency

  • How many documents the query keyword occurs in

– Fewer give more importance to keyword

– Hyperlinks to documents

  • More links to a document document is more important
slide-15
SLIDE 15

CMPT 354: Database I -- Information Retrieval 15

TF-IDF

  • Term frequency/Inverse Document frequency ranking
  • Let n(d) = number of terms in the document d
  • n(d, t) = number of occurrences of term t in the document d
  • Relevance of a document d to a term t

The log factor is to avoid excessive weight to frequent terms

  • Relevance of document to query Q

n n( (d d) ) n n( (d d, , t t) ) 1 + 1 + TF TF ( (d d, , t t) = ) = log log r r ( (d d, , Q Q) = ) = ∑ ∑ TF TF ( (d d, , t t) ) n n( (t t) )

t t∈ ∈Q Q

slide-16
SLIDE 16

CMPT 354: Database I -- Information Retrieval 16

Relevance Ranking Using Terms

  • Most systems also consider

– Words that occur in title, author list, section headings, etc. are given greater importance – Words whose first occurrence is late in the document are given lower importance – Very common words (stop words) such as “a”, “an”, “the”, “it” etc are eliminated – Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart

  • Documents are returned in decreasing order of relevance

score

– Usually only top few documents are returned, not all

slide-17
SLIDE 17

CMPT 354: Database I -- Information Retrieval 17

Similarity Based Retrieval

  • Similarity based retrieval - retrieve

documents similar to a given document

– Similarity may be defined on the basis of common words: e.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents.

  • Relevance feedback: Similarity can be used

to refine answer set to keyword query

– User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these

slide-18
SLIDE 18

CMPT 354: Database I -- Information Retrieval 18

Vector space model

  • Define an n-dimensional space, where n is

the number of words in the document set

  • Vector for document d goes from origin to a

point whose i th coordinate is TF (d,t ) / n (t )

  • The cosine of the angle between the vectors
  • f two documents is used as a measure of

their similarity

slide-19
SLIDE 19

CMPT 354: Database I -- Information Retrieval 19

Relevance Using Hyperlinks

  • The number of documents relevant to a query can

be enormous if only term frequencies are taken into account

  • Using term frequencies makes “spamming” easy

– E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high

  • People often look for pages from popular sites
  • Idea: use popularity of Web site (e.g. how many

people visit it) to rank site pages that match given keywords

– Problem: hard to find actual popularity of site

slide-20
SLIDE 20

CMPT 354: Database I -- Information Retrieval 20

Relevance Using Hyperlinks

  • Use the number of hyperlinks to a site as a

measure of the popularity or prestige of the site

– Count only one hyperlink from each site (why?) – Popularity measure is for site, not for individual page

  • But, most hyperlinks are to root of site
  • Also, concept of “site” is difficult to define since a URL prefix like

cs.sfu.ca contains many unrelated pages of varying popularity

  • Refinements

– When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige

  • Definition is circular
  • Set up and solve system of simultaneous linear equations
slide-21
SLIDE 21

CMPT 354: Database I -- Information Retrieval 21

PageRank

  • Simulate a user navigating randomly in the

web who jumps to a random page with probability q or follows a random hyperlink with probability (1-q)

  • C(a) is the number of outgoing links of page

a

  • Page a is pointed to by pages p1 to pn

=

− + =

n i i i

p C p PR q q a PR

1

) ( / ) ( ) 1 ( ) (

slide-22
SLIDE 22

CMPT 354: Database I -- Information Retrieval 22

Relevance Using Hyperlinks

  • Connections to social networking theories

that ranked prestige of people

– E.g. the president of the U.S.A has a high prestige since many people know him

  • Someone known by multiple prestigious

people has high prestige

slide-23
SLIDE 23

CMPT 354: Database I -- Information Retrieval 23

Rethinking Search Engines

  • High recall, low precision

– Many mildly relevant or irrelevant documents may be returned – “Too much can easily become as bad as too little”

  • Low or no recall, often when combinations of

keywords are used

  • Results are highly sensitive to vocabulary

– A search engine does not know “XML data” is “semi- structured data”

  • Results are single web pages

– How to find information spread over various documents, e.g., a survey on the latest XML initiatives

slide-24
SLIDE 24

CMPT 354: Database I -- Information Retrieval 24

Web Information Processing

  • The amount of web content outpaces

technological progress

  • Information retrieval? No, it is location finder

– The Web search results are often not accessible by other software tools, e.g., OLAP tools – Search engines are relatively isolated applications

  • Main goal of Web information processing:

make the Web content machine accessible

slide-25
SLIDE 25

CMPT 354: Database I -- Information Retrieval 25

Semantic Web Initiative

  • To represent Web content in a machine-

processable form and to use intelligent techniques to take advantage of these representations

– “Machine-understandable” versus processable

  • A fast track for R & D

– DARPA Agent Markup Language Project (DAML) – European Union’s Sixth Framework Programme

slide-26
SLIDE 26

CMPT 354: Database I -- Information Retrieval 26

Retrieval Effectiveness

  • Information-retrieval systems save space by using index

structures that support only approximate retrieval

– false negative (false drop) - some relevant documents may not be retrieved – false positive - some irrelevant documents may be retrieved – For many applications a good index should not permit any false drops, but may permit a few false positives

  • Relevant performance metrics:

– precision - what percentage of the retrieved documents are relevant to the query – recall - what percentage of the documents relevant to the query were retrieved

slide-27
SLIDE 27

CMPT 354: Database I -- Information Retrieval 27

Retrieval Effectiveness

  • Recall vs. precision tradeoff: can improve recall by

retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision

  • Measures of retrieval effectiveness:

– Recall as a function of number of documents fetched, or precision as a function of recall, e.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%”

  • Problem: it is not easy to determine which

documents are actually relevant, and which are not

slide-28
SLIDE 28

CMPT 354: Database I -- Information Retrieval 28

Web Search Engines

  • Web crawlers are programs that locate and gather

information on the Web

– Recursively follow hyperlinks present in known documents, to find

  • ther documents, starting from a seed set of documents

– Fetched documents

  • Handed over to an indexing system
  • Can be discarded after indexing, or store as a cached copy
  • Crawling the entire Web would take a very large amount of

time and very a very huge space

– Search engines typically cover only a part of the Web, not all of it – Take months to perform a single crawl

slide-29
SLIDE 29

CMPT 354: Database I -- Information Retrieval 29

Web Crawling

  • Crawling is done by multiple processes on multiple

machines, running in parallel

– Set of links to be crawled stored in a database – New links found in crawled pages added to this set, to be crawled later

  • Indexing process also runs on multiple machines

– Create a new copy of index instead of modifying old index – Old index is used to answer queries – After a crawl is “completed” new index becomes “old” index

  • Multiple machines used to answer queries

– Indices may be kept in memory – Queries may be routed to different machines for load balancing

slide-30
SLIDE 30

CMPT 354: Database I -- Information Retrieval 30

IR and Structured Data

  • IR systems originally treated documents as a

collection of words

  • Information extraction systems infer structure from

documents, e.g., extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement

  • Relations or XML structures are used to store

extracted data

– System seeks connections among data to answer queries

slide-31
SLIDE 31

CMPT 354: Database I -- Information Retrieval 31

Directories

  • Storing related documents together in a

library facilitates browsing

– users can see not only requested document but also related ones.

  • Browsing is facilitated by classification

system that organizes logically related documents together.

  • Organization is hierarchical: classification

hierarchy

slide-32
SLIDE 32

CMPT 354: Database I -- Information Retrieval 32

Classification Hierarchy Example

slide-33
SLIDE 33

CMPT 354: Database I -- Information Retrieval 33

Classification DAG

  • Documents can reside in multiple places in

a hierarchy in an information retrieval system, since physical location is not important

  • Classification hierarchy is thus Directed

Acyclic Graph (DAG)

slide-34
SLIDE 34

CMPT 354: Database I -- Information Retrieval 34

A Classification DAG

slide-35
SLIDE 35

CMPT 354: Database I -- Information Retrieval 35

Web Directories

  • A Web directory is just a classification

directory on Web pages

– E.g. Yahoo! Directory, Open Directory project – Issues:

  • What should the directory hierarchy be?
  • Given a document, which nodes of the directory are

categories relevant to the document

– Often done manually

  • Classification of documents into a hierarchy may be

done based on term similarity

slide-36
SLIDE 36

CMPT 354: Database I -- Information Retrieval 36

Summary

  • Data analysis is important in many applications – decision

support systems

  • OLAP data analysis
  • Data warehousing
  • Data mining: association rules, classification, clustering
  • Information retrieval

– Retrieve relevant documents – Rank relevant documents – Applications: Web, digital libraries

  • Techniques

– Relevance measure – Similarity search – Semantic web

slide-37
SLIDE 37

CMPT 354: Database I -- Information Retrieval 37

To-Do List

  • Use the Mint-TPC data set, for each order,

collect the following information

– Customer, date, total number of items in the

  • rder, the list of items in the order, total price
  • Build a data cube using customer, date and

total number of items as dimensions, and average price per order as the measure

  • Mine frequent itemsets in orders
slide-38
SLIDE 38

CMPT 354: Database I -- Information Retrieval 38

To-Do List

  • Read Chapter 19 in the textbook
  • Using the email archive of this course, can

you suggest

– How to find the important emails? – How to search the relevant emails?