Web Mining Web Mining to automatically discover and extract - - PowerPoint PPT Presentation

web mining web mining
SMART_READER_LITE
LIVE PREVIEW

Web Mining Web Mining to automatically discover and extract - - PowerPoint PPT Presentation

What is Web Mining? What is Web Mining? Web mining is the use of data mining techniques Web Mining Web Mining to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) 1 2 The Web The Web


slide-1
SLIDE 1

1

Web Mining Web Mining

2

What is Web Mining? What is Web Mining?

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services

(Etzioni, 1996, CACM 39(11))

3

What is Web Mining? What is Web Mining?

Motivation / Opportunity

The WWW is huge, widely distributed, global information service

centre and, therefore, constitutes a rich source for data mining

Personalization, Recommendation Engines Web-commerce applications Building the Semantic Web Intelligent Web Search Hypertext classification and Categorization Information / trend monitoring Analysis of online communities 4

The Web The Web

Over 1 billion HTML pages, 15 terabytes Wealth of information

Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes,

yellow & white pages, maps, markets, .........

Diverse media types: text, images, audio, video Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3

Highly Dynamic

1 million new pages each day Average page changes in a few weeks

Graph structure with links between pages

Average page has 7-10 links in-links and out-links follow power-law distribution

Hundreds of millions of queries per day

slide-2
SLIDE 2

5

Abundance and authority crisis Abundance and authority crisis

Liberal and informal culture of content generation and

dissemination

Redundancy and non-standard form and content Millions of qualifying pages for most broad queries

Example: java or kayaking

No authoritative information about the reliability of a site Little support for adapting to the background of specific users

6

How do you suggest we could How do you suggest we could estimate the size of the estimate the size of the web? web?

7

One Interesting Approach One Interesting Approach

The number of web servers was estimated by sampling

and testing random IP address numbers and determining the fraction of such tests that successfully located a web server

The estimate of the average number of pages per

server was obtained by crawling a sample of the servers identified in the first experiment

  • Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the
  • web. Nature, 400(6740): 107–109.

8

The Web The Web

The Web is a huge collection of documents except for

Hyper-link information Access and usage information

Lots of data on user access patterns

Web logs contain sequence of URLs accessed by users

Challenge: Develop new Web mining algorithms and

adapt traditional data mining algorithms to

Exploit hyper-links and access patterns

slide-3
SLIDE 3

9

Applications of web mining Applications of web mining

E-commerce (Infrastructure)

Generate user profiles -> improving customization and provide users with

pages, advertisements of interest

Targeted advertising -> Ads are a major source of revenue for Web portals

(e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the “hottest” web mining application today

Fraud -> Maintain a signature for each user based on buying patterns on the

Web (e.g., amount spent, categories of items bought). If buying pattern changes significantly, then signal fraud Network Management

Performance management -> Annual bandwidth demand is increasing ten-fold

  • n average, annual bandwidth supply is rising only by a factor of three. Result is

frequent congestion. During a major event (World cup), an overwhelming number

  • f user requests can result in millions of redundant copies of data flowing back

and forth across the world

Fault management -> analyze alarm and traffic data to carry out root cause

analysis of faults

10

Applications of web mining Applications of web mining

Information retrieval (Search) on the Web

Automated generation of topic hierarchies Web knowledge bases

11

Why is Web Information Retrieval Important? Why is Web Information Retrieval Important?

According to most predictions, the majority of human information

will be available on the Web in ten years

Effective information retrieval can aid in

Research: Find all papers about web mining Health/Medicene: What could be reason for symptoms of “yellow

eyes”, high fever and frequent vomitting

Travel: Find information on the tropical island of St. Lucia Business: Find companies that manufacture digital signal processors Entertainment: Find all movies starring Marilyn Monroe during the

years 1960 and 1970

Arts: Find all short stories written by Jhumpa Lahiri

12

Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult?

The Abundance Problem (99% of information of no interest to 99%

  • f people)

Hundreds of irrelevant documents returned in response to a search

query

Limited Coverage of the Web (Internet sources hidden behind

search interfaces)

Largest crawlers cover less than 18% of Web pages

The Web is extremely dynamic

Lots of pages added, removed and changed every day

Very high dimensionality (thousands of dimensions) Limited query interface based on keyword-oriented search Limited customization to individual users

slide-4
SLIDE 4

13

http://www.searchengineshowdown.com/stats/size.shtml

Search Engine Relative Size Search Engine Relative Size

14

Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap

From http://www.searchengineshowdown.com/stats/overlap.shtml

  • Coverage – about 40% in 1999

4 searches were defined that returned 141 web pages.

15

End Of Size Wars? Google Says Most Comprehensive

But Drops Home Page Count

http://searchenginewatch.com/searchday/article.php/3551586 By Danny Sullivan, Editor, September 27, 2005 How do you measure Comprehensiveness?

Rare words The Duplicate Content Issue Counting Pages Indexed Per Site

16

Web Mining

Web Structure Mining Web Content Mining Web Usage Mining

Web Mining Taxonomy Web Mining Taxonomy

slide-5
SLIDE 5

17

Web Mining Taxonomy Web Mining Taxonomy

Web content mining: focuses on techniques for

assisting a user in finding documents that meet a certain criterion (text mining)

Web structure mining: aims at developing techniques to

take advantage of the collective judgement of web page quality which is available in the form of hyperlinks

Web usage mining: focuses on techniques to study the

user behaviour when navigating the web

(also known as Web log mining and clickstream analysis)

18

Web Content Mining Web Content Mining

Examines the content of web pages as well as results of web searching.

19

Web Content Web Content Minng Minng

Can be thought of as extending the work performed by

basic search engines

Search engines have crawlers to search the web and

gather information, indexing techniques to store the information, and query processing support to provide information to the users

Web Content Mining is: the process of extracting

knowledge from web contents

20

Semi Semi-

  • Structured Data

Structured Data

Content is, in general, semi-structured Example:

Title Author Publication_Date Length Category Abstract Content

Structured attribute/value pairs Unstructured

slide-6
SLIDE 6

21

Structuring Textual Information Structuring Textual Information

Many methods designed to analyze structured data If we can represent documents by a set of attributes we will be

able to use existing data mining methods

How to represent a document?

Vector based representation (referred to as “bag of words” as it is invariant to permutations)

Use statistics to add a numerical dimension to unstructured text

Term frequency Document frequency Document length Term proximity

22

Document Representation Document Representation

A document representation aims to capture what the document

is about

One possible approach: Each entry describes a document Attribute describe whether or not a term appears in the

document Example

Terms … 1 Pixel … 1 1 Document 2 … … … … Memory 1 1 Document 1 Digital Camera

23

Document Representation Document Representation

Another approach: Each entry describes a document Attributes represent the frequency in which a term appears

in the document Example: Term frequency table

Terms … 3 1 Print … 4 Document 2 … … … … Memory 2 3 Document 1 Digital Camera

24

Document Representation Document Representation

But a term is mentioned more times in longer documents Therefore, use relative frequency (% of document):

  • No. of occurrences/No. of words in document

Terms … 0.003 0.01 Print … 0.004 Document 2 … … … … Memory 0.02 0.03 Document 1 Digital Camera

slide-7
SLIDE 7

25

More on Document Representation More on Document Representation

  • Stop Word removal: Many words are not informative and thus

irrelevant for document representation

the, and, a, an, is, of, that, …

  • Stemming: reducing words to their root form (Reduce dimensionality)

A document may contain several occurrences of words like

fish, fishes, fisher, and fishers

But would not be retrieved by a query with the keyword

fishing

Different words share the same word stem and should be represented with

its stem, instead of the actual word

Fish

  • For the Portuguese language these techniques are less studied

26

Weighting Scheme for Term Frequencies Weighting Scheme for Term Frequencies

TF-IDF weighting: give higher weight to terms that are rare

TF: term frequency (increases weight of frequent terms) If a term is frequent in lots of documents it does not have discriminative power IDF: inverse term frequency

j j i i j ij i j

w n n d d w n d w contain that documents

  • f

number the is documents

  • f

number the is document in words

  • f

number the is d document in

  • f

s

  • ccurrence
  • f

number the is document and term given a For

i i ij ij

d n TF = n n IDF

j j

log =

j ij ij

IDF TF x ⋅ =

There is no compelling motivation for this method but it has been shown to be superior to other methods

27

Locating Relevant Documents Locating Relevant Documents

Given a set of keywords Use similarity/distance measure to find

similar/relevant documents

Rank documents by their relevance/similarity

How to determine if two documents are similar?

28

In order retrieve documents similar to a given document we need a

measure of similarity

Euclidean distance (example of a metric distance):

The Euclidean distance between

X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn)

is defined as:

Distance Based Matching Distance Based Matching ∑

=

− =

n i i i

y x Y X D

1 2

) ( ) , (

A B C D Properties of a metric distance:

  • D(X,X)=0
  • D(X,Y)=D(Y,X)
  • D(X,Z)+D(Z,Y) ≥ D(X,Y)
slide-8
SLIDE 8

29

Angle Based Matching Angle Based Matching

Cosine of the angle between the vectors representing the document

and the query

Documents “in the same direction” are closely related. Transforms the angular measure into a measure ranging from 1 for

the highest similarity to 0 for the lowest

A B C D

∑ ∑ ∑

⋅ = = ⋅ = =

2 2 i i i i T

y x y x Y X Y X Y X Y X D ) , cos( ) , (

30

Performance Measure Performance Measure

  • The set of retrieved documents can be formed by collecting the top-

ranking documents according to a similarity measure

  • The quality of a collection can be compared by the two following measures

} { } { } { } { } { } { Relevant Retrieved Relevant recall Retrieved Retrieved Relevant precision ∩ = ∩ =

All documents Retrieved documents Relevant documents Relevant & retrieved

percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) percentage of documents that are relevant to the query and were, in fact, retrieved 31

Text Mining Text Mining

Document classification Document clustering Key-word based association rules

32

Web Search Web Search

Domain-specific search engines

www.buildingonline.com www.lawcrawler.com www.drkoop.com (medical)

Meta-searching

Connects to multiple search engines and combine the search results www.metacrawler.com www.dogpile.com www.37.com

slide-9
SLIDE 9

33

Web Search Web Search

Post-retrieval analysis and visualization

www.vivisimo.com www.tumba.pt www.kartoo.com

Natural language processing

www.askjeeves.com

Search Agents

Instead of storing a search index, search agents can perform real-

time searches on the Web.

Fresher data, slower response time and lower coverage.

34

Focused Crawling Focused Crawling

4 3 2 7 6 5 1 1 R Breadth-first crawl 1 4 3 2 5 R X X Focused crawl

Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value

35

Database Approaches Database Approaches

One approach is to build a local knowledge base - model data on the

web and integrate them in a way that enables specifically designed query languages to query the data

Store locally abstract characterizations of web pages. A query

language enables to query the local repository at several levels

  • f abstraction. As a result of the query the system may have to

request pages from the web if more detail is needed

Zaiane, O. R. and Han, J. (2000). WebML: Querying the world-wide web for resources and knowledge. In Proc. Workshop on Web Information and Data Management, pages 9–12 36

Agent Agent-

  • Based Approach

Based Approach

Agents to search for relevant information using domain

characteristics and user profiles

A system for extracting a relation from the web, for example, a

list of all the books referenced on the web. The system is given a set of training examples which are used to search the web for similar documents. Another application of this tool could be to build a relation with the name and address of restaurants referenced on the web

Brin, S. (1998). Extracting patterns and relations from the world wide web. In Int. Workshop on Web and Databases, pages 172–183.

slide-10
SLIDE 10

37

Web Structure Mining Web Structure Mining

Exploiting Hyperlink Structure

38

First generation of search engines First generation of search engines

Early days: keyword based searches

Keywords: “web mining” Retrieves documents with “web” and mining”

Later on: cope with

synonymy problem polysemy problem stop words

Common characteristic: Only information on the pages

is used

39

Modern search engines Modern search engines

Link structure is very important

Adding a link: deliberate act Harder to fool systems using in-links Link is a “quality mark”

Modern search engines use link structure as important

source of information

40

Central Question:

Which useful information can be Which useful information can be derived derived from the link structure of the web? from the link structure of the web?

slide-11
SLIDE 11

41

Some answers Some answers

1.

Structure of Internet

2.

Google

3.

HITS: Hubs and Authorities

42

  • 1. The Web Structure
  • 1. The Web Structure

A study was conducted on a graph inferred from two

large Altavista crawls.

Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., andWiener, J. (2000). Graph structure in the web. In

  • Proc. WWW Conference.

The study confirmed the hypothesis that the number of

in-links and out-links to a page approximately follows a Zipf distribution (a particular case of a power-law)

43

Power Laws Power Laws

44

In In-

  • Links

Links

slide-12
SLIDE 12

45

Out Out-

  • Links

Links

46

  • 1. The Web Structure
  • 1. The Web Structure

If the web is treated as an undirected graph 90% of

the pages form a single connected component

If the web is treated as a directed graph four distinct

components are identified, the four with similar size

47

General Topology General Topology

SCC IN OUT 44mil 44mil 56mil Tendrils Tendrils 44mil Disconnected components Tubes

SCC: set of pages that can be reached by one another IN: pages that have a path to SCC but not from it OUT: pages that can be reached by SCC but not reach it TENDRILS: pages that cannot reach and be reached the SCC pages

48

Some statistics Some statistics

Only between 25% of the pages there is a connecting path

BUT

If there is a path:

Directed: average length <17 Undirected: average length <7 (!!!)

It’s a “small world” -> between two people only chain of length 6! Small World Graphs

High number of relatively small cliques Small diameter

Internet (SCC) is a small world graph

slide-13
SLIDE 13

49

  • 2. Google
  • 2. Google
  • Search engine that uses link structure to calculate a quality

ranking (PageRank) for each page

  • Intuition: PageRank can be seen as the probability that a “random

surfer” visits a page

Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc. WWW Conference, pages 107–117

  • Keywords w entered by user
  • Select pages containing w and pages which have in-links with

caption w

  • Anchor text
  • Provide more accurate descriptions of Web pages
  • Anchors exist for un-indexable documents (e.g., images)
  • Font sizes of words in text: Words in larger or bolder font are assigned

higher weights

  • Rank pages according to importance

50

PageRank PageRank

Link i→j :

i considers j important. the more important i, the more important j becomes. if i has many out-links: links are less important.

Initially: all importances pi = 1. Iteratively, pi is refined.

Page Rank Page Rank: :

A page is important if many important pages link to it.

− + =

j i

i OutDegree i PageRank p p j PageRank ) ( ) ( ) ( ) ( 1

(PageRank) + (Website Content) = Overall Rank in Results

51

PageRank PageRank

Let OutDegreei = # out-links of page i Adjust pj:

This is the weighted sum of the importance of the pages referring to Pj

Parameter p is probability that the surfer gets bored and starts on a new random page (1-p) is the probability that the random surfer follows a link on current page

− + =

j i

i OutDegree i PageRank p p j PageRank ) ( ) ( ) ( ) ( 1

52

PageRank PageRank

Repeat until pagerank vector converges…

slide-14
SLIDE 14

53

  • 3. HITS
  • 3. HITS (Hyperlink

(Hyperlink-

  • Induced Topic Search)

Induced Topic Search)

HITS uses hyperlink structure to identify authoritative

Web sources for broad-topic information discovery

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.

Premise: Sufficiently broad topics contain communities

consisting of two types of hyperlinked pages:

Authorities: highly-referenced pages on a topic Hubs: pages that “point” to authorities A good authority is pointed to by many good hubs; a good hub

points to many good authorities

54

Hubs and Authorities Hubs and Authorities

authority hub

Authorities are targets of hub pages Hub pages point to interesting links to authorities = relevant pages

55

HITS HITS

Steps for Discovering Hubs and Authorities on a specific topic

Collect seed set of pages S (returned by search engine) Expand seed set to contain pages that point to or are pointed to by

pages in seed set (removes links inside a site)

Iteratively update hub weight h(p) and authority weight a(p) for

each page:

After a fixed number of iterations, pages with highest

hub/authority weights form core of community

Extensions proposed in Clever

Assign links different weights based on relevance of link anchor text

∑ ∑

→ →

= =

q p p q

q a p h q h p a ) ( ) ( ) ( ) (

56

Applications of HITS Applications of HITS

Search engine querying Finding web communities Finding related pages Populating categories in web directories. Citation analysis

slide-15
SLIDE 15

57

Web Usage Mining Web Usage Mining

analyzing user web navigation

58

Web Usage Mining Web Usage Mining

Pages contain information Links are “roads” How do people navigate over the Internet? ⇒

Web usage mining (Clickstream Analysis)

Information on navigation paths is available in log files. Logs can be examined from either a client or a server

prespective.

59

Website Usage Analysis Website Usage Analysis

Why analyze Website usage? Knowledge about how visitors use Website could

Provide guidelines to web site reorganization; Help prevent disorientation Help designers place important information where the visitors look for it Pre-fetching and caching web pages Provide adaptive Website (Personalization) Questions which could be answered What are the differences in usage and access patterns among users? What user behaviors change over time? How usage patterns change with quality of service (slow/fast)? What is the distribution of network traffic over time?

60

Website Usage Analysis Website Usage Analysis

slide-16
SLIDE 16

61

Data Sources Data Sources

62

Data Sources Data Sources

Server level collection: the server stores data regarding requests

performed by the client, thus data regard generally just one source;

Client level collection: it is the client itself which sends to a

repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection

  • capabilities. );

Proxy level collection: information is stored at the proxy side,

thus Web data regards several Websites, but only users whose Web clients pass through the proxy.

63

An Example of a Web Server Log An Example of a Web Server Log

64

Analog Analog – – Web Log File Analyser Web Log File Analyser

Gives basic statistics such as

number of hits average hits per time period what are the popular pages in your site who is visiting your site what keywords are users searching for to get to you what is being downloaded http://www.analog.cx/

slide-17
SLIDE 17

65

Web Usage Mining Process Web Usage Mining Process

We b Se r ve r L

  • g

Data Pr e par ation Cle an Data

Data Mining

Site Data

Usage Patte r ns

66

Data Preparation Data Preparation

Data cleaning

By checking the suffix of the URL name, for example, all log entries

with filename suffixes such as, gif, jpeg, etc

User identification

If a page is requested that is not directly linked to the previous pages,

multiple users are assumed to exist on the same machine

Other heuristics involve using a combination of IP address, machine

name, browser agent, and temporal information to identify users

Transaction identification

All of the page references made by a user during a single visit to a site Size of a transaction can range from a single page reference to all of

the page references

67

Sessionizing Sessionizing

Main Questions:

how to identify unique users how to identify/define a user transaction

Problems:

user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy servers client-side & proxy caching makes server log data less reliable

Standard Solutions/Practices:

user registration – practical ???? client-side cookies – not fool proof cache busting - increases network traffic

68

Sessionizing Sessionizing

Time oriented

By total duration of session

  • not more than 30 minutes

By page stay times (good for short sessions)

  • not more than 10 minutes per page

Navigation oriented (good for short sessions and when timestamps

unreliable)

Referrer is previous page in session, or Referrer is undefined but request within 10 secs, or Link from previous to current page in web site

The task of identifying the sequence of requests from a user is not

trivial - see Berendt et.al., Measuring the Accuracy of Sessionizers for Web

Usage Analysis SIAM-DM01

slide-18
SLIDE 18

69

Web Usage Mining Web Usage Mining

Commonly used approaches

Preprocessing data and adapting existing data mining

techniques

For example associatin rules: does not take into account the

  • rder of the page requests

Developing novel data mining models

70

Association Rules Association Rules

Find frequent patterns/associations/correlations among

sets of items

Find correlations between pages not directly connected Reveal associations between groups of users with

specific interests

e.g.: /events/ski.html, travel/ski_resorts.html → /equipment/ski_boots.html (85%, 3%)

71

Clustering Clustering

Group together items with similar

characteristics

user clusters (similar navigational behaviour) page clusters (groups of pages conceptually related

72

An Example of An Example of Preprocessing Preprocessing Data and Data and Adapting Existing Data Mining Techniques Adapting Existing Data Mining Techniques

  • Chen, M.-S., Park, J. S., and Yu, P. S. (1998). Efficient data mining for traversal
  • patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2): 209–221.

The log data is converted into a tree, from which is inferred a set of maximal forward references. The maximal forward references are then processed by existing association rules techniques. Two algorithms are given to mine for the rules, which in this context consist

  • f large itemsets with the additional

restriction that references must be consecutive in a transaction.

slide-19
SLIDE 19

73

Mining Navigation Patterns Mining Navigation Patterns

Each session induces a user trail through the site A trail is a sequence of web pages followed by a user

during a session, ordered by time of access.

A pattern in this context is a frequent trail. Co-occurrence of web pages is important, e.g. shopping-

basket and checkout.

Use a Markov chain model to model the user navigation

records, inferred from log data.

74

Ngram Ngram model model

We make use of the Ngram concept in order to improve the model

accuracy in representing user sessions. The Ngram model assumes that only the previous n-1 visited pages have a direct effect on the probability of the next page chosen.

A state corresponds to a navigation trail with n-1 pages Chi-square test is used to assess the order of the model

(in most cases N=3 is enough)

Experiments have shown that the number of states is manageable

75

Ngram Ngram model model

A1→A2→A3→A4 A1→A5→A3→A4 → A1 A5→A2→A4→A6 A5→A2→A3 A5→A2→A3→A6 A4→A1→A5→A3

76

First First-

  • Order Model

Order Model

S F A C B 1 1 1 1

Number of traversals Artificial state

Input Streams A,B,C A,B,D A,B,C E,B,D E,B,C E,B,D

slide-20
SLIDE 20

77

First First-

  • Order Model

Order Model

Input Streams A,B,C A,B,D A,B,C E,B,D E,B,C E,B,D

S F A C B 2 2 1 1 D 1 1

78

First First-

  • Order Model

Order Model

Input Streams A,B,C A,B,D A,B,C E,B,D E,B,C E,B,D

S F A C B 3 3 2 2 D 1 1

79

First First-

  • Order Model

Order Model

Input Streams A,B,C A,B,D A,B,C E,B,D E,B,C E,B,D

S F A C B

3 (0.5) 3 (1) 3 (0.5) 3 (1)

D

3 (0.5) 3 (1)

E

3 (0.5) 3 (1) Transition probability

80

Second Order Evaluation Second Order Evaluation

S F A C B

3 (0.5) 3 (1) 3 (0.5) 3 (1)

D

3 (0.5) 3 (1)

E

3 (0.5) 3 (1)

Input Streams A,B,C A,B,D A,B,C E,B,D E,B,C E,B,D

P(C | AB) = 0.67 Not accurate

slide-21
SLIDE 21

81

Cloning Cloning

Duplicate states to separate in-links whose second-order probability diverge. ( state B cloned based on link A,B )

S F A C B’

3 (0.5) 3 (1)

D

3 (1)

E

3 (0.5) 3 (1)

B

3 (1) 2 (0.67) 1 (0.33) 1 (0.33) 2 (0.67)

  • Num. of traversals are

updated according to input data.

82

  • In cases when a state has more than two in-links we use clustering for

assigning in-links to clones.

  • We use a state accuracy parameter, which sets the maximum admissible

difference between corresponding first and second-order probabilities.

Clustering Clustering-

  • Based Cloning

Based Cloning

83

Clustering Clustering-

  • Based Cloning

Based Cloning

Occur. Session 6 H,B,D 3 H,B,C 7 G,B,D 4 G,B,C 4 E,B,D 7 E,B,C 3 A,B,D 6 A,B,C

F H S A C B

9 (0.22)

D E G

9 (0.22) 11 (0.28) 11 (0.28) 9 (1) 9 (1) 11 (1) 11 (1) 20 (0.5) 20 (0.5) 20 (1) 20 (1)

Not accurate

  • 2nd. O. Prob.

P(D|HB)=0.67 P(C|HB)=0.33 P(D|GB)=0.64 P(C|GB)=0.36 P(D|EB)=0.36 P(C|EB)=0.64 P(D|AB)=0.33 P(C|AB)=0.67

Input Streams

84

Clustering Clustering-

  • Based Cloning

Based Cloning

A C B’ D H E G

9 (1) 9 (1) 11 (1) 11 (1) 6 (0.67) 3 (0.33)

0.33 0.67 0.36 0.64 0.64 0.36 0.67 0.33

B

Second order probabilities 14 (0.45) 17 (0.55)

slide-22
SLIDE 22

85

Clustering Clustering-

  • Based Cloning

Based Cloning

A C B’ D H E G

9 (1) 9 (1) 11 (1) 11 (1) 13 (0.65) 7 (0.35)

0.33 0.67 0.36 0.64 0.64 0.36 0.67 0.33

B

Second order probabilities 7 (0.35) 13 (0.65)

Nice trade-off between number of states and accuracy

86

Applications of Markov models Applications of Markov models

Provide guidelines for the optimisation of a web site structure. Work as a model of the user’s preferences in the creation of

adaptive web sites.

Improve search engine’s technologies by enhancing the random surf

concept.

Web personal assistant. Visualisation tool Use model to learn access patterns and predict future accesses.

Pre-fetch predicted pages to reduce latency.

Also cache results of popular search engine queries.

87

Summary Summary

Web is huge and dynamic Web mining makes use of data mining techniques to automatically

discover and extract information from Web documents/services

Web content mining Web structure mining Web usage mining

Semantic web: "The Semantic Web is an extension of the current web in

which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila

88

References References

Data Mining: Introductory and Advanced Topics,

Margaret Dunham (Prentice Hall, 2002)

Mining the Web - Discovering Knowledge from

Hypertext Data, Soumen Chakrabarti, Morgan- Kaufmann Publishers

slide-23
SLIDE 23

89

Thank you !!! Thank you !!!