Web Mining Web Mining to automatically discover and extract - PowerPoint PPT Presentation

What is Web Mining? What is Web Mining? Web mining is the use of data mining techniques Web Mining Web Mining to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) 1 2 The Web The Web What is Web Mining? What is Web Mining? � Over 1 billion HTML pages, 15 terabytes � Wealth of information � Motivation / Opportunity � Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes, � The WWW is huge, widely distributed, global information service yellow & white pages, maps, markets, ......... centre and, therefore, constitutes a rich source for data mining � Diverse media types: text, images, audio, video � Personalization, Recommendation Engines � Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 � Highly Dynamic � Web-commerce applications � 1 million new pages each day � Building the Semantic Web � Average page changes in a few weeks � Intelligent Web Search � Graph structure with links between pages � Hypertext classification and Categorization � Average page has 7-10 links � Information / trend monitoring � in-links and out-links follow power-law distribution � Analysis of online communities � Hundreds of millions of queries per day 3 4

Abundance and authority crisis Abundance and authority crisis � Liberal and informal culture of content generation and dissemination How do you suggest we could How do you suggest we could � Redundancy and non-standard form and content estimate the size of the estimate the size of the � Millions of qualifying pages for most broad queries web? � Example: java or kayaking web? � No authoritative information about the reliability of a site � Little support for adapting to the background of specific users 6 5 The Web The Web One Interesting Approach One Interesting Approach � The Web is a huge collection of documents except for � The number of web servers was estimated by sampling � Hyper-link information and testing random IP address numbers and determining � Access and usage information the fraction of such tests that successfully located a � Lots of data on user access patterns web server � Web logs contain sequence of URLs accessed by users � The estimate of the average number of pages per server was obtained by crawling a sample of the servers identified in the first experiment � Challenge: Develop new Web mining algorithms and Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the adapt traditional data mining algorithms to � web. Nature , 400(6740): 107–109. � Exploit hyper-links and access patterns 7 8

Applications of web mining Applications of web mining Applications of web mining Applications of web mining � E-commerce (Infrastructure) � Information retrieval (Search) on the Web � Generate user profiles -> improving customization and provide users with pages, advertisements of interest � Automated generation of topic hierarchies � Targeted advertising -> Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the � Web knowledge bases “hottest” web mining application today � Fraud -> Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought). If buying pattern changes significantly, then signal fraud � Network Management � Performance management -> Annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three. Result is frequent congestion. During a major event (World cup), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world � Fault management -> analyze alarm and traffic data to carry out root cause analysis of faults 9 10 Why is Web Information Retrieval Important? Why is Web Information Retrieval Important? Why is Web Information Retrieval Difficult? Why is Web Information Retrieval Difficult? � The Abundance Problem (99% of information of no interest to 99% � According to most predictions, the majority of human information of people) will be available on the Web in ten years � Hundreds of irrelevant documents returned in response to a search query � Effective information retrieval can aid in � Limited Coverage of the Web (Internet sources hidden behind � Research: Find all papers about web mining search interfaces) � Health/Medicene: What could be reason for symptoms of “yellow � Largest crawlers cover less than 18% of Web pages eyes”, high fever and frequent vomitting � The Web is extremely dynamic � Travel: Find information on the tropical island of St. Lucia � Lots of pages added, removed and changed every day � Business: Find companies that manufacture digital signal processors � Very high dimensionality (thousands of dimensions) � Entertainment: Find all movies starring Marilyn Monroe during the � Limited query interface based on keyword-oriented search years 1960 and 1970 � Arts: Find all short stories written by Jhumpa Lahiri � Limited customization to individual users 11 12

Search Engine Web Coverage Overlap Search Engine Web Coverage Overlap Search Engine Relative Size Search Engine Relative Size 4 searches were defined that returned 141 web pages. Coverage – about 40% in 1999 � From http://www.searchengineshowdown.com/stats/overlap.shtml http://www.searchengineshowdown.com/stats/size.shtml 13 14 Web Mining Taxonomy Web Mining Taxonomy � End Of Size Wars? Google Says Most Comprehensive But Drops Home Page Count Web Mining � http://searchenginewatch.com/searchday/article.php/3551586 � By Danny Sullivan, Editor, September 27, 2005 � How do you measure Comprehensiveness? Web Web � Rare words Web Usage Content Structure Mining � The Duplicate Content Issue Mining Mining � Counting Pages Indexed Per Site 15 16

Web Mining Taxonomy Web Mining Taxonomy � Web content mining: focuses on techniques for assisting a user in finding documents that meet a Web Content Mining Web Content Mining certain criterion (text mining) � Web structure mining: aims at developing techniques to take advantage of the collective judgement of web page quality which is available in the form of hyperlinks Examines the content of web pages as well as results of web searching. � Web usage mining: focuses on techniques to study the user behaviour when navigating the web (also known as Web log mining and clickstream analysis) 17 18 Web Content Minng Web Content Minng Semi- Semi -Structured Data Structured Data � Content is, in general, semi-structured � Can be thought of as extending the work performed by � Example: basic search engines � Title � Search engines have crawlers to search the web and � Author gather information, indexing techniques to store the Structured attribute/value pairs � Publication_Date information, and query processing support to provide information to the users � Length � Category � Web Content Mining is: the process of extracting � Abstract knowledge from web contents Unstructured � Content 19 20

Document Representation Structuring Textual Information Structuring Textual Information Document Representation � Many methods designed to analyze structured data � A document representation aims to capture what the document is about � If we can represent documents by a set of attributes we will be able to use existing data mining methods � One possible approach: � Each entry describes a document � How to represent a document? � Attribute describe whether or not a term appears in the � Vector based representation document � (referred to as “bag of words” as it is invariant to permutations) � Use statistics to add a numerical dimension to unstructured text Example Terms Camera Digital Memory Pixel … Term frequency Document 1 1 1 0 1 Document frequency Document 2 1 1 0 0 Term proximity Document length … … … … … 21 22 Document Representation Document Representation Document Representation Document Representation � Another approach: � But a term is mentioned more times in longer documents � Each entry describes a document � Therefore, use relative frequency (% of document): � Attributes represent the frequency in which a term appears in the document � No. of occurrences/No. of words in document Example: Term frequency table Terms Terms Camera Digital Memory Print … Camera Digital Memory Print … Document 1 0.03 0.02 0 0.01 Document 1 3 2 0 1 Document 2 0 0.004 0 0.003 Document 2 0 4 0 3 … … … … … … … … … … 23 24

Web Mining Web Mining to automatically discover and extract - PowerPoint PPT Presentation

What is Web Mining? What is Web Mining? Web mining is the use of data mining techniques Web Mining Web Mining to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) 1 2 The Web The Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Semantic Web Mining Bettina Berendt Humboldt-Universitt zu Berlin Institut fr

What is Web Mining? The use of data mining techniques to automatically RECOMMENDATION MODELS

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Web Engineering An interim report for the Economic and Social Research Council (ESRC), says that

Search engines, Question Answering and Syntactic Analysis Kaarel Kaljurand (kaarel@ut.ee) Tartu

Information Retrieval Lecture 10 Recap Last lecture HITS algorithm using anchor text

Outline Introduction to information retrieval Logical view of documents L i l i f d

Domain Name System Computer Center, CS, NCTU History of DNS Before DNS ARPAnet

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16) Michael Hahsler

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

Measuring the Fitness of Evolving Networks Section 6.3 TYLER SHEPHERD Overview 1. Recap of