web content mining
play

Web Content Mining Dr. Ahmed Rafea Outline Introduction The - PowerPoint PPT Presentation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining aims to


  1. Web Content Mining Dr. Ahmed Rafea

  2. Outline • Introduction • The Web: Opportunities & Challenges • Techniques • Applications

  3. Introduction • The Web is perhaps the single largest data source in the world. • Web mining aims to extract and mine useful knowledge from the Web. • A multidisciplinary field:data mining, machine learning, natural language processing, statistics, databases, information retrieval, multimedia, etc. • Due to the heterogeneity and lack of structure of Web data, mining is a challenging task.

  4. The Web: Opportunities & Challenges(1) • Web offers an unprecedented opportunity and challenge to data mining – The amount of information on the Web is huge, – The coverage of Web information is very wide and diverse. – Information/data of almost all types exist on the Web, Much of the Web information is semi-structured – Much of the Web information is linked. – Much of the Web information is redundant. – The Web is noisy.. – The Web consists of surface Web and deep Web. – The Web is also about services. – The Web is dynamic. – Above all, the Web is a virtual society

  5. Techniques • Classification of Multimedia Content and Websites • Focused Crawling • Clustering Web Objects • Wrapper Induction • Automatic Data Extraction • NLP technique for sentiment classification • Sentiment classification using ML methods • NLP for Customer Reviews Analysis

  6. Classification of Multimedia Content and Websites • In order to retrieve relevant knowledge a system has to analyze web content first. • Classification of web objects offers an automatic way to decide the relevance of web objects. • Since websites are usually represented by multiple pages, classifying website on top of web pages classification demands new algorithms

  7. Focused Crawling • A focused web crawler takes a set of well-selected web pages exemplifying the user interest. • The focused crawler starts from the given pages and recursively explores the linked web pages. • While the crawlers perform a breadth-first search of the whole web, a focused crawler explores only a small portion of the web using a best-first search guided by the user interest. • Crawling for retrieving multimedia content in the web, instead of plain HTML documents.

  8. Clustering Web Objects • Focused Crawling retrieves large numbers of relevant data. • In order to offer fast and more specific access to the query results, clustering is an established method to group the retrieved information to achieve better understanding. • If the query results are websites or combined objects like images and their text descriptions, algorithm are needed to handle these combined data types to find meaningful clustering

  9. Wrapper Induction • A wrapper is a piece of software that enables a semi structured Web source to be queried as if it were a database • Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.

  10. Automatic Data Extraction • Given a set of positive pages, generate extraction patterns. • Given only a single page with multiple data records, generate extraction patterns.

  11. NLP techniques for sentiment classification • The approach: Three steps – Step 1: •Part-of-speech tagging •Extracting two consecutive words (two-word phrases) from reviews if their tags conform to some given patterns, – Step 2: •Estimate the semantic orientation (SO) of the extracted phrases – Step 3: •Compute the average SO of all phrases •Classify the review as recommended if average SO is positive, not recommended otherwise.

  12. Sentiment classification using ML methods • Three classification techniques were tried: – Naïve Bayes – Maximum entropy – Support vector machine

  13. NLP for Customer Reviews Analysis • Mining product features – Part-of-Speech tagging – features are nouns and nouns phrases • Identify Orientation of an Opinion Sentence – Use dominant orientation of opinion words (e.g., adjectives) as sentence orientation.

  14. Applications • Automatic Maintenance of Topic Specific Directory Services • Data extraction • Sentiment classification, analysis • Summarization of consumer reviews • Information integration and schema matching • Knowledge synthesis • Template detection and page segmentation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend