Web Content Mining Dr. Ahmed Rafea Outline Introduction The - - PowerPoint PPT Presentation

web content mining
SMART_READER_LITE
LIVE PREVIEW

Web Content Mining Dr. Ahmed Rafea Outline Introduction The - - PowerPoint PPT Presentation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining aims to


slide-1
SLIDE 1

Web Content Mining

  • Dr. Ahmed Rafea
slide-2
SLIDE 2

Outline

  • Introduction
  • The Web: Opportunities & Challenges
  • Techniques
  • Applications
slide-3
SLIDE 3

Introduction

  • The Web is perhaps the single largest data

source in the world.

  • Web mining aims to extract and mine useful

knowledge from the Web.

  • A multidisciplinary field:data mining, machine

learning, natural language processing, statistics, databases, information retrieval, multimedia, etc.

  • Due to the heterogeneity and lack of structure
  • f Web data, mining is a challenging task.
slide-4
SLIDE 4

The Web: Opportunities & Challenges(1)

  • Web offers an unprecedented opportunity and

challenge to data mining

– The amount of information on the Web is huge, – The coverage of Web information is very wide and diverse. – Information/data of almost all types exist on the Web, Much of the Web information is semi-structured – Much of the Web information is linked. – Much of the Web information is redundant. – The Web is noisy.. – The Web consists of surface Web and deep Web. – The Web is also about services. – The Web is dynamic. – Above all, the Web is a virtual society

slide-5
SLIDE 5

Techniques

  • Classification of Multimedia Content and Websites
  • Focused Crawling
  • Clustering Web Objects
  • Wrapper Induction
  • Automatic Data Extraction
  • NLP technique for sentiment classification
  • Sentiment classification using ML methods
  • NLP for Customer Reviews Analysis
slide-6
SLIDE 6

Classification of Multimedia Content and Websites

  • In order to retrieve relevant knowledge a system

has to analyze web content first.

  • Classification of web objects offers an automatic

way to decide the relevance of web objects.

  • Since websites are usually represented by multiple

pages, classifying website on top of web pages classification demands new algorithms

slide-7
SLIDE 7

Focused Crawling

  • A focused web crawler takes a set of well-selected

web pages exemplifying the user interest.

  • The focused crawler starts from the given pages

and recursively explores the linked web pages.

  • While the crawlers perform a breadth-first search
  • f the whole web, a focused crawler explores only

a small portion of the web using a best-first search guided by the user interest.

  • Crawling for retrieving multimedia content in the

web, instead of plain HTML documents.

slide-8
SLIDE 8

Clustering Web Objects

  • Focused Crawling retrieves large numbers of

relevant data.

  • In order to offer fast and more specific access to

the query results, clustering is an established method to group the retrieved information to achieve better understanding.

  • If the query results are websites or combined
  • bjects like images and their text descriptions,

algorithm are needed to handle these combined data types to find meaningful clustering

slide-9
SLIDE 9

Wrapper Induction

  • A wrapper is a piece of software that

enables a semi structured Web source to be queried as if it were a database

  • Given a set of manually labeled pages,

a machine learning method is applied to learn extraction rules or patterns.

slide-10
SLIDE 10

Automatic Data Extraction

  • Given a set of positive pages, generate

extraction patterns.

  • Given only a single page with multiple data

records, generate extraction patterns.

slide-11
SLIDE 11

NLP techniques for sentiment classification

  • The approach: Three steps

– Step 1:

  • Part-of-speech tagging
  • Extracting two consecutive words (two-word phrases) from

reviews if their tags conform to some given patterns,

– Step 2:

  • Estimate the semantic orientation (SO) of the extracted phrases

– Step 3:

  • Compute the average SO of all phrases
  • Classify the review as recommended if average SO is positive, not

recommended otherwise.

slide-12
SLIDE 12

Sentiment classification using ML methods

  • Three classification techniques were tried:

– Naïve Bayes – Maximum entropy – Support vector machine

slide-13
SLIDE 13

NLP for Customer Reviews Analysis

  • Mining product features

– Part-of-Speech tagging – features are nouns and nouns phrases

  • Identify Orientation of an Opinion Sentence

– Use dominant orientation of opinion words (e.g., adjectives) as sentence orientation.

slide-14
SLIDE 14

Applications

  • Automatic Maintenance of Topic Specific

Directory Services

  • Data extraction
  • Sentiment classification, analysis
  • Summarization of consumer reviews
  • Information integration and schema matching
  • Knowledge synthesis
  • Template detection and page segmentation