Web Content Mining Dr. Ahmed Rafea Outline Introduction The - - PowerPoint PPT Presentation

▶

Mar 01, 2023 264 likes •421 views

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining aims to

SLIDE 1

Web Content Mining

Dr. Ahmed Rafea

SLIDE 2

Outline

Introduction
The Web: Opportunities & Challenges
Techniques
Applications

SLIDE 3

Introduction

The Web is perhaps the single largest data

source in the world.

Web mining aims to extract and mine useful

knowledge from the Web.

A multidisciplinary field:data mining, machine

learning, natural language processing, statistics, databases, information retrieval, multimedia, etc.

Due to the heterogeneity and lack of structure
f Web data, mining is a challenging task.

SLIDE 4

The Web: Opportunities & Challenges(1)

Web offers an unprecedented opportunity and

challenge to data mining

– The amount of information on the Web is huge, – The coverage of Web information is very wide and diverse. – Information/data of almost all types exist on the Web, Much of the Web information is semi-structured – Much of the Web information is linked. – Much of the Web information is redundant. – The Web is noisy.. – The Web consists of surface Web and deep Web. – The Web is also about services. – The Web is dynamic. – Above all, the Web is a virtual society

SLIDE 5

Techniques

Classification of Multimedia Content and Websites
Focused Crawling
Clustering Web Objects
Wrapper Induction
Automatic Data Extraction
NLP technique for sentiment classification
Sentiment classification using ML methods
NLP for Customer Reviews Analysis

SLIDE 6

Classification of Multimedia Content and Websites

In order to retrieve relevant knowledge a system

has to analyze web content first.

Classification of web objects offers an automatic

way to decide the relevance of web objects.

Since websites are usually represented by multiple

pages, classifying website on top of web pages classification demands new algorithms

SLIDE 7

Focused Crawling

A focused web crawler takes a set of well-selected

web pages exemplifying the user interest.

The focused crawler starts from the given pages

and recursively explores the linked web pages.

While the crawlers perform a breadth-first search
f the whole web, a focused crawler explores only

a small portion of the web using a best-first search guided by the user interest.

Crawling for retrieving multimedia content in the

web, instead of plain HTML documents.

SLIDE 8

Clustering Web Objects

Focused Crawling retrieves large numbers of

relevant data.

In order to offer fast and more specific access to

the query results, clustering is an established method to group the retrieved information to achieve better understanding.

If the query results are websites or combined
bjects like images and their text descriptions,

algorithm are needed to handle these combined data types to find meaningful clustering

SLIDE 9

Wrapper Induction

A wrapper is a piece of software that

enables a semi structured Web source to be queried as if it were a database

Given a set of manually labeled pages,

a machine learning method is applied to learn extraction rules or patterns.

SLIDE 10

Automatic Data Extraction

Given a set of positive pages, generate

extraction patterns.

Given only a single page with multiple data

records, generate extraction patterns.

SLIDE 11

NLP techniques for sentiment classification

The approach: Three steps

– Step 1:

Part-of-speech tagging
Extracting two consecutive words (two-word phrases) from

reviews if their tags conform to some given patterns,

– Step 2:

Estimate the semantic orientation (SO) of the extracted phrases

– Step 3:

Compute the average SO of all phrases
Classify the review as recommended if average SO is positive, not

recommended otherwise.

SLIDE 12

Sentiment classification using ML methods

Three classification techniques were tried:

– Naïve Bayes – Maximum entropy – Support vector machine

SLIDE 13

NLP for Customer Reviews Analysis

Mining product features

– Part-of-Speech tagging – features are nouns and nouns phrases

Identify Orientation of an Opinion Sentence

– Use dominant orientation of opinion words (e.g., adjectives) as sentence orientation.

SLIDE 14

Applications

Automatic Maintenance of Topic Specific

Directory Services

Data extraction
Sentiment classification, analysis
Summarization of consumer reviews
Information integration and schema matching
Knowledge synthesis
Template detection and page segmentation