WIMS ’11
Page Segmentation by Web Content Clustering
Sadet Alcic
Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems
May 26, 2011
1 / 19
Page Segmentation by Web Content Clustering Sadet Alcic - - PowerPoint PPT Presentation
Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 WIMS 11 1 / 19 Outline 1 Introduction Motivation
WIMS ’11
1 / 19
WIMS ’11
2 / 19
WIMS ’11
Introduction Motivation
3 / 19
WIMS ’11
Introduction Motivation
3 / 19
WIMS ’11
Introduction Motivation
3 / 19
WIMS ’11
Introduction Motivation
3 / 19
WIMS ’11
Introduction Motivation
3 / 19
WIMS ’11
Introduction Related Work
◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from
◮ APWeb’03: Cai et al. Extracting content structure for web pages based
◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining
◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search
◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage
◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web
4 / 19
WIMS ’11
Introduction Related Work
◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from
◮ APWeb’03: Cai et al. Extracting content structure for web pages based
◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining
4 / 19
◮ should the block be
◮ if yes, where to separate?
WIMS ’11
Introduction Related Work
◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search
◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage
◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web
4 / 19
WIMS ’11
Introduction Related Work
◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search
◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage
◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web
4 / 19
WIMS ’11
Web Page Segmentation by Clustering
5 / 19
WIMS ’11
Web Page Segmentation by Clustering General Idea
6 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
◮ web browser puts every object of a web page in a 2-dim plane ◮ extract the bounding rectangle for each object
◮ elements in DOM contain some textual contents ◮ extract keywords from the corresponding text
◮ each object is a node in the DOM tree of the page ◮ use the position of the object in DOM tree to characterize it A B
7 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
2 , with ti =
(rx', ry') (rx, ry) (sx, sy) (sx', sy') x y mindist(R,S) (0, 0)
8 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
◮ Lexical word-to-word matching → sim(T1, T2) = 0
◮ Words are mapped to concepts in WordNet (concept-to-concept
9 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0 10 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0
10 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0
10 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0
10 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0
10 / 19
WIMS ’11
Web Page Segmentation by Clustering Distance functions for web contents
level = 0 level = 1 level = 2 level degree = 2 level degree = 3 level degree = 0
10 / 19
WIMS ’11
Web Page Segmentation by Clustering Clustering methods
◮ k-medoid (similar as k-means, but cluster representatives are real
◮ single link method applied to compute distance between sets of objects
◮ DBSCAN variant (able to find clusters of different density levels) 11 / 19
WIMS ’11
Evaluation Studies
12 / 19
WIMS ’11
Evaluation Studies Distance functions
13 / 19
WIMS ’11
Evaluation Studies Distance functions
57-65 1-19 66-68 29-43 44-54 55-56 20-28
13 / 19
WIMS ’11
Evaluation Studies Distance functions
57-65 1-19 66-68 29-43 44-54 55-56 20-28
13 / 19
WIMS ’11
Evaluation Studies Distance functions
1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60
a) DOM-Distance b) Geometric-Distance c) Semantic-Distance
13 / 19
WIMS ’11
Evaluation Studies Distance functions
57-65 1-19 66-68 29-43 44-54 55-56 20-28
13 / 19
WIMS ’11
Evaluation Studies Distance functions
1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60
a) DOM-Distance b) Geometric-Distance c) Semantic-Distance
13 / 19
WIMS ’11
Evaluation Studies Distance functions
1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60
a) DOM-Distance b) Geometric-Distance c) Semantic-Distance
13 / 19
WIMS ’11
Evaluation Studies Distance functions
1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60
a) DOM-Distance b) Geometric-Distance c) Semantic-Distance
13 / 19
WIMS ’11
Evaluation Studies Distance functions
57-65 1-19 66-68 29-43 44-54 55-56 20-28
13 / 19
WIMS ’11
Evaluation Studies Distance functions
1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60 1 10 20 30 40 50 60
a) DOM-Distance b) Geometric-Distance c) Semantic-Distance
13 / 19
WIMS ’11
Evaluation Studies Clustering
◮ compute clustering of the web contents into page blocks
14 / 19
WIMS ’11
Evaluation Studies Clustering
15 / 19
WIMS ’11
Evaluation Studies Clustering
◮ the distances between together belonging objects are varying in the
◮ DB clustering is able to find clusters of different densities 16 / 19
WIMS ’11
Conclusion and Future work
17 / 19
WIMS ’11
Conclusion and Future work
18 / 19
WIMS ’11
Conclusion and Future work
19 / 19