page segmentation by web content clustering
play

Page Segmentation by Web Content Clustering Sadet Alcic - PowerPoint PPT Presentation

Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 WIMS 11 1 / 19 Outline 1 Introduction Motivation


  1. Page Segmentation by Web Content Clustering Sadet Alcic Heinrich-Heine-University of Duesseldorf Department of Computer Science Institute for Databases and Information Systems May 26, 2011 WIMS ’11 1 / 19

  2. Outline 1 Introduction Motivation Related Work 2 Web Page Segmentation by Clustering General Idea Distance functions for web contents Clustering methods 3 Evaluation Studies Distance functions Clustering 4 Conclusion and Future work WIMS ’11 2 / 19

  3. Introduction Motivation Motivation WIMS ’11 3 / 19

  4. Introduction Motivation Motivation Web Page is cluttered with different contents ◮ Different news articles ◮ Link lists ◮ Commercials ◮ Template elements ◮ Functional elements WIMS ’11 3 / 19

  5. Introduction Motivation Motivation Web Page Segmentation ◮ Separation of web contents into structural and semantic cohesive blocks WIMS ’11 3 / 19

  6. Introduction Motivation Motivation WIMS ’11 3 / 19

  7. Introduction Motivation Motivation Applications ◮ Web Content Search ◮ Web Page Categorization ◮ Web Page Adaptation for Mobile Devices ◮ Web Image Indexing ◮ ... WIMS ’11 3 / 19

  8. Introduction Related Work Overview of Related Work to Web Page Segmentation ◮ TOP-DOWN page segmentation: ◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) ◮ APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) ◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics) ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  9. Introduction Related Work Overview of Related Work to Web Page Segmentation ◮ TOP-DOWN page segmentation: ◮ KDD’02: Lin and Ho. Discovering Informative Content Blocks from Web Documents (Table properties) ◮ APWeb’03: Cai et al. Extracting content structure for web pages based on visual representation (Heuristic rules on visual and DOM properties) ◮ TKDE’05: Kao et al. Web Intrapage Informative Structure Mining Based on DOM (Term entropy based on heuristics) Basic Idea ◮ Start with complete Page as initial block ◮ Decide for each block: ◮ should the block be separated? ◮ if yes, where to separate? ! Based on heuristics WIMS ’11 4 / 19

  10. Introduction Related Work Overview of Related Work to Web Page Segmentation Basic Idea ◮ Start with smallest content units (e.g., DOM leafs) ◮ group them to blocks of coherent content ◮ How? ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  11. Introduction Related Work Overview of Related Work to Web Page Segmentation Our Approach ◮ belongs to BOTTOM-UP methods ◮ DOM leafs are used as basic web objects ◮ Idea!: group web objects to blocks by clustering ◮ BOTTOM-UP page segmentation: ◮ CIKM’02: Li et al. Using Micro Information Units for Internet Search (Heuristic rules) ◮ WWW’08: Chakrabarti et al. A graph-theoretic approach to webpage segmentation (Graph partitioning) ◮ CIKM’08: Kohlschuetter and Nejdl. A densitometric approach to web page segmentation (Partitioning of a histogram of text density) WIMS ’11 4 / 19

  12. Web Page Segmentation by Clustering Web Page Segmentation by Clustering WIMS ’11 5 / 19

  13. Web Page Segmentation by Clustering General Idea Page Segmentation by Clustering General Definition: Clustering ◮ Clustering is the process of organizing objects into groups whose members are similar in some way ◮ A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters Open questions addressed in this work ◮ How can the similarity (or dissimilarity) of web objects be estimated? ◮ Which representation is best suitable to represent web objects? ◮ Which clustering method should be applied for clustering? WIMS ’11 6 / 19

  14. Web Page Segmentation by Clustering Distance functions for web contents Different Representations of Web objects ◮ Geometric Representation ◮ web browser puts every object of a web page in a 2-dim plane ◮ extract the bounding rectangle for each object ◮ Semantic Representation ◮ elements in DOM contain some textual contents ◮ extract keywords from the corresponding text ◮ DOM-based Representation ◮ each object is a node in the DOM tree of the page ◮ use the position of the object in DOM tree to characterize it A B ⇒ Different distance measures are possible WIMS ’11 7 / 19

  15. Web Page Segmentation by Clustering Distance functions for web contents Geometric Distance ◮ Let R = [( r x , r y ) , ( r x , , r y , )] and S = [( s x , s y ) , ( s x , , r y , )] be two bounding rectangles ◮ The geometric distance of R to S is given by  r i − s i , if r i > s i , i ∈ x , y t i 2 � 1 2 , with t i = ��  r i , < s i dist ( R , S ) = s i − r i , if 0 if otherwise .  ◮ Visually: (0, 0) x (s x , s y ) S mindist(R,S) (r x , r y ) (s x ', s y ') R y (r x ', r y ') WIMS ’11 8 / 19

  16. Web Page Segmentation by Clustering Distance functions for web contents Semantic Distance ◮ Given T 1 = (dog, run, street), T 2 = (puppy, walk, road) ◮ Cosine Similarity Measure (Information Retrieval) ◮ Lexical word-to-word matching → sim ( T 1 , T 2 ) = 0 ◮ to strict: e.g. synonym and hyponym relationships are not considered ◮ Instead: text similarity measure based on WordNet [Corley 05] ◮ Words are mapped to concepts in WordNet (concept-to-concept matching) � w i ∈ T 1 maxSim ( w i , T 2 ) · idf ( w i ) sim ( T 1 , T 2 ) = � w i ∈ T 1 idf ( w i ) WIMS ’11 9 / 19

  17. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 WIMS ’11 10 / 19

  18. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 Requirements ◮ Nodes under same parent are closer than nodes under different parent ◮ Nodes on higher tree level are closer that nodes on lower level WIMS ’11 10 / 19

  19. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ Traverse DOM-tree in preorder traversing: P = ( ❦ A , ❦ B , ❦ 1 , ❦ 2 , ❦ 3 , ❦ C , ❦ 4 , ❦ 5 , ❦ 6 ) WIMS ’11 10 / 19

  20. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ Traverse DOM-tree in preorder traversing: P = ( ❦ A , ❦ B , ❦ 1 , ❦ 2 , ❦ 3 , ❦ C , ❦ 4 , ❦ 5 , ❦ 6 ) ◮ For each element in P define a weight w p i that expresses the costs needed to reach p i from its predecessor in P WIMS ’11 10 / 19

  21. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ The distance between p a , p b ∈ P , (wlog. a < b ) is defined as: b � d ( p a , p b ) = w p i i = a +1 Example: d ( ❦ 2 , ❦ 4 ) = w 3 + w C + w 4 WIMS ’11 10 / 19

  22. Web Page Segmentation by Clustering Distance functions for web contents DOM-based Distance 1 4 level = 0 A level degree = 2 level = 1 2 5 B C level degree = 3 1 2 3 4 5 6 level = 2 6 3 level degree = 0 ◮ The weight w i of a node p i ∈ P depends on the level l and the level degree d l of p i : � : d l = 0 c w ( l ) = (1) d l · w ( l + 1) : d l > 0 , e.g., w (2) = c , w (1) = 3 ∗ w (2) = 3 c , w (0) = 2 ∗ w (1) = 6 c , WIMS ’11 10 / 19

  23. Web Page Segmentation by Clustering Clustering methods Clustering Methods ◮ Partitioning Clustering ◮ k-medoid (similar as k-means, but cluster representatives are real objects) ◮ Agglomerative Hierarchical Clustering ◮ single link method applied to compute distance between sets of objects ◮ Density-based Clustering ◮ DBSCAN variant (able to find clusters of different density levels) WIMS ’11 11 / 19

  24. Evaluation Studies Evaluation Studies WIMS ’11 12 / 19

  25. Evaluation Studies Distance functions Distance-Matrix Visualization ◮ A distance matrix contains all pairwise distances of the objects to be clustered, e.g. a b c a 0 1 . 9 1 . 1 b 1 . 9 0 2 . 3 c 1 . 1 2 . 3 0 WIMS ’11 13 / 19

  26. Evaluation Studies Distance functions 57-65 66-68 44-54 1-19 29-43 20-28 WIMS ’11 55-56 13 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend