Extracting Logical Hierarchical Structure of HTML Documents Based on - - PowerPoint PPT Presentation

extracting logical hierarchical structure of html
SMART_READER_LITE
LIVE PREVIEW

Extracting Logical Hierarchical Structure of HTML Documents Based on - - PowerPoint PPT Presentation

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. Sakyo, Kyoto 606-8501 Japan {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp Background


slide-1
SLIDE 1

Extracting Logical Hierarchical Structure of HTML Documents Based on Headings

Tomohiro Manabe and Keishi Tajima Graduate School of Informatics, Kyoto Univ. Sakyo, Kyoto 606-8501 Japan {manabe@dl.kuis, tajima@i}.kyoto-u.ac.jp

slide-2
SLIDE 2

Background

  • Understanding of structure in web pages is

important for many applications

  • Web search
  • Automatic summarization of web pages
  • Web information extraction

2

slide-3
SLIDE 3

Structure in web pages

  • Web pages contain

various types of structures

slide-4
SLIDE 4

Structure in web pages

  • Web pages contain

various types of structures

  • Layout structure,

Content body Menu Header

slide-5
SLIDE 5

Structure in web pages

  • Web pages contain

various types of structures

  • Layout structure,

list or table structure, …

Content body Menu Header

Item 1 Item 2 Item 3

slide-6
SLIDE 6

Structure in web pages

  • Web pages contain

various types of structures

  • Layout structure,

list or table structure, …

  • We focus on hierarchical

heading structure

  • 78% of pages contain it

Content body Menu Header Big heading Big heading

Small heading Small heading Item 1 Item 2 Item 3

slide-7
SLIDE 7

Hierarchical heading structure

7

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-8
SLIDE 8

Hierarchical heading structure

8

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
slide-9
SLIDE 9

Hierarchical heading structure

9

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
slide-10
SLIDE 10

Hierarchical heading structure

10

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
slide-11
SLIDE 11

Hierarchical heading structure

11

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
slide-12
SLIDE 12

Hierarchical heading structure

12

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
  • Block
  • A segment with its heading
  • may contain each other
slide-13
SLIDE 13

Hierarchical heading structure

13

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
  • Block
  • A segment with its heading
  • may contain each other
slide-14
SLIDE 14

Hierarchical heading structure

14

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • Heading
  • Topic description of a segment
  • Block
  • A segment with its heading
  • may contain each other
slide-15
SLIDE 15

Hierarchical heading structure

  • Heading
  • Topic description of a segment
  • Block
  • A segment with its heading
  • may contain each other
  • Hierarchical heading structure
  • composed of these

headings and blocks

15

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-16
SLIDE 16

Importance of heading structure

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

16

  • Traditional search engines:
  • This page contains both words
  • Extracts this page incorrectly
  • Heading-aware Bool. retrieval:
  • “March” occurs under “2012”,

not “2010”

  • Can reject this page correctly

2010 Mar Search

slide-17
SLIDE 17

Importance of heading structure

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

17

  • Traditional search engines:
  • This page contains both words
  • return this page incorrectly
  • Heading-aware Bool. retrieval:
  • “March” occurs under “2012”,

not “2010”

  • Can reject this page correctly

Search 2010 Mar

slide-18
SLIDE 18

Importance of heading structure

  • Traditional search engines:
  • This page contains both words
  • return this page incorrectly
  • Heading-aware engines:
  • “Mar.” occurs under “2012”,

not “2010”

  • Will not return this page

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

18

Search 2010 Mar

slide-19
SLIDE 19

Problem to be solved

  • Hierarchical heading structure is useful
  • It seems easy to extract the structure

19

slide-20
SLIDE 20

Problem to be solved

  • Hierarchical heading structure is useful
  • It seems easy to extract the structure
  • In fact, it’s NOT easy

Our research problem: Extraction of hierarchical heading structure 20

slide-21
SLIDE 21

Hierarchical heading structure extraction is NOT easy

  • HTML has tags for descripting headings
  • H1 to H6 and DT tags

21

slide-22
SLIDE 22

Hierarchical heading structure extraction is NOT easy

  • HTML has tags for descripting headings
  • H1 to H6 and DT tags
  • These tags are not always used or used incorrectly

In our data set:

  • Only 32% of headings were tagged by these tags
  • Only 67% of components tagged by these tags were headings

22

slide-23
SLIDE 23

Hierarchical heading structure extraction is NOT easy

  • HTML has tags for descripting headings
  • H1 to H6 and DT tags
  • These tags are not always used or used incorrectly

In our data set:

  • Only 32% of headings were tagged by these tags
  • Only 67% of components tagged by these tags were headings
  • More sophisticated extraction method is necessary

23

slide-24
SLIDE 24

Humans use visual style

  • How do humans extract

hierarchical heading structure? 24

slide-25
SLIDE 25

Humans use visual style

  • How do humans extract

hierarchical heading structure?

  • They use visual style
  • consists of various visual

attributes of components

  • e.g. font-size, color

25

slide-26
SLIDE 26

Humans use visual style

  • How do humans extract

hierarchical heading structure?

  • They use visual style
  • consists of various visual

attributes of components

  • e.g. font-size, color

26

slide-27
SLIDE 27

Visual style can be easily detected

  • Visual style is assigned to each DOM node
  • DOM node is a pair of tags or a text fragment split by tags

27

B LI text Jul.

<LI> <B> Jul. </B> Construction started. </LI>

slide-28
SLIDE 28

Visual style can be easily detected

  • Visual style is assigned to each DOM node
  • DOM node is a pair of tags or a text fragment split by tags

28

B LI text Jul.

<LI> <B> Jul. </B> Construction started. </LI>

slide-29
SLIDE 29

Visual style can be easily detected

  • Visual style is assigned to each DOM node
  • DOM node is a pair of tags or a text fragment split by tags

29

B LI text Jul.

<LI> <B> Jul. </B> Construction started. </LI>

slide-30
SLIDE 30

Visual style can be easily detected

  • Visual style is assigned to each DOM node
  • DOM node is a pair of tags or a text fragment split by tags
  • Visual style can be easily detected by computers
  • We use it to extract hierarchical heading structure

30

slide-31
SLIDE 31

Disadvantages of existing methods

31

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • There exists some methods

that use visual style of nodes

  • Existing methods
  • check nodes one-by-one
  • compare two nodes and

judge which one is more likely to be a heading

  • Their available information

is too limited

slide-32
SLIDE 32

Disadvantages of existing methods

32

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • There exists some methods

that use visual style of nodes

  • Existing methods
  • check nodes one-by-one

[Okada, Arakawa]

  • compare two nodes and

judge which one is more likely to be a heading

  • Their available information

is too limited

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

slide-33
SLIDE 33

Disadvantages of existing methods

  • There exists some methods

that use visual style of nodes

  • Existing methods
  • check nodes one-by-one
  • compare two nodes and

judge which one is more likely to be a heading [Pembe, Güngör]

  • They do not use global

information

33

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

[Pembe, Güngör] F. C. Pembe and T. Güngör. A tree learning Approach to web document sectional hierarchy

  • extraction. In Proc. of ICAART, pages 447–450, 2010.
slide-34
SLIDE 34

Disadvantages of existing methods

  • There exists some methods

that use visual style of nodes

  • Existing methods
  • check nodes one-by-one
  • compare two nodes and

judge which one is more likely to be a heading

  • They do not use global

information within given page

34

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-35
SLIDE 35

Our idea

  • To use more information,
  • ur method
  • groups nodes by visual style

into node sets

  • judges if each node set is a

set of actual headings

  • Each node set is
  • aset of headings of same level
  • or a set of non-headings

35

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-36
SLIDE 36

Example node sets

  • Node sets indicated

by color 36

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-37
SLIDE 37

Example node sets

  • Node sets indicated

by color 37

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

An example set of actual headings

slide-38
SLIDE 38

Example node sets

  • Node sets indicated

by color 38

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

An example set of non-heading components.

slide-39
SLIDE 39

Outline of our method

  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

39

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-40
SLIDE 40

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

Outline of our method

40

  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

slide-41
SLIDE 41

Outline of our method

41

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

1 1 1

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

slide-42
SLIDE 42

Outline of our method

42

  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.

1 1 1

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

slide-43
SLIDE 43

Outline of our method

43

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

1 1 1

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

slide-44
SLIDE 44

Outline of our method

44

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

1 1 1

2 2 2 2

3 3 3 3

4 5 6 6 7 7 7 7

slide-45
SLIDE 45

Outline of our method

45

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
  • 1. Group candidate headings
  • 2. Sort node sets by

significance of their style

  • 3. For each node set in desc.
  • rder of significance

3.1 Judge if the node set is a set of actual headings 3.2 For actual headings, also extract corresponding blocks

slide-46
SLIDE 46

Step 1. Group candidate headings

  • Candidate heading nodes: a single text or image node
  • Group candidates with exactly the same attribute values

Three types of attributes for grouping

1. Visual attribute values computed by web browsers

  • Font-size, font-style, font-weight, text-decoration, and color

2. Tag path

  • Sequence of node names between a node and the root
  • e.g. /HTML/BODY/TABLE/TR/TD/UL/LI/text()

3. Height of images

46

slide-47
SLIDE 47

Step 2. Sort node sets by significance of their style Four sort keys in this priority order

1. Depth of corresponding blocks in hierarchy

  • because blocks never include blocks at upper levels

2. Font-size 3. Font-weight 4. Document order

  • because a heading of a parent block usually

appear earlier than that of a child block

47

slide-48
SLIDE 48

Step 3. Scan node sets in order of significance

  • Our method
  • recursively scans node sets

in the descending order of their significance

  • When an actual heading set is found,

extracts the blocks corresponding to the headings

  • Two sub-steps

3.1 Judge if a node set is an actual heading set 3.2 Detect the corresponding blocks from headings

48

slide-49
SLIDE 49

Step 3.1 Judging if a node set is actual heading set

5 heuristic rules

  • e.g. all headings in one

parent block are unique 49

Kyoto Aquarium

is an aquarium in Kyoto, Japan.

Overview

One of the largest inland aquariums.

Information

Holidays Open throughout the year. Opening Hours From 9 a.m. to 5 p.m.

History

2010

  • Jul. Construction started.

2012

  • Feb. Construction finished.
  • Mar. Opened just as planned.
  • Jul. Welcomed the 1Mth visitor.
slide-50
SLIDE 50

Step 3.2 Detecting corresponding blocks from headings

  • When a node set passed all the rules, our method
  • regards it is an actual heading set
  • detects blocks corresponding to the headings
  • To determine blocks from headings, our method use

correspondence between them and DOM sub-tree

  • A heading corresponds to a single text or image node
  • A block corresponds to a node array,

a set of adjoining sibling nodes and their descendants

50

slide-51
SLIDE 51

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul.

51

slide-52
SLIDE 52

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul. “2010” node array “2012” node array

slide-53
SLIDE 53

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul.

slide-54
SLIDE 54

54

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul. The lowest common ancestor

  • f the headings, “2010” and “2012”

“2010” heading “2012” heading

slide-55
SLIDE 55

55

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul. The first node of “2010” node array The first node of “2012” node array

slide-56
SLIDE 56

56

DIV B UL LI text Jul. B 2010 B 2012 B UL LI text Mar. B LI text Feb. B LI text Jul. The last node of “2010” node array The last node of “2012” node array

slide-57
SLIDE 57

Experimental setting

To evaluate our method

  • Random 803 pages from ClueWeb09
  • For excluding spam pages, only pages relevant to some

intents in TREC Web track were collected

  • For each page, 1 of 5 annotators hand-annotated

hierarchical heading structure in its content body

  • Fleiss’ Kappa: .693 for headings and .583 for blocks

57

slide-58
SLIDE 58

Evaluation result (heading extraction)

58

Method Precision Recall F1-score Decision tree learning [Okada, Arakawa] .084 .884 .154 Naïve method based on tag names .668 .320 .433 Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

slide-59
SLIDE 59

Evaluation result (heading extraction)

  • The decision tree learning method did not work well
  • Most test pages did not share visual style with training pages

59

Method Precision Recall F1-score Decision tree learning [Okada, Arakawa] .084 .884 .154 Naïve method based on tag names .668 .320 .433 Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

slide-60
SLIDE 60

Evaluation result (heading extraction)

  • The decision tree learning method did not work well
  • Most test pages did not share visual style with training pages
  • Our method achieved a far better recall

retaining about same precision as the naïve method 60

Method Precision Recall F1-score Decision tree learning [Okada, Arakawa] .084 .884 .154 Naïve method based on tag names .668 .320 .433 Our method .638 .569 .602

[Okada, Arakawa] H. Okada and H. Arakawa. Automated extraction of non <h>-tagged headers in webpages by decision trees. In Proc. of SICE Annual Conf., pages 2117–2120, 2011.

slide-61
SLIDE 61

Evaluation result (block extraction)

61

Method Precision Recall F1-score VIPS [Cai+] .215 .070 .106 Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation

  • algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.
slide-62
SLIDE 62

Evaluation result (block extraction)

  • VIPS did not work well
  • because its extraction target is layout structure
  • VIPS is complementary to our method

62

Method Precision Recall F1-score VIPS [Cai+] .215 .070 .106 Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation

  • algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.
slide-63
SLIDE 63

Evaluation result (block extraction)

  • VIPS did not work well
  • because its extraction target is layout structure
  • VIPS is complementary to our method
  • Our method: in accuracy close to heading extraction
  • Extracted blocks from actual headings by precision of .769

63

Method Precision Recall F1-score VIPS [Cai+] .215 .070 .106 Our method .586 .563 .574

[Cai+] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. VIPS: A vision-based page segmentation

  • algorithm. Technical Report MSR–TR–2003–79, Microsoft Research, 2003.
slide-64
SLIDE 64

Conclusion

  • Extraction of hierarchical heading structure is important

for various applications of the web

  • We proposed a method based on an idea

that headings of the same level share their visual style

  • Our method achieved high recall and satisfactory precision
  • Our code and data sets will be available online
  • https://github.com/tmanabe

64