Content Extraction from Webpages Using Machine Learning Masters - - PowerPoint PPT Presentation

content extraction from webpages using machine learning
SMART_READER_LITE
LIVE PREVIEW

Content Extraction from Webpages Using Machine Learning Masters - - PowerPoint PPT Presentation

Content Extraction from Webpages Using Machine Learning Masters Thesis Hamza Yunis Bauhaus Universit at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby Motivation Hamza Yunis (Bauhaus


slide-1
SLIDE 1

Content Extraction from Webpages Using Machine Learning

Master’s Thesis Hamza Yunis

Bauhaus Universit¨ at

26.01.2017 Supervised by: Advised by:

  • Prof. Benno Stein

Johannes Kiesel

  • Dr. Andreas Jakoby
slide-2
SLIDE 2

Motivation

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 2 /35

slide-3
SLIDE 3

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 3 /35

slide-4
SLIDE 4

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 4 /35

slide-5
SLIDE 5

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 5 /35

slide-6
SLIDE 6

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 6 /35

slide-7
SLIDE 7

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 7 /35

slide-8
SLIDE 8

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 8 /35

slide-9
SLIDE 9

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 9 /35

slide-10
SLIDE 10

What is the Main Content?

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 10 /35

slide-11
SLIDE 11

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-12
SLIDE 12

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-13
SLIDE 13

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-14
SLIDE 14

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.

Definition (ii): The main content is what makes the webpage interesting in to the user.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-15
SLIDE 15

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.

Definition (ii): The main content is what makes the webpage interesting in to the user.

Different users may have different interests in the webpage.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-16
SLIDE 16

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.

Definition (ii): The main content is what makes the webpage interesting in to the user.

Different users may have different interests in the webpage.

Definition (iii): The main content of a webpage consists of information that cannot be found in other webpages.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-17
SLIDE 17

What is the Main Content?

Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.

We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.

Definition (ii): The main content is what makes the webpage interesting in to the user.

Different users may have different interests in the webpage.

Definition (iii): The main content of a webpage consists of information that cannot be found in other webpages.

Usually used in template recognition.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

slide-18
SLIDE 18

What is the Main Content?

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 12 /35

slide-19
SLIDE 19

What is the Main Content?

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 13 /35

slide-20
SLIDE 20

What is the Main Content?

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 14 /35

slide-21
SLIDE 21

What is the Main Content?

The main content is the non-noisy content!

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 15 /35

slide-22
SLIDE 22

What is the Noisy Content?

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 16 /35

slide-23
SLIDE 23

What is the Noisy Content?

Advertisements.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-24
SLIDE 24

What is the Noisy Content?

Advertisements. Navigation links.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-25
SLIDE 25

What is the Noisy Content?

Advertisements. Navigation links. Links to promoted webpages.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-26
SLIDE 26

What is the Noisy Content?

Advertisements. Navigation links. Links to promoted webpages. Legal information.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-27
SLIDE 27

What is the Noisy Content?

Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-28
SLIDE 28

What is the Noisy Content?

Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Input elements.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

slide-29
SLIDE 29

Types of HTML Elements

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 18 /35

slide-30
SLIDE 30

Types of HTML Elements

Content elements. Inline semantic elements. Sectioning elements.

<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

slide-31
SLIDE 31

Types of HTML Elements

Content elements. Inline semantic elements. Sectioning elements.

<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

slide-32
SLIDE 32

Types of HTML Elements

Content elements. Inline semantic elements. Sectioning elements.

<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

slide-33
SLIDE 33

Types of HTML Elements

Content elements. Inline semantic elements. Sectioning elements.

<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

slide-34
SLIDE 34

Types of HTML Elements

Elements to Be Classified

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-35
SLIDE 35

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-36
SLIDE 36

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-37
SLIDE 37

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-38
SLIDE 38

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Table cell elements: <th> and <td>.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-39
SLIDE 39

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Table cell elements: <th> and <td>.

If they do not have content element descendants.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-40
SLIDE 40

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Table cell elements: <th> and <td>.

If they do not have content element descendants.

List item elements: <li>.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-41
SLIDE 41

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Table cell elements: <th> and <td>.

If they do not have content element descendants.

List item elements: <li>. Header elements: <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-42
SLIDE 42

Types of HTML Elements

Elements to Be Classified

Paragraph elements: <p>. <div> elements.

If they do not have content element descendants.

Table cell elements: <th> and <td>.

If they do not have content element descendants.

List item elements: <li>. Header elements: <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>. Image elements: <img>.

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

slide-43
SLIDE 43

Learning Workflow

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 21 /35

slide-44
SLIDE 44

Learning Workflow

HTML Documents

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-45
SLIDE 45

Learning Workflow

HTML Documents CSV Documents Raw Features Extraction

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-46
SLIDE 46

Learning Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-47
SLIDE 47

Learning Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-48
SLIDE 48

Learning Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-49
SLIDE 49

Learning Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors Learning Classifier

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35

slide-50
SLIDE 50

Feature Engineering

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 23 /35

slide-51
SLIDE 51

Feature Engineering

<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> <div class="comment-content"> <p>The body of the comment</p> </div> </div> ... </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 24 /35

slide-52
SLIDE 52

Feature Engineering

<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> <div class="comment-content"> <p>The body of the comment</p> </div> </div> ... </div>

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 25 /35

slide-53
SLIDE 53

Feature Engineering

<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> ...

Raw features:

ancestor names="div, div, div, ul" ancestor classes="NO CLASSES, comment, comment-header, NO CLASSES" inner text="Author name."

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 26 /35

slide-54
SLIDE 54

Feature Engineering

<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> ...

Raw features:

ancestor names="div, div, div, ul" ancestor classes="NO CLASSES, comment, comment-header, NO CLASSES" inner text="Author name."

Derived features:

is desc comment="1" is desc cookies="0" is desc section="1" inner text length=2

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 27 /35

slide-55
SLIDE 55

Evaluation

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 28 /35

slide-56
SLIDE 56

Evaluation

Evaluation Workflow

HTML Documents

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35

slide-57
SLIDE 57

Evaluation

Evaluation Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35

slide-58
SLIDE 58

Evaluation

Evaluation Workflow

HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors Classifier Evaluation Results

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35

slide-59
SLIDE 59

Evaluation

Confusion matrix:

❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤

Actual Class Predicted Class "Noisy" "Main" "Noisy" tn fp "Main" fn tp Evaluation metrics: precision = tp tp + fp recall = tp tp + fn Fβ = (1 + β2) · precision · recall (β2 · precision) + recall

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 30 /35

slide-60
SLIDE 60

Evaluation

Element-based results for textual content elements:

❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤

Actual Class Predicted Class "Noisy" "Main" "Noisy" 4625 211 "Main" 277 1018 precision = 0.828 recall = 0.786 F1 = 0.806

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 31 /35

slide-61
SLIDE 61

Evaluation

Text-based results for textual content elements:

❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤

Actual Class Predicted Class "Noisy" "Main" "Noisy" 496921 19618 "Main" 28654 163908 precision = 0.893 recall = 0.851 F1 = 0.871

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 32 /35

slide-62
SLIDE 62

Evaluation

Element-based results for small and medium-size images (≤ 40000px):

❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤

Actual Class Predicted Class "Noisy" "Main" "Noisy" 900 5 "Main" 97 25 precision = 0.833 recall = 0.205 F1 = 0.328

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 33 /35

slide-63
SLIDE 63

Evaluation

Element-based results for large images (> 40000px):

❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤

Actual Class Predicted Class "Noisy" "Main" "Noisy" 122 6 "Main" 10 29 precision = 0.828 recall = 0.743 F1 = 0.783

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 34 /35

slide-64
SLIDE 64

Thank you for your attention!

Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 35 /35