Content Extraction from Webpages Using Machine Learning
Master’s Thesis Hamza Yunis
Bauhaus Universit¨ at
26.01.2017 Supervised by: Advised by:
- Prof. Benno Stein
Johannes Kiesel
- Dr. Andreas Jakoby
Content Extraction from Webpages Using Machine Learning Masters - - PowerPoint PPT Presentation
Content Extraction from Webpages Using Machine Learning Masters Thesis Hamza Yunis Bauhaus Universit at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby Motivation Hamza Yunis (Bauhaus
Master’s Thesis Hamza Yunis
Bauhaus Universit¨ at
26.01.2017 Supervised by: Advised by:
Johannes Kiesel
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 2 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 3 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 4 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 5 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 6 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 7 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 8 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 9 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 10 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.
Definition (ii): The main content is what makes the webpage interesting in to the user.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.
Definition (ii): The main content is what makes the webpage interesting in to the user.
Different users may have different interests in the webpage.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.
Definition (ii): The main content is what makes the webpage interesting in to the user.
Different users may have different interests in the webpage.
Definition (iii): The main content of a webpage consists of information that cannot be found in other webpages.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Definition (i): The main content is what the webpage is supposed to communicate according to the publisher.
We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information.
Definition (ii): The main content is what makes the webpage interesting in to the user.
Different users may have different interests in the webpage.
Definition (iii): The main content of a webpage consists of information that cannot be found in other webpages.
Usually used in template recognition.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 12 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 13 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 14 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 15 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 16 /35
Advertisements.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Advertisements. Navigation links.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Advertisements. Navigation links. Links to promoted webpages.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Advertisements. Navigation links. Links to promoted webpages. Legal information.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Input elements.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 18 /35
Content elements. Inline semantic elements. Sectioning elements.
<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Content elements. Inline semantic elements. Sectioning elements.
<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Content elements. Inline semantic elements. Sectioning elements.
<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Content elements. Inline semantic elements. Sectioning elements.
<ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35
Elements to Be Classified
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Table cell elements: <th> and <td>.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Table cell elements: <th> and <td>.
If they do not have content element descendants.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Table cell elements: <th> and <td>.
If they do not have content element descendants.
List item elements: <li>.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Table cell elements: <th> and <td>.
If they do not have content element descendants.
List item elements: <li>. Header elements: <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Elements to Be Classified
Paragraph elements: <p>. <div> elements.
If they do not have content element descendants.
Table cell elements: <th> and <td>.
If they do not have content element descendants.
List item elements: <li>. Header elements: <h1>, <h2>, <h3>, <h4>, <h5>, and <h6>. Image elements: <img>.
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 21 /35
HTML Documents
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
HTML Documents CSV Documents Raw Features Extraction
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors Learning Classifier
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 22 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 23 /35
<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> <div class="comment-content"> <p>The body of the comment</p> </div> </div> ... </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 24 /35
<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> <div class="comment-content"> <p>The body of the comment</p> </div> </div> ... </div>
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 25 /35
<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> ...
Raw features:
ancestor names="div, div, div, ul" ancestor classes="NO CLASSES, comment, comment-header, NO CLASSES" inner text="Author name."
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 26 /35
<div id="comments-section"> <div class="comment"> <div class="comment-header"> <ul> <li>Author name.</li> <li>Comment title.</li> </ul> </div> ...
Raw features:
ancestor names="div, div, div, ul" ancestor classes="NO CLASSES, comment, comment-header, NO CLASSES" inner text="Author name."
Derived features:
is desc comment="1" is desc cookies="0" is desc section="1" inner text length=2
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 27 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 28 /35
Evaluation Workflow
HTML Documents
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35
Evaluation Workflow
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35
Evaluation Workflow
HTML Documents CSV Documents Concatenation CSV Document Raw Features Extraction Feature Vectors Derived Features Extraction Enhanced Feature Vectors Classifier Evaluation Results
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 29 /35
Confusion matrix:
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤
Actual Class Predicted Class "Noisy" "Main" "Noisy" tn fp "Main" fn tp Evaluation metrics: precision = tp tp + fp recall = tp tp + fn Fβ = (1 + β2) · precision · recall (β2 · precision) + recall
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 30 /35
Element-based results for textual content elements:
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤
Actual Class Predicted Class "Noisy" "Main" "Noisy" 4625 211 "Main" 277 1018 precision = 0.828 recall = 0.786 F1 = 0.806
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 31 /35
Text-based results for textual content elements:
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤
Actual Class Predicted Class "Noisy" "Main" "Noisy" 496921 19618 "Main" 28654 163908 precision = 0.893 recall = 0.851 F1 = 0.871
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 32 /35
Element-based results for small and medium-size images (≤ 40000px):
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤
Actual Class Predicted Class "Noisy" "Main" "Noisy" 900 5 "Main" 97 25 precision = 0.833 recall = 0.205 F1 = 0.328
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 33 /35
Element-based results for large images (> 40000px):
❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤ ❤
Actual Class Predicted Class "Noisy" "Main" "Noisy" 122 6 "Main" 10 29 precision = 0.828 recall = 0.743 F1 = 0.783
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 34 /35
Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 35 /35