content extraction from webpages using machine learning
play

Content Extraction from Webpages Using Machine Learning Masters - PowerPoint PPT Presentation

Content Extraction from Webpages Using Machine Learning Masters Thesis Hamza Yunis Bauhaus Universit at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby Motivation Hamza Yunis (Bauhaus


  1. Content Extraction from Webpages Using Machine Learning Master’s Thesis Hamza Yunis Bauhaus Universit¨ at 26.01.2017 Supervised by: Advised by: Prof. Benno Stein Johannes Kiesel Dr. Andreas Jakoby

  2. Motivation Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 2 /35

  3. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 3 /35

  4. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 4 /35

  5. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 5 /35

  6. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 6 /35

  7. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 7 /35

  8. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 8 /35

  9. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 9 /35

  10. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 10 /35

  11. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  12. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  13. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  14. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  15. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  16. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  17. What is the Main Content? Definition (i) : The main content is what the webpage is supposed to communicate according to the publisher. We cannot always tell what the webpage publisher wants to communicate. A single webpage may have different publishers, each wanting to communicate a different type of information. Definition (ii) : The main content is what makes the webpage interesting in to the user . Different users may have different interests in the webpage. Definition (iii) : The main content of a webpage consists of information that cannot be found in other webpages . Usually used in template recognition. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 11 /35

  18. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 12 /35

  19. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 13 /35

  20. What is the Main Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 14 /35

  21. What is the Main Content? The main content is the non-noisy content! Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 15 /35

  22. What is the Noisy Content? Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 16 /35

  23. What is the Noisy Content? Advertisements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  24. What is the Noisy Content? Advertisements. Navigation links. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  25. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  26. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  27. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  28. What is the Noisy Content? Advertisements. Navigation links. Links to promoted webpages. Legal information. Irrelevant information. Input elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 17 /35

  29. Types of HTML Elements Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 18 /35

  30. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  31. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  32. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  33. Types of HTML Elements Content elements. Inline semantic elements. Sectioning elements. <ul> <li>List item 1.</li> <li>List item 2.</li> </ul> <div> <p>This is the <span class="important">first</span> paragraph.</p> <p>This is the <span class="important">second</span> paragraph.</p> </div> Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 19 /35

  34. Types of HTML Elements Elements to Be Classified Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

  35. Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

  36. Types of HTML Elements Elements to Be Classified Paragraph elements: <p> . <div> elements. Hamza Yunis (Bauhaus Universit¨ at) Content Extraction from Webpages Using Machine Learning 20 /35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend