large crawls of the web for linguistic purposes
play

Large Crawls of the Web for Linguistic Purposes Marco Baroni - PowerPoint PPT Presentation

Introduction Selecting seed urls Crawling Post-processing Conclusion Large Crawls of the Web for Linguistic Purposes Marco Baroni SSLMIT, University of Bologna Birmingham, July 2005 Marco Baroni Linguistic Crawls Introduction Selecting


  1. Introduction Selecting seed urls Crawling Post-processing Conclusion Large Crawls of the Web for Linguistic Purposes Marco Baroni SSLMIT, University of Bologna Birmingham, July 2005 Marco Baroni Linguistic Crawls

  2. Introduction Selecting seed urls Crawling Post-processing Conclusion Outline Introduction 1 Selecting seed urls 2 Crawling 3 Basics Heritrix My ongoing crawl Post-processing 4 Filtering and cleaning Language identification Near-duplicate spotting Conclusion 5 Annotation Indexing, etc. Summing up and open issues Marco Baroni Linguistic Crawls

  3. Introduction Selecting seed urls Crawling Post-processing Conclusion The WaCky approach http://wacky.sslmit.unibo.it Current target: 1-billion token English, German, Italian Web-corpora by 2006. Use existing open tools, make developed tools publicly available. Please join us (for other languages as well!) Marco Baroni Linguistic Crawls

  4. Introduction Selecting seed urls Crawling Post-processing Conclusion The basic steps Select “seed” urls. Crawl. Post-processing. Linguistic annotation. Indexing, etc. Marco Baroni Linguistic Crawls

  5. Introduction Selecting seed urls Crawling Post-processing Conclusion Outline Introduction 1 Selecting seed urls 2 Crawling 3 Basics Heritrix My ongoing crawl Post-processing 4 Filtering and cleaning Language identification Near-duplicate spotting Conclusion 5 Annotation Indexing, etc. Summing up and open issues Marco Baroni Linguistic Crawls

  6. Introduction Selecting seed urls Crawling Post-processing Conclusion Selecting seed urls Use queries for random word combinations to Google search engine. Marco Baroni Linguistic Crawls

  7. Introduction Selecting seed urls Crawling Post-processing Conclusion Selecting seed urls Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way. Marco Baroni Linguistic Crawls

  8. Introduction Selecting seed urls Crawling Post-processing Conclusion Selecting seed urls Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way. Which random words? Middle-frequency words from general/newspaper corpus (“public”). Basic vocabulary list (“private”). Marco Baroni Linguistic Crawls

  9. Introduction Selecting seed urls Crawling Post-processing Conclusion Selecting seed urls Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way. Which random words? Middle-frequency words from general/newspaper corpus (“public”). Basic vocabulary list (“private”). How random are the urls collected in this way? Ongoing work with Massimiliano Ciaramita (ISTC, Rome). Marco Baroni Linguistic Crawls

  10. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Outline Introduction 1 Selecting seed urls 2 Crawling 3 Basics Heritrix My ongoing crawl Post-processing 4 Filtering and cleaning Language identification Near-duplicate spotting Conclusion 5 Annotation Indexing, etc. Summing up and open issues Marco Baroni Linguistic Crawls

  11. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Crawling Fetch pages, extract links. Follow links, fetch pages. Marco Baroni Linguistic Crawls

  12. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Marco Baroni Linguistic Crawls

  13. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Marco Baroni Linguistic Crawls

  14. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Avoid spider traps Marco Baroni Linguistic Crawls

  15. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Avoid spider traps Control over crawl scope Marco Baroni Linguistic Crawls

  16. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Avoid spider traps Control over crawl scope Progress monitoring Marco Baroni Linguistic Crawls

  17. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Avoid spider traps Control over crawl scope Progress monitoring Intelligent management of downloaded text Marco Baroni Linguistic Crawls

  18. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Important in a good crawler Honoring robots.txt, politeness Efficiency, multi-threading, robust “Frontier” Avoid spider traps Control over crawl scope Progress monitoring Intelligent management of downloaded text Works out of the box, reasonable defaults Marco Baroni Linguistic Crawls

  19. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Heritrix http://crawler.archive.org/ Marco Baroni Linguistic Crawls

  20. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Heritrix http://crawler.archive.org/ Free/open crawler of Internet Archive Marco Baroni Linguistic Crawls

  21. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Heritrix http://crawler.archive.org/ Free/open crawler of Internet Archive Very active, supporting community. . . Marco Baroni Linguistic Crawls

  22. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Heritrix http://crawler.archive.org/ Free/open crawler of Internet Archive Very active, supporting community. . . that includes linguists and machine learning experts Marco Baroni Linguistic Crawls

  23. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion The Heritrix WUI Marco Baroni Linguistic Crawls

  24. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion The outpuf of Heritrix Documents distributed across gzipped “arc” files not larger than 100 MB. Info about retrieved docs (fingerprints, size, path) in arc file headers and in log files. Marco Baroni Linguistic Crawls

  25. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion My German crawl On server running RH Fedora Core 3 with 4 GB RAM, Dual Xeon 4.3 GHz CPUs, about 1.1 TB hard disk space. Seeded from random Google queries for SDZ and basic vocabulary list terms. 8631 urls, all from different domains. SURT scope: http:(at, http:(de, Tom Emerson’s regexp to “focus on HTML ” For most settings, Heritrix defaults. Marco Baroni Linguistic Crawls

  26. Introduction Selecting seed urls Basics Crawling Heritrix Post-processing My ongoing crawl Conclusion Current status of crawl In about a week: Retrieved about 265 GB, about 54 GB of arc files In earlier experiments, 7 GB arc files yielded about 250M words after cleaning. Marco Baroni Linguistic Crawls

  27. Introduction Selecting seed urls Filtering and cleaning Crawling Language identification Post-processing Near-duplicate spotting Conclusion Outline Introduction 1 Selecting seed urls 2 Crawling 3 Basics Heritrix My ongoing crawl Post-processing 4 Filtering and cleaning Language identification Near-duplicate spotting Conclusion 5 Annotation Indexing, etc. Summing up and open issues Marco Baroni Linguistic Crawls

  28. Introduction Selecting seed urls Filtering and cleaning Crawling Language identification Post-processing Near-duplicate spotting Conclusion Post-processing Various forms of filtering, boilerplate stripping Marco Baroni Linguistic Crawls

  29. Introduction Selecting seed urls Filtering and cleaning Crawling Language identification Post-processing Near-duplicate spotting Conclusion Post-processing Various forms of filtering, boilerplate stripping Language identification Marco Baroni Linguistic Crawls

  30. Introduction Selecting seed urls Filtering and cleaning Crawling Language identification Post-processing Near-duplicate spotting Conclusion Post-processing Various forms of filtering, boilerplate stripping Language identification Near-duplicate identification Marco Baroni Linguistic Crawls

  31. Introduction Selecting seed urls Filtering and cleaning Crawling Language identification Post-processing Near-duplicate spotting Conclusion Filtering as you crawl. . . Wouldn’t it be nice to filter as you crawl? Marco Baroni Linguistic Crawls

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend