open source tools for mining and analysing web data scale
play

Open Source Tools for Mining and Analysing Web Data @ Scale Kris - PowerPoint PPT Presentation

Open Source Tools for Mining and Analysing Web Data @ Scale Kris Carpenter Negulescu, Internet Archive Annual Meeting, Washington DC July 20, 2011 Key Problems to Address & Primary Benefits Archived Web Data is often isolated,


  1. Open Source Tools for Mining and Analysing Web Data @ Scale Kris Carpenter Negulescu, Internet Archive Annual Meeting, Washington DC July 20, 2011

  2. Key Problems to Address & Primary Benefits… Archived Web Data is often isolated, difficult to link to other related resources by topic, and minimally navigable Benefits of mining and analysis: Mapping relationships between links over time Geo-location maps Tag clouds Classification Facets Rate of change Related information; Enhanced keyword search Annual Meeting, Washington DC July 20, 2011

  3. The Tool Box  HDFS  Map Reduce  Pig Latin  Web archive code – metadata extraction jar  Other extraction layers: Tika, Jhove(2), etc  Google analytics APIs/Drupal modules, Neo4j, etc. Annual Meeting, Washington DC July 20, 2011

  4. Web Archive Transformation (WAT) - a structured way of storing metadata generated by Web Crawls  ARCs and WARCs are “heavy”  WAT – Web Archive Transformation file • Uses WARC format as a generic meta data container • Extract everything you're likely to want from ARCs/WARCs once  Store into HDFS; Part of standard ingest process Annual Meeting, Washington DC July 20, 2011

  5. Web archive code: metadata extractor  The WAT utilities produce structured metadata that is optimized for data analysis, i.e. JavaScript Object Notation (JSON), from compressed (GZIPed) or uncompressed ARC or WARC files. • Currently just a bit of glue code around an ARC/WARC reader whose function is HTML metadata extraction • JSON data is written to STDOUT in compressed (GZIP) format. The ARC or WARC file can be a local file, a HTTP accessible file (http://), or an Hadoop File System (HDFS) accessible file (hdfs://).  Includes example “UDF” code  Will integrate with Jhove(2), Tiki, etc Annual Meeting, Washington DC July 20, 2011

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend