Session 6A - Big data sources: web scraping and smart meters Using - PowerPoint PPT Presentation

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli(*) (barcarol@istat.it), Monica Scannapieco (*) (scannapi@istat.it), Marco Scarnò (*) (m.scarnò@cineca.it), Donato Summa (*) (donato.summa@istat.it) (*) Istituto Nazionale di Statistica (Istat) (**) Consorzio Interuniversitario per il Calcolo Automatico (CINECA)

Web scraping definition and types Web scraping is the process of automatically collecting information from the World Wide Web, based on tools (called scrapers, internet robots, crawlers, spiders etc.) that navigate, extract the content of websites and store scraped data in local data bases for subsequent elaboration purposes. We can distinguish two different kinds of web scraping : 1. specific web scraping , when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behaviour of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices (ONS, CBS, Istat); 2. generic web scraping , when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest. Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

An application on «ICT in enterprises» survey Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping different techniques and tools Different solutions for the web scraping are being investigated, based on the use of (i) the Apache suite Nutch/Solr (https://nutch.apache.org) for crawling, content extraction, indexing and searching results is a highly extensible and scalable open source web crawler; it facilitates parsing, indexing, creating a search engine, customizing search according to needs, scalability, robustness, and scoring filter for custom implementations; (ii) HTTrack ( http://www.httrack.com /), a free and open source software tool that permits to “mirror” locally a web site, by downloading each page that composes its structure. In technical terms it is a web crawler and an offline browser; (iii) JSOUP (http://jsoup.org) permits to parse and extract the structure of a HTML document. It has been integrated in a specific step of the ADaMSoft system (http://adamsoft.sourceforge.net), this latter selected as already including facilities that allow to handle huge data sets and textual information. Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping solutions evaluation These techniques are evaluated by taking into account: 1. efficiency : number of websites actually scraped on the total and execution performance; 2. effectiveness : completeness and richness of collected text that can influence the quality levels of prediction. Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping techniques evaluation: efficiency Solution # websites reached Average Time Type of Storage number of spent Storage dimensions webpages per site Nutch 7020 / 8550=82,1% 15,2 32,5 Binary files 2,3 GB (data) hours on HDFS 5,6 GB (index) HTTrack 7710 / 8550=90,2% 43,5 6,7 days HTML files on 16, 1 GB file system JSOUP 7835/8550=91,6% 68 11 hours 500MB HTML ADaMSoft compressed binary files Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping techniques evaluation: effectiveness The evaluation of the effectiveness of the different solutions is being based on the application of the steps of text and data mining to collected data in order to predict a subset of the target information of the survey. The developed application is available on the Adamsoft website: http://adamsoft.sourceforge.net/ appscripts.html Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Prediction of survey information by text and data mining Application of Naïve Bayes to predict all questions in section B8 Performance of Naive Bayes Question B8:"indicate if the Website Observed Predicted have any of the following facilities" Precision Sensitivity Specificity proportion proportion a) Online ordering or reservation or booking 0.78 0.50 0.86 0.21 0.21 (web sales functionality) b) Tracking or status of orders placed 0.82 0.49 0.85 0.18 0.11 c) Description of goods or services, price lists 0.62 0.44 0.79 0.48 0.32 d) Personalized content in the website for 0.74 0.41 0.78 0.09 0.23 regular/repeated visitors e) Possibility for visitors to customize or 0.86 0.53 0.87 0.05 0.14 design online goods or services f) A privacy policy statement, a privacy seal 0.59 0.57 0.64 0.68 0.51 or a website safety certificate g) Advertisement of open job positions or 0.69 0.52 0.78 0.35 0.33 online job application Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping: from sample to whole population So far, the three different solutions for web scraping have been applied to a limited number of websites (related to the subset of enterprises respondents in the sampling survey and declaring to have a website: 8,600). Next step is the scraping of all the websites owned by the enterprises included to the population of interest (212,000). Two problems: 1. URLs retrieval : how to individuate all the websites owned by the 212,000 (between 90,000 and 100,000 are expected to own one website); 2. massive scraping : how to increase efficiency when scaling a factor 10: O(10^4)  O(10^5) Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping: URLs retrieval General idea: for each enterprise: 1. Querying search engines with the enterprise denomination 2. Processing the first ten URLs retrieved in order to choose the right one for the given enterprise Processing: a) matching of the enterprises information (denomination, fiscal code, etc. available from administrative data ) and the content of the first ten URLs retrieved; b) use of the subset of enterprises (from survey data ) for which the correct URL is known, as a training set in order to maximise the precision of the choice function; c) application of the choice function to the whole set. Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Web scraping: mass scraping Use of Nutch on top of MapReduce / Hadoop to harness parallelism Completed tasks:  enhancement of Nutch by using the following plugins: • HTML-Plugin (Nutch custom search) to retrieve HTML tags • Metatag plugin (urlmeta) to add custom metatag information  integration of Nutch with analysis activities in order to execute the whole process Future task:  deployment and execution of Adamsoft/JSOUP and Nutch (HTTrack is abandoned due to its scalability problems) on CINECA PICO platform (1,080 cores, 54 nodes, 6.9 TB RAM) http://www.cineca.it/en/news/pico-cineca-new-platform-data- analytics-applications Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Conclusions 1. A first remark is that a scraping task can be carried out for different purposes in an Official Statistics production environment, and the choice of a unique tool for all the purposes may not always be possible . 2. As for this specific case, the final evaluation of the different solutions will depend on the evaluation of the results of their execution for massive scraping on an adequate platform (PICO). 3. Finally, we highlight that the scraping application here presented is a sort of “generalized” scraping task , as it does not require any specific assumption on the structure of the websites. In this sense it goes a step further with respect to previous experiences. Giulio Barcaroli, Monica Scannapieco, Marco Scarnò, Donato Summa NTTS 2015 Session 6A - Big data sources: web scraping and smart meters

Session 6A - Big data sources: web scraping and smart meters Using - PowerPoint PPT Presentation

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli() (barcarol@istat.it), Monica Scannapieco ()

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

World: a 20 meter cube around this pit, centered at the star. Pit Tube Z=10 meters 4.35

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

SMART ENERGY SMART ASSET SMART SMART SMART & CUSTOMER ASSET PURPOSE PEOPLE

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Towards a Smarter Future 16 December 2009 My Focus Today DECCs agenda Smart Meters

Smart Meters covertly monitor your home! In 2010, Victoria s Privacy Commissioner expressed

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

S S mart meters and mart meters and Piergiorgio CityGML for energy Cipriano efficiency in

Smart and Adaptive Cyber-Physical Systems Chapters 1,2 Cyber-Physical Systems Smart mobility

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

IoT - Big Data & Security MWC Smart Cities Seminar Telefnica Global IoT Group Feb 2017

Step by step guide Step 1: Purchasing an RSSeo! membership Step 2: Download RSSeo! 2.1 Download

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

Association of National Stakeholders in Traffic Safety Education Novice Teen Driver Education and

Web crawler system for collecting malicious activities FIRST TC Mauritius 2016 Hisao Nashiwa

Beta Presentation Open Source Intel The Capstone Experience Team GM Ben Buscarino Will

Presented by - Karan Kurani and Jason Marcell (Some slides adapted from presentation on 12 th

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using automated Web

Sambuz

Useful Links

Newsletter

Mail Us

Session 6A - Big data sources: web scraping and smart meters Using - PowerPoint PPT Presentation

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli(*) (barcarol@istat.it), Monica Scannapieco (*)

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

World: a 20 meter cube around this pit, centered at the star. Pit Tube Z=10 meters 4.35

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Web Scraping &amp; APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

SMART ENERGY SMART ASSET SMART SMART SMART &amp; CUSTOMER ASSET PURPOSE PEOPLE

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Towards a Smarter Future 16 December 2009 My Focus Today DECCs agenda Smart Meters

Smart Meters covertly monitor your home! In 2010, Victoria s Privacy Commissioner expressed

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

S S mart meters and mart meters and Piergiorgio CityGML for energy Cipriano efficiency in

Smart and Adaptive Cyber-Physical Systems Chapters 1,2 Cyber-Physical Systems Smart mobility

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

IoT - Big Data &amp; Security MWC Smart Cities Seminar Telefnica Global IoT Group Feb 2017

Step by step guide Step 1: Purchasing an RSSeo! membership Step 2: Download RSSeo! 2.1 Download

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

CRASH COURSE OR COURSE CRASH: Gaming, VR and a Pedagogical Approach Dr. Brent Chamberlain

Association of National Stakeholders in Traffic Safety Education Novice Teen Driver Education and

Web crawler system for collecting malicious activities FIRST TC Mauritius 2016 Hisao Nashiwa

Beta Presentation Open Source Intel The Capstone Experience Team GM Ben Buscarino Will

Presented by - Karan Kurani and Jason Marcell (Some slides adapted from presentation on 12 th

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using automated Web

Sambuz

Useful Links

Newsletter

Mail Us

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli() (barcarol@istat.it), Monica Scannapieco ()

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

SMART ENERGY SMART ASSET SMART SMART SMART & CUSTOMER ASSET PURPOSE PEOPLE

IoT - Big Data & Security MWC Smart Cities Seminar Telefnica Global IoT Group Feb 2017