 
              Automatic price collection on Ingolf Boettcher the internet (web scraping) Brussels 10. March 2015 NTTS 2015 - Session 6A – Big data sources: web scraping and smart meters www.statistik.at Wir bewegen Informationen
Web scraping There is a huge amount of data on the internet <HTML> How can we best collect/scrape/harvest <HEAD> data from there for <TITLE> DATA statistical purposes? </Title> </HEAD> </HTML> www.statistik.at Folie 2 | 10.03.2015
Web scraping Internet data collection – Minimum goal for (Price) Statistics: Turn website content into a spreadsheet www.statistik.at Folie 3 | 10.03.2015
Web scraping Internet data collection Options: 1. Manual price collection 2. Develop an API /Web scraper 2.1 by writing custom computer code 2.2 by using point and click web tools www.statistik.at Folie 4 | 10.03.2015
Web scraping Reasons for not writing an own web scraper IT-developer needed, therefore: • Expensive • Inflexible • Even maintenance cannot be handled by CPI staff www.statistik.at Folie 5 | 10.03.2015
Web scraping Reasons to use click and point webtools for web scraping: No IT-developer needed, therefore: • Cheap • Flexible • No programming skill required www.statistik.at Folie 6 | 10.03.2015
Web scraping How web scraping with click and point using import.io looks like: • web-platform that allows to structure and extract data from websites www.statistik.at Folie 7 | 10.03.2015
Webscraping with import.io www.statistik.at Folie 8 | 10.03.2015
Webscraping Web scraping with click and point on web- based platform offers solutions to: • extract data by point-and-click • record actions on a website • crawl all the data of a webpage More issues to be considered: • Legality to crawl on websites • Internal IT Security • Training of staff www.statistik.at Folie 9 | 10.03.2015
Automatic price Contact: Ingolf Boettcher collection on the internet Guglgasse 13, 1110 Wien Tel: +43 (1) 71128-7917 (web scraping) Fax: +43 (1) 7180718 Ingolf.boettcher@statistik.gv.at www.statistik.at Folie 10 | 10.03.2015
Recommend
More recommend