Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web - PowerPoint PPT Presentation

Jul 04, 2023 •377 likes •486 views

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web scraping The easy way, using web service APIs Well see examples of both. 2 / 9 Web Scraping Web scraping, a.k.a. screen scraping, means getting

Web Scraping 1 / 9
Web Scraping Two ways to mine data from the web ◮ The hard way, by web scraping ◮ The easy way, using web service APIs We’ll see examples of both. 2 / 9
Web Scraping Web scraping, a.k.a. screen scraping, means getting data from a web page. Suppose we want to get the current wind data for a city from Open Weather Map. 3 / 9
What is a Web Page? A web page is a chunk of text containing HTML code. The browser "renders" the HTML graphically. So web scraping means analyzing text using Python’s text processing features. 4 / 9
Finding The Data On the Page First you need to find the data within the HTML code for a page so you can construct a regex. Your browser’s developer features can help you find the data: 5 / 9
Getting the Web Page’s HTML Code To get the HTML code of the web page into a Python string variable that you can play with, use Python’s urllib.request module: import urllib.request # 2994160 is the city code for Metz, FR request = urllib.request.Request("http://www.openweathermap.com/city/2994160") response = urllib.request.urlopen(request) page_bytes = response.read() page_text = page_bytes.decode() # page_text is Python str containing the HTML code or with the requests module, which is what we’ll use: import requests resp = requests.get("http://www.openweathermap.com/city/2994160") resp.text # the text of the web page 6 / 9
Extracting the Data Looks like the wind data is in the second <td> element after the <div class="weather-widget"> tag, following a <td>Wind</td> element. We can play around with the HTML text in the Python REPL. We eventually end up with: wind = re.findall(r’<td>Wind</td><td>(.+?)</td>’, page_text.replace("\n",""))[0] Notice that we used a capture group to get the element data. 7 / 9
Aside: Parsing HTML HTML is context free language, which roughly means that it supports arbitrary nesting of elements. For example, you could have arbitrarily nested div elements with "leaf" elements containing text data, e.g.: < div > < div > < div >some text</ div > </ div > </ div > By the rules of HTML, you could nest div tags as deeply as you want. Regular expressions match regular laguages, which don’t support arbitrary nesting. So how can we use regexes to "parse" HTML? 8 / 9
Regex Matching in HTML Code Parsing means scanning the linear sequence of symbols in a string to determine its structure (usually by putting the symbols in a tree). We don’t need to parse HTML to find data on a web page. While the HTML language supports arbitrary nesting, a particular web page will be nested to a particular depth, resulting a simple linear sequence of symbols that we can match with a regular expression. 9 / 9

Recommend

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

An introduction to Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web Scraping with R Simon Munzert An introduction to Web Scraping and Text Mining with R Simon Munzert University of Konstanz

525 views • 20 slides

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

Automatic price collection on Ingolf Boettcher the internet (web scraping) Brussels 10. March 2015 NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir bewegen Informationen Web scraping There is a

396 views • 10 slides

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool input demonstration of how to collect the first row of a relational dataset F r o m h i g h l y s t r u c t u r e d w e b p a

837 views • 63 slides

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam Introduction Scraping the Dark Web Useful for

972 views • 56 slides

Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli(*) (barcarol@istat.it), Monica Scannapieco (*)

436 views • 12 slides

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda Web sites Requests Scraping APIs API Wrappers What is the internet? The request response cycle The request response cycle

301 views • 29 slides

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A long way to get here What is a Web Service? What is a Web Service? What is a Web Service? Web Services Web Services Software service :

552 views • 33 slides

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API Most websites dont give us this, so we need to try and pull the information out How to scrape? Fetch the HTML source code python:

444 views • 13 slides

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E.

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E. Chasins 1 Maria Mueller 2 Rastislav Bodik 2 1 University of California, Berkeley 2 University of Washington The web: a rich source of data! 2008:

256 views • 24 slides

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss Python Summit 2016, Rapperswil @mrueegg Motivation Motivation Im the co-founder of the web site lauflos.ch which is a platform for

531 views • 38 slides

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London School of Economics pablobarbera.com Course website: pablobarbera.com/ECPR-SC104 Scraping the web Advanced scraping Selenium: I General idea:

198 views • 6 slides

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs Dynamic Websites Drop-downs Scrolling Pop-ups Inputting password Examples Web-Crawling Automate movement through websites

963 views • 24 slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

774 views • 20 slides

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years old Evolution of the Web The Future of the Web? THE SEMANTIC WEB The Semantic Web what is the Semantic Web? Semantic Web?

1.35k views • 99 slides

Modernizing Census Bureau Economic Statistics through Web Scraping Joint Statistical Meetings

Modernizing Census Bureau Economic Statistics through Web Scraping Joint Statistical Meetings Vancouver, Canada August 1, 2018 Brian Dumbacher Carma Hogue U.S. Census Bureau Disclaimer : Any views expressed are those of the authors and not

686 views • 23 slides

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

571 views • 40 slides

Challenges in Present Light Sources and Future Low-Emittance Rings 8-10 March 2017, Karlsruhe

Beam Dynamics and Vacuum Challenges in Present Light Sources and Future Low-Emittance Rings 8-10 March 2017, Karlsruhe Karlsruhe Institute of Technology Campus Ryutaro Nagaoka (Synchrotron SOLEIL) Content: 1. Introduction: Goals and

491 views • 20 slides

JAVASCRIPT DEVELOPMENT Sasha Vodnik, Instructor SLACK BOT LAB 2 HELLO! 1. Pull changes from

JAVASCRIPT DEVELOPMENT Sasha Vodnik, Instructor SLACK BOT LAB 2 HELLO! 1. Pull changes from the svodnik/JS-SF-9-resources repo to your computer 2. Open the starter-code folder in your code editor JAVASCRIPT DEVELOPMENT SLACK BOT LAB SLACK

569 views • 52 slides

PRIVACY-PRESERVING PROCESSING OF REGULAR LANGUAGES Peeter Laud Joint work with Jan Willemson

UaESMC PRIVACY-PRESERVING PROCESSING OF REGULAR LANGUAGES Peeter Laud Joint work with Jan Willemson 17.05.2014 Deterministic Finite Automata a b q 1 q 2 q 3 a a b b b b b a q 4 q 5 q 6 a a DFA A = ( Q , , q 0 ,, F ) Q

1.42k views • 127 slides

Todd Klindt Todd.klindt@sympraxisconsulting.com @toddklindt www.toddklindt.com

Todd Klindt Todd.klindt@sympraxisconsulting.com @toddklindt www.toddklindt.com www.toddklindt.com/HSUS How Do I Get to it? Three Main Ways Your Browser no really Chrome IE Edge (barf!) Sleipnir, etc. Windows desktop

756 views • 53 slides

How to get Twitter Data from the Twitter REST and Streaming API David M. Beskow Carnegie Mellon

How to get Twitter Data from the Twitter REST and Streaming API David M. Beskow Carnegie Mellon University 29 January 2018 Collecting Data on the Web in General 1. What platform should I use? 2. Should I collect everything? 3. How much

397 views • 20 slides

Informationsextraktion aus Websites Michael Haas <haas@computerlinguist.org> Service-Center

Informationsextraktion aus Websites Michael Haas <haas@computerlinguist.org> Service-Center Forschungsdaten, Universitt Mannheim 22.01.2013 Lessons Learned - Kontext Mein Hintergrund: B.A. Computerlinguistik, Universitt Heidelberg

516 views • 39 slides

Data Architecture 101 for Your Business Bence Faludi - bence@subninja.org Setting up your entire

Data Architecture 101 for Your Business Bence Faludi - bence@subninja.org Setting up your entire data architecture should be straightforward task. how to collect our frontend data? ... which engine should we use? ... or just pick a

850 views • 44 slides

London Ethereum Meetup swarm and web3 9 th June 2016 Viktor Trn A brief history of: Web

London Ethereum Meetup swarm and web3 9 th June 2016 Viktor Trn A brief history of: Web Hosting and Incentivisation Web 1.0 Start a web server Upload content 1. Content is unpopular - pay costs of maintaining webserver 2. Content

778 views • 43 slides