Web Scraping Ben Williams October 9 th 2020 Non-Static Websites - PowerPoint PPT Presentation

Nov 16, 2023 •701 likes •963 views

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs Dynamic Websites Drop-downs Scrolling Pop-ups Inputting password Examples Web-Crawling Automate movement through websites

Web Scraping Ben Williams October 9 th 2020
Non-Static Websites • Dynamic Websites • APIs
Dynamic Websites • Drop-downs • Scrolling • Pop-ups • Inputting password
Examples
Web-Crawling • Automate movement through websites • Navigate to website, then use techniques Ryan showed us • Navigation done “remotely” via code
Example: Airbnb Plus • Airbnb Plus: Airbnb differentiation program • Hosts apply to be part of Plus program • Variety of benefits once part of program • Compare effect of Plus program introduction • How to determine which listings are plus? • Work with Karen Xie
Airbnb Plus 1) Identify main city page 2) Check if there are multiple listing pages 3) Scrape current page 4) Click on next page if applicable 5) Determine which listings have “plus” in their url Listing ID Number Plus Identifier
Examples
Take a break! Should we click through pages?
Dynamic Web-scraping • Each situation is unique • Requires trial and error • Tools: • Selenium (python, R)
APIs • Application Programming Interface • “Easily” facilitated connection to apps, websites, etc. • Another way to extract data from a website/platform
Some examples
APIs • Pros: • Can make data collection very smooth • Popular APIs often have libraries/packages for common software (python, R) • Cons: • Restricted Access (only a certain amount of data given per day) • Data not in format of your choice
Example: Twitter • What do you need? • A Twitter account! • `rtweet` R package • Could use python as well
Example: Twitter • What can I get? • Hashtags • Followers • Friends • Locations • Source (android, iPhone, etc) • Basic: 18,000 tweets every 15 minutes from “rest” API • More advanced: “streaming” API: much more data
Example: #fakenews • Can we learn about the spread of #fakenews on Twitter? • Scrape twice daily, look for #fakenews • October 27 th to December 11 th 2019 • Over 170,000 unique tweets
Example: #fakenews Search for tweets that use the hastag `#fakenews` Simple code: search_tweets( "#fakenews", n = 18000, include_rts = FALSE,lang = "en")
Example: #fakenews 6000 4000 2000 0 United States USA Florida, USA California, USA Texas, USA Washington, DC London, England New York, USA London United Kingdom Los Angeles, CA UK New York, NY Florida England, United Kingdom
Example: #fakenews Impeachment Hearing Epstein Kwong 400 300 200 100 Nov 01 Nov 15 Dec 01
After Scraping… • Post-scraping analyses • Simple (sentiment analysis) • Complicated (machine learning) • Many options, low hanging fruit • Text Mining with R (Silge & Robinson) tidytextmining.com
Take-aways • Dream big about web-scraping! • Different types of websites have different approaches • Usually can find a way to scrape data • Please do not hesitate to contact me for help/collaboration • benjamin.williams@du.edu

Recommend

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web scraping The easy way, using web service APIs Well see examples of both. 2 / 9 Web Scraping Web scraping, a.k.a. screen scraping, means getting

486 views • 9 slides

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

An introduction to Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web Scraping with R Simon Munzert An introduction to Web Scraping and Text Mining with R Simon Munzert University of Konstanz

525 views • 20 slides

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

Automatic price collection on Ingolf Boettcher the internet (web scraping) Brussels 10. March 2015 NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir bewegen Informationen Web scraping There is a

396 views • 10 slides

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool input demonstration of how to collect the first row of a relational dataset F r o m h i g h l y s t r u c t u r e d w e b p a

837 views • 63 slides

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam Introduction Scraping the Dark Web Useful for

972 views • 56 slides

Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for

NTTS 2015 Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli(*) (barcarol@istat.it), Monica Scannapieco (*)

436 views • 12 slides

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda Web sites Requests Scraping APIs API Wrappers What is the internet? The request response cycle The request response cycle

301 views • 29 slides

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A long way to get here What is a Web Service? What is a Web Service? What is a Web Service? Web Services Web Services Software service :

552 views • 33 slides

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API Most websites dont give us this, so we need to try and pull the information out How to scrape? Fetch the HTML source code python:

444 views • 13 slides

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E.

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E. Chasins 1 Maria Mueller 2 Rastislav Bodik 2 1 University of California, Berkeley 2 University of Washington The web: a rich source of data! 2008:

256 views • 24 slides

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss Python Summit 2016, Rapperswil @mrueegg Motivation Motivation Im the co-founder of the web site lauflos.ch which is a platform for

531 views • 38 slides

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London School of Economics pablobarbera.com Course website: pablobarbera.com/ECPR-SC104 Scraping the web Advanced scraping Selenium: I General idea:

198 views • 6 slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

774 views • 20 slides

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years old Evolution of the Web The Future of the Web? THE SEMANTIC WEB The Semantic Web what is the Semantic Web? Semantic Web?

1.35k views • 99 slides

Modernizing Census Bureau Economic Statistics through Web Scraping Joint Statistical Meetings

Modernizing Census Bureau Economic Statistics through Web Scraping Joint Statistical Meetings Vancouver, Canada August 1, 2018 Brian Dumbacher Carma Hogue U.S. Census Bureau Disclaimer : Any views expressed are those of the authors and not

686 views • 23 slides

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations

RECSM Summer School: Scraping the web Pablo Barber a School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

571 views • 40 slides

DUNE/35-ton News and Announcements Tom Junk, Michelle

DUNE/35-ton News and Announcements Tom Junk, Michelle Stancari, Tingjun Yang, Mark Convery 7/1/15 T. Junk News Announcements 35-ton 1 New LArSoK

321 views • 12 slides

ADVISORY PANEL ON PATIENT ENGAGEMENT MEETING Via GoToWebinar Fall 2020 Meeting - Day One October

ADVISORY PANEL ON PATIENT ENGAGEMENT MEETING Via GoToWebinar Fall 2020 Meeting - Day One October 22, 11:30am-3:30pm EDT Welcome Kristin L. Carman Director, Public and Patient Engagement Gwen Darien Chair, Advisory Panel on Patient Engagement

962 views • 72 slides

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing

Large-Scale Data Engineering Intro to LSDE, Intro to Big Data & Intro to Cloud Computing event.cwi.nl/lsde Administration Canvas Page Announcements, also via email (pardon html formatting) Turning in practicum assignments, Check

548 views • 27 slides

CS 423 Operating System Design: MP2 Walkthrough Professor Adam Bates Spring 2018 CS 423:

CS 423 Operating System Design: MP2 Walkthrough Professor Adam Bates Spring 2018 CS 423: Operating Systems Design 1 MP2: Rate-Monotonic Scheduling MP2 will be out at the end of the week We are currently grading MP1 Reminder

841 views • 32 slides

Breaking E-bay audio captcha d r o f n a Elie Bursztein Steven Bethard t S Stanford

b a L y t i r u c e S r e t u p m o C Breaking E-bay audio captcha d r o f n a Elie Bursztein Steven Bethard t S Stanford University Outline Breaking an audio captcha E-bay audio captcha What is next ? Elie

501 views • 48 slides

Introducing The New PMM 2! Michael Coburn Percona Michael Coburn Product Manager for PMM

Introducing The New PMM 2! Michael Coburn Percona Michael Coburn Product Manager for PMM (and Percona Toolkit) At Percona for six years across multiple MySQL roles Principal Architect, Managing Consultant, Technical Account

606 views • 24 slides

TREBEK (Text REtrieval Boosted by Exterior Knowledge) Group 6: Chuck Curtis, Matt Hohensee,

TREBEK (Text REtrieval Boosted by Exterior Knowledge) Group 6: Chuck Curtis, Matt Hohensee, Nathan Imse Back to the Drawing Board Went back and essentially re-implemented D3 Changes to Document Retrieval: Slightly more document

523 views • 15 slides

The Anatomy Of An API MacSysAdmin 2020 Charles Edge Software Is Just A Collection of

The Anatomy Of An API MacSysAdmin 2020 Charles Edge Software Is Just A Collection of Interconnected API Endpoints Microservices Monoliths Data UI Access Layer Business Logic Microservices UI Micro- Micro- Micro- Micro- Micro-

541 views • 53 slides