READING DATA FROM THE WEB Jeff Goldsmith, PhD Department of - PowerPoint PPT Presentation

Nov 05, 2022 •34 likes •154 views

READING DATA FROM THE WEB Jeff Goldsmith, PhD Department of Biostatistics 1 Two major paths Theres data included as content on a webpage, and you want to scrape those data Table from Wikipedia Reviews from Amazon

READING DATA FROM THE WEB Jeff Goldsmith, PhD Department of Biostatistics � 1
Two major paths • There’s data included as content on a webpage, and you want to “scrape” those data – Table from Wikipedia – Reviews from Amazon – Cast and characters on IMBD • There’s a dedicated server holding data in a relatively usable form, and you want to ask for those data – Open NYC data – Data.gov – Star Wars API � 2
Scraping web content • Webpages combine HTML (content) and CSS (styling) to produce what you see • When you retrieve the HTML for a page with data you want, you’ve retrieved the data • Also you have a lot of other stuff • Challenge is extracting what you want from the HTML � 3
https://github.com/ropensci/user2016-tutorial Garrett Grolemund, “Extracting data from the web” � 4
https://github.com/ropensci/user2016-tutorial Garrett Grolemund, “Extracting data from the web” � 5
CSS Selectors • Because CSS controls appearance, CSS identifiers appear throughout HTML code • HTML elements you care about frequently have unique identifiers • Extracting what you want from HTML is often a question of specifying an appropriate CSS Selector � 6
Find the CSS Selector • Selector Gadget is the most common tool for finding the right CSS selector on a page – In a browser, go to the page you care about – Launch the Selector Gadget – Click on things you want – Unclick things you don’t – Iterate until only what you want is highlighted – Copy the CSS Selector Inspector Gadget � 7
Scraping data into R rvest facilitates web scraping • • Workflow is: – Download HTML using read_html() – Extract nodes using html_nodes() and your CSS Selector – Extract content from nodes using html_text() , html_table() , etc � 8
APIs • In contrast to scraping, A pplication P rogramming I nterfaces provide a way to communicate with software • Web APIs may give you a way to request specific data from a server • Web APIs aren’t uniform – The Star Wars API is different from the NYC Open Data API • This means that what is returned by one API will differ from what is returned by another API � 9
Getting data into R • Web APIs are mostly accessible using HTTP (the same protocol that’s used to serve up web pages) httr contains a collection of tools for constructing HTTP requests • • We’ll focus on GET, which retrieves information from a specified URL – You can refine your HTTP request with query parameters if the API makes them available � 10
API data formats • In “lucky” cases, you can request a CSV from an API – Sometimes you could download this by clicking a link on a webpage, but ### I went to <website> and clicked “download” isn’t reproducible • In more general cases, you’ll get J ava S cript O bject N otation (JSON) – JSON files can be parsed in R using jsonlite � 11
Real talk about web data • Data from the web is messy • It will frequently take a lot of work to figure out – How to get what you want – How to tidy it once you have it � 12

Recommend

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A long way to get here What is a Web Service? What is a Web Service? What is a Web Service? Web Services Web Services Software service :

552 views • 33 slides

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

VYOJEOPL85R9 // Kindle Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation Book A - Grade 5 Filesize: 1.4 MB Reviews Reviews This pdf will be

72 views • 4 slides

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

What is Web Mining? Wh t i W b Mi i What is Web Mining? Wh t i W b Mi i ? ? Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques to automat cally d scover and extract nformat on automatically

774 views • 20 slides

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years old Evolution of the Web The Future of the Web? THE SEMANTIC WEB The Semantic Web what is the Semantic Web? Semantic Web?

1.35k views • 99 slides

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web scraping The easy way, using web service APIs Well see examples of both. 2 / 9 Web Scraping Web scraping, a.k.a. screen scraping, means getting

486 views • 9 slides

Web Mining Web Mining to automatically discover and extract information from Web

What is Web Mining? What is Web Mining? Web mining is the use of data mining techniques Web Mining Web Mining to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) 1 2 The Web The Web

469 views • 23 slides

What is Reading? Reading is making meaning from print. PRE READING SKILLS The image

17/08/17 Reading Reading is a complex cogni9ve process of decoding symbols in order to construct or derive meaning. What is Reading? Reading is making meaning from print. PRE READING SKILLS The image cannot be displayed. Your

117 views • 7 slides

General Reading Strategies For students who love reading and students who will love reading! Our

General Reading Strategies For students who love reading and students who will love reading! Our Agenda Introduction Purpose of reading Types of reading material Creating rapport with your student Setting the

505 views • 25 slides

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

IN3210 Network Security Web Application Security Attacks on the Web Attacker Web User Application Web Database Web Server Application 2 The OWASP Foundation The Open Web Application Security Project (OWASP) is a 501(c)(3)

1.21k views • 60 slides

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon Mikhaiel Struts framework rimon@cs.ualberta.ca Example of workflow management. A Hello World Example Web Model 2 Web Model 1 Web MVC

610 views • 4 slides

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

What is Web Mining? What is Web Mining? Web Mining Web Mining Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services (Etzioni, 1996, CACM 39(11)) Web mining aims to

571 views • 22 slides

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline Web Mining Outline Goal: Examine the use of data mining on Examine the use of data mining on Goal: the World Wide Web the World Wide Web Web

1.14k views • 18 slides

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services 2002 2 Abstract: web services Web Services are the next step in the evolution of the World Wide Web and allow active objects to be placed on Web

749 views • 61 slides

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web Server Mac running Chrome Web Clients Basic Terminology | HTML | JavaScript Terminology Web page consists of objects Each object is

761 views • 47 slides

Responsive Web Design Introduction to Web Design Responsive Web Design Introduction to Web

Responsive Web Design Introduction to Web Design Responsive Web Design Introduction to Web Design The control which designers know in the print medium, and A Dao of Web Design often desire in the web medium, is simply a function of the

298 views • 7 slides

CS/INFO 330: Web Driven Web Applications Web Services Definition: Web Services A

CS/INFO 330: Web Driven Web Applications Web Services Definition: Web Services A standardized way of integrating Web- based applications, using XML to tag data SOAP to transport data Simple Object Access Protocol WSDL to

287 views • 25 slides

An Introduction to Prometheus Brian Brazil Founder Who am I? Engineer passionate about running

An Introduction to Prometheus Brian Brazil Founder Who am I? Engineer passionate about running software reliably in production. Core developer of Prometheus Studied Computer Science in Trinity College Dublin. Google SRE for 7 years, working

431 views • 42 slides

Exposing the Lack of Privacy in File Hos9ng Services Nick

Exposing the Lack of Privacy in File Hos9ng Services Nick Nikiforakis Marco Balduzzi Steven Van Acker Wouter Joosen Davide BalzaroF Sharing is

876 views • 35 slides

... and other open source software April 17, 2019 Data Council San Francisco, CA Greg

Running Apache Airflow reliably with Kubernetes ... and other open source software April 17, 2019 Data Council San Francisco, CA Greg Neiheisel, CTO On Deck Quick Airflow / Kubernetes overview Running Airflow at Scale Major

475 views • 47 slides

Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by

331 views • 12 slides

Power-law revisited: A large scale measurement study of P2P content popularity Gyrgy Dn

Power-law revisited: A large scale measurement study of P2P content popularity Gyrgy Dn Niklas Carlsson School of Electrical Engineering Department of Computer Science KTH, Royal Institute of Technology University of Calgary Stockholm,

453 views • 18 slides

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool input demonstration of how to collect the first row of a relational dataset F r o m h i g h l y s t r u c t u r e d w e b p a

837 views • 63 slides

MEBT Status and Commissioning Plan A. Shemyakin PIP-II Machine Advisory Committee Meeting 15-17

MEBT Status and Commissioning Plan A. Shemyakin PIP-II Machine Advisory Committee Meeting 15-17 March 2016 Outline MEBT functions and challenges MEBT elements and their status Magnets, bunching cavities, scrapers, chopping system

494 views • 20 slides

CS 744: SUMMARY Shivaram Venkataraman Fall 2019 Administrivia Midterm 2 on Tuesday Poster

CS 744: SUMMARY Shivaram Venkataraman Fall 2019 Administrivia Midterm 2 on Tuesday Poster session Dec 13 th , 3-5pm details on Piazza Final report Dec 17 th AEFIS Course feedback form! Applications Machine Learning SQL

793 views • 33 slides