Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e - - PowerPoint PPT Presentation

web scraping 101
SMART_READER_LITE
LIVE PREVIEW

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e - - PowerPoint PPT Presentation

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e Wickham Instr u ctor Selectors Li le bro w ser e x tensions Identif y the speci c bit ( s ) y o u w ant Gi v e y o u a u niq u e ID to grab them w ith Not u sed in this co


slide-1
SLIDE 1

Web Scraping 101

W OR K IN G W ITH W E B DATA IN R

Charloe Wickham

Instructor

slide-2
SLIDE 2

WORKING WITH WEB DATA IN R

Selectors

Lile browser extensions Identify the specic bit(s) you want Give you a unique ID to grab them with Not used in this course (but worth grabbing aer)

slide-3
SLIDE 3

WORKING WITH WEB DATA IN R

rvest

rvest is a dedicated web scraping package

Makes things shockingly easy Read HTML page with read_html(url = ___)

slide-4
SLIDE 4

WORKING WITH WEB DATA IN R

Parsing HTML

read_html() returns an XML document

Use html_node() to extract contents with XPATHs

slide-5
SLIDE 5

WORKING WITH WEB DATA IN R

Parsing HTML

wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c . [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 .

slide-6
SLIDE 6

WORKING WITH WEB DATA IN R

wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c ... [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ... html_node(wiki_r, xpath = "//ul") {xml_node} <ul> [1] <li><a href="/wiki/Common_Lisp" title="Common Lisp">Common Li ... [2] <li><a href="/wiki/S_(programming_language)" title="S (progra ... [3] <li>\n<a href="/wiki/Scheme_(programming_language)" title="Sc ... [4] <li><a href="/wiki/XLispStat" title="XLispStat">XLispStat</a> ...

slide-7
SLIDE 7

Let's practice!

W OR K IN G W ITH W E B DATA IN R

slide-8
SLIDE 8

HTML Structure

W OR K IN G W ITH W E B DATA IN R

Oliver Keyes

Instructor

slide-9
SLIDE 9

WORKING WITH WEB DATA IN R

Tags

HTML is content within tags Like XML

<p> this is a test </p>

slide-10
SLIDE 10

WORKING WITH WEB DATA IN R

Attributes

<a href = "https://en.wikipedia.org/"> this is a test </a>

slide-11
SLIDE 11

WORKING WITH WEB DATA IN R

Extracting information

html_text(x = ___) - get text contents html_attr(x = ___, name = ___) - get specic aribute html_name(x = ___) - get tag name

slide-12
SLIDE 12

Let's practice!

W OR K IN G W ITH W E B DATA IN R

slide-13
SLIDE 13

Reformatting Data

W OR K IN G W ITH W E B DATA IN R

Charloe Wickham

Instructor

slide-14
SLIDE 14

WORKING WITH WEB DATA IN R

HTML tables

HTML tables are dedicated structures: <table>...</table> They can be turned into data.frames with html_table() Use colnames(table) <- c("name", "second_name") to name the columns

slide-15
SLIDE 15

WORKING WITH WEB DATA IN R

Turning things into data.frames

Non-tables can also become data.frames Use data.frame() , with the vectors of text or names or aributes

slide-16
SLIDE 16

Let's practice!

W OR K IN G W ITH W E B DATA IN R