web scraping 101
play

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e - PowerPoint PPT Presentation

Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo e Wickham Instr u ctor Selectors Li le bro w ser e x tensions Identif y the speci c bit ( s ) y o u w ant Gi v e y o u a u niq u e ID to grab them w ith Not u sed in this co


  1. Web Scraping 101 W OR K IN G W ITH W E B DATA IN R Charlo � e Wickham Instr u ctor

  2. Selectors Li � le bro w ser e x tensions Identif y the speci � c bit ( s ) y o u w ant Gi v e y o u a u niq u e ID to grab them w ith Not u sed in this co u rse ( b u t w orth grabbing a � er ) WORKING WITH WEB DATA IN R

  3. r v est rvest is a dedicated w eb scraping package Makes things shockingl y eas y Read HTML page w ith read_html(url = ___) WORKING WITH WEB DATA IN R

  4. Parsing HTML read_html() ret u rns an XML doc u ment Use html_node() to e x tract contents w ith XPATHs WORKING WITH WEB DATA IN R

  5. Parsing HTML wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c . [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 . WORKING WITH WEB DATA IN R

  6. wiki_r <- read_html( "https://en.wikipedia.org/wiki/R_(programming_language)" ) wiki_r {xml_document} <html class="client-nojs" lang="en" dir="ltr"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; c ... [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ... html_node(wiki_r, xpath = "//ul") {xml_node} <ul> [1] <li><a href="/wiki/Common_Lisp" title="Common Lisp">Common Li ... [2] <li><a href="/wiki/S_(programming_language)" title="S (progra ... [3] <li>\n<a href="/wiki/Scheme_(programming_language)" title="Sc ... [4] <li><a href="/wiki/XLispStat" title="XLispStat">XLispStat</a> ... WORKING WITH WEB DATA IN R

  7. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

  8. HTML Str u ct u re W OR K IN G W ITH W E B DATA IN R Oli v er Ke y es Instr u ctor

  9. Tags HTML is content w ithin tags Like XML <p> this is a test </p> WORKING WITH WEB DATA IN R

  10. Attrib u tes <a href = "https://en.wikipedia.org/"> this is a test </a> WORKING WITH WEB DATA IN R

  11. E x tracting information html_text(x = ___) - get te x t contents html_attr(x = ___, name = ___) - get speci � c a � rib u te html_name(x = ___) - get tag name WORKING WITH WEB DATA IN R

  12. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

  13. Reformatting Data W OR K IN G W ITH W E B DATA IN R Charlo � e Wickham Instr u ctor

  14. HTML tables HTML tables are dedicated str u ct u res : <table>...</table> The y can be t u rned into data . frames w ith html_table() Use colnames(table) <- c("name", "second_name") to name the col u mns WORKING WITH WEB DATA IN R

  15. T u rning things into data . frames Non - tables can also become data . frames Use data.frame() , w ith the v ectors of te x t or names or a � rib u tes WORKING WITH WEB DATA IN R

  16. Let ' s practice ! W OR K IN G W ITH W E B DATA IN R

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend