social data science Data Gathering Sebastian Barfort August 10, - PowerPoint PPT Presentation

social data science Data Gathering Sebastian Barfort August 10, 2016 University of Copenhagen Department of Economics 1/54

ethics On the ethics of web scraping and data journalism If an institution publishes data on its website, this data should automatically be public If a regular user can’t access the data, we shouldn’t try to get it (that would be hacking) Always read the user terms and conditions Always check the robots.txt file, which states what is allowed to be scraped 2/54

rules of web scraping 1. You should check a site’s terms and conditions before you scrape them. It’s their data and they likely have some rules to govern it. 2. Be nice - A computer will send web requests much quicker than a user can. Make sure you space out your requests a bit so that you don’t hammer the site’s server. 3. Scrapers break - Sites change their layout all the time. If that happens, be prepared to rewrite your code. 4. Web pages are inconsistent - There’s sometimes some manual clean up that has to happen even after you’ve gotten your data. 3/54

how does a web page look like? https://sebastianbarfort.github.io/ 5/54

motivating example https: //en.wikipedia.org/wiki/Table_%28information%29 6/54

example rvest is a nice R package for scraping web pages that don’t have an API which css selector matches the data we want Selectorgadget is a browser extension for quickly extracting desired parts of an HTML page. With some user feedback, the gadget find out the CSS selector that returns the highlighted page elements. 7/54 To extract something, you start with selectorgadget to figure out

library (”rvest”) link = paste0 (”http://en.wikipedia.org/”, ”wiki/Table_(information)”) link.data = link %>% read_html () %>% html_node (”.wikitable”) %>% # extract first node with class wikitable html_table () # then convert the HTML table into a data frame html_table usually only works on ‘nicely’ formatted HTML tables. 8/54

First name Chijiaku 22 Zinn Jon-Kabat 22 Athanasios Axelia 22 Anthoula Adrienne 22 Olatunkboh Last name 16 McGarrett Lily 25 Kostrzewski Blaszczyk 14 Elejogun Tinu Age 9/54

This is a nice format? Really? Yes, really. It’s the format used to render tables on webpages (remember: programming sucks) <table class=”wikitable” > <tr> <th> First name </th> <th> Last name </th> <th> Age </th> </tr> <tr> <td> Bielat </td> <td> Adamczak </td> <td> 24 </td> </tr> ... </table> 10/54

scraping jyllands posten http://jyllands-posten.dk/ 11/54

scraping jyllands posten in rvest Assume we want to extract the headlines 12/54 · Fire up Selectorgadget · Find the correct selector · css selector : .artTitle a · Want to use xpath? no problem.

scraping headlines css.selector = ”.artTitle a” link = ”http://jyllands-posten.dk/” jp.data = link %>% read_html () %>% html_nodes (css = css.selector) %>% html_text () 13/54

## [1] ”\r\n\t\t\tTruende milliardtab tvinger borgmestre til U-vending: Vil sejle udenlandsk affald til Amager ” ## [2] ”\r\n\t\t\tHvem er mest undertrykt her? ” ## [3] ”\r\n\t\t\tForslag: Slut med burka og dobbelt statsborgerskab ” ## [4] ”\r\n\t\t\tDna-spor har ført til ny teori om ruten til Amerika ” ## [5] ”\r\n\t\t\tLiveblog fra OL: Følg danskerne, stjernerne og de store begivenheder ” 14/54

garbage Notice that there are still some garbage characters in the scraped text So we need our string processing skills to clean the scraped data Can be done in many ways library (”stringr”) jp.data1 = jp.data %>% str_replace_all (pattern = ”\\n|\\t|\\r” , replacement = ””) 15/54

Truende milliardtab tvinger borgmestre til U-vending: Vil sejle udenlandsk affald til Amager Hvem er mest undertrykt her? Forslag: Slut med burka og dobbelt statsborgerskab Dna-spor har ført til ny teori om ruten til Amerika Liveblog fra OL: Følg danskerne, stjernerne og de store begivenheder Cancellara tog OL-guld i enkeltstart 16/54

str_trim str_trim : Trim whitespace from start and end of string library (”stringr”) jp.data2 = jp.data %>% str_trim () 17/54

Truende milliardtab tvinger borgmestre til U-vending: Vil sejle udenlandsk affald til Amager Hvem er mest undertrykt her? Forslag: Slut med burka og dobbelt statsborgerskab Dna-spor har ført til ny teori om ruten til Amerika Liveblog fra OL: Følg danskerne, stjernerne og de store begivenheder Cancellara tog OL-guld i enkeltstart 18/54

extracting attributes What if we also wanted the links embedded in those headlines? jp.links = link %>% read_html (encoding = ”UTF-8”) %>% html_nodes (css = css.selector) %>% html_attr (name = ’href’) 19/54

http://finans.dk/finans/erhverv/ECE8904112/truende-milliardtab-tvinger-borgmestre-til-uvending-vil-sejle-udenlandsk-affald-til-amager/ http://jyllands-posten.dk/sport/ol/ECE8909430/hvem-er-mest-undertrykt-her/ http://www.jyllands-posten.dk/protected/premium/international/ECE8909823/vi-afviser-denne-delte-loyalitet-de-der-vil-gaa-ind-i-udenlandske-regeringers-politik-foreslaar-vi-at-forlade-tyskland/ http://jyllands-posten.dk/nyviden/ECE8909952/de-foerste-mennesker-kom-til-amerika-via-en-anden-rute-end-hidtil-antaget/ http://jyllands-posten.dk/sport/ol/ECE8898888/liveblog-fra-ol-foelg-danskerne-stjernerne-og-de-store-begivenheder/ http://jyllands-posten.dk/sport/ol/cykling/ECE8909671/cancellara-tog-olguld-i-enkeltstart/ 20/54

looping through collection of links We now have jp.links , a vector of all the links to news stories from JP’s front page Let’s loop through every link and extract some information. 21/54

cleaning the vector Assume that we’re only interested in domestic and international politics jp.keep = jp.links %>% str_detect (”politik|indland|international”) jp.links.clean = jp.links[jp.keep] jp.remove.index = jp.links.clean %>% str_detect (”protected|premium|finans”) jp.links.clean = jp.links.clean[!jp.remove.index] 22/54

grab info from first link first.link = jp.links.clean[1] first.link.text = first.link %>% read_html (encoding = ”UTF-8”) %>% html_nodes (”#articleText”) %>% html_text () 23/54

social data science Data Gathering Sebastian Barfort August 10, - PowerPoint PPT Presentation

social data science Data Gathering Sebastian Barfort August 10, 2016 University of Copenhagen Department of Economics 1/54 ethics On the ethics of web scraping and data journalism If an institution publishes data on its website, this data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

network science and social science on Twitter mor naaman rutgers SC&I | social media

European Social Network Social services in Europe Christian Fillet Chair, European Social

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

RESEARCH THE SCIENTIFIC STUDY OF HUMAN SOCIETY AND SOCIAL RELATIONSHIPS social science SOCIAL

social-emotional functioning Dr Dawn Watling Department of Psychology Social Withdrawal

Outline Social Contagion Social Contagion Social Contagion Social Contagion Models Models

SOCI 210: Sociological Perspectives Nov. 3 1. Social Change 2. Collective behavior 3. Social

Social Science Perspectives and Countering Insider Threat The Framework Social science is

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Social Security: An Overview For the National Academy of Social Insurance Demystifying Social

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Social Media Advocacy and Social Media Advocacy and Data Driven Outreach Data Driven Outreach

SOCIAL PROGRESS INDEX SOCIAL SOCIAL PROGRESS PROGRESS IMPERATIVE IMPERATIVE Social Progress

Social Entrepreneurship Caravan Social entrepreneurship aims at solving social problems by

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Current Solution MBTA Upgrade 7Y $10B Pole Partner 4-5 8 Ergonomics People naturally

When learning a foreign language, ones grammar improves if one learns to listen to the

Resistance Unit Slides Discussion Norms and Vocabulary Todays Agenda: 1. Record homework

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Astrobites: beyond seven years of astro-blogging Benny Tsang Feb 23 GSPS Experience with

WORSHIP & SACRAMENTS RESOURCE ROUNDTABLE Archdiocese of Cincinnati Office for Divine Worship

Who IS this person? Telling Your Chapter Story Joined YP movement 2010 Elected PR

Scopus: how to use analysis tools in your research Presenter: Moderator: Kai Wan Susanne

social data science Data Gathering Sebastian Barfort August 10, - PowerPoint PPT Presentation

social data science Data Gathering Sebastian Barfort August 10, 2016 University of Copenhagen Department of Economics 1/54 ethics On the ethics of web scraping and data journalism If an institution publishes data on its website, this data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

network science and social science on Twitter mor naaman rutgers SC&amp;I | social media

European Social Network Social services in Europe Christian Fillet Chair, European Social

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

RESEARCH THE SCIENTIFIC STUDY OF HUMAN SOCIETY AND SOCIAL RELATIONSHIPS social science SOCIAL

social-emotional functioning Dr Dawn Watling Department of Psychology Social Withdrawal

Outline Social Contagion Social Contagion Social Contagion Social Contagion Models Models

SOCI 210: Sociological Perspectives Nov. 3 1. Social Change 2. Collective behavior 3. Social

Social Science Perspectives and Countering Insider Threat The Framework Social science is

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Social Security: An Overview For the National Academy of Social Insurance Demystifying Social

Maps and Twitter data Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media

Social Media Advocacy and Social Media Advocacy and Data Driven Outreach Data Driven Outreach

SOCIAL PROGRESS INDEX SOCIAL SOCIAL PROGRESS PROGRESS IMPERATIVE IMPERATIVE Social Progress

Social Entrepreneurship Caravan Social entrepreneurship aims at solving social problems by

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Current Solution MBTA Upgrade 7Y $10B Pole Partner 4-5 8 Ergonomics People naturally

When learning a foreign language, ones grammar improves if one learns to listen to the

Resistance Unit Slides Discussion Norms and Vocabulary Todays Agenda: 1. Record homework

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Astrobites: beyond seven years of astro-blogging Benny Tsang Feb 23 GSPS Experience with

WORSHIP &amp; SACRAMENTS RESOURCE ROUNDTABLE Archdiocese of Cincinnati Office for Divine Worship

Who IS this person? Telling Your Chapter Story Joined YP movement 2010 Elected PR

Scopus: how to use analysis tools in your research Presenter: Moderator: Kai Wan Susanne

network science and social science on Twitter mor naaman rutgers SC&I | social media

WORSHIP & SACRAMENTS RESOURCE ROUNDTABLE Archdiocese of Cincinnati Office for Divine Worship