Scraping Distributed, Hierarchical Web Data with Programming by - PowerPoint PPT Presentation

Scraping Distributed, Hierarchical Web Data with “Programming by Demonstration”! Sarah E. Chasins 1 Maria Mueller 2 Rastislav Bodik 2 1 University of California, Berkeley 2 University of Washington

The web: a rich source of data! 2008: Google indexed 1 trillion pages Now: indexes > 60 trillion pages → lots of content out there Have you written a scraper? Percentages of Female and Male Speaking Characters - Top 100 Films of 2017 Woman director or writer: 42% female speaking roles Only male directors, writers: 32% female speaking roles Martha M. Lauzen. 2018. It’s a Man’s (Celluloid) World: Portrayals of Female Characters in the 100 Top Films of 2017

Let’s automate! common thread: users must reverse engineer target webpages DOM ... We’ve got some libraries... 3

Formative Study: What kinds of web data? distributed hierarchical must navigate between pages - must traverse and collect e.g., click, use forms + widgets tree-structured data

Formative Study: Can social scientists use... Traditional Manual Programming by programming? collection? demonstration? Skills: Skills: Skills: Basic programming Browser use Browser use Web DSL But But DOM Slow Can’t collect JavaScript distributed, Tedious Server interaction hierarchical datasets Small-scale data /

What’s Programming by Demonstration (PBD)? Closely related to Programming by Example (PBE) (e.g., FlashFill) input 1 output 1 program ... ... But PBD (e.g., SMARTedit) gets to see the input being transformed into the output: user demo! input 1 [action i , action j , …] 1 output 1 program ... ... ...

The Helena Ecosystem web servers language design [Chasins OOPSLA17] Helena Rousillon user demo program web data! Interpreter PBD tool Parallelizing Ringer Runtime Record and Replay systems web [Chasins WWW15]

The Interaction Model load https://www.imdb.com/... user click demonstrates how to collect one joined row movie 1 start recording load www.imdb.com… collect movie 1 actor 1 click movie 1 collect actor 1 end recording

Can we even offer this interaction model? Hierarchical Data : Synthesis of nested loops - needed for hierarchical data - is a long-standing open problem. Relation Ambiguity : Single row is an ambiguous demo. Which relation did the user intend to select? Readability : For robust automation, must run 100s of low-level, unreadable DOM events.

Problem 1: Hierarchical Data hierarchical data → nested loops The issue: Nested loop synthesis is an open problem. progs w/ progs for movie in movie_list: progs w/ single-level w/ no // scrape movie data nested loops loops loops for actor in actor_list: // scrape actor data The space of possible programs is Past solutions: just too big. To pick among all In web automation, none. In other domains, manually marking loop boundaries. these, our spec is ambiguous.

Problem 1: Hierarchical Data PBD takeaway: Our solution: Label uses of relation cells To add loops Design user interaction to make efficiently, first movie relation actor relation search tractable find objects that should be Contract w/ user: perform one treated together. iteration of each loop, ordered from outer to inner One loop per relation, start before cell use for movie in movie_list: movie cell movie cell movie cell movie cell movie cell movie cell for actor in actor_list: actor cell actor cell actor cell actor cell

Problem 2: Relation Ambiguity scrape Given this demo, what’s the right relation? Is node 1 included? If not, do we want purple or orange cells in rows 2 and 3? Maybe purple + scrape orange + unhighlighted? The issue: Can extract many relations from one page. Set of interacted nodes → 1 chosen relation? Past solutions: Have user label multiple rows.

Problem 2: Relation Ambiguity Our solution: S = subsets of interacted nodes of size n...1 for row1 in S: shape = getSubtreeShape(row1) row2 = siblingWithShape(row1, shape) relation = extractRelation([row1,row2]) if relation: PBD takeaway: return relation Take advantage of siblingWithShape([n1,n2], s) → ∅ domain-specific patterns (e.g, web siblingWithShape([n2], s) → n3 design best practices) to find objects we relation → [n2, n3, n4] should treat together

Problem 3: Readability ... Page allowed to react to any PBD takeaway: DOM event → prog must run It’s ok to record low-level events like this to be demo at one level, robust on modern interactive show program at DOM + JS + AJAX pages another. The issue: It’s not readable. Past solutions: Actually, it’s a new problem. Our solution: Reverse compilation

skills to do PBD User Study: scraping PBD vs. traditional programming l o o t D B P b a s i c c o d i n g scraping DOM JS AJAX library s k i l l s t o d o t r a d i t i o n a l s c r a p i n g 16

User Study: PBD vs. traditional programming Setup: Within-subject study, 15 CS PhD students 1 task, 2 tools; Helena then Selenium OR Selenium then Helena 9/15 prior scraping experience 4/15 prior Selenium experience Context: PBD vs. traditional programming eval is rare To date, solid speedups, but only small tasks (best averaged 12 mins saved time)

Q1 : Can users learn PBD faster? Helena Selenium Completion rate with Helena: 100% Completion rate with Selenium: 26.7% Lower bound on time savings is 47 mins for task 1, 52 mins for task 2 Task 1 Task 2

Q2 : Do users perceive PBD as more usable? PBD: Selenium: 1.2 4.8 very easy very hard to use to use 1 7 Q3 : Do users perceive PBD as more learnable? PBD: Selenium: 1.1 5.6 very easy very hard to learn to learn 1 7 19

Q4 : Having already learned both tools, which tool would users want for future tasks? 20

[It] was very useful how it automatically inferred the nesting that I wanted when going to multiple pages so that I didn’t have to write multiple loops. Super easy to use... It felt like magic and for quick data collection tasks online I’d love to use it in the future. Helena’s way easier to use – point and click at what I wanted and it ‘just worked’ like magic. Selenium is more fully featured, but...pretty clumsy (inserting random sleeps into the script).

The real test: social scientists and data scientists Can we set housing voucher DEPARTMENT OF SOCIOLOGY thresholds based on real-time _______________________________________________________________________________________________ UNIVERSITY of WASHINGTON neighborhood rents? DEPARTMENT OF ECONOMICS How is the minimum wage _______________________________________________________________________________________________ UNIVERSITY of WASHINGTON affecting Seattle restaurants? 15+ collaborations CIVIL & ENVIRONMENTAL Can we design a better ENGINEERING _______________________________________________________________________________________________ carpool matching algorithm? 6 different scrapers UNIVERSITY of WASHINGTON parallelized How do charitable EVANS SCHOOL OF PUBLIC all run 24/7 POLICY & GOVERNANCE foundations communicate _______________________________________________________________________________________________ with supporters? UNIVERSITY of WASHINGTON

Contributions ● A demonstration model that users love ● Solutions for key technical challenges: Hierarchical Data Relation Ambiguity Readability

Helena Scraper and Automator helena-lang.org/install github.com/schasins/helena Want to use the Use it to write: tool yourself? ● Parallel and distributed scrapers ● Programs for non-scraping web automation tasks ● Voice automation ‘skills’ Helena Rousillon user demo program web data! Interpreter PBD tool Parallelizing Ringer Runtime Record and Replay @sarahchasins I’m on the academic job market! schasins@cs.berkeley.edu

Scraping Distributed, Hierarchical Web Data with Programming by - PowerPoint PPT Presentation

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E. Chasins 1 Maria Mueller 2 Rastislav Bodik 2 1 University of California, Berkeley 2 University of Washington The web: a rich source of data! 2008:

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Hierarchical Pointer Analysis for Distributed Programs Distributed Programs Amir Kamil and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

STATS 507 Data Analysis in Python Lecture 27: APIs Previously: Scraping Data from the Web We

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs

Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate

The Future of Sharing Ideas Nigel Portwood Chief Executive, Oxford University Press Fellow,

Established Management Paradigms for Advanced Triple-Negative Breast Cancer (TNBC); Actionable and

Analysis of variance and regression November 27, 2007 Other types of regression models Counts

Web Scraping With P y thon W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU B

Dr. Scrapelove (or: How I Learned to Beat Anti-Scrape Websites and Love WWW::Mechanize::Firefox)

Relabeling Julien Pivotto (@roidelapluie) PromConf Munich August 9, 2017 user{name="Julien

Staleness and Isolation in Prometheus 2.0 Brian Brazil Founder Who am I? One of the core

Scraping Distributed, Hierarchical Web Data with Programming by - PowerPoint PPT Presentation

Scraping Distributed, Hierarchical Web Data with Programming by Demonstration! Sarah E. Chasins 1 Maria Mueller 2 Rastislav Bodik 2 1 University of California, Berkeley 2 University of Washington The web: a rich source of data! 2008:

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Scraping and Text Mining with R Simon Munzert University of Konstanz October 2014 Web

NTTS 2015 - Session 6A Big data sources: web scraping and smart meters www.statistik.at Wir

video demo End-User Web Scraping: Google Scholar Edition Sarah Chasins data scraping tool

Session 6A - Big data sources: web scraping and smart meters Using Internet as a Data Source for

Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka

Web Scraping &amp; APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Hierarchical Pointer Analysis for Distributed Programs Distributed Programs Amir Kamil and

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

STATS 507 Data Analysis in Python Lecture 27: APIs Previously: Scraping Data from the Web We

Web Scrapers/Crawlers Aaron Neyer - 2014/02/26 Scraping the Web Optimal - A nice JSON API

ECPR Methods Summer School: Automated Collection of Web and Social Data Pablo Barber a London

Scrapy and Elasticsearch: Powerful Web Scraping and Searching with Python Michael Regg Swiss

Web Scraping Ben Williams October 9 th 2020 Non-Static Websites Dynamic Websites APIs

Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate

The Future of Sharing Ideas Nigel Portwood Chief Executive, Oxford University Press Fellow,

Established Management Paradigms for Advanced Triple-Negative Breast Cancer (TNBC); Actionable and

Analysis of variance and regression November 27, 2007 Other types of regression models Counts

Web Scraping With P y thon W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU B

Dr. Scrapelove (or: How I Learned to Beat Anti-Scrape Websites and Love WWW::Mechanize::Firefox)

Relabeling Julien Pivotto (@roidelapluie) PromConf Munich August 9, 2017 user{name=&quot;Julien

Staleness and Isolation in Prometheus 2.0 Brian Brazil Founder Who am I? One of the core

Web Scraping & APIs Nel Escher many slides lifted from EECS 485 lectures thank u bbs Agenda

Relabeling Julien Pivotto (@roidelapluie) PromConf Munich August 9, 2017 user{name="Julien