Skip Blocks : Reusing Execution History to Accelerate Web Scripts - PowerPoint PPT Presentation

Skip Blocks : Reusing Execution History to Accelerate Web Scripts Sarah Chasins Rastislav Bodik University of California, Berkeley University of Washington OOPSLA Oct 25, 2017 Vancouver

care about data tomorrow coders non-coders care about data today non-coders coders 2

What web data collection tools do we have? tools that require users to ● hire a human to reverse engineer target copy & paste webpages ● hire a coder to use JS X X A A J J A A one of these DOM ● Helena our tool! ... WEB AUTOMATION FOR END USERS coders non-coders 3

Helena Helena program program’ PBD tool editor Web Browser demonstration skip blocks this is our prior work 4

output: a script Let’s PBD a web automation script! Goal: scrape all papers by top 10,000 CS authors from Google Scholar input: a demonstration Author 1 Paper A 1998 Author 1 Paper B 2007 Author 1 ... ... Author 2 Paper C 2012 Helena Author 2 Paper D 2009 WEB AUTOMATION FOR END USERS Author 2 ... ... Author 3 Paper E 2014 Author 3 Paper F 2006 Author 3 ... ... 5

Helena Helena program program’ PBD tool editor Web today asking: how can Browser this go wrong, and how can we handle it? demonstration skip blocks this is our prior work 6

How is rent changing across Seattle neighborhoods? 7

page New listings have 1 pushed the last three listings from p1 onto p2 page Kept losing 2 wasting 10+ hours network scraping duplicates! connection 8

How is the minimum DEPARTMENT OF ECONOMICS _______________________________________________________________________________________________ wage affecting UNIVERSITY of WASHINGTON Seattle restaurants? CIVIL & ENVIRONMENTAL Can we design a ENGINEERING better carpool _______________________________________________________________________________________________ matching algorithm? UNIVERSITY of WASHINGTON How do charitable EVANS SCHOOL OF PUBLIC foundations POLICY & GOVERNANCE communicate with _______________________________________________________________________________________________ UNIVERSITY of WASHINGTON supporters? 9

Problem Statement (1) Failures : What happens when the network fails, the server fails, the computer fails? When we lose our session with the server and have to start over? (2) Data changes : What happens when the server gives the client pages produced from different (potentially conflicting) reads of the underlying data store? not client side problems → scraping script can’t prevent them, must handle them 10

Helena Helena program program’ PBD tool editor Web Browser demonstration skip blocks our solution 11

Solution on the surface, seem like very different problems failures data changes “Just don’t redo the same work you’ve already done!” But what’s the ‘same’ work? After all, data changes... ● Our answer: the skip block! User can If object already committed ● tell us what makes objects the same (memoized), skip block; else, run block ● ● associate the code that operates on No reverse engineering! Reasoning an object about output data 12

Recovering from Failures text-ify-ed representation of the block language for (aRow in p1.authors){ scrape stuff about the author, scrape aRow.author_name scrape aRow.author_institution click the author p2 = click aRow.author_name for the author’s papers, scrape for (pRow in p2.papers){ paper stuff scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } add a row of output with the author and paper info } 13

Recovering from Failures key attributes : is the current author the same as another we’ve already seen? for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name scrape aRow.author_institution block : the code that operates p2 = click aRow.author_name on the author object for (pRow in p2.papers){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } } if ever, in any run , script has committed an object with the same key attributes, skips the block 14

Recovering from Failures for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name A scrape aRow.author_institution U p2 = click aRow.author_name T for (pRow in p2.papers){ H scrape pRow.title P A O scrape pRow.citations P E R output ([aRow.author_name, pRow.title, pRow.citations]) R } } page skip block } load commit always at least one page load per author (to load paper list), but often ≈ 40 time a1 a2 ... ... ... ... ... ... p1 p2 p21 p22 p41 p42 p1 p2 p21 p22 p41 p42 15

Recovering from Failures external didn’t reach failure point this commit recovery without the author skip block recovery with the author skip block 10 authors per page, so skips 40 just 1 page load by this “fast-forwarding” over prior work page loads point, 200 skipped page skip block load commit always at least one page load per author (to load paper list), but often ≈ 40 time a1 a2 ... ... ... ... ... ... p1 p2 p21 p22 p41 p42 p1 p2 p21 p22 p41 p42 16

Nested Skip Blocks Review i Restaurant A Review ii City 1 Restaurant B In authors vs. papers, authors ... ... is clearly the right level for the Review iii skip block. But here? Restaurant C Review iv City 2 Restaurant D ... ... skip block only at city → scraping a whole city takes many hours, so scraping half a city also takes hours skip block only at restaurant → iterating through a city’s restaurant list takes a long time, and now we have to go through all of Seattle, San Francisco before we can resume in the middle of Vancouver skip block at city & restaurant → adjustable granularity skipping 17

Nested Skip Blocks for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution)){ scrape aRow.author_name scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } and the inner block may commit even if the } outer doesn’t - like a nested open transaction } 18

Refreshing a Dataset this is the for (aRow in p1.authors){ default skipBlock (Author(aRow.author_name, aRow.author_institution), - ∞ )){ staleness scrape aRow.author_name threshold scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } } } -∞ means skip any duplicate we’ve seen ever If we’re scraping once a week, we don’t want to revisit each author. But after a year, maybe we should see what’s new. 19

Refreshing a Dataset for (aRow in p1.authors){ skipBlock (Author(aRow.author_name, aRow.author_institution), now - 365*24*60)){ scrape aRow.author_name scrape aRow.author_institution p2 = click aRow.author_name for (pRow in p2.papers){ skipBlock (Paper(pRow.title, pRow.year)){ scrape pRow.title scrape pRow.citations output ([aRow.author_name, pRow.title, pRow.citations]) } } Also have logical time (ex: last 3 runs) } } Bonus! In addition to failure recovery and data redundancy handling, get incremental/longitudinal scraping! 20

Demo time! 21

Benchmark Suite benchmark suite: Need web 7 long-running web data? scraping tasks Ex: for 50 top foundations, scrape the last 1,000 tweets they tweeted Ex: scrape all Seattle apartment listings from Craigslist 22

1.7x Skipping one ad skips one page Data Change load, and pagination gives us so many within one run duplicate ads! 0.9x All overhead, no gains - skipping a tweet doesn’t skip Speedup any page loads! Measured full execution time of: ● Script with skip blocks ● Script without skip blocks Chart shows speedup from using skip blocks higher is better 23

49x Lots of benefit from last week’s Data Re-Scraping data. Gates Foundation doesn’t post within multiple runs that many new tweets in a week! 1.9x Little additional benefit from last week’s data; ≈ same Executed script with skip blocks speedup Speedup speedup as first run. One week later, measured full execution time of: ● Script with skip blocks ● Script without skip blocks Chart shows speedup from using skip blocks higher is better 24

Failure Recovery execution time if with skip block fast-forwarding there’s no failure failure during high For each benchmark, for three churn → see new data failure locations, the execution → slower recovery time of: ● Script that recovers by Execution Time naive restarting ● Script that recovers by skip block fastforwarding Normalized by execution time of a script that doesn’t encounter failures lower is better overall, performance close to ideal! 25

User Study the UI in the Helena tool the UI in the online survey 26

Skip Blocks : Reusing Execution History to Accelerate Web Scripts - PowerPoint PPT Presentation

Skip Blocks : Reusing Execution History to Accelerate Web Scripts Sarah Chasins Rastislav Bodik University of California, Berkeley University of Washington OOPSLA Oct 25, 2017 Vancouver care about data tomorrow coders non-coders care

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

drop hum run If a word Yes! skip has only one syllable Yes! ends with a single consonant

The SDS Skip Subsea Deployment Systems Ltd. Subsea Deployment Systems Ltd. SUBSEA SKIP An

A Distributed Polylogarithmic Time Algorithm for Self-Stabilizing Skip Graphs Christian Decker

Skip Lists + S 3 + S 2 15 + S 1 15 23 + S 0 10

Skip Lists + S 3 S 2 + 15 S 1 + 15 23 S 0 + 10 15

STARTER PLANT CONCRETE BLOCKS 1 X 8 INCH Quality building blocks are essential in the safe

Building Blocks Operating Systems, Processes, Threads Reusing this material This work is

Building Blocks CPUs, Memory and Accelerators Reusing this material This work is licensed under

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Oracle Accelerate for Midsize Companies Ian Boyling, Director and Lead Consultant Prject (EU)

Scala Scripting Scala By the Bay, San Francisco, 12 Nov 2016 Scala has a code-size gap Scala

CS/COE 1520 pitt.edu/~ach54/cs1520 Web security Security basics: CIA Confidentiality

KAMAILIO - PICK YOUR SIP ROUTING SCRIPTING LANGUAGE DANIEL-CONSTANTIN MIERLA (@MICONDA)

04 The Find command, editing, and scripting CS 2043: Unix Tools and Scripting, Spring 2019 [1]

IT350: Web & Internet Programming Set 8: JavaScript JavaScript Intro Outline What is

Scripting Linux system calls with Lua Lua Workshop 2018 Pedro Tammela CUJO AI Scripting system

Chapter 3 Attaway MATLAB 4E Algorithms An algorithm is the sequence of steps needed to

Working on scripts with logical opcodes Thomas Kerin 1 Thanks to the speakers committee and

Skip Blocks : Reusing Execution History to Accelerate Web Scripts - PowerPoint PPT Presentation

Skip Blocks : Reusing Execution History to Accelerate Web Scripts Sarah Chasins Rastislav Bodik University of California, Berkeley University of Washington OOPSLA Oct 25, 2017 Vancouver care about data tomorrow coders non-coders care

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

ACCELERATE AUDIT ACCELERATE ATTAIN ALIGN ACCREDIT THE 4 STAGE PROCESS ACCELERATE ACCREDIT

MASTERING STRATEGY EXECUTION 18 BEST PRACTICES FOR STRATEGY EXECUTION STRATEGY EXECUTION AS

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

drop hum run If a word Yes! skip has only one syllable Yes! ends with a single consonant

The SDS Skip Subsea Deployment Systems Ltd. Subsea Deployment Systems Ltd. SUBSEA SKIP An

A Distributed Polylogarithmic Time Algorithm for Self-Stabilizing Skip Graphs Christian Decker

Skip Lists + S 3 + S 2 15 + S 1 15 23 + S 0 10

Skip Lists + S 3 S 2 + 15 S 1 + 15 23 S 0 + 10 15

STARTER PLANT CONCRETE BLOCKS 1 X 8 INCH Quality building blocks are essential in the safe

Building Blocks Operating Systems, Processes, Threads Reusing this material This work is

Building Blocks CPUs, Memory and Accelerators Reusing this material This work is licensed under

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

execution states with swapping Processes, Execution, and State 3F. Execution State Model exit

Oracle Accelerate for Midsize Companies Ian Boyling, Director and Lead Consultant Prject (EU)

Scala Scripting Scala By the Bay, San Francisco, 12 Nov 2016 Scala has a code-size gap Scala

CS/COE 1520 pitt.edu/~ach54/cs1520 Web security Security basics: CIA Confidentiality

KAMAILIO - PICK YOUR SIP ROUTING SCRIPTING LANGUAGE DANIEL-CONSTANTIN MIERLA (@MICONDA)

04 The Find command, editing, and scripting CS 2043: Unix Tools and Scripting, Spring 2019 [1]

IT350: Web &amp; Internet Programming Set 8: JavaScript JavaScript Intro Outline What is

Scripting Linux system calls with Lua Lua Workshop 2018 Pedro Tammela CUJO AI Scripting system

Chapter 3 Attaway MATLAB 4E Algorithms An algorithm is the sequence of steps needed to

Working on scripts with logical opcodes Thomas Kerin 1 Thanks to the speakers committee and

IT350: Web & Internet Programming Set 8: JavaScript JavaScript Intro Outline What is