Towards prac+cal incremental recomputa+on for scien+sts Philip J. - PowerPoint PPT Presentation

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop on the Theory and Prac+ce of Provenance Feb 22, 2010

Talk outline 1. Mo#va#on : ad‐hoc data analysis scripts 2. Technique : fully automa+c memoiza+on 3. Benefits : faster itera+on with simple code

Types of programs Size & complexity Research prototypes Data munging and analysis scripts All programs wriKen

Problem Scien+fic data processing and analysis scripts oQen execute for several minutes to hours, which slows down the scien+st’s itera+on and debugging cycle.

Manually coping Write ini+al single‐file Python prototype Re‐write to break computa+on into mul+ple stages (files) and selec+vely Itera+on / comment‐out code to test re‐execu+on +me Write code to cache intermediate results to disk and manually manage dependencies Code size and complexity

Automated solu+on Write ini+al single‐file Python prototype Itera+on / re‐execu+on +me Let interpreter cache and manage intermediate results Code amount and complexity

Ideal workflow 1. Write simple first version of script 2. Execute and wait for 1 hour to get results 3. Interpret results and no+ce a bug 4. Edit script slightly to fix that bug 5. Re‐execute and wait for a few seconds 6. Enhance script with new func+ons 7. Re‐execute and wait for a few minutes

Technique Fully automa+c and persistent memoiza+on for a general‐ purpose impera+ve language

Tradi+onal memoiza+on def Fib (n): if n <= 2: return 1 else: return Fib (n‐1) + Fib (n‐2)

Tradi+onal memoiza+on MemoTable = {} Input (n) Result 1 1 def Fib (n): 2 1 if n <= 2: 3 2 return 1 4 3 else: if n in MemoTable: 5 5 return MemoTable[n] 6 8 else: 7 13 MemoTable[n] = Fib (n‐1) + Fib (n‐2) … … return MemoTable[n]

Auto‐memoizing real programs 1. Code changes 2. External dependencies 3. Side‐effects

Auto‐memoizing real programs: Detec+ng code changes def stageC (datLst): res = ... # run for 10 minutes munging datLst return res Input (datLst) Result [1,2,3,4] 10 [5,6,7,8] 20 [9,10,11,12] 30

Auto‐memoizing real programs: Detec+ng code changes def stageC (datLst): res = ... # run for 10 minutes munging datLst return res Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C 1 10 [5,6,7,8] stageC ‐> C 1 20 [9,10,11,12] stageC ‐> C 1 30

Auto‐memoizing real programs: Detec+ng code changes def stageC (datLst): res = ... # run for 10 minutes munging datLst return (res * ‐1) Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C 1 10 [5,6,7,8] stageC ‐> C 1 20 [9,10,11,12] stageC ‐> C 1 30

Auto‐memoizing real programs: Detec+ng code changes def stageC (datLst): res = ... # run for 10 minutes munging datLst return (res * ‐1) Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C 1 10 [5,6,7,8] stageC ‐> C 1 20 [9,10,11,12] stageC ‐> C 1 30 [1,2,3,4] stageC ‐> C 2 ‐10

Auto‐memoizing real programs: Detec+ng file reads def stageB (queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return res Input (queryStr) Code deps. Result SELECT * FROM tbl1 stageB ‐> B 1 1 SELECT * FROM tbl2 stageB ‐> B 1 2

Auto‐memoizing real programs: Detec+ng file reads def stageB (queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return res Input (queryStr) Code deps. File deps. Result SELECT * FROM tbl1 stageB ‐> B 1 test.db ‐> DB 1 1 SELECT * FROM tbl2 stageB ‐> B 1 test.db ‐> DB 1 2

Auto‐memoizing real programs: Detec+ng global variable reads MULTIPLIER = 5 def stageB (queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER) Input (queryStr) Code deps. File deps. Result SELECT * FROM tbl1 stageB ‐> B 2 test.db ‐> DB 1 5

Auto‐memoizing real programs: Detec+ng global variable reads MULTIPLIER = 5 def stageB (queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER) Input (queryStr) Code deps. File deps. Global deps. Result SELECT * FROM tbl1 stageB ‐> B 2 test.db ‐> DB 1 MULTIPLIER ‐> 5 5

Auto‐memoizing real programs: Detec+ng global variable reads MULTIPLIER = 10 def stageB (queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER) Input (queryStr) Code deps. File deps. Global deps. Result SELECT * FROM tbl1 stageB ‐> B 2 test.db ‐> DB 1 MULTIPLIER ‐> 5 5 SELECT * FROM tbl1 stageB ‐> B 2 test.db ‐> DB 1 MULTIPLIER ‐> 10 10

Auto‐memoizing real programs: Detec+ng transi+ve dependencies def stageA (filename): lst = [] for line in open(filename): lst.append( stageB (line)) transformedLst = stageC (lst) return sum(transformedLst) Input (filename) Code deps. File deps. Global deps. Result queries.txt stageA ‐> A 1 queries.txt ‐> Q 1 50

Auto‐memoizing real programs: Detec+ng transi+ve dependencies queries.txt test.db def stageA (filename): def stageB (queryStr): lst = [] db = sql_open_db(“test.db”) for line in open(filename): q = db.query(queryStr) lst.append( stageB (line)) res = ... # run for 10 minutes processing q transformedLst = stageC (lst) return (res * MULTIPLIER) return sum(transformedLst) MULTIPLIER = 5 def stageC (datLst): res = ... # run for 10 minutes munging datLst return res

Auto‐memoizing real programs: Detec+ng transi+ve dependencies def stageA (filename): lst = [] for line in open(filename): lst.append( stageB (line)) transformedLst = stageC (lst) return sum(transformedLst) Input (filename) Code deps. File deps. Global deps. Result queries.txt stageA ‐> A 1 queries.txt ‐> Q 1 MULTIPLIER ‐> 5 50 stageB ‐> B 1 test.db ‐> DB 1 stageC ‐> C 1

Auto‐memoizing real programs: Detec+ng impurity “Before memoizing a given rou+ne, the programmer needs to verify that there is no internal dependency on side effects. This is not always simple; despite aKempts to encourage a func+onal programming style, programmers will occasionally discover that some rou+ne their func+on depended upon had some deeply buried dependence on a global variable or the slot value of a CLOS [Common Lisp Object System] Instance.” [Hall and Mayfield, 1993]

Auto‐memoizing real programs: Detec+ng impurity • All func+ons start out pure • Mark all func+ons on stack as impure when: – Muta+ng a non‐local value – Wri+ng to a file – Calling a non‐determinis+c func+on • Data analysis func+ons mostly pure

Incremental recomputa+on queries.txt SELECT … SELECT … test.db SELECT … SELECT … SELECT … def stageA (filename): def stageB (queryStr): lst = [] db = sql_open_db(“test.db”) for line in open(filename): q = db.query(queryStr) lst.append( stageB (line)) res = ... # run for 10 minutes processing q transformedLst = stageC (lst) return (res * MULTIPLIER) return sum(transformedLst) MULTIPLIER = 5 def stageC (datLst): res = ... # run for 10 minutes munging datLst return res

Benefits 1. Less code and bugs 2. Faster itera+on cycle 3. Real‐+me collabora+on

Talk review 1. Mo#va#on : ad‐hoc data analysis scripts 2. Technique : fully automa+c memoiza+on 3. Benefits : faster itera+on with simple code

Ongoing and future work • Provenance browsing • Database‐aware caching • Network‐aware caching • Lightweight programmer annota+ons • Finer‐grained tracking within func+ons

Implementa+on • Python as target language • Plug‐and‐play with no code changes • Low run‐+me overhead • Compa+ble with 3 rd ‐party libraries

Towards prac+cal incremental recomputa+on for scien+sts Philip J. - PowerPoint PPT Presentation

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop on the Theory and Prac+ce of Provenance Feb 22, 2010 Talk outline 1. Mo#va#on : adhoc data analysis scripts 2. Technique : fully automa+c

STS for Machine Translation Evaluation STS Workshop, NYC March 12-13 2012 Lucia Specia

Click to edit Master /tle style "Scien'fic thought is the common heritage of

Advanced fMRI Prac/cal Nonparametric Inference, Power & Meta-Analysis Thomas E. Nichols

Scien&fic Skep&cism What is Skep&cism What does it mean to be skep&cal?

Scien&fic Data File Formats Han-Wei Shen The Ohio

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Medi-Cal Healthier California for All Drug Medi-Cal Organized Delivery System Program Renewal and

CAL IF ORNIA HIGH- - SPE SPE E D RAIL CAL IF ORNIA HIGH E D RAIL CAL IF ORNIA HIGH-

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

Lead Scien*sts: Tracy Gartner and Carolyn Thomas Anna Aguilera

A message from scien.sts Contaminants in drinking water sources

Compilers and VMs for Programming Environments Used by Scien;sts

The Purge Threat : Scien*sts thoughts on peta-scale

Collabora'on Among Data Scien'sts, Sta's'cians, and Domain Experts With

SEVENTH STS FORUM Kyoto 3-5 October 2010 David Bibby, Victoria University of Wellington, New

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Data Wrangling John Meehan Jeff Rasley Working with raw

Machine Translation and Type Theory Aarne Ranta Types 2010, Warsaw 14 October 2010 Download this

Spreadsheets: an Introduction to utilized the notion of a malleable matrix to develop the

Building SoCs with Migen and MiSoC Sbastien Bourdeauducq M-Labs Ltd, Hong Kong

Buffer overflows & friends CS642: Computer Security

What we need 1. Laziness and partial recalc 2. Caching 3. Asynchronous result production

Data wrangling with Tableau and Excel October 11 2016 JRNL 520H What is data wrangling? Data

Towards prac+cal incremental recomputa+on for scien+sts Philip J. - PowerPoint PPT Presentation

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop on the Theory and Prac+ce of Provenance Feb 22, 2010 Talk outline 1. Mo#va#on : adhoc data analysis scripts 2. Technique : fully automa+c

STS for Machine Translation Evaluation STS Workshop, NYC March 12-13 2012 Lucia Specia

Click to edit Master /tle style &quot;Scien'fic thought is the common heritage of

Advanced fMRI Prac/cal Nonparametric Inference, Power &amp; Meta-Analysis Thomas E. Nichols

Scien&amp;fic Skep&amp;cism What is Skep&amp;cism What does it mean to be skep&amp;cal?

Scien&amp;fic Data File Formats Han-Wei Shen The Ohio

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Medi-Cal Healthier California for All Drug Medi-Cal Organized Delivery System Program Renewal and

CAL IF ORNIA HIGH- - SPE SPE E D RAIL CAL IF ORNIA HIGH E D RAIL CAL IF ORNIA HIGH-

PRAC feedback to working parties Presented by: V. Hivert, R.Anderson (PRAC) 25 September 2019

Objec(ves Review Lab 1 Linux prac(ce Programming prac(ce Print statements

Lead Scien*sts: Tracy Gartner and Carolyn Thomas Anna Aguilera

A message from scien.sts Contaminants in drinking water sources

Compilers and VMs for Programming Environments Used by Scien;sts

The Purge Threat : Scien*sts thoughts on peta-scale

Collabora'on Among Data Scien'sts, Sta's'cians, and Domain Experts With

SEVENTH STS FORUM Kyoto 3-5 October 2010 David Bibby, Victoria University of Wellington, New

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Data Wrangling John Meehan Jeff Rasley Working with raw

Machine Translation and Type Theory Aarne Ranta Types 2010, Warsaw 14 October 2010 Download this

Spreadsheets: an Introduction to utilized the notion of a malleable matrix to develop the

Building SoCs with Migen and MiSoC Sbastien Bourdeauducq M-Labs Ltd, Hong Kong

Buffer overflows &amp; friends CS642: Computer Security

What we need 1. Laziness and partial recalc 2. Caching 3. Asynchronous result production

Data wrangling with Tableau and Excel October 11 2016 JRNL 520H What is data wrangling? Data

Click to edit Master /tle style "Scien'fic thought is the common heritage of

Advanced fMRI Prac/cal Nonparametric Inference, Power & Meta-Analysis Thomas E. Nichols

Scien&fic Skep&cism What is Skep&cism What does it mean to be skep&cal?

Scien&fic Data File Formats Han-Wei Shen The Ohio

Buffer overflows & friends CS642: Computer Security