Towards prac+cal incremental recomputa+on for scien+sts Philip J. - - PowerPoint PPT Presentation

towards prac cal incremental recomputa on for scien sts
SMART_READER_LITE
LIVE PREVIEW

Towards prac+cal incremental recomputa+on for scien+sts Philip J. - - PowerPoint PPT Presentation

Towards prac+cal incremental recomputa+on for scien+sts Philip J. Guo and Dawson Engler Workshop on the Theory and Prac+ce of Provenance Feb 22, 2010 Talk outline 1. Mo#va#on : adhoc data analysis scripts 2. Technique : fully automa+c


slide-1
SLIDE 1

Towards prac+cal incremental recomputa+on for scien+sts

Philip J. Guo and Dawson Engler Workshop on the Theory and Prac+ce of Provenance Feb 22, 2010

slide-2
SLIDE 2

Talk outline

  • 1. Mo#va#on: ad‐hoc data analysis scripts
  • 2. Technique: fully automa+c memoiza+on
  • 3. Benefits: faster itera+on with simple code
slide-3
SLIDE 3

Types of programs

All programs wriKen Size & complexity

Research prototypes Data munging and analysis scripts

slide-4
SLIDE 4

Problem Scien+fic data processing and analysis scripts oQen execute for several minutes to hours, which slows down the scien+st’s itera+on and debugging cycle.

slide-5
SLIDE 5

Manually coping

Code size and complexity

Itera+on / re‐execu+on +me

Write ini+al single‐file Python prototype Re‐write to break computa+on into mul+ple stages (files) and selec+vely comment‐out code to test Write code to cache intermediate results to disk and manually manage dependencies

slide-6
SLIDE 6

Code amount and complexity

Let interpreter cache and manage intermediate results

Itera+on / re‐execu+on +me

Write ini+al single‐file Python prototype

Automated solu+on

slide-7
SLIDE 7

Ideal workflow

  • 1. Write simple first version of script
  • 2. Execute and wait for 1 hour to get results
  • 3. Interpret results and no+ce a bug
  • 4. Edit script slightly to fix that bug
  • 5. Re‐execute and wait for a few seconds
  • 6. Enhance script with new func+ons
  • 7. Re‐execute and wait for a few minutes
slide-8
SLIDE 8

Technique Fully automa+c and persistent memoiza+on for a general‐ purpose impera+ve language

slide-9
SLIDE 9

Tradi+onal memoiza+on

def Fib(n): if n <= 2: return 1 else: return Fib(n‐1) + Fib(n‐2)

slide-10
SLIDE 10

Tradi+onal memoiza+on

MemoTable = {} def Fib(n): if n <= 2: return 1 else: if n in MemoTable: return MemoTable[n] else: MemoTable[n] = Fib(n‐1) + Fib(n‐2) return MemoTable[n]

Input (n) Result 1 1 2 1 3 2 4 3 5 5 6 8 7 13 … …

slide-11
SLIDE 11

Auto‐memoizing real programs

  • 1. Code changes
  • 2. External dependencies
  • 3. Side‐effects
slide-12
SLIDE 12

def stageC(datLst): res = ... # run for 10 minutes munging datLst return res

Input (datLst) Result [1,2,3,4] 10 [5,6,7,8] 20 [9,10,11,12] 30

Auto‐memoizing real programs:

Detec+ng code changes

slide-13
SLIDE 13

def stageC(datLst): res = ... # run for 10 minutes munging datLst return res

Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C1 10 [5,6,7,8] stageC ‐> C1 20 [9,10,11,12] stageC ‐> C1 30

Auto‐memoizing real programs:

Detec+ng code changes

slide-14
SLIDE 14

Auto‐memoizing real programs:

Detec+ng code changes

def stageC(datLst): res = ... # run for 10 minutes munging datLst return (res * ‐1)

Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C1 10 [5,6,7,8] stageC ‐> C1 20 [9,10,11,12] stageC ‐> C1 30

slide-15
SLIDE 15

Auto‐memoizing real programs:

Detec+ng code changes

def stageC(datLst): res = ... # run for 10 minutes munging datLst return (res * ‐1)

Input (datLst) Code deps. Result [1,2,3,4] stageC ‐> C1 10 [5,6,7,8] stageC ‐> C1 20 [9,10,11,12] stageC ‐> C1 30 [1,2,3,4] stageC ‐> C2 ‐10

slide-16
SLIDE 16

Auto‐memoizing real programs:

Detec+ng file reads

Input (queryStr) Code deps. Result SELECT * FROM tbl1 stageB ‐> B1 1 SELECT * FROM tbl2 stageB ‐> B1 2

def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return res

slide-17
SLIDE 17

Auto‐memoizing real programs:

Detec+ng file reads

def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return res

Input (queryStr) Code deps. File deps. Result SELECT * FROM tbl1 stageB ‐> B1 test.db ‐> DB1 1 SELECT * FROM tbl2 stageB ‐> B1 test.db ‐> DB1 2

slide-18
SLIDE 18

Auto‐memoizing real programs:

Detec+ng global variable reads

MULTIPLIER = 5 def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER)

Input (queryStr) Code deps. File deps. Result SELECT * FROM tbl1 stageB ‐> B2 test.db ‐> DB1 5

slide-19
SLIDE 19

Auto‐memoizing real programs:

Detec+ng global variable reads

Input (queryStr) Code deps. File deps. Global deps. Result SELECT * FROM tbl1 stageB ‐> B2 test.db ‐> DB1 MULTIPLIER ‐> 5 5

MULTIPLIER = 5 def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER)

slide-20
SLIDE 20

Auto‐memoizing real programs:

Detec+ng global variable reads

Input (queryStr) Code deps. File deps. Global deps. Result SELECT * FROM tbl1 stageB ‐> B2 test.db ‐> DB1 MULTIPLIER ‐> 5 5 SELECT * FROM tbl1 stageB ‐> B2 test.db ‐> DB1 MULTIPLIER ‐> 10 10

MULTIPLIER = 10 def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER)

slide-21
SLIDE 21

Auto‐memoizing real programs:

Detec+ng transi+ve dependencies

Input (filename) Code deps. File deps. Global deps. Result queries.txt stageA ‐> A1 queries.txt ‐> Q1 50

def stageA(filename): lst = [] for line in open(filename): lst.append(stageB(line)) transformedLst = stageC(lst) return sum(transformedLst)

slide-22
SLIDE 22

def stageA(filename): lst = [] for line in open(filename): lst.append(stageB(line)) transformedLst = stageC(lst) return sum(transformedLst) def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER) def stageC(datLst): res = ... # run for 10 minutes munging datLst return res queries.txt

MULTIPLIER = 5

test.db

Auto‐memoizing real programs:

Detec+ng transi+ve dependencies

slide-23
SLIDE 23

Auto‐memoizing real programs:

Detec+ng transi+ve dependencies

Input (filename) Code deps. File deps. Global deps. Result queries.txt stageA ‐> A1 stageB ‐> B1 stageC ‐> C1 queries.txt ‐> Q1 test.db ‐> DB1 MULTIPLIER ‐> 5 50

def stageA(filename): lst = [] for line in open(filename): lst.append(stageB(line)) transformedLst = stageC(lst) return sum(transformedLst)

slide-24
SLIDE 24

“Before memoizing a given rou+ne, the programmer needs to verify that there is no internal dependency

  • n side effects. This is not always simple; despite

aKempts to encourage a func+onal programming style, programmers will occasionally discover that some rou+ne their func+on depended upon had some deeply buried dependence on a global variable or the slot value of a CLOS [Common Lisp Object System] Instance.” [Hall and Mayfield, 1993]

Auto‐memoizing real programs:

Detec+ng impurity

slide-25
SLIDE 25
  • All func+ons start out pure
  • Mark all func+ons on stack as impure when:

– Muta+ng a non‐local value – Wri+ng to a file – Calling a non‐determinis+c func+on

  • Data analysis func+ons mostly pure

Auto‐memoizing real programs:

Detec+ng impurity

slide-26
SLIDE 26

Incremental recomputa+on

def stageA(filename): lst = [] for line in open(filename): lst.append(stageB(line)) transformedLst = stageC(lst) return sum(transformedLst) def stageB(queryStr): db = sql_open_db(“test.db”) q = db.query(queryStr) res = ... # run for 10 minutes processing q return (res * MULTIPLIER) def stageC(datLst): res = ... # run for 10 minutes munging datLst return res queries.txt

SELECT … SELECT … SELECT … SELECT … SELECT …

MULTIPLIER = 5

test.db

slide-27
SLIDE 27

Benefits

  • 1. Less code and bugs
  • 2. Faster itera+on cycle
  • 3. Real‐+me collabora+on
slide-28
SLIDE 28
  • 1. Mo#va#on: ad‐hoc data analysis scripts
  • 2. Technique: fully automa+c memoiza+on
  • 3. Benefits: faster itera+on with simple code

Talk review

slide-29
SLIDE 29
slide-30
SLIDE 30

Ongoing and future work

  • Provenance browsing
  • Database‐aware caching
  • Network‐aware caching
  • Lightweight programmer annota+ons
  • Finer‐grained tracking within func+ons
slide-31
SLIDE 31

Implementa+on

  • Python as target language
  • Plug‐and‐play with no code changes
  • Low run‐+me overhead
  • Compa+ble with 3rd‐party libraries