"[An article about computational science in a scientific - - PowerPoint PPT Presentation

an article about computational science in a scientific
SMART_READER_LITE
LIVE PREVIEW

"[An article about computational science in a scientific - - PowerPoint PPT Presentation

P RUNE : A Preserving Run Environment for Reproducible Scientific Computing -Peter Ivie Reproducibility "[An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of


slide-1
SLIDE 1

PRUNE: A Preserving Run

Environment for Reproducible Scientific Computing

  • Peter Ivie
slide-2
SLIDE 2

Reproducibility

  • "[An article about computational science in

a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions]" –Jon Claerbout

2

slide-3
SLIDE 3

Verify and Extend

  • Don’t

re-invent the wheel

  • Stand on the

shoulders of giants

3

slide-4
SLIDE 4

PRUNE features

  • Designed for Big Data
  • Manage storage and compute resources
  • Reproducible workflow specifications
  • Share workflow with others
  • Reshare changes back
  • User defined granularity

4

slide-5
SLIDE 5

Design Execute Observe Share/Publish Preserve Later

Preserve

Accepted philosphy

  • Libraries
  • Hardware
  • Network
  • System Administrators
  • Remote Collaborators
  • Graduated Students

5

slide-6
SLIDE 6

Proposed philosophy

Design Execute Observe Share/Publish Preserve Later

Preserve Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

6

slide-7
SLIDE 7

Differences

  • Git: User decides when to

preserve

Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

7

slide-8
SLIDE 8

Differences

  • Git: User decides when to

preserve

  • Preserve ALL specification

changes

Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

8

slide-9
SLIDE 9

Differences

  • Git: User decides when to

preserve

  • Preserve ALL specification

changes

  • Git: Code Commits

separate from Code Execution

Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

9

slide-10
SLIDE 10

Differences

  • Git: User decides when to

preserve

  • Preserve ALL specification

changes

  • Git: Code Commits

separate from Code Execution

  • System Manages ALL

computation

Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

10

slide-11
SLIDE 11

Differences

  • Git: User decides when to

preserve

  • Preserve ALL specification

changes

  • Git: Code Commits

separate from Code Execution

  • System Manages ALL

computation

  • Remove unneeded items

later on

Preserve

Execute Unpreserve Design Share/Publish Preserve First Observe

11

slide-12
SLIDE 12

What to Preserve

Hardware Kernel Operating System Software Command: ‘do < in.txt in.dat > out.txt o2.txt’ Data arguments: [ file_id1, file_id2 ] parameters: [ ‘in.txt’, ‘in.dat’ ] returns: [ ‘out.txt’, ‘o2.txt’ ] results: [ file_id3, file_id4 ] Environment Virtual Machine / Container Prune Task

environment: envi_id1

12

slide-13
SLIDE 13

Overview

F1 = file_add( filename=‘./observed.dat’ ) export( [ T7[1] ], filename=‘./plot.jpg’ ) T6 = task_add( args=[ T4[0] ], params=['input_data’], cmd=‘analyze < in_data > out_data’, returns=[‘out_data'], environment=E2 ) E2 = envi_add( type=‘EC2’, image=‘hep.stable’ ) T7 = task_add( cmd=‘plot in1 in2 out1 out2’, args=[ T5[0], T6[0] ], params=[‘in1’,‘in2’], returns=[‘out1’,‘out2’], environment=E2 ) User space E1 = envi_add( type=‘EC2’, image=‘hep.beta’ )

Compute Resources

E1

Simulate Analyze

T3

(E1)

F2 F7 F6 F3 F4 PRUNE space User interface T2

(E1)

T5

(E2)

T6

(E2)

T1

(E1)

Plot

E2 T4

(E2)

F5 F1 F9 F8 T7

(E2)

File Environment Task T5 = task_add( args=[ F1 ], ...) (remaining arguments the same as above) T4 = task_add( cmd=‘simulate > output’, returns=[‘output'], environment=E1) Workflow Version #2

13

slide-14
SLIDE 14

Sample code: Merge sort

#!/usr/bin/env python from prune import client prune = client.Connect() #Use SQLite3 ###### Import sources stage ###### E1 = prune.env_add(type=`EC2', image=`ami-b06a98d8') D1, D2 = prune.file_add( `nouns.txt', `verbs.txt' )

14

slide-15
SLIDE 15

Sample code: Merge sort

###### Sort stage ###### D3, = prune.task_add( returns=[`output.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D1], params=[`input.txt'] ) D4, = prune.task_add( returns=[`output.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D2], params=[`input.txt'] ) ###### Merge stage ###### D5, = prune.task_add( returns=[`merged_out.txt'], env=E1, cmd=`sort -m input*.txt > merged_out.txt', args=[D3,D4], params=[`input1.txt',`input2.txt'] )

15

slide-16
SLIDE 16

Prune Task

Hardware Kernel Operating System Software Command: ‘do < in.txt in.dat > out.txt o2.txt’ Data arguments: [ file_id1, file_id2 ] parameters: [ ‘in.txt’, ‘in.dat’ ] returns: [ ‘out.txt’, ‘o2.txt’ ] results: [ file_id3, file_id4 ] Environment Virtual Machine / Container Prune Task

environment: envi_id1

16

slide-17
SLIDE 17

Sample code: Merge sort

###### Sort stage ###### D3, = prune.task_add( returns=[`output.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D1], params=[`input.txt'] ) D4, = prune.task_add( returns=[`output.txt'], env=E1, cmd=`sort input.txt > output.txt', args=[D2], params=[`input.txt'] ) ###### Merge stage ###### D5, = prune.task_add( returns=[`merged_out.txt'], env=E1, cmd=`sort -m input*.txt > merged_out.txt', args=[D3,D4], params=[`input1.txt',`input2.txt'] )

17

slide-18
SLIDE 18

Sample code: Merge sort

###### Execute the workflow ###### prune.execute( worker_type='local', cores=8 )

#prune.execute( worker_type='wq', name='myapp' )

###### Export ###### prune.export( D5, `merged.txt' ) # Final data prune.export( D5, `wf.prune', lineage=2 )

18

slide-19
SLIDE 19

Sample code: Merge sort

###### Execute the workflow ###### prune.execute( worker_type='local', cores=8 )

#prune.execute( worker_type='wq', name='myapp' )

###### Export ###### prune.export( D5, `merged.txt' ) # Final data prune.export( D5, `wf.prune', lineage=2 )

19

slide-20
SLIDE 20

Sharable workflow description file

{"body": {"args": ["f908ff689b9e57f0055875d927d191ccd2d6deef:0", "319418e43783a78e3cb7e219f9a1211cba4b3b31:0"], "cmd": "sort -m input*.txt > merged_output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input1.txt", "input2.txt"], "precise": true, "returns": ["merged_output.txt"], "types": []}, "cbid": "e82855394e9dcdee03ed8a25c96c79245fd0481a", "size": 322, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.7171359} {"body": {"args": ["29ae0a576ab660cb17bf9b14729c7b464fa98cca"], "cmd": "sort input.txt > output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input.txt"], "precise": true, "returns": ["output.txt"], "types": []}, "cbid": "f908ff689b9e57f0055875d927d191ccd2d6deef", "size": 241, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.484422} {"body": {"args": ["48044131b31906e6c917d857ddd1539278c455cf"], "cmd": "sort input.txt > output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input.txt"], "precise": true, "returns": ["output.txt"], "types": []}, "cbid": "319418e43783a78e3cb7e219f9a1211cba4b3b31", "size": 241, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.6183109} {"cbid": "29ae0a576ab660cb17bf9b14729c7b464fa98cca", "size": 144, "type": "file", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.2482941} time person year Way …

20

slide-21
SLIDE 21

Workflow evolution (US Censuses)

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

21

slide-22
SLIDE 22

Redefine filter criteria

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

22

slide-23
SLIDE 23

Redefine match criteria

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

23

slide-24
SLIDE 24

New key function chosen

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1940 1940

...

1850

...

1850 1940

...

1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

24

slide-25
SLIDE 25

Re-normalize

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1850 1940

...

1940 1940

...

1850

...

1850 1940

...

1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

25

slide-26
SLIDE 26

New input data

1850 1940

...

1850 1940

...

1850 1940

...

Stage 1

Uncompress (year+fragment)

Stage 2

Normalize (year+fragment)

1940 1940

...

1850

...

1850 1940

...

1940

...

Stage 3

Split by key (year+fragment+key)

Stage 4

Join fragments (year+key)

Stage 5

Pair by year (year1+year2+key)

Stage 6

Group matches (year1+year2+key)

Stage 7

Filter 1-1 matches (year1+year2+key)

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1850 1940

...

1850 1940

...

1850 1940

...

1940 1940

...

1850

...

1850 1940

...

1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

1930 1940 1850 1860 1860 1870 1930 1940

...

26

slide-27
SLIDE 27

Derivation History = Cachable Results

27

slide-28
SLIDE 28

Execution time cut in half for run #2

28

slide-29
SLIDE 29

Quotas

29

slide-30
SLIDE 30

Scalability

  • ~12,000 parallel cores
  • ~3 million tasks
  • Overhead

~1% above native wall clock

30

slide-31
SLIDE 31

Sharing workflow between users

{"body": {"args": ["f908ff689b9e57f0055875d927d191ccd2d6deef:0", "319418e43783a78e3cb7e219f9a1211cba4b3b31:0"], "cmd": "sort -m input*.txt > merged_output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input1.txt", "input2.txt"], "precise": true, "returns": ["merged_output.txt"], "types": []}, "cbid": "e82855394e9dcdee03ed8a25c96c79245fd0481a", "size": 322, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.7171359} {"body": {"args": ["29ae0a576ab660cb17bf9b14729c7b464fa98cca"], "cmd": "sort input.txt > output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input.txt"], "precise": true, "returns": ["output.txt"], "types": []}, "cbid": "f908ff689b9e57f0055875d927d191ccd2d6deef", "size": 241, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.484422} {"body": {"args": ["48044131b31906e6c917d857ddd1539278c455cf"], "cmd": "sort input.txt > output.txt", "env": "da39a3ee5e6b4b0d3255bfef95601890afd80709", "env_vars": {}, "params": ["input.txt"], "precise": true, "returns": ["output.txt"], "types": []}, "cbid": "319418e43783a78e3cb7e219f9a1211cba4b3b31", "size": 241, "type": "call", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.6183109} {"cbid": "29ae0a576ab660cb17bf9b14729c7b464fa98cca", "size": 144, "type": "file", "wfid": "a0230143-9b3a-4766-809d-5b7172e9b967", "when": 1476886144.2482941} time person year Way …

31

slide-32
SLIDE 32

http://ccl.cse.nd.edu/research/papers/

  • Sample workflows
  • http://ccl.cse.nd.edu/software/prune/prune.html

– Merge sort – Pairwise comparisons (US Censuses) – High-energy Physics

Thank You!

For more informa+on: pivie@nd.edu