LoonyBin: Keeping Language Technologists Sane through Automated - - PowerPoint PPT Presentation

loonybin
SMART_READER_LITE
LIVE PREVIEW

LoonyBin: Keeping Language Technologists Sane through Automated - - PowerPoint PPT Presentation

LoonyBin: Keeping Language Technologists Sane through Automated Management of (Hyper)Workflows Jonathan Clark and Alon Lavie Carnegie-Mellon University LREC 2010 Thursday, May 20, 2010 Outline Empirical NLP Research Day-to-day issues


slide-1
SLIDE 1

LoonyBin:

Keeping Language Technologists Sane

through Automated Management of

(Hyper)Workflows

Jonathan Clark and Alon Lavie Carnegie-Mellon University

LREC 2010

Thursday, May 20, 2010

slide-2
SLIDE 2

Outline

  • Empirical NLP Research
  • Day-to-day issues
  • Current problems
  • LoonyBin’s solutions

2

  • Workflows
  • HyperWorkflows
slide-3
SLIDE 3

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

slide-4
SLIDE 4

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

slide-5
SLIDE 5

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

slide-6
SLIDE 6

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

slide-7
SLIDE 7

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

B A C

slide-8
SLIDE 8

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

B A C

slide-9
SLIDE 9

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

B A C

slide-10
SLIDE 10

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

B A C

slide-11
SLIDE 11

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X

B A C

slide-12
SLIDE 12

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

slide-13
SLIDE 13

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

X

slide-14
SLIDE 14

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

X

slide-15
SLIDE 15

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

X

slide-16
SLIDE 16

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

X X

slide-17
SLIDE 17

Empirical NLP

  • Plumbing: Gluing (Linux) tools

together

3

  • Recording results
  • Sanity checking
  • Running variations
  • Moving between clusters

& schedulers

X X

B A C

X

X

X

slide-18
SLIDE 18

Proposed Solution:

HyperWorkflow Management

4

slide-19
SLIDE 19

LoonyBin

  • Define the tools

(inputs/outputs/parameters → shell commands)

  • Define the workflow

(DAG of steps and dependencies)

  • Generate & run a shell script

5

slide-20
SLIDE 20

LoonyBin

  • Define the tools

(inputs/outputs/parameters → shell commands)

  • Define the workflow

(DAG of steps and dependencies)

  • Generate & run a shell script

5

slide-21
SLIDE 21

6

slide-22
SLIDE 22

Available Tools

6

slide-23
SLIDE 23

Drag and Drop Available Tools

6

slide-24
SLIDE 24

Drag and Drop Available Tools

6

slide-25
SLIDE 25

Drag and Drop Available Tools

6

slide-26
SLIDE 26

Drag and Drop Available Tools

6

slide-27
SLIDE 27

Drag and Drop Available Tools

6

slide-28
SLIDE 28

Tooltips for Params Drag and Drop Available Tools

6

slide-29
SLIDE 29

Tooltips for Params Drag and Drop Available Tools Machine Assignment

6

slide-30
SLIDE 30

Generating a Script for

foreignCorpus nativeCorpus alignments fertility

A B W

Python Tool Descriptor

7

INPUTS PARAMETERS OUTPUTS

slide-31
SLIDE 31

Generating a Script for

foreignCorpus nativeCorpus alignments fertility

A B W

Python Tool Descriptor

0.01 A’s output “x” B’s output “y”

Parameters & dependencies from workflow

7

INPUTS PARAMETERS OUTPUTS

slide-32
SLIDE 32

Generating a Script for

foreignCorpus nativeCorpus alignments fertility

A B W

Python Tool Descriptor

0.01 A’s output “x” B’s output “y”

LoonyBin assigns paths

…/inputs/f …/inputs/n …/outputs/wa

Parameters & dependencies from workflow

7

INPUTS PARAMETERS OUTPUTS

slide-33
SLIDE 33

Generating a Script for

foreignCorpus nativeCorpus alignments fertility

A B W

Python Tool Descriptor

0.01 A’s output “x” B’s output “y”

LoonyBin assigns paths

…/inputs/f …/inputs/n …/outputs/wa

Parameters & dependencies from workflow java edu.cmu.Tokenizer ../inputs/f ../inputs/n > ../outputs/wa

7

INPUTS PARAMETERS OUTPUTS

slide-34
SLIDE 34

So far...

  • Complaints about current implementation of

empirical NLP experiments

  • Define the tools

(inputs/outputs/parameters)

  • Define the workflow

(DAG of steps and dependencies)

  • Generate & run a shell script

8

slide-35
SLIDE 35

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

9

slide-36
SLIDE 36

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

Packing Node

9

slide-37
SLIDE 37

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

Packing Node

9

slide-38
SLIDE 38

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

Packing Node

Realizations

9

slide-39
SLIDE 39

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

Packing Node

Realizations

Don’t re-run

9

slide-40
SLIDE 40

HyperWorkflows

  • HyperWorkflows: Shared substructure in

experiments

  • Encode small variations in a HyperDAG

Filter Corpus {syntax-st, syntax-ch, moses} Word Alignment Stanford Parser Build Syntactic Translation Model Minimum Error Rate Training Decode Sentences Build Language Model Parallel Corpus Target Language Corpus Moses Phrase Table Training syntax moses Charniak Parser st ch {st,ch} {syntax-st, syntax-ch, moses}

Packing Node

Realizations

Don’t re-run Organized directory structure & easy- to-parse logs

9

slide-41
SLIDE 41

Multiple Machines and Schedulers

Design Machine Java

10

slide-42
SLIDE 42

Multiple Machines and Schedulers

Design Machine Home Execution Machine Java UNIX

Manually Copy Bash Script

10

slide-43
SLIDE 43

Multiple Machines and Schedulers

Design Machine Home Execution Machine Remote Execution Machine Remote Execution Machine Java UNIX UNIX UNIX

Manually Copy Bash Script Passwordless SSH Passwordless SSH

10

slide-44
SLIDE 44

Multiple Machines and Schedulers

Design Machine Home Execution Machine Remote Execution Machine Remote Execution Machine Java UNIX UNIX UNIX

Manually Copy Bash Script Passwordless SSH Passwordless SSH

Bash Sun Grid Engine Condor

10

slide-45
SLIDE 45

Other Things to Make Life Easier

  • Sanity checking at each step

(embedded in Tool Descriptors)

  • Copying of files (including to HDFS)
  • Text-based workflow definition

(in SVN)

  • Open-source LGPL License

11

slide-46
SLIDE 46

WANTED

Users & Contributors Machine Translation Toolpack (released) Corpus Processing Toolpack? Parsing Toolpack? Question Answering Toolpack? Resource Directory Toolpack? Speech Recognition Toolpack?

12

slide-47
SLIDE 47

Conclusion

  • Make your life easier
  • Automation
  • Sanity Checking
  • Logging
  • Make your colleagues’ lives easier
  • Reproducibility
  • Modularity

13

slide-48
SLIDE 48

Questions? http://loonybin.sourceforge.net

14

Tutorial & Software at

slide-49
SLIDE 49
slide-50
SLIDE 50

Practical Issues

  • How does LoonyBin know when to run/

rerun a step?

  • Each vertex x realization has a loon log
  • file. If the file does not exist, the step is

(re)run

  • What if I don’t want that many steps?
  • Workflows have many granularities!
slide-51
SLIDE 51

Recommendations

  • Store your workflow files in SVN
  • Store your log files in SVN -- experimental

data is useful long after we get annoyed with size of data files!

  • Log the SVN revision of frequently

changing tools in your Loon logs -- Build them from SVN every time to ensure you’re executing that version

slide-52
SLIDE 52

Future Work

  • Default parameters -- Short-term
  • Asynchronous DAG execution (currently

all steps are run in serial) -- Mid-Term

  • Workflow monitoring and reprioritization

during execution -- Long-term

  • Encapsulation of workflows as

“tools” (hierarchical tools) -- Long-term

  • Automatic file compression -- Long-term