Computational Notebooks Huq Imdadul, Memmel Marius 29.06.2020 | - - PowerPoint PPT Presentation

computational notebooks
SMART_READER_LITE
LIVE PREVIEW

Computational Notebooks Huq Imdadul, Memmel Marius 29.06.2020 | - - PowerPoint PPT Presentation

Computational Notebooks Huq Imdadul, Memmel Marius 29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 1 Table of content 1. Definition 2. What are computational notebooks?


slide-1
SLIDE 1

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 1

Huq Imdadul, Memmel Marius

Computational Notebooks

slide-2
SLIDE 2

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 2

1. Definition 2. What are computational notebooks? 3. Why use computational notebooks? 4. Use cases 5. What’s wrong about computational notebooks? 6. Conclusion / discussion

Table of content

slide-3
SLIDE 3

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 3

Definition

Literate Programming ‘I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature.’

  • Donald Knuth, Literate Programming (1984) [4]

‘[Literate programming] pairs the functionality of word processing software with both the shell and kernel of [a] notebook's programming language.’

  • Wikipedia, Notebook Interface [3]
slide-4
SLIDE 4

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 4

Definition

Computational Notebook ‘A notebook interface (also called a computational notebook) is a virtual notebook environment used for literate programming.’

  • Wikipedia, Notebook Interface [3]

Mixed Notebooks ‘[Mixed notebooks are a] new generation of notebooks that is based on cells, each of which contains rich text or code that can be executed to compute results or generate visualizations.

  • Exploration and Explanation in Computational Notebooks [12]
slide-5
SLIDE 5

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 5

Some Examples

[11]

slide-6
SLIDE 6

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 6

Technology at the example of Jupyter Notebooks

❏ Frontend: code editor ❏ Kernels: computational engines ❏ Communication via API

  • -> Separation of content and execution
  • -> Multi-language support by swapping kernels

UI UI kernel kernel

… …

API

[1]

slide-7
SLIDE 7

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 7

Template

[12]

slide-8
SLIDE 8

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 8

A look at a data scientists work

Assumptions / situations

❏ Small changes can lead to different results

  • -> documentation essential

❏ Iterative and exploratory approach

  • -> difficult documentation

❏ ‘Dead ends’ ❏ Process creates many figures, files and scripts with similar names

Data science is an iterative exploratory process of extracting insights from data.

[1]

slide-9
SLIDE 9

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 9

Computational notebooks to the rescue!

Combination of code, text and visualizations in a single document

Easy to share

Easy to iterate fast and debug code → Enables quick prototyping and EDA

[1]

slide-10
SLIDE 10

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 10

And they can do even more …

Cloud offers

Platform independence

Computational narrative

Single document

Reproducibility

...

slide-11
SLIDE 11

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 11

Use Cases

Education:

Coding tutorials

Data analysis

Visualization (techniques) Commercial:

distill.pub

Netflix

slide-12
SLIDE 12

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 12

distill.pub: modern medium for presenting research

[5] [6]

slide-13
SLIDE 13

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 13

Netflix: reimagining notebooks

❏ Unified tool for most common data jobs

❏ Run code, explore data, present results

❏ Use cases

❏ Data access ❏ Notebook templates (parameterization) ❏ Scheduling notebooks

[1]

slide-14
SLIDE 14

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 14

Netflix: scheduling notebooks

[2] [13]

slide-15
SLIDE 15

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 15

❏ Fundamental idea of notebook

Quick input for a single step, get fast feedback, share

… & iterate ❏ Negative effects

Leads to bad practices -> Encourages polluting global space, discourage code reusability….

Like a junk food, if eaten too much it makes you obese & harder to maintain

Number of pain points

What's wrong about Computational Notebook?

slide-16
SLIDE 16

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 16

What's wrong about Computational Notebook?

9 Pain points[7].

❏ Setup ❏ Repeating tasks like external loading & cleaning heavy data. ❏ Also sometime leads to crash. ❏ Explore and analyze ❏ Modeling & visualization at the same time is frustrating.

slide-17
SLIDE 17

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 17

…. 9 Pain points.

❏ Manage code ❏ Not an IDE, missing autocomplete, documentation, package dependencies ❏ Reliability ❏ Occasional crash -> No feedback -> Inconsistent state = Makes it unreliable. ❏ Resulting restarting notebook & iterate the process again. Especially with Big Data. ❏ Archival ❏ No out-of-the-box version controlling system.

slide-18
SLIDE 18

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 18

…. 9 Pain points.

❏ Security ❏ No masking to sensitive data while sharing notebook to execute. ❏ No read-only or run-only feature. ❏ External tools required for enforce access. ❏ Share & Collaborate ❏ Share data, documentation for setup is needed. ❏ Sharing with non-technical person is not supported.

slide-19
SLIDE 19

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 19

…. 9 Pain points.

❏ Reproduce & Reuse ❏ Because of dependency & environment setting ability to reproduce & reuse is difficult. ❏ Notebooks as product. Deploying to production requires significant cleanup & packaging of libraries - Outside of core skills of data scientist.

slide-20
SLIDE 20

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 20

Good Software engineering?

Rigorous software engineering isn't that important, I'm just experimenting! You mean you're just doing science? I just want to see if my model works before I put it into production. Don't you need to write correct code to make sure it works?

Not in the best Balance

src-https://docs.google.com/presentation/d/1n2RlMdmv 1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI

slide-21
SLIDE 21

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 21

Tools for reducing pain

  • nbdime. Jupyter Notebook Diff and Merge tools
slide-22
SLIDE 22

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 22

nbgather

Tools for reducing pain

slide-23
SLIDE 23

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 23

More tools

  • Papermill. A tool for parameterizing and executing Jupyter
  • Notebooks. It can store output notebooks cloud storages.

nteract is an open-source, desktop-based, interactive computing application

NbExtensions provides a collection of unofficial extensions for use with Jupyter Notebook. Some of the extensions .provided, allow convert python 2 to python 3 code, push to github gist, automatic code formatting etc.

slide-24
SLIDE 24

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 24

Statistics from Github on Notebook usage

Analysis[14] publicly available notebooks from github 2017 & 2019

slide-25
SLIDE 25

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 25

Conclusion

Great for data scientists to quickly data analyzation and fast iterations

Questionable software engineering technique when it comes to maintainability, reliability & shipping to production

Number of external tools available who try to solve the shortcomings

If discipline is maintained, they are an effective toolbox

slide-26
SLIDE 26

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 26

THANK YOU EVERYONE

slide-27
SLIDE 27

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 27

Discussion

What do you think, is notebook suitable for production?

slide-28
SLIDE 28

25.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 28

Discussion

Pro or con computational notebook?

slide-29
SLIDE 29

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 29

Reference

[1] https://netflixtechblog.com/notebook-innovation-591ee3221233 [2] https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6 [3] https://en.wikipedia.org/wiki/Notebook_interface [4] https://www-cs-faculty.stanford.edu/~knuth/lp.html [5] https://distill.pub/2020/bayesian-optimization/ [6] https://distill.pub/2020/grand-tour/ [7] https://web.eecs.utk.edu/~azh/pubs/Chattopadhyay2020CHI_NotebookPainpoints.pdf [8] https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI [9] https://github.com/jupyter/nbdime [10] https://github.com/microsoft/gather [11] Google Image Search + Company Name [12] Rule, A., Tabard, A., Hollan, J., Rule, A., Tabard, A., Exploration, J. H., … Hollan, J. D. (2018). Exploration and Explanation in Computational Notebooks. [13] https://www.google.com/url?q=https://youtu.be/MaDXqDUG5dk?t%3D964&sa=D&ust=1593685897750000&usg=AFQjCNH30S6RnM5uvcCH7wk59VUDjtGxyA [14 https://arxiv.org/pdf/1912.09536v1.pdf]

slide-30
SLIDE 30

29.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Huq Imdadul, Memmel Marius | 30

License

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/