Bridging the Gap: From Data Science to Production Florian Wilhelm - - PowerPoint PPT Presentation

bridging the gap from data science to production
SMART_READER_LITE
LIVE PREVIEW

Bridging the Gap: From Data Science to Production Florian Wilhelm - - PowerPoint PPT Presentation

Bridging the Gap: From Data Science to Production Florian Wilhelm EuroPython 2018 @ Edinburgh, 2018-07-25 Dr. Florian Wilhelm Principal Data Scientist @ inovex Special Interests Mathematical Modelling @FlorianWilhelm Recommendation


slide-1
SLIDE 1

Bridging the Gap: From Data Science to Production

Florian Wilhelm EuroPython 2018 @ Edinburgh, 2018-07-25

slide-2
SLIDE 2

Special Interests

  • Mathematical Modelling
  • Recommendation Systems
  • Data Science in Production
  • Python Data Stack
  • Dr. Florian Wilhelm

Principal Data Scientist @ inovex

@FlorianWilhelm FlorianWilhelm florianwilhelm.info 2

slide-3
SLIDE 3

IT-project house for digital transformation:

  • Agile Development & Management
  • Web · UI/UX · Replatforming · Microservices
  • Mobile · Apps · Smart Devices · Robotics
  • Big Data & Business Intelligence Platforms
  • Data Science · Data Products · Search · Deep Learning
  • Data Center Automation · DevOps · Cloud · Hosting
  • Trainings & Coachings

Using technology to inspire our

  • clients. And ourselves.

inovex offices in Karlsruhe · Cologne · Munich · Pforzheim · Hamburg · Stuttgart. www.inovex.de

slide-4
SLIDE 4

Agenda

Many facets

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages 4

slide-5
SLIDE 5

Use-Case: High Level Perspective

What does your model pipeline look like?

f(...) Data Results Model

5

slide-6
SLIDE 6

Use-Case: High Level Perspective

What is your Data Source?

Data Variants:

  • Database (PostgreSQL, C*)
  • Distributed Filesystem (HDFS)
  • Stream (Kafka)
  • ...

How is your data accessed? What are the frequency and recency requirements? Batch, Near-Realtime, Realtime, Stream?

6

slide-7
SLIDE 7

Use-Case: High Level Perspective

What is a model?

Model Model includes:

  • Preprocessing (cleansing, imputation, scaling)
  • Construction of derived features (EMAs)
  • Machine Learning Algorithm (Random Forest, ANN)
  • ...

Is the input of your model raw data or pregenerated features? Does your model have a state?

7

slide-8
SLIDE 8

Use-Case: High Level Perspective

How is your result stored?

Results Variants:

  • Database (PostgreSQL, C*)
  • Distributed Filesystem (HDFS)
  • Stream (Kafka)
  • On demand (REST API)
  • ...

What are the frequency and recency requirements? Batch, Near-Realtime, Realtime, Stream?

8

slide-9
SLIDE 9

Use-Case: High Level Perspective

Our challenge

Model Data Results Deployment Interface Interface Production

9

slide-10
SLIDE 10

Use-Case Evaluation

Delivery Problem Class Volume & Velocity Inference / Prediction Technical Conditions WebService Classification 10 GB weekly Batch Java-Stack + Python Stream Regression 1 GB daily Near-Realtime On-Premise Database Recommendation 10k events/s Realtime AWS Cloud Explainability? Stream

Characteristics of a Data Use-case

Note down your specific requirements before thinking about an architecture. There is no one size fits all!

10

slide-11
SLIDE 11

Use-Case: High Level Perspective

Right from the Start

  • State the requirements of your data use-case
  • Identify and check data sources
  • Define interfaces with other teams/departments
  • Test the whole data flow and infrastructure early on with

a dummy model

11

slide-12
SLIDE 12

Many facets

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages 12

slide-13
SLIDE 13

It‘s an iterative Process

Quality Assurance for smooth iterations

Data

Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

CRISP-DM

13

slide-14
SLIDE 14

http://clean-code-developer.de/

Clean Code

Clean code is code that is easy to understand and easy to change.

Resources:

  • Software Design Patterns
  • SOLID Principles
  • The Pragmatic Programmer
  • The Software Craftsman

14

https://www.linkedin.com/pulse/5-pro-tips-data-scientists-write-good-code-jason-byrne/

slide-15
SLIDE 15

https://huddle.eurostarsoftwaretesting.com/4-ways-automation-and-ci-are-changing-testing-and-development/

Continuous Integration

  • Version, package

and manage your artefacts

  • Provide tests (unit,

systems, ...)

  • Automize as much

as possible

  • Embrace processes

15

slide-16
SLIDE 16

Monitoring

KPI and Stats

  • KPIs (CTR, Conversions)
  • Number of requests
  • Timeouts, delays
  • Total number of predictions
  • Runtimes
  • ...

All monitoring needs to be linked to the currently running version of your model!

16

slide-17
SLIDE 17

17 https://landing.google.com/sre/book/chapters/part3.html

Monitoring

Site Reliability Engineering How Google Runs Production Systems

@Google

slide-18
SLIDE 18

Lucas Javier Bernardi | Diagnosing Machine Learning Models: https://www.youtube.com/watch?v=ZD8LA3n6YvI

Monitoring

Model Stats

Monitor the results of model‘s predicitons Example: Response Distribution Analysis a) working model b) confused model

a) b)

18

slide-19
SLIDE 19

A/B Tests

Feedback for your model

  • Always compare your “improved“

model to the current baseline

  • Allows comparing two models not
  • nly in offline metrics but also
  • nline metrics and KPIs.
  • Also possible to adjust

hyperparameters with online feedback, e.g. multi-armed bandit

19

slide-20
SLIDE 20

A/B Tests

Technical requirements

  • Versioning of your models to allow linking them to test

groups

  • Deploying and serving several models at the same time

independently (needed for fast rollback anyway!)

  • Tracking the results of a given model up to the point of

facing the customer

Serving

20

slide-21
SLIDE 21

Many facets

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages 21

slide-22
SLIDE 22

Sebastian Neubauer - There should be one obvious way to bring python into production https://www.youtube.com/watch?v=hnQKsxKjCUo

Organisation of Teams

Wall of Confusion

  • Code
  • Tests
  • Releases
  • Version Control
  • Continuous

Integration

  • Features

Developers

  • Packaging
  • Deployment
  • Lifecycle
  • Configuration
  • Security
  • Monitoring

Operations

Release v1.2.3

22

slide-23
SLIDE 23

Different Cultures/Thinking

Wrong Approach!

  • Especially dangerous

separation for data products/features

  • Speed and Time to Market

are important thus “not my job“-thinking hurts

  • “I made a great model“ vs.

„We made a great data product“

23

slide-24
SLIDE 24

https://en.wikipedia.org/wiki/DevOps

Organisation of Teams

Overcoming the Wall of Confusion

Continuous Delivery

Dev Ops

24

slide-25
SLIDE 25

http://101.datascience.community/2016/11/28/data-scientists-data-engineers-software-engineers-the-difference-according-to-linkedin/

Heterogeneous Teams

How to bring Data Scientists into DevOps?

  • Pure teams of Data Scientists
  • ften struggle to get anything in

production

  • As a minimum complement, SW

and Data Engineers are needed. (2-3 Engineers per Data Scientist)

  • Optionally a Data Product

Manager as well as an UI/UX expert if necessary

25

slide-26
SLIDE 26

http://www.full-stackagile.com/2016/02/14/team-organisation-squads-chapters-tribes-and-guilds/

Organisation around Features

Responsibility with vertical teams

  • Fully autonomous teams
  • End-to-end responsibility for a feature
  • Works well with Agile Methods like Scrum
  • Faster delivery and less politics

26

slide-27
SLIDE 27

Many facets

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages 27

slide-28
SLIDE 28

Programming Languages

The Two Language Problem Industry

  • Java stack common

also Scala

  • Strongly typed
  • Emphasize on

robustness and edge cases

  • Industrial standards

for deployment

Science

  • Often Python and R
  • Dynamic typed since

easier to get the job done

  • Emphasize on fancy

methods and results

  • Runs on my machine

28

slide-29
SLIDE 29

Language Problem

Solution: Select one to rule them all!

  • Having a single language reduces

the complexity of deployment

  • Implementation efforts due to

abandoning one ecosystem totally

29

slide-30
SLIDE 30

https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/

Language Problem

Solution: Python in production

  • Especially easy for batch prediction use-cases
  • If a web service is needed flask is a viable
  • ption
  • Scale horizontally during prediction and use a

big metal node for training a model

  • Tap into the Hadoop world by using PySpark,

PyHive etc.

  • Consider isolated containers using docker

30

slide-31
SLIDE 31

Language Problem

Solution: PoCs in Python/R, rewrite in Java for production

  • Lots of efforts and slow
  • Iterations and new feature are

hard to implement

  • Reproducability of bugs is

cumbersome

  • Pro: Everyone gets what they

want Worst-case Scenario

31

slide-32
SLIDE 32

Language Problem

Solution: Exchangable formats

  • Works great in theory
  • Limited functionality and

flexibility

  • No guarantee the same model

description will be interpreted the same by two different implementations

  • Preprocessing / feature

generation not included

32

slide-33
SLIDE 33

Language Problem

Solution: Frameworks

  • Various language bindings allow developing

in Python/R and running on the Java stack

  • Be aware if framework also covers feature

generation

  • Ease of use at the cost of flexibility

33

slide-34
SLIDE 34

Two Language Problem

Three concepts of dealing with it

Reimplement Frameworks Single Language

1 2 3

x

34

slide-35
SLIDE 35

Many facets

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages 35

slide-36
SLIDE 36

Deployment

Deployment heavily depends

  • n the chosen approach!

Still some software engineering principles apply like Continuous Integration or even Continuous Delivery

36

slide-37
SLIDE 37

Sculley et al (2015), Hidden Technical Debt in Machine Learning Systems

Technical Debt in ML Pipelines

Deployment

37

slide-38
SLIDE 38

Deployment

General principles

  • Versioning & packaging, defined processes, quality

management

  • Keep the development and production environment as

similar as possible

  • Automation is a must, avoid human error!
  • Isolated and controllable environments are a great idea,

e.g. Docker.

38

slide-39
SLIDE 39

39 https://developers.google.com/machine-learning/rules-of-ml/

Google‘s Best Practices for ML Engineering

Best of Google‘s rules

Rule #1: Don’t be afraid to launch a product without machine learning. Rule #2: First, design and implement metrics Rule #4: Keep the first model simple and get the infrastructure right. Rule #5: Test the infrastructure independently from the machine learning Rule #9: Detect problems before exporting models. Rule #11: Give feature columns owners and documentation. Rule #13: Choose a simple, observable and attributable metric for your first objective. Rule #14: Starting with an interpretable model makes debugging easier. Rule #16: Plan to launch and iterate. Rule #24: Measure the delta between models. Rule #27: Try to quantify observed undesirable behavior. "measure first, optimize second“ Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible.

Most of the problems are engineering problems!

slide-40
SLIDE 40

https://www.inovex.de/blog/data-science-in-production/

Example: Continuous Integration

40

devpi

slide-41
SLIDE 41

https://pyscaffold.org/

Example: Python Package/Distribution

PyScaffold

  • Easy and sane Python packaging
  • Proper versioning of every commit
  • Git integration, e.g. pre-commit
  • Declarative configuration with setup.cfg
  • Follows community standards
  • Many extensions available

$> pip install pyscaffold $> putup my_project

41

slide-42
SLIDE 42

Key Learnings

Data Science to Production

Data Science to Production

Organisation Quality Assurance Deployment Use-Case Languages

  • Dependent on your use-case,

no one-size fits all!

  • Think early on about QA
  • DevOps Culture & team

responsibility

  • Choose a framework or single

language to overcome the Two- Language-Problem

  • Embrace processes &

automation

42

Production is NOT an Afterthought!

slide-43
SLIDE 43

Thank you!

Florian Wilhelm Principal Data Scientist inovex GmbH Schanzenstraße 6-20 Kupferhütte 1.13 51063 Cologne, Germany florian.wilhelm@inovex.de