Machine Learning Track Data Analytics, Machine Learning and HPC in - - PowerPoint PPT Presentation

machine learning track
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Track Data Analytics, Machine Learning and HPC in - - PowerPoint PPT Presentation

Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in todays changing application environment Franz J. Kirly (practical) An overview of data analytics DATA Scientific


slide-1
SLIDE 1

Intel HPC Developer Convention Salt Lake City 2016

Machine Learning Track

Franz J. Király

Data Analytics, Machine Learning and HPC in today’s changing application environment

slide-2
SLIDE 2

An overview of data analytics

DATA

Scientific Questions

Exploration

Statistical Questions

Methods Quantitative Modelling

Predictive/Inferential Descriptive/Explanatory

Statistical Programming

R

python

The Scientific Method Scientific and Statistical Validation

Knowledge

(practical)

slide-3
SLIDE 3

Data analytics and data science

in a broader context

Data analytics Data mining, Machine learning Statistics, Modelling,

Raw data Clean data

Lot of problems and subtleties at these stages already

  • ften, most of manpower

in „data“ project needs to go here first before

  • ne can attempt reliable

Knowledge

underlying arguments need to be explained well and properly Relevant findings and

slide-4
SLIDE 4

Big Data?

slide-5
SLIDE 5

What „Big Data“ may mean in practice

Kernel methods, OLS

10.000

Solution strategies

Number of data samples Strategies that stop working in reasonable time Number of features 10.000.000 10.000.000.000 1.000

Reading in all the data Random forests

100

L1, LASSO (around the same order) Manual exploratory data analysis

1.000

Super-linear algorithms Linear algorithms, including

Sub-sampling On-line models Feature extraction Large-scale strategies for super-linear algorithms Feature selection Distributed computing

slide-6
SLIDE 6

Large-scale motifs in data science

Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models

„Big models“

New fancy example: large neural networks aka „deep learning“

= where high-performance computing is helpful/impactful

Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data

„Big data“

Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes = the „classic“, beloved by everyone = what it says, a lot of data (ca 1 million samples or more) Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting

Model validation and model selection = this talk‘s focus

Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing?

slide-7
SLIDE 7

Customer: Hospital specializing in treatment of patients with a certain disease.

Meta-modelling: stylized case studies

Scientific question: depending on patient characteristics, predict the event risk. Patients with this disease are at-risk to experience an adverse event (e.g. death) Data set: complete clinical records of 1.000 patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Not of interest: which algorithm/strategy, out of many, exactly solves the task Scientific question: predict future customer behaviour given past behaviour Customers can buy (or not buy) any of a number of products, or churn. Data set: complete customer and purchase records of 100.000 customers Of interest: model interpretability; how accurate the predictions are expected to be Customer: Manufacturer wishes to find best parameter setting for machines. Scientific question: find parameter settings which optimizes the above Parameters influence amount/quality of product (or whether machine breaks) Data set: outcomes for 10.000 parameter settings on those machines whether the algorithm/model is (easily) deployable in the „real world“

slide-8
SLIDE 8

= data-centric and data-dependent modelling

Model validation and model selection

  • 1. There is no model that is good for all data.
  • 2. For given data, there is no a-priori reason to believe

that a certain type of model will be the best one.

(otherwise the justification of validity is circular hence faulty)

a scientific necessity implied by the scientific method and the following: Machine learning provides algorithms & theory for meta-modelling

(otherwise the concept of a model would be unnecessary) (any such belief is not empirically justified hence pseudoscientific)

  • 3. No model can be trusted unless its validity has

been verified by a model-independent argument.

and powerful algorithms motivated by meta-modelling optimality.

slide-9
SLIDE 9

Machine Learning and Meta-Modelling in a Nutshell

slide-10
SLIDE 10

modelling strategy

Leitmotifs of Machine Learning

Statistical models are objects in their own right „learning machines“

modelling strategy

Engineering & statistics idea: Engineering & computer science idea: Computer science & statistics idea: Any abstract algorithm can be a modelling strategy/learning machine Future performance of algorithm/learning machine can be estimated „model validation“ „model selection“ „computational learning“ from the intersection of engineering, statistics and computer science

Possibly non-explicit

(and should)

learning machine

?

slide-11
SLIDE 11

Problem types in Machine Learning

? ? ?

Supervised Learning: some data is labelled by expert/oracle Task: predict label from covariates

statistical models are usually discriminative Examples: regression, classification

slide-12
SLIDE 12

Problem types in Machine Learning

? ? !

Unsupervised Learning: the training data is not pre-labelled Task: find „structure“ or „pattern“ in data

statistical models are usually generative Examples: clustering, dimension reduction

slide-13
SLIDE 13

Advanced learning tasks

Semi-supervised learning some training data are labelled, some are not On-line learning the data is revealed with time, models need to update Anomaly detection all or most data are „positive examples“, the task is to flag „test negatives“

Complications in the labelling Complications through correlated data and/or time

Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct Reinforcement learning data are not directly labelled, only indirect gain/loss

slide-14
SLIDE 14
  • bservations

„training data“ predictions model fitting “learning” fitted model prediction new data

??

model tuning parameters

e.g., to base decisions on

What is a Learning Machine?

Examples: generalized linear model, linear regression, support vector machine, neural networks (= „deep learning“), random forests, gradient boosting, … … an algorithm that solves, e.g., the previous tasks:

Illustration: supervised learning machine

slide-15
SLIDE 15

Example: Linear Regression

  • bservations

„training data“ predictions model fitting “learning” fitted model prediction new data

?

Fit intercept or not?

slide-16
SLIDE 16

Model validation: does the model make sense?

Model learning

Prediction

„the truth“ „training data“ „test data“

e.g. regression, GLM, advanced methods

learnt model

?

„test labels“

compare & quantify

„out-of-sample“ „hold-out “ „in-sample“

Predictive models need to be validated on unseen data!

Which means the part of data for testing has not been seen by the algorithm before!

(note: this includes the case where machine = linear regression, deep learning, etc)

The only (general) way to test goodness of prediction is actually observing prediction!

??

predictions

e.g. evaluating the regression model

prediction strategy learning machine

slide-17
SLIDE 17

„Re-sampling“:

training data 1 test data

Predictor 1 Predictor 2 Predictor 3

training data 2 test data

Predictor 1 Predictor 2 Predictor 3

training data 3 test data 3

Predictor 1 Predictor 2 Predictor 3

all data

errors 1,2,3 errors 1,2,3 errors 1,2,3 aggregate errors 1,2,3 comparison

k-fold

cross-validation

how to obtain training/test splits

type of re-sampling

pros/cons

  • 2. obtain k train/tests splits via:
  • 1. divide data in k (almost) equal parts

each part is test data exactly once the rest of data is the training set

  • ften: k=5

good compromise between runtime and accuracy

Multiple algorithms are compared on multiple data splits/sub-datasets leave-one-out

when k is small compared to data size

= [number of data points]-fold c.v. very accurate, high run-time

repeated sub-sampling

parameters:

training/test size # of repetitions

  • 1. obtain a random sub-sample of

training/test data of specified sizes

(train/test need not cover all data)

can be arbitrarily quick can be arbitrarily inaccurate

(depending on parameter choice)

  • 2. repeat 1. desired number of times

can be combined with k-fold State-of-art principle in model validation, model comparison and meta-modelling

slide-18
SLIDE 18

Quantitative model comparison

a „benchmarking experiment“ results in a table like this

model RMSE

15.3

?

Confidence regions (or paired tests) to compare models to each other:

A is better than B / B is better than A / A and B are equally good

Uninformed model (stupid model/random guess) needs to be included

  • therwise a statement „is better than an uninformed guess“ cannot be made.

9.5 13.6 20.1

± 1.2 ± 0.9 ± 0.7 ± 1.4 MAE

12.3 7.3 11.4 18.1

± 1.1 ± 0.8 ± 0.9 ± 1.7

„useful model“ = (significantly) better than uninformed baseline

slide-19
SLIDE 19

Meta-model: automated parameter tuning

training

data test data Parameters 1 Parameters 2 Parameters 3 mo del goodn ess 1 5 . 3 ? 9 . 5 1 3 . 6 2 . 1 ± 1 . 2 ± . 9 ± . 7 ± 1 . 4

Best parameters whole training data Re-sampled training data

Important caveat:

Which measure

  • f predictive goodness

Which inner re-sampling scheme Methods are usually less sensitive to these „new“ tuning parameters

the „inner“ training/test splits need to be part of any „outer“ training set

  • therwise validation is not out-of-sample!

Re-sampling is used to determine [best parameter setting] For validation, new unseen data needs to be used: all data

training data test data

tuning train tuning test „real“ test

model goodness 1 5 . 3 ? 9 . 5 1 3 . 6 2 . 1 ± 1 . 2 ± . 9 ± . 7 ± 1 . 4

Model w. Best Parameter

training data

fit to all predict & quantify

Multi-fold-schemes are nested:

„splits within splits“

slide-20
SLIDE 20

Meta-Strategies in ML

„Model tuning“

Model with tuning parameters Best tuning parameters are determined using data-driven tuning algorithm

„Ensemble learning“

A B C D

a number of (possibly „weak“) models

A D B

„strong“ ensemble model

slide-21
SLIDE 21

Object dependencies in the ML workflow

all data

One interesting dataset into multiple train/test splits

training data test data

is re-sampled

training data test data training data test data

„Typical number of“ 5-10

  • n each
  • f which

the strategies are compared

1 2 M

M = 5-20 most of which are parameter- tuned by the same principle 10-10.000 parameter combinations Ensembles: further nesting 10-1.000 base learners Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples 3-5 nested splits

  • uter

splits N = 100-100.000 data points

(„small data“) (usually O(N²) or O(N³) )

slide-22
SLIDE 22

Machine Learning Toolboxes

slide-23
SLIDE 23

An incomplete list of influential toolboxes

Modular API

(e.g., methods)

Model tuning, meta-methods Model validation and comparison

GUI Language R caret python

multi- interface

R

Java

3rd party wrappers

python

Common models

Not entirely

scikit-learn is perhaps the most widely used ML toolbox

mostly kernels some

Few, mostly classifiers

few

python

slide-24
SLIDE 24

The object-oriented ML Toolbox API

Learning Machines as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization

modular structure Linear regression fit(traindata) „learning machine“ object predict(testdata) plus metadata & model info

  • bject orientation

Abstraction models objects with unified API:

Public interface Concept abstracted in R/mlr in sklearn fitting, predicting, set parameters

Learner estimator

Re-sampling schemes sample, apply & get results

ResampleDesc splitter classes in

model_selection

Evaluation metrics compute from results, tabulate

Measure

metrics classes in metrics

Meta-modelling wrapping machines by strategy Learning task

benchmark, list strategies/measures

Task

Implicit, not encapsulated Tuning Ensembling Pipelining

Pipeline

various wrappers various wrappers fused classes

slide-25
SLIDE 25

HPC for benchmarking/validation today

all data

Scikit-learn: joblib

training data test data training data test data training data test data

„Typical number of“ 5-10

1 2 M

M = 5-20 10-10.000 parameter combinations 10-1.000 base learners Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive) 3-5 nested splits

  • uter

splits N = 100-100.000 data points

(„small data“)

mlr: parallelMap

1 2 3 4

At the selected level: Distribute to clusters/cores

(one of 1-4)

slide-26
SLIDE 26

HPC support tomorrow?

1 2 M

Layer 2: Layer 1: full graph of dependencies:

re-samples algorithms parameters …

Scheduler for algorithms and meta-algorithms Data/task pipeline DATA

(e.g. Hadoop)

Layer 3: Optimized Primitives Layer 4: Hardware API

Combining (?) MapReduce, DAAL, dask, joblib -> TBB? e.g. MKL, CUDA, BLAS

e.g. distributed, multi-core, multi-type/heterogeneous

(image source: continuum analytics)

Linear systems convex optimization

  • stoch. gradient descent

(image source: Intel math kernel library)

slide-27
SLIDE 27

Challenges in ML APIs and HPC

Surprisingly few resources have been invested in ML toolboxes Most advanced toolboxes are currently open-source & academic

Features that would be desirable to the practitioner but not available without mid-scale software development:

Integration of (a) data management, (b) exploration and (c) modelling Full HPC integration on granular level for distributed ML benchmarking Non-standard modelling tasks, structured data (incl time series)

data heterogeneity, multiple datasets, time series, spatial features, images etc forecasting, on-line learning, anomaly detection, change point detection especially challenging: integration in large scale scenarios e.g. MapReduce for divide/conquer over data, model parts, and models making full use parallelism for nesting and computational redundancies complete HPC architecture for whole model benchmarking workflow meta-modelling and re-sampling for these is an order of magnitude more costly