Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - - - PowerPoint PPT Presentation

scikit learn s transformers
SMART_READER_LITE
LIVE PREVIEW

Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - - - PowerPoint PPT Presentation

Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - PyParis 14/11/2018 1 / 30 Scikit-learn's Transformers 2 / 30 Transformer from sklearn.preprocessing import StandardScaler model = StandardScaler() X_train_2 =


slide-1
SLIDE 1

Scikit-learn's Transformers

  • v0.20 and beyond -

Tom Dupré la Tour - PyParis 14/11/2018

1 / 30

slide-2
SLIDE 2

Scikit-learn's Transformers

2 / 30

slide-3
SLIDE 3

Transformer

from sklearn.preprocessing import StandardScaler model = StandardScaler() X_train_2 = model.fit(X_train).transform(X_train) X_test_2 = model.transform(X_test)

3 / 30

slide-4
SLIDE 4

Pipeline

from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier model = make_pipeline(StandardScaler(), SGDClassifier(loss='log')) y_pred = model.fit(X_train, y_train).predict(X_test)

4 / 30

slide-5
SLIDE 5

Pipeline

from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier model = make_pipeline(StandardScaler(), SGDClassifier(loss='log')) y_pred = model.fit(X_train, y_train).predict(X_test)

Advantages

Clear overview of the pipeline Correct cross-validation Easy parameter grid-search Caching intermediate results

4 / 30

slide-6
SLIDE 6

Transformers before v0.20

Dimensionality reduction: PCA, KernelPCA, FastICA, NMF, etc. Scalers: StandardScaler, MaxAbsScaler, etc. Encoders: OneHotEncoder, LabelEncoder, MultiLabelBinarizer Expansions: PolynomialFeatures Imputation: Imputer Custom 1D transforms: FunctionTransformer Quantiles: QuantileTransformer (v0.19) and also: Binarizer, KernelCenterer, RBFSampler, ...

5 / 30

slide-7
SLIDE 7

New in v0.20

6 / 30

slide-8
SLIDE 8

v0.20: Easier data science pipeline

Many new Transfomers

ColumnTransformer (new) PowerTransformer (new) KBinsDiscretizer (new) MissingIndicator (new) SimpleImputer (new) OrdinalEncoder (new) TransformedTargetRegressor (new)

Transformer with signicant improvements

OneHotEncoder handles categorical features. MaxAbsScaler, MinMaxScaler, RobustScaler, StandardScaler, PowerTransformer, and QuantileTransformer, handles missing values (NaN).

7 / 30

slide-9
SLIDE 9

v0.20: Easier data science pipeline

SimpleImputer (new) handles categorical features. MissingIndicator (new)

8 / 30

slide-10
SLIDE 10

v0.20: Easier data science pipeline

SimpleImputer (new) handles categorical features. MissingIndicator (new) OneHotEncoder handles categorical features. OrdinalEncoder (new)

8 / 30

slide-11
SLIDE 11

v0.20: Easier data science pipeline

SimpleImputer (new) handles categorical features. MissingIndicator (new) OneHotEncoder handles categorical features. OrdinalEncoder (new) MaxAbsScaler, MinMaxScaler, RobustScaler, StandardScaler, PowerTransformer, and QuantileTransformer, handles missing values (NaN).

8 / 30

slide-12
SLIDE 12

ColumnTransformer (new)

from sklearn.compose import make_column_transformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression numeric = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical = make_pipeline( # new: 'constant' strategy, handles categorical features SimpleImputer(strategy='constant', fill_value='missing'), # new: handles categorical features OneHotEncoder()) preprocessing = make_column_transformer( [(['age', 'fare'], numeric), # continuous features (['sex', 'pclass'], categorical)], # categorical features remainder='drop') model = make_pipeline(preprocessing, LogisticRegression())

9 / 30

slide-13
SLIDE 13

PowerTransformer (new)

10 / 30

slide-14
SLIDE 14

KBinsDiscretizer (new)

11 / 30

slide-15
SLIDE 15

KBinsDiscretizer (new)

12 / 30

slide-16
SLIDE 16

TransformedTargetRegressor (new)

13 / 30

slide-17
SLIDE 17

TransformedTargetRegressor (new)

import numpy as np from sklearn.linear_model import LinearRegression from sklearn.compose import TransformedTargetRegressor model = TransformedTargetRegressor(LinearRegression(), func=np.log, inverse_func=np.exp) y_pred = model.fit(X_train, y_train).predict(X_test)

14 / 30

slide-18
SLIDE 18

Glossary of Common Terms and API Elements (new)

https://scikit-learn.org/stable/glossary.html

15 / 30

slide-19
SLIDE 19

Joblib backend system (new)

New pluggable backend system for Joblib New default backend for single host multiprocessing (loky)

Does not break third-party threading runtimes

Ability to delegate to dask/distributed for cluster computing

16 / 30

slide-20
SLIDE 20

Nearest Neighbors

17 / 30

slide-21
SLIDE 21

Nearest Neighbors Classifier

17 / 30

slide-22
SLIDE 22

Nearest Neighbors in scikit-learn

Used in:

KNeighborsClassifier, RadiusNeighborsClassifier KNeighborsRegressor, RadiusNeighborsRegressor, LocalOutlierFactor TSNE, Isomap, SpectralEmbedding DBSCAN, SpectralClustering

18 / 30

slide-23
SLIDE 23

Nearest Neighbors

Computed with brute force, KDTree, or BallTree, ...

19 / 30

slide-24
SLIDE 24

Nearest Neighbors

Computed with brute force, KDTree, or BallTree, ... ... or with approximated methods (random projections)

annoy (by Spotify) faiss (by Facebook research) nmslib ...

19 / 30

slide-25
SLIDE 25

Nearest Neighbors benchmark

https://github.com/erikbern/ann-benchmarks

20 / 30

slide-26
SLIDE 26

Nearest Neighbors

  • scikit-learn API -

21 / 30

slide-27
SLIDE 27

Trees and wrapping estimator

KDTree and BallTree:

Not proper scikit-learn estimators query, query_radius, which return (indices, distances)

22 / 30

slide-28
SLIDE 28

Trees and wrapping estimator

KDTree and BallTree:

Not proper scikit-learn estimators query, query_radius, which return (indices, distances)

NearestNeighbors:

scikit-learn estimator, but without transform or predict kneighbors, radius_neighbors, which return (distances, indices)

22 / 30

slide-29
SLIDE 29

Nearest Neighbors call

KernelDensity, NearestNeighbors:

Create an instance of BallTree or KDTree

23 / 30

slide-30
SLIDE 30

Nearest Neighbors call

KernelDensity, NearestNeighbors:

Create an instance of BallTree or KDTree

KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor

Inherit fit and kneighbors (weird) from NearestNeighbors

23 / 30

slide-31
SLIDE 31

Nearest Neighbors call

KernelDensity, NearestNeighbors:

Create an instance of BallTree or KDTree

KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor

Inherit fit and kneighbors (weird) from NearestNeighbors

TSNE, DBSCAN, Isomap, LocallyLinearEmbedding:

Create an instance of NearestNeighbors

23 / 30

slide-32
SLIDE 32

Nearest Neighbors call

KernelDensity, NearestNeighbors:

Create an instance of BallTree or KDTree

KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor, LocalOutlierFactor

Inherit fit and kneighbors (weird) from NearestNeighbors

TSNE, DBSCAN, Isomap, LocallyLinearEmbedding:

Create an instance of NearestNeighbors

SpectralClustering, SpectralEmbedding:

Call kneighbors_graph, which creates an instance of NearestNeighbors

23 / 30

slide-33
SLIDE 33

Copy of NearestNeighbors parameters in each class

params = [algorithm, leaf_size, metric, p, metric_params, n_jobs] # sklearn.neighbors NearestNeighbors(n_neighbors, radius, *params) KNeighborsClassifier(n_neighbors, *params) KNeighborsRegressor(n_neighbors, *params) RadiusNeighborsClassifier(radius, *params) RadiusNeighborsRegressor(radius, *params) LocalOutlierFactor(n_neighbors, *params) # sklearn.manifold TSNE(metric) Isomap(n_neighbors, neighbors_algorithm, n_jobs) LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs) SpectralEmbedding(n_neighbors, n_jobs) # sklearn.cluster SpectralClustering(n_neighbors, n_jobs) DBSCAN(eps, *params)

24 / 30

slide-34
SLIDE 34

Different handling of precomputed neighbors in X

Handle precomputed distance matrices:

TSNE, DBSCAN, SpectralEmbedding, SpectralClustering, LocalOutlierFactor, NearestNeighbors KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor (not Isomap)

25 / 30

slide-35
SLIDE 35

Different handling of precomputed neighbors in X

Handle precomputed distance matrices:

TSNE, DBSCAN, SpectralEmbedding, SpectralClustering, LocalOutlierFactor, NearestNeighbors KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor (not Isomap)

Handle precomputed sparse neighbors graphs:

DBSCAN, SpectralClustering

25 / 30

slide-36
SLIDE 36

Different handling of precomputed neighbors in X

Handle precomputed distance matrices:

TSNE, DBSCAN, SpectralEmbedding, SpectralClustering, LocalOutlierFactor, NearestNeighbors KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor (not Isomap)

Handle precomputed sparse neighbors graphs:

DBSCAN, SpectralClustering

Handle objects inheriting NearestNeighbors:

LocalOutlierFactor, NearestNeighbors

25 / 30

slide-37
SLIDE 37

Different handling of precomputed neighbors in X

Handle precomputed distance matrices:

TSNE, DBSCAN, SpectralEmbedding, SpectralClustering, LocalOutlierFactor, NearestNeighbors KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor (not Isomap)

Handle precomputed sparse neighbors graphs:

DBSCAN, SpectralClustering

Handle objects inheriting NearestNeighbors:

LocalOutlierFactor, NearestNeighbors

Handle objects inheriting BallTree/KDTree:

LocalOutlierFactor, NearestNeighbors KNeighborsClassifier, KNeighborsRegressor, RadiusNeighborsClassifier, RadiusNeighborsRegressor

25 / 30

slide-38
SLIDE 38

Challenges

Consistent API, avoid copying all parameters, Changing the API? difcult without breaking code Use approximated nearest neighbors from other libraries

26 / 30

slide-39
SLIDE 39

Proposed soluon

Precompute sparse graphs in a Transformer

[#10482]

27 / 30

slide-40
SLIDE 40

Precomputed sparse nearest neighbors graph

Steps:

  • 1. Make all classes accept precomputed sparse neighbors graph

28 / 30

slide-41
SLIDE 41

Precomputed sparse nearest neighbors graph

Steps:

  • 1. Make all classes accept precomputed sparse neighbors graph
  • 2. Pipeline: Add KNeighborsTransformer and

RadiusNeighborsTransformer

from sklearn.pipeline import make_pipeline from sklearn.neighbors import KNeighborsTransformer from sklearn.manifold import TSNE graph = KNeighborsTransformer(n_neighbors=n_neighbors, mode='distance', metric=metric) tsne = TSNE(metric='precomputed', method="barnes_hut") model_1 = make_pipeline(graph, tsne) model_2 = TSNE(metric=metric, method="barnes_hut")

28 / 30

slide-42
SLIDE 42

Precomputed sparse nearest neighbors graph

Improvements:

  • 1. All parameters are accessible in the transformer

29 / 30

slide-43
SLIDE 43

Precomputed sparse nearest neighbors graph

Improvements:

  • 1. All parameters are accessible in the transformer
  • 2. Caching properties of the pipeline (memory="path/to/cache")

29 / 30

slide-44
SLIDE 44

Precomputed sparse nearest neighbors graph

Improvements:

  • 1. All parameters are accessible in the transformer
  • 2. Caching properties of the pipeline (memory="path/to/cache")
  • 3. Allow custom nearest neighbors estimators

# Example: TSNE with AnnoyTransformer: 46.222 sec TSNE with KNeighborsTransformer: 79.842 sec TSNE with internal NearestNeighbors: 79.984 sec

29 / 30

slide-45
SLIDE 45

Thank you for your aenon!

tomdlt.github.io/decks/2018_pyparis @tomdlt10

30 / 30