Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - - PowerPoint PPT Presentation

Scikit-learn's Transformers - v0.20 and beyond - Tom Dupré la Tour - PyParis 14/11/2018 1 / 30

Scikit-learn's Transformers 2 / 30

Transformer from sklearn.preprocessing import StandardScaler model = StandardScaler() X_train_2 = model.fit(X_train).transform(X_train) X_test_2 = model.transform(X_test) 3 / 30

Pipeline from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier model = make_pipeline(StandardScaler(), SGDClassifier(loss='log')) y_pred = model.fit(X_train, y_train).predict(X_test) 4 / 30

Pipeline from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import SGDClassifier model = make_pipeline(StandardScaler(), SGDClassifier(loss='log')) y_pred = model.fit(X_train, y_train).predict(X_test) Advantages Clear overview of the pipeline Correct cross-validation Easy parameter grid-search Caching intermediate results 4 / 30

Transformers before v0.20 Dimensionality reduction: PCA , KernelPCA , FastICA , NMF , etc. Scalers: StandardScaler , MaxAbsScaler , etc. Encoders: OneHotEncoder , LabelEncoder , MultiLabelBinarizer Expansions: PolynomialFeatures Imputation: Imputer Custom 1D transforms: FunctionTransformer Quantiles: QuantileTransformer (v0.19) and also: Binarizer , KernelCenterer , RBFSampler , ... 5 / 30

New in v0.20 6 / 30

v0.20: Easier data science pipeline Many new Transfomers ColumnTransformer (new) PowerTransformer (new) KBinsDiscretizer (new) MissingIndicator (new) SimpleImputer (new) OrdinalEncoder (new) TransformedTargetRegressor (new) Transformer with signi�cant improvements OneHotEncoder handles categorical features. MaxAbsScaler , MinMaxScaler , RobustScaler , StandardScaler , PowerTransformer , and QuantileTransformer , handles missing values (NaN). 7 / 30

v0.20: Easier data science pipeline SimpleImputer (new) handles categorical features. MissingIndicator (new) 8 / 30

v0.20: Easier data science pipeline SimpleImputer (new) handles categorical features. MissingIndicator (new) OneHotEncoder handles categorical features. OrdinalEncoder (new) 8 / 30

v0.20: Easier data science pipeline SimpleImputer (new) handles categorical features. MissingIndicator (new) OneHotEncoder handles categorical features. OrdinalEncoder (new) MaxAbsScaler , MinMaxScaler , RobustScaler , StandardScaler , PowerTransformer , and QuantileTransformer , handles missing values (NaN). 8 / 30

ColumnTransformer (new) from sklearn.compose import make_column_transformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression numeric = make_pipeline( SimpleImputer(strategy='median'), StandardScaler()) categorical = make_pipeline( # new: 'constant' strategy, handles categorical features SimpleImputer(strategy='constant', fill_value='missing'), # new: handles categorical features OneHotEncoder()) preprocessing = make_column_transformer( [(['age', 'fare'], numeric), # continuous features (['sex', 'pclass'], categorical)], # categorical features remainder='drop') model = make_pipeline(preprocessing, LogisticRegression()) 9 / 30

PowerTransformer (new) 10 / 30

KBinsDiscretizer (new) 11 / 30

KBinsDiscretizer (new) 12 / 30

TransformedTargetRegressor (new) 13 / 30

TransformedTargetRegressor (new) import numpy as np from sklearn.linear_model import LinearRegression from sklearn.compose import TransformedTargetRegressor model = TransformedTargetRegressor(LinearRegression(), func=np.log, inverse_func=np.exp) y_pred = model.fit(X_train, y_train).predict(X_test) 14 / 30

Glossary of Common Terms and API Elements (new) https://scikit-learn.org/stable/glossary.html 15 / 30

Joblib backend system (new) New pluggable backend system for Joblib New default backend for single host multiprocessing (loky) Does not break third-party threading runtimes Ability to delegate to dask/distributed for cluster computing 16 / 30

Nearest Neighbors 17 / 30

Nearest Neighbors Classifier 17 / 30

Nearest Neighbors in scikit-learn Used in: KNeighborsClassifier , RadiusNeighborsClassifier KNeighborsRegressor , RadiusNeighborsRegressor , LocalOutlierFactor TSNE , Isomap , SpectralEmbedding DBSCAN , SpectralClustering 18 / 30

Nearest Neighbors Computed with brute force, KDTree , or BallTree , ... 19 / 30

Nearest Neighbors Computed with brute force, KDTree , or BallTree , ... ... or with approximated methods (random projections) annoy (by Spotify) faiss (by Facebook research) nmslib ... 19 / 30

Nearest Neighbors benchmark https://github.com/erikbern/ann-benchmarks 20 / 30

Nearest Neighbors - scikit-learn API - 21 / 30

Trees and wrapping estimator KDTree and BallTree : Not proper scikit-learn estimators query , query_radius , which return (indices, distances) 22 / 30

Trees and wrapping estimator KDTree and BallTree : Not proper scikit-learn estimators query , query_radius , which return (indices, distances) NearestNeighbors : scikit-learn estimator, but without transform or predict kneighbors , radius_neighbors , which return (distances, indices) 22 / 30

Nearest Neighbors call KernelDensity , NearestNeighbors : Create an instance of BallTree or KDTree 23 / 30

Nearest Neighbors call KernelDensity , NearestNeighbors : Create an instance of BallTree or KDTree KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor , LocalOutlierFactor Inherit fit and kneighbors (weird) from NearestNeighbors 23 / 30

Nearest Neighbors call KernelDensity , NearestNeighbors : Create an instance of BallTree or KDTree KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor , LocalOutlierFactor Inherit fit and kneighbors (weird) from NearestNeighbors TSNE , DBSCAN , Isomap , LocallyLinearEmbedding : Create an instance of NearestNeighbors 23 / 30

Nearest Neighbors call KernelDensity , NearestNeighbors : Create an instance of BallTree or KDTree KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor , LocalOutlierFactor Inherit fit and kneighbors (weird) from NearestNeighbors TSNE , DBSCAN , Isomap , LocallyLinearEmbedding : Create an instance of NearestNeighbors SpectralClustering , SpectralEmbedding : Call kneighbors_graph , which creates an instance of NearestNeighbors 23 / 30

Copy of NearestNeighbors parameters in each class params = [algorithm, leaf_size, metric, p, metric_params, n_jobs] # sklearn.neighbors NearestNeighbors(n_neighbors, radius, *params) KNeighborsClassifier(n_neighbors, *params) KNeighborsRegressor(n_neighbors, *params) RadiusNeighborsClassifier(radius, *params) RadiusNeighborsRegressor(radius, *params) LocalOutlierFactor(n_neighbors, *params) # sklearn.manifold TSNE(metric) Isomap(n_neighbors, neighbors_algorithm, n_jobs) LocallyLinearEmbedding(n_neighbors, neighbors_algorithm, n_jobs) SpectralEmbedding(n_neighbors, n_jobs) # sklearn.cluster SpectralClustering(n_neighbors, n_jobs) DBSCAN(eps, *params) 24 / 30

Different handling of precomputed neighbors in X Handle precomputed distance matrices: TSNE , DBSCAN , SpectralEmbedding , SpectralClustering , LocalOutlierFactor , NearestNeighbors KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor (not Isomap ) 25 / 30

Different handling of precomputed neighbors in X Handle precomputed distance matrices: TSNE , DBSCAN , SpectralEmbedding , SpectralClustering , LocalOutlierFactor , NearestNeighbors KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor (not Isomap ) Handle precomputed sparse neighbors graphs: DBSCAN , SpectralClustering 25 / 30

Different handling of precomputed neighbors in X Handle precomputed distance matrices: TSNE , DBSCAN , SpectralEmbedding , SpectralClustering , LocalOutlierFactor , NearestNeighbors KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor (not Isomap ) Handle precomputed sparse neighbors graphs: DBSCAN , SpectralClustering Handle objects inheriting NearestNeighbors : LocalOutlierFactor , NearestNeighbors 25 / 30

Different handling of precomputed neighbors in X Handle precomputed distance matrices: TSNE , DBSCAN , SpectralEmbedding , SpectralClustering , LocalOutlierFactor , NearestNeighbors KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor (not Isomap ) Handle precomputed sparse neighbors graphs: DBSCAN , SpectralClustering Handle objects inheriting NearestNeighbors : LocalOutlierFactor , NearestNeighbors Handle objects inheriting BallTree / KDTree : LocalOutlierFactor , NearestNeighbors KNeighborsClassifier , KNeighborsRegressor , RadiusNeighborsClassifier , RadiusNeighborsRegressor 25 / 30

Challenges Consistent API, avoid copying all parameters, Changing the API? dif�cult without breaking code Use approximated nearest neighbors from other libraries 26 / 30

Proposed solu�on Precompute sparse graphs in a Transformer [#10482] 27 / 30

Precomputed sparse nearest neighbors graph Steps: 1. Make all classes accept precomputed sparse neighbors graph 28 / 30

Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - - PowerPoint PPT Presentation

Scikit-learn's Transformers - v0.20 and beyond - Tom Dupr la Tour - PyParis 14/11/2018 1 / 30 Scikit-learn's Transformers 2 / 30 Transformer from sklearn.preprocessing import StandardScaler model = StandardScaler() X_train_2 =

Scikit-learn some perspectives Lundi 17 septembre 2018 Lancement de linitjatjve scikit-learn

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

Classification scikit-learn Artificial Intelligence @ Allegheny College Janyl Jumadinova

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

Accelerating Random Forests in Scikit-Learn Gilles Louppe Universit e de Li` ege, Belgium

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

COMP 204 Intro to machine learning with scikit-learn (part three) Mathieu Blanchette 1 / 14

Topic Modelling with Scikit-learn Derek Greene University College Dublin PyData Dublin

Laboratory of Machine Learning with Python Numpy / Matplotlib / Scikit-learn Luca Erculiani

scikit-learn Case Study Professor Patrick McDaniel Jonathan Price Fall 2015 More Advanced Usage

Introduction to regression Supervised Learning with scikit-learn Boston housing data In [1]:

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Characterizing quotation Chung-chieh Shan Rutgers University April 3, 2009 Thanks to Chris

+ Hawaiians in Los Angeles Lessa Kananiopua Pelayo -Lozada Childrens Librarian Glendale

Patent Law Prof. Roger Ford September 18, 2017 Class 6 Disclosure: Definiteness; Best Mode

CSEE 3827: Fundamentals of Computer Systems, Spring 2011 5. Finite State Machine Design Prof.

Parallel Runtime Environments with Cloud Database: Performance Study for HMM with Adaptive

Information Flow Control For Standard OS Abstractions Max Krohn, Alex Yip, Micah Brodsky, Natan

A Dublin Core Application Profile for the digital Pina Bausch Archive Kerstin Diwisch Bernhard

Managing Teams and Keys with Keybase Max Krohn (https://keybase.io/max) etc. qmail cvs