Time - dela y ed feat u res and a u to - regressi v e models MAC H - PowerPoint PPT Presentation

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

The past is u sef u l Timeseries data almost al w a y s ha v e information that is shared bet w een timepoints Information in the past can help predict w hat happens in the f u t u re O � en the feat u res best - s u ited to predict a timeseries are pre v io u s v al u es of the same timeseries . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A note on smoothness and a u to - correlation A common q u estion to ask of a timeseries : ho w smooth is the data . AKA , ho w correlated is a timepoint w ith its neighboring timepoints ( called a u tocorrelation ). The amo u nt of a u to - correlation in data w ill impact y o u r models . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating time - lagged feat u res Let ' s see ho w w e co u ld b u ild a model that u ses v al u es in the past as inp u t feat u res . We can u se this to assess ho w a u to - correlated o u r signal is ( and lots of other st u� too ) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Time - shifting data w ith Pandas print(df) df 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 # Shift a DataFrame/Series by 3 index values towards the past print(df.shift(3)) df 0 NaN 1 NaN 2 NaN 3 0.0 4 1.0 MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating a time - shifted DataFrame # data is a pandas Series containing time series data data = pd.Series(...) # Shifts shifts = [0, 1, 2, 3, 4, 5, 6, 7] # Create a dictionary of time-shifted data many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts} # Convert them into a dataframe many_shifts = pd.DataFrame(many_shifts) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Fitting a model w ith time - shifted feat u res # Fit the model using these input features model = Ridge() model.fit(many_shifts, data) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpreting the a u to - regressi v e model coefficients # Visualize the fit model coefficients fig, ax = plt.subplots() ax.bar(many_shifts.columns, model.coef_) ax.set(xlabel='Coefficient name', ylabel='Coefficient value') # Set formatting so it looks nice plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing coefficients for a ro u gh signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing coefficients for a smooth signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

Cross -v alidating timeseries data MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

Cross v alidation w ith scikit - learn # Iterating over the "split" method yields train/test indices for tr, tt in cv.split(X, y): model.fit(X[tr], y[tr]) model.score(X[tt], y[tt]) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Cross v alidation t y pes : KFold KFold cross -v alidation splits y o u r data into m u ltiple " folds " of eq u al si z e It is one of the most common cross -v alidation ro u tines from sklearn.model_selection import KFold cv = KFold(n_splits=5) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing model predictions fig, axs = plt.subplots(2, 1) # Plot the indices chosen for validation on each loop axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40) axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)', xlabel='Index of raw data') # Plot the model predictions on each iteration axs[1].plot(model.predict(X[tt])) axs[1].set(title='Test set predictions on each CV loop', xlabel='Prediction index') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing KFold CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A note on sh u ffling y o u r data Man y CV iterators let y o u sh u� e data as a part of the cross -v alidation process . This onl y w orks if the data is i . i . d ., w hich timeseries u s u all y is not . Yo u sho u ld not sh u� e y o u r data w hen making predictions w ith timeseries . from sklearn.model_selection import ShuffleSplit cv = ShuffleSplit(n_splits=3) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing sh u ffled CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using the time series CV iterator Th u s far , w e 'v e broken the linear passage of time in the cross v alidation Ho w e v er , y o u generall y sho u ld not u se datapoints in the f u t u re to predict data in the past One approach : Al w a y s u se training data from the past to predict the f u t u re MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing time series cross v alidation iterators # Import and initialize the cross-validation iterator from sklearn.model_selection import TimeSeriesSplit cv = TimeSeriesSplit(n_splits=10) fig, ax = plt.subplots(figsize=(10, 5)) for ii, (tr, tt) in enumerate(cv.split(X, y)): # Plot training and test indices l1 = ax.scatter(tr, [ii] * len(tr), c=[plt.cm.coolwarm(.1)], marker='_', lw=6) l2 = ax.scatter(tt, [ii] * len(tt), c=[plt.cm.coolwarm(.9)], marker='_', lw=6) ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior', xlabel='data index', ylabel='CV iteration') ax.legend([l1, l2], ['Training', 'Validation']) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing the TimeSeriesSplit cross v alidation iterator MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

C u stom scoring f u nctions in scikit - learn def myfunction(estimator, X, y): y_pred = estimator.predict(X) my_custom_score = my_custom_function(y_pred, y) return my_custom_score MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A c u stom correlation f u nction for scikit - learn def my_pearsonr(est, X, y): # Generate predictions and convert to a vector y_pred = est.predict(X).squeeze() # Use the numpy "corrcoef" function to calculate a correlation matrix my_corrcoef_matrix = np.corrcoef(y_pred, y.squeeze()) # Return a single correlation value from the matrix my_corrcoef = my_corrcoef[1, 0] return my_corrcoef MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

Stationarit y and stabilit y MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

Stationarit y Stationar y time series do not change their statistical properties o v er time E . g ., mean , standard de v iation , trends Most time series are non - stationar y to some e x tent MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Model stabilit y Non - stationar y data res u lts in v ariabilit y in o u r model The statistical properties the model � nds ma y change w ith the data In addition , w e w ill be less certain abo u t the correct v al u es of model parameters Ho w can w e q u antif y this ? MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Cross v alidation to q u antif y parameter stabilit y One approach : u se cross -v alidation Calc u late model parameters on each iteration Assess parameter stabilit y across all CV splits MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Bootstrapping the mean Bootstrapping is a common w a y to assess v ariabilit y The bootstrap : 1. Take a random sample of data w ith replacement 2. Calc u late the mean of the sample 3. Repeat this process man y times (1000 s ) 4. Calc u late the percentiles of the res u lt (u s u all y 2.5, 97.5) The res u lt is a 95% con � dence inter v al of the mean of each coe � cient . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Bootstrapping the mean from sklearn.utils import resample # cv_coefficients has shape (n_cv_folds, n_coefficients) n_boots = 100 bootstrap_means = np.zeros(n_boots, n_coefficients) for ii in range(n_boots): # Generate random indices for our data with replacement, # then take the sample mean random_sample = resample(cv_coefficients) bootstrap_means[ii] = random_sample.mean(axis=0) # Compute the percentiles of choice for the bootstrapped means percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Plotting the bootstrapped coefficients fig, ax = plt.subplots() ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200) ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Assessing model performance stabilit y If u sing the TimeSeriesSplit , can plot the model ' s score o v er time . This is u sef u l in � nding certain regions of time that h u rt the score Also u sef u l to � nd non - stationar y signals MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Time - dela y ed feat u res and a u to - regressi v e models MAC H - PowerPoint PPT Presentation

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science The past is u sef u l Timeseries data almost al w a y s

Feat u re engineering P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G IN P YTH ON Sarah G

Feat u re e x traction D IME N SION AL ITY R E D U C TION IN P YTH ON Jeroen Boe y e Machine

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data

Wh y generate feat u res ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P

Feat ure Select ion using/ f or Feat ure Select ion using/ f or Transduct ive ransduct ive S

RES updates, resources and Access European HPC ecosystem Sergi Girona RES Coordinator RES: HPC

Regression : feat u re selection P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

SUNAH RES English presentation 06/05/2016 www.sunah res .com SUNAH RES is a commercial real

Volume visualization Steve Marschner CS 6630 Fall 2009 U. Texas High-Res CT Facility U.

Preprocessing data SU P E R VISE D L E AR N IN G W ITH SC IK IT - L E AR N Andreas M ller

Selecting feat u res for model performance D IME N SION AL ITY R E D U C TION IN P YTH ON

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist

Research, Resistance and Sisterhood The Res-Sisters The Res-Sisters Documenting pains and

GO ! 100% RES Communities / Publishable slides / version: April 2013 100% RES COMMUNITIES A

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 Hacettepe University

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer

Bayesian leave-one-out cross-validation for large data Mns Magnusson (Aalto University) Michael

Machine Learning July 20, 2016 Basic Concepts: Review Example machine learning problem: Decide

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Introduction to Machine Learning Model Validation and Selection Dr. Ilija Bogunovic Learning