time dela y ed feat u res and a u to regressi v e models
play

Time - dela y ed feat u res and a u to - regressi v e models MAC H - PowerPoint PPT Presentation

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science The past is u sef u l Timeseries data almost al w a y s


  1. Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

  2. The past is u sef u l Timeseries data almost al w a y s ha v e information that is shared bet w een timepoints Information in the past can help predict w hat happens in the f u t u re O � en the feat u res best - s u ited to predict a timeseries are pre v io u s v al u es of the same timeseries . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  3. A note on smoothness and a u to - correlation A common q u estion to ask of a timeseries : ho w smooth is the data . AKA , ho w correlated is a timepoint w ith its neighboring timepoints ( called a u tocorrelation ). The amo u nt of a u to - correlation in data w ill impact y o u r models . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  4. Creating time - lagged feat u res Let ' s see ho w w e co u ld b u ild a model that u ses v al u es in the past as inp u t feat u res . We can u se this to assess ho w a u to - correlated o u r signal is ( and lots of other st u� too ) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  5. Time - shifting data w ith Pandas print(df) df 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 # Shift a DataFrame/Series by 3 index values towards the past print(df.shift(3)) df 0 NaN 1 NaN 2 NaN 3 0.0 4 1.0 MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  6. Creating a time - shifted DataFrame # data is a pandas Series containing time series data data = pd.Series(...) # Shifts shifts = [0, 1, 2, 3, 4, 5, 6, 7] # Create a dictionary of time-shifted data many_shifts = {'lag_{}'.format(ii): data.shift(ii) for ii in shifts} # Convert them into a dataframe many_shifts = pd.DataFrame(many_shifts) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  7. Fitting a model w ith time - shifted feat u res # Fit the model using these input features model = Ridge() model.fit(many_shifts, data) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  8. Interpreting the a u to - regressi v e model coefficients # Visualize the fit model coefficients fig, ax = plt.subplots() ax.bar(many_shifts.columns, model.coef_) ax.set(xlabel='Coefficient name', ylabel='Coefficient value') # Set formatting so it looks nice plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  9. Vis u ali z ing coefficients for a ro u gh signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  10. Vis u ali z ing coefficients for a smooth signal MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  11. Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

  12. Cross -v alidating timeseries data MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

  13. Cross v alidation w ith scikit - learn # Iterating over the "split" method yields train/test indices for tr, tt in cv.split(X, y): model.fit(X[tr], y[tr]) model.score(X[tt], y[tt]) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  14. Cross v alidation t y pes : KFold KFold cross -v alidation splits y o u r data into m u ltiple " folds " of eq u al si z e It is one of the most common cross -v alidation ro u tines from sklearn.model_selection import KFold cv = KFold(n_splits=5) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  15. Vis u ali z ing model predictions fig, axs = plt.subplots(2, 1) # Plot the indices chosen for validation on each loop axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40) axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)', xlabel='Index of raw data') # Plot the model predictions on each iteration axs[1].plot(model.predict(X[tt])) axs[1].set(title='Test set predictions on each CV loop', xlabel='Prediction index') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  16. Vis u ali z ing KFold CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  17. A note on sh u ffling y o u r data Man y CV iterators let y o u sh u� e data as a part of the cross -v alidation process . This onl y w orks if the data is i . i . d ., w hich timeseries u s u all y is not . Yo u sho u ld not sh u� e y o u r data w hen making predictions w ith timeseries . from sklearn.model_selection import ShuffleSplit cv = ShuffleSplit(n_splits=3) for tr, tt in cv.split(X, y): ... MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  18. Vis u ali z ing sh u ffled CV beha v ior MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  19. Using the time series CV iterator Th u s far , w e 'v e broken the linear passage of time in the cross v alidation Ho w e v er , y o u generall y sho u ld not u se datapoints in the f u t u re to predict data in the past One approach : Al w a y s u se training data from the past to predict the f u t u re MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  20. Vis u ali z ing time series cross v alidation iterators # Import and initialize the cross-validation iterator from sklearn.model_selection import TimeSeriesSplit cv = TimeSeriesSplit(n_splits=10) fig, ax = plt.subplots(figsize=(10, 5)) for ii, (tr, tt) in enumerate(cv.split(X, y)): # Plot training and test indices l1 = ax.scatter(tr, [ii] * len(tr), c=[plt.cm.coolwarm(.1)], marker='_', lw=6) l2 = ax.scatter(tt, [ii] * len(tt), c=[plt.cm.coolwarm(.9)], marker='_', lw=6) ax.set(ylim=[10, -1], title='TimeSeriesSplit behavior', xlabel='data index', ylabel='CV iteration') ax.legend([l1, l2], ['Training', 'Validation']) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  21. Vis u ali z ing the TimeSeriesSplit cross v alidation iterator MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  22. C u stom scoring f u nctions in scikit - learn def myfunction(estimator, X, y): y_pred = estimator.predict(X) my_custom_score = my_custom_function(y_pred, y) return my_custom_score MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  23. A c u stom correlation f u nction for scikit - learn def my_pearsonr(est, X, y): # Generate predictions and convert to a vector y_pred = est.predict(X).squeeze() # Use the numpy "corrcoef" function to calculate a correlation matrix my_corrcoef_matrix = np.corrcoef(y_pred, y.squeeze()) # Return a single correlation value from the matrix my_corrcoef = my_corrcoef[1, 0] return my_corrcoef MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  24. Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

  25. Stationarit y and stabilit y MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

  26. Stationarit y Stationar y time series do not change their statistical properties o v er time E . g ., mean , standard de v iation , trends Most time series are non - stationar y to some e x tent MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  27. MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  28. Model stabilit y Non - stationar y data res u lts in v ariabilit y in o u r model The statistical properties the model � nds ma y change w ith the data In addition , w e w ill be less certain abo u t the correct v al u es of model parameters Ho w can w e q u antif y this ? MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  29. Cross v alidation to q u antif y parameter stabilit y One approach : u se cross -v alidation Calc u late model parameters on each iteration Assess parameter stabilit y across all CV splits MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  30. Bootstrapping the mean Bootstrapping is a common w a y to assess v ariabilit y The bootstrap : 1. Take a random sample of data w ith replacement 2. Calc u late the mean of the sample 3. Repeat this process man y times (1000 s ) 4. Calc u late the percentiles of the res u lt (u s u all y 2.5, 97.5) The res u lt is a 95% con � dence inter v al of the mean of each coe � cient . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  31. Bootstrapping the mean from sklearn.utils import resample # cv_coefficients has shape (n_cv_folds, n_coefficients) n_boots = 100 bootstrap_means = np.zeros(n_boots, n_coefficients) for ii in range(n_boots): # Generate random indices for our data with replacement, # then take the sample mean random_sample = resample(cv_coefficients) bootstrap_means[ii] = random_sample.mean(axis=0) # Compute the percentiles of choice for the bootstrapped means percentiles = np.percentile(bootstrap_means, (2.5, 97.5), axis=0) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  32. Plotting the bootstrapped coefficients fig, ax = plt.subplots() ax.scatter(many_shifts.columns, percentiles[0], marker='_', s=200) ax.scatter(many_shifts.columns, percentiles[1], marker='_', s=200) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

  33. Assessing model performance stabilit y If u sing the TimeSeriesSplit , can plot the model ' s score o v er time . This is u sef u l in � nding certain regions of time that h u rt the score Also u sef u l to � nd non - stationar y signals MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend