Predicting data o v er time MAC H IN E L E AR N IN G FOR TIME - PowerPoint PPT Presentation

Predicting data o v er time MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

Classification v s . Regression CLASSIFICATION REGRESSION classification_model.predict(X_test) regression_model.predict(X_test) array([0, 1, 1, 0]) array([0.2, 1.4, 3.6, 0.6]) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Correlation and regression Regression is similar to calc u lating correlation , w ith some ke y di � erences Regression : A process that res u lts in a formal model of the data Correlation : A statistic that describes the data . Less information than regression model . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Correlation bet w een v ariables often changes o v er time Timeseries o � en ha v e pa � erns that change o v er time T w o timeseries that seem correlated at one moment ma y not remain so o v er time MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing relationships bet w een timeseries fig, axs = plt.subplots(1, 2) # Make a line plot for each timeseries axs[0].plot(x, c='k', lw=3, alpha=.2) axs[0].plot(y) axs[0].set(xlabel='time', title='X values = time') # Encode time as color in a scatterplot axs[1].scatter(x_long, y_long, c=np.arange(len(x_long)), cmap='viridis') axs[1].set(xlabel='x', ylabel='y', title='Color = time') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing t w o timeseries MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Regression models w ith scikit - learn from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X, y) model.predict(X) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z e predictions w ith scikit - learn alphas = [.1, 1e2, 1e3] ax.plot(y_test, color='k', alpha=.3, lw=3) for ii, alpha in enumerate(alphas): y_predicted = Ridge(alpha=alpha).fit(X_train, y_train).predict(X_test) ax.plot(y_predicted, c=cmap(ii / len(alphas))) ax.legend(['True values', 'Model 1', 'Model 2', 'Model 3']) ax.set(xlabel="Time") MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z e predictions w ith scikit - learn MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Scoring regression models T w o most common methods : Correlation ( r ) 2 Coe � cient of Determination ( R ) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

2 Coefficient of Determination ( R ) 2 The v al u e of R is bo u nded on the top b y 1, and can be in � nitel y lo w Val u es closer to 1 mean the model does a be � er job of predicting o u tp u ts error ( model ) 1 − variance ( testdata ) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

2 R in scikit - learn from sklearn.metrics import r2_score print(r2_score(y_predicted, y_test)) 0.08 MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

Cleaning and impro v ing y o u r data MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

Data is mess y Real -w orld data is o � en mess y The t w o most common problems are missing data and o u tliers This o � en happens beca u se of h u man error , machine sensor malf u nction , database fail u res , etc Vis u ali z ing y o u r ra w data makes it easier to spot these problems MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

What mess y data looks like MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpolation : u sing time to fill in missing data A common w a y to deal w ith missing data is to interpolate missing v al u es With timeseries data , y o u can u se time to assist in interpolation . In this case , interpolation means u sing u sing the kno w n v al u es on either side of a gap in the data to make ass u mptions abo u t w hat ' s missing . MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpolation in Pandas # Return a boolean that notes where missing values are missing = prices.isna() # Interpolate linearly within missing windows prices_interp = prices.interpolate('linear') # Plot the interpolated data in red and the data w/ missing values in black ax = prices_interp.plot(c='r') prices.plot(c='k', ax=ax, lw=2) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing the interpolated data MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using a rolling w indo w to transform data Another common u se of rolling w indo w s is to transform the data We 'v e alread y done this once , in order to smooth the data Ho w e v er , w e can also u se this to do more comple x transformations MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Transforming data to standardi z e v ariance A common transformation to appl y to data is to standardi z e its mean and v ariance o v er time . There are man y w a y s to do this . Here , w e ' ll sho w ho w to con v ert y o u r dataset so that each point represents the % change o v er a pre v io u s w indo w . This makes timepoints more comparable to one another if the absol u te v al u es of data change a lot MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Transforming to percent change w ith Pandas def percent_change(values): """Calculates the % change between the last value and the mean of previous values""" # Separate the last value and all previous values into variables previous_values = values[:-1] last_value = values[-1] # Calculate the % difference between the last value # and the mean of earlier values percent_change = (last_value - np.mean(previous_values)) \ / np.mean(previous_values) return percent_change MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Appl y ing this to o u r data # Plot the raw data fig, axs = plt.subplots(1, 2, figsize=(10, 5)) ax = prices.plot(ax=axs[0]) # Calculate % change and plot ax = prices.rolling(window=20).aggregate(percent_change).plot(ax=axs[1]) ax.legend_.set_visible(False) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Finding o u tliers in y o u r data O u tliers are datapoints that are signi � cantl y statisticall y di � erent from the dataset . The y can ha v e negati v e e � ects on the predicti v e po w er of y o u r model , biasing it a w a y from its " tr u e " v al u e One sol u tion is to remo v e or replace o u tliers w ith a more representati v e v al u e Be v er y caref u l abo u t doing this - o � en it is di � c u lt to determine w hat is a legitimatel y e x treme v al u e v s an abberation MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Plotting a threshold on o u r data fig, axs = plt.subplots(1, 2, figsize=(10, 5)) for data, ax in zip([prices, prices_perc_change], axs): # Calculate the mean / standard deviation for the data this_mean = data.mean() this_std = data.std() # Plot the data, with a window that is 3 standard deviations # around the mean data.plot(ax=ax) ax.axhline(this_mean + this_std * 3, ls='--', c='r') ax.axhline(this_mean - this_std * 3, ls='--', c='r') MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z ing o u tlier thresholds MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Replacing o u tliers u sing the threshold # Center the data so the mean is 0 prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean() # Calculate standard deviation std = prices_outlier_perc.std() # Use the absolute value of each datapoint # to make it easier to find outliers outliers = np.abs(prices_outlier_centered) > (std * 3) # Replace outliers with the median value # We'll use np.nanmean since there may be nans around the outliers prices_outlier_fixed = prices_outlier_centered.copy() prices_outlier_fixed[outliers] = np.nanmedian(prices_outlier_fixed) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Vis u ali z e the res u lts fig, axs = plt.subplots(1, 2, figsize=(10, 5)) prices_outlier_centered.plot(ax=axs[0]) prices_outlier_fixed.plot(ax=axs[1]) MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let ' s practice ! MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON

Creating feat u res o v er time MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science

E x tracting feat u res w ith w indo w s MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using . aggregate for feat u re e x traction # Visualize the raw data print(prices.head(3)) symbol AIG ABT date 2010-01-04 29.889999 54.459951 2010-01-05 29.330000 54.019953 2010-01-06 29.139999 54.319953 # Calculate a rolling window, then extract two features feats = prices.rolling(20).aggregate([np.std, np.max]).dropna() print(feats.head(3)) AIG ABT std amax std amax date 2010-02-01 2.051966 29.889999 0.868830 56.239949 2010-02-02 2.101032 29.629999 0.869197 56.239949 2010-02-03 2.157249 29.629999 0.852509 56.239949 MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Check the properties of y o u r feat u res ! MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Predicting data o v er time MAC H IN E L E AR N IN G FOR TIME - PowerPoint PPT Presentation

Predicting data o v er time MAC H IN E L E AR N IN G FOR TIME SE R IE S DATA IN P YTH ON Chris Holdgraf Fello w, Berkele y Instit u te for Data Science Classification v s . Regression CLASSIFICATION REGRESSION

Welcome Predicting Change Outcomes Leveraging SQL Server Profiler Lee Everest SQL Rx Predicting

Predicting Return to Work Predicting Return to Work with Data Mining with Data Mining Claim A

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Predicting implicit and explicit questions Matthijs Westera COLT kick-off workshop Predicting

Predicting Real-Time Transaction Fraud Sami Niemi, PhD Barclays, Quantitative Analytics, Fraud

Predicting Regulatory Elements Predicting Regulatory Elements in P. falciparum in P. falciparum

Predicting and Comprehending Predicting and Comprehending Asteroid Impacts Asteroid Impacts

Predicting and modeling water chemistry Predicting and modeling water chemistry associated with

O tt itti Outtwitting the Twitterers th T itt Predicting Information Predicting

An Agent Architecture An Agent Architecture An Agent Architecture An Agent Architecture for

Predicting Min Predicting Min-Bias and the Bias and the Underlying Event at

Computational Algorithm Predicting Surface Computational Algorithm Predicting Surface Morphology

Application exercise: MLR - Interpreting models and checking diagnostics Name: Predicting car

Predicting Patient Recruitment in Multicenter Clinical Trials Xiaotong (Phoebe) Jiang Department

Predicting Student Retention in STEM Majors Andrew Sage Dan Nettleton Cinzia Cervato Craig

Transportation Associates Pty Ltd Predicting Rails Share of Airport Passenger Movements Key

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

Time Series Analysis Henrik Madsen hm@imm.dtu.dk Informatics and Mathematical Modelling

Background estimation in searches for binary inspiral Patrick Brady Inspiral Working Group LIGO

Mean-field methods: what can go wrong? with some applications to bike-sharing systems and caching

Lagrangian Relaxation via Randomized Rounding, Introduction Neal E. Young UCR, 1/28/04

Covering Metric Spaces by Few Trees Yair Bartal Nova Fandina Ofer neiman Tree Covers Let

CS 574: Randomized Algorithms Lecture 1. Introduction to Randomness August 25, 2015 Lecture 1.

Homework 5.4 Recall the following theorem from the overconfidence versus paranoia