SLIDE 1
In [1]:
import numpy as np import pandas as pd import os import warnings import time warnings.simplefilter("ignore")
In [2]:
from sklearn.svm import SVR from sklearn.linear_model import SGDRegressor, LinearRegression from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor, GradientBoostingReg ressor from sklearn.metrics import auc_score from sklearn.preprocessing import OneHotEncoder
Cause-Effect Pairs
http://www.kaggle.com/c/cause-effect-pairs/
Goals:
Introduction to the challenge Philosophy behind my solution Rough overview of my solution Possible improvements
Understanding The Problem
SLIDE 2 "Given samples from a pair of variables A, B, find whether A is a cause
The objective of the challenge is to rank pairs of variables {A, B} to prioritize experimental verifications of the conjecture that A causes B. This challenge is limited to pairs of variables deprived of their context. For the training data, we are given whether A causes B B cases A A and B are consequences of a common cause, or A and B are indepednent. Error Metric: Bi-directional AUC -> It's a ranking problem. Simplified: Is it more likely that y = f(x, noise) or x = f(y, noise)? In [3]:
def bidirectional_auc(true, predictions): return (auc_score(true == 1, predictions) + auc_score(true == -1, - predictions)) / 2
In [4]:
from IPython.core.display import Image Image('quiz.png')
Out[4]:
SLIDE 3
Understanding The Data
In [5]:
folder = "SUP2data_text" # Using SUP2 because SUP1 and SUP3 have no ca tegorical data!!! pairs = pd.read_csv(os.path.join(folder, "CEdata_train_pairs.csv")) target = pd.read_csv(os.path.join(folder, "CEdata_train_target.csv")) publicinfo = pd.read_csv(os.path.join(folder, "CEdata_train_publicinfo. csv"))
In [6]:
# utility functions! def to_array(pair_str): return np.array(map(int, pair_str.split())) def get_pair(idx, df): return np.array(df['A'])[idx], np.array(df['B'])[idx] def plot_pair(idx, subset=(pairs, target)): df, target = subset x, y = map(to_array, get_pair(idx, df)) print(target.iloc[idx]) plot(x, y, 'o') xlabel('A') ylabel('B') pylab.show() return x, y def type_finder(a_type, b_type): idxs = (a_type == publicinfo['A type']) * (b_type == publicinfo['B type']) > 0 return pairs[idxs], target[idxs]
SLIDE 4 def with_fit(clf, x, y): x_test = np.linspace(x.min(), x.max(), 1000).reshape(-1, 1) clf.fit(x.reshape(-1, 1), y.flatten()) predictions = clf.predict(x_test) return x_test, predictions def fancy_plot(clf, x, y): plot(x, y, 'o') xlabel('A') ylabel('B') X_to_Y = with_fit(clf, x, y) Y_to_X = with_fit(clf, y, x) p1, = plot(*X_to_Y) p2, = plot(*reversed(Y_to_X)) legend([p1, p2], ["X->Y", "Y->X"]) class PolynomialLinearRegression(object): def __init__(self, degree): self.degree = degree def fit(self, X, y): self.clf = LinearRegression() new_X = np.hstack([X ** n for n in range(1, self.degree + 1)]) self.clf.fit(new_X, y) return self def predict(self, X): new_X = np.hstack([X ** n for n in range(1, self.degree + 1)]) return self.clf.predict(new_X) numerical, categorical, binary = "Numerical", "Categorical", "Binary" types = (numerical, categorical, binary)
In [7]:
warnings.simplefilter("ignore") pairs.head()
Out[7]: SampleID A B 0 train1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1...
- 3374 1389 -20322 -12206
- 9211 11501 6180 -71...
1 train2 10105 -3574 -6717 6152
- 6488 1383 6710 8727 1...
- 12820 122 -1222 -4496
- 3502 -1733 -2409 -363...
2 train3 9493 -12079 25841 2191
- 7618 4012 -1875 -1526...
- 3421 -9969 -6761 -5880
5276 -5847 -4636 -915... 3 train4
- 1679 13870 -9157 -1269
- 8850 13711 -1438 -92...
3 2 2 3 2 2 3 2 2 2 2 3 2 3 3 3 3 2 3 2 2 3 2... 4 train5 7810 -5188 -10806 11361 2020 7001 5679 4429 -... 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0...
SLIDE 5 In [8]:
target.head()
In [9]:
publicinfo.head()
In [10]:
publicinfo['A type'].value_counts()
In [11]:
publicinfo['B type'].value_counts()
In [12]:
CC = type_finder(categorical, categorical) NC = type_finder(numerical, categorical)
4 train5 2020 7001 5679 4429 -... 1 1 0 0 0 0 0 0... Out[8]: SampleID Target [1 for A->B;
Details [1 for A->B; 2 for A<- B; 3 for A-B; 4 for A|B] 0 train1
2 1 train2
2 2 train3 4 3 train4
2 4 train5
2 Out[9]: SampleID A type B type 0 train1 Binary Numerical 1 train2 Numerical Numerical 2 train3 Numerical Numerical 3 train4 Numerical Categorical 4 train5 Numerical Binary Out[10]: Numerical 4511 Categorical 743 Binary 735 dtype: int64 Out[11]: Numerical 4480 Binary 784 Categorical 725 dtype: int64
SLIDE 6 CN = type_finder(categorical, numerical) NN = type_finder(numerical, numerical)
In [13]:
type_matrix = np.array([[len(type_finder(t1, t2)[0]) for t1 in types] f
pd.DataFrame(type_matrix, index=types, columns=types)
In [14]:
for i in range(6): plot_pair(i, NN)
Out[13]: Numerical Categorical Binary Numerical 3372 558 550 Categorical 550 92 83 Binary 589 93 102 SampleID train2 Target [1 for A->B; -1 otherwise] -1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 2 Name: 1, dtype: object SampleID train3 Target [1 for A->B; -1 otherwise] 0 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 4 Name: 2, dtype: object
SLIDE 7
SampleID train6 Target [1 for A->B; -1 otherwise] 1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 1 Name: 5, dtype: object SampleID train7 Target [1 for A->B; -1 otherwise] 0 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 4 Name: 6, dtype: object SampleID train9
SLIDE 8
Tangent: My Philosophy
The Ultimate Goal: Automatic Feature Creation
The data scientist is often the bottleneck Reproducibility SampleID train9 Target [1 for A->B; -1 otherwise] -1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 2 Name: 8, dtype: object SampleID train12 Target [1 for A->B; -1 otherwise] 1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 1 Name: 11, dtype: object
SLIDE 9
Applicability to a wider range of problems
It was also a great fit for the problem (since the variables don't have context). The results:
No visualization. No hand-tuned features. No human-driven feedback loop. No state of the art features.
Doing the above things would probably improve the score.
Feature Creation
Problem: How do we get features? We have a matrix of pairs of non-uniform length! We want a 2-D matrix to stand on the shoulders of the existing ML algorithms. How do we not lose local information? In [15]:
x, y = plot_pair(0, NN) # Let's focus on numerical-numerical fancy_plot(LinearRegression(), x, y)
SampleID train2 Target [1 for A->B; -1 otherwise] -1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 2 Name: 1, dtype: object
SLIDE 10
Solution: Compute metrics between predicted and true. Because we don't want to tune the metrics, let's use all of them! various distance metrics various statistical metrics various machine learning metrics Other solutions: Analyzing the images of the plots. Aggregate statistics: entropy, mean, variance, etc. (I used some of these too!) State of the art features Possible Improvement: Finding which metrics were most effective, and using more of those. Use cross-validated predictions instead of in-sample predictions.
What we're at:
for metric in METRICS: B_to_A = metric(true_A, fit_A) A_to_B = metric(true_B, fit_B) features.append(B_to_A) features.append(A_to_B) features.append(A_to_B - B_to_A) # Why not? Insert more of these...
SLIDE 11
Problem: How do we get a fit? Which technique would we use to be most applicable? In [16]:
x, y = plot_pair(4, NN) # Note to self: show 0, then 4, then 5 fancy_plot(GradientBoostingRegressor(max_depth=3), x, y)
Solution: Key insight: machine learning is (to some extent) curve fitting! Instead of choosing a technique, let's have the machine learning model pick the good SampleID train9 Target [1 for A->B; -1 otherwise] -1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 2 Name: 8, dtype: object
SLIDE 12
Other solutions: Rolling Averages Lots of different polynomials Possible Improvement: Finding which models were most effective, and using more of those. Use a less limited subset of models (I didn't use the smoother models because they take much longer...)
What we're at:
for model in MODELS: fit_A = model.get_fit(true_B, true_A) fit_B = model.get_fit(true_A, true_B) for metric in METRICS: # get metrics for each feature ... Problem: All of the above (somewhat) breaks down when we take categorical variables into account! Where do binary variables fall? If we treat them as separate problems, where do we draw the line? Can we afford to deal with much less data? Can we treat them all the same without losing too much information? Above, we used regression models to determine goodness of fit! Should we use classification models for categorical variables? In [17]:
x, y = plot_pair(1, CN) fancy_plot(GradientBoostingRegressor(max_depth=2), x, y) # The model assumes numeric input!
SampleID train11 Target [1 for A->B; -1 otherwise] 1 Details [1 for A->B; 2 for A<-B; 3 for A-B; 4 for A|B] 1 Name: 10, dtype: object
SLIDE 13 In [18]:
- he_x = OneHotEncoder().fit_transform(x.reshape(-1, 1))
sixth_col = ohe_x[:, 5].todense().A.flatten() fancy_plot(GradientBoostingRegressor(max_depth=2), sixth_col, y)
Solution: Treat binary as either categorical or numerical. For each observation (pair of variables), create separate features for: NN, NC, CN, CC
SLIDE 14 To convert numerical to categorical: K-means clustering using the gap statistic to compute optimal clusters Binning To convert categorical to numerical: Randomized PCA down to one dimension One-hot encoding To combine metrics for multiple dimensions (categorical), run the previous algorithm on each column, then aggregate the results (mean, min, max, etc.) so that the features are comparable to ones with different dimensionality. # Assume A, B are defined, and without loss of generality, A is categor ical
- he_A = one_hot_encoding(A)
aggregate_features = [] for col in columns(A): features = [] for model in MODELS: # create features as above ... features.append(...) aggregate_features.append(features) grouped_by_feature = zip(*aggregate_features) for aggregator in AGGREGATORS: # (mean, min, max, etc.) for feature in grouped_by_feature: final_features.append(aggregator(feature)) Note: I did the looping and aggregation twice when both variables are categorical. Other solutions: Only binning Ignoring this problem and treating categorical as numerical (!!!) Possible Improvement: Different feature extraction methods (ICA, more clustering, etc...) Increasing the dimensionality instead of reducing it (so that we more accurately compare apples-to-apples) Different ways of aggregation in the multidimensional case
What we're at:
SLIDE 15
features = [] for (A, B) in pairs: pair_features = [] for A_type in {numerical, categorical}: # the type to convert TO, not the actual type for B_type in {numerical, categorical}: for A2 in convert(A, A_type): # apply different algorithms to con vert for B2 in convert(B, B_type): for column in columns(A2): for column in columns(B2): features_B for model in MODELS: # get features here ... ... # these need to be done even for 1-D input so that we can make comparisons # to when the input is not 1-D for aggregator in AGGREGATORS: ... ... # ditto here for aggregator in AGGREGATORS: ... # append to pair_features somewhere in here features.append(pair_features)
Feature Selection
Problem: We have a lot of features! They're possibly redundant and poor! They could possibly hurt our accuracy (curse of dimensionality) (These parts are pretty normal, so I'm skipping the code!) Solution: Genetic algorithm: A feature bitmask maps on to genes well Meaningful cross-over Can explore large spaces quickly and better than random Very easy to code Brought features from ~9000 to ~4500 Turns out this wasn't too helpful
SLIDE 16
Other solutions: Since others' features were hand-tuned, they were also hand selected (: Things I considered and reasons they weren't used: Gaussian process sequential model-based optimization: too slow Information theoretic feature selection: not tuned to the model, performed poorly Recursive feature elimination: doesn't take into account complex relationships, computationally intensive Model specific (such as L1 regularization or decision tree feature importance): importance is extracted from the training set Possible Improvement: Having feature value automatically feedback into feature creation Sequential model-based optimization with random forests
Model Optimization
Problem: What do we do with our features? Which model(s) do we use? What hyperparameters to use? clf = GradientBoostingRegressor( loss='huber', n_estimators=5000, random_state=1, # always seed your random state min_samples_split=2, min_samples_leaf=1, subsample=1.0, max_features=686, alpha=0.995355212043, max_depth=10, learning_rate=np.exp(-4.09679792914) ) Solution: Gradient boosted trees Scales well with a lot of features Pretty fast Generally has amazing performance (especially with non-uniform features) Bayesian hyperparameter optimization using Gaussian Processes (see: Spearmint) Other solutions: Ensembled tree-based models (GBMs as well, mostly)
SLIDE 17
Some neural networks Possible Improvement: Trying more models Ranking specific models Ensembling
Questions?
In [19]:
from IPython.core.display import Image Image('leaderboard.png')
Back to top More info on IPython website (http://ipython.org). The code for this site (https://github.com/ipython/nbviewer) is licensed under BSD (https://github.com/ipython/nbviewer/blob/master/LICENSE.txt). Some icons from Glyphicons Free (http://glyphicons.com), built thanks to Twitter Bootstrap (http://twitter.github.com/bootstrap/) This web site does not host notebooks, it only renders notebooks available on other websites. Thanks to all our contributors (https://github.com/ipython/nbviewer/contributors). Out[19]: