Conditional distribution variability measures for causality - - PowerPoint PPT Presentation

conditional distribution variability
SMART_READER_LITE
LIVE PREVIEW

Conditional distribution variability measures for causality - - PowerPoint PPT Presentation

NIPS 2013 Workshop on Causality Conditional distribution variability measures for causality detection Jos A. R. Fonollosa December 9, 2013 Outline Introduction Preprocessing Conditional distributions similarity measures


slide-1
SLIDE 1

Conditional distribution variability measures for causality detection

José A. R. Fonollosa

December 9, 2013

NIPS 2013 Workshop on Causality

slide-2
SLIDE 2

Outline

  • Introduction
  • Preprocessing
  • Conditional distributions similarity measures
  • Additional features
  • Model
  • Results
  • Conclusions
slide-3
SLIDE 3

Introduction

  • Heterogeneous Cause-effect pairs
  • Statistical / Machine learning approach (3 classes)
  • Standard features
  • Measures of the similarity of ‘shape’ of the conditional

distributions

  • Robust estimation methods:

– Limited number of samples – Noise – Quantization – Avoid overfitting

  • Tree-based ensemble learning model (Gradient

Boosting)

slide-4
SLIDE 4

Preprocessing

  • Mean and Variance normalization: all the

features are scale and mean invariant.

  • Homogeneous set of features from mixed

numerical/categorical data:

– Discretization of numerical variables – Relabeling of categorical variables.

0.00 0.20 0.40 0.60 A B C D 0.00 0.20 0.40 0.60 1 2 3

Arbitrary labels or numbers

slide-5
SLIDE 5

Conditional distributions similarity

Rationale: the conditional distribution P(Y|X=x) is expected to be simpler to describe in the causal direction. Similar: – Normalized Shape/histogram for different values of the given variable x. – Similar entropy and moments. – Similar Bayesian error probability.

Related with functional causal models y = f(x) + gx(e) but f(x) is replaced by the conditional mean in an interval Independence tests are replaced by similarity measures (Image from a presentation of Kun Zhang on functional causal models)

slide-6
SLIDE 6

Additional Features (I)

  • Information-theoretic measures

– Discrete entropy and joint entropy – Discrete conditional entropy – Discrete mutual information (+ 2 normalized versions) – Adjusted (discrete) mutual information – Gaussian divergence (Differential entropy) – Uniform divergence

  • Slope-based Information Geometric Causal Inference

(IGCI)

  • Hilbert Schmidt Independence Criterion (HSIC)
  • Pearson R

(Adapted versions)

slide-7
SLIDE 7

Additional Features (II)

  • Number of samples and number of unique

samples

  • Moments and mixed moments: skewness,

kurtosis and mixed moments (1,2) (1,3)

  • Polynomial fit (order 2)
slide-8
SLIDE 8

Model schemes

Pi Pc x Pc Pi Ps(-1) Ps(1) +

  • ½ (Ps (1)-Ps(-1))

A single ternary classification model. Two binary models: a model for 1 versus -1, and a model for 0 versus the rest. Ternary symmetric problem. Single output (+1) A is a cause of B ( -1) B is a cause of A ( 0) Neither Two binary models: a model for class 1 versus the rest of classes, and a model for -1 versus the rest. Pa(-1) Pa(0) Pa(1) +

  • Pa(1)-Pa(-1)

Features Similar performance

slide-9
SLIDE 9

Gradient Boosting Model (GBM)

  • Gradient boosting

– Large number of boosting stages = 500 – Large tree size = 9 (higher-order interaction)

slide-10
SLIDE 10

Results

Training time: 45 minutes (4-core server) Test predictions: 12 minutes

Features Score Baseline(21) 0.742 Baseline(21) + Moment31(2) 0.750 Baseline(21) + Moment21(2) 0.757 Baseline(21) + Error probability(2) 0.749 Baseline(21) + Polyfit(2) 0.757 Baseline(21) + Polyfit error(2) 0.757 Baseline(21) + Skewness(2) 0.754 Baseline(21) + Kurtosis(2) 0.744 Baseline(21) + the above statistics set (14) 0.790 Baseline(21) + Standard deviation of conditional distributions(2) 0.779 Baseline(21) + Standard deviation of the skewness of conditional distributions(2) 0.765 Baseline(21) + Standard deviation of the kurtosis of conditional distributions(2) 0.759 Baseline(21) + Standard deviation of the entropy of conditional distributions(2) 0.759 Baseline(21) + Measures of variability of the conditional distribution(8) 0.789 Full set(43 features) 0.820

slide-11
SLIDE 11

Conclusions

  • A statistical machine learning approach to deal

with heterogeneous causal-effect pairs

  • We need to combine several features to obtain

good results. (higher-order interaction)

  • The proposed measures of the similarity of the

conditional distributions provide significant additional performance.

  • Competitive results, open source code, simple

and fast.

  • Next step: detailed study of the performance in

different type of data pairs.