[PPT] - INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1) Prof. Dr. Dr. K. PowerPoint Presentation

SLIDE 1

INTRODUCTION TO GENETIC EPIDEMIOLOGY (GBIO0015-1)

Prof. Dr. Dr. K. Van Steen

SLIDE 2

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 521

CHAPTER 7: A WORLD OF INTERACTIONS 1 Beyond main effects 1.a Dealing with multiplicity 1.b A bird’s eye view on roads less travelled by 1.c Multi-locus analysis epistasis analysis 2 Epistasis detection: a challenging task 2.a Variable selection 2.b Multifactor dimensionality reduction 2.c Interpretation 3 Future challenges

SLIDE 3

Introduction to Genetic Epidemiology

K Van Steen

1 Beyond main effects 1.a Dealing with multiplic

Multiple testing explosion

genome (HapMap)

1

Cha

tiplicity

sion: ~500,000 SNPs span 80% of com

n-th order interaction

2 3 4 5

Chapter 7: A World of Interactions

522

common variation in

SLIDE 4

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 523

Ways to handle multiplicity Recall that several strategies can be adopted, including:

clever multiple corrective procedures
pre-screening strategies,
multi-stage designs,
adopting haplotype tests or
multi-locus tests

Which of these approaches are more powerful is still under heavy debate…

The multiple testing problem becomes “unmanageable” when looking at

multiple loci jointly?

SLIDE 5

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 524

1.b A bird’s eye view on roads less travelled by

Multiple disease susceptibility loci (mDSL)

Dichotomy between
Improving single markers strategies to pick up multiple signals at once

(PBAT)

Testing groups of markers (FBAT multi-locus tests)

SLIDE 6

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 525

PBAT screening for mDSL

Little has been done in the context of family-based screening for epistasis
First assess how a method is capable of detecting multiple DSL
Simulation strategy (10,000 replicates):
Genetic data from Affymetrix SNPChip 10K array on 467 subjects from

167 families

Select 5 regions; 1 DSL in each region
Generate traits according to normal distribution, including up to 5

genetic contributions

For each replicate: generate heritability according to uniform

distribution with mean h = 0.03 for all loci considered (or h = 0.05 for all loci)

(Van Steen et al 2005)

SLIDE 7

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 526

General theory on FBAT testing

Test statistic:
works for any phenotype, genetic model
use covariance between offspring trait and genotype

|

Test Distribution:
computed assuming H0 true; random variable is offspring genotype
condition on parental genotypes when available, extend to family

configurations (avoid specification of allele distribution)

condition on offspring phenotypes (avoid specification of trait

distribution) (Horvath et al 1998, 2001; Laird et al 2000)

SLIDE 8

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 527

Screen

Use ‘between-family’ information

[f(S,Y)]

Calculate conditional power

(ab,Y,S)

Select top N SNPs on the basis of

power

| |

Test

Use ‘within-family’ information

[f(X|S)] while computing the FBAT statistic

This step is independent from the

screening step

Adjust for N tests (not 500K!)

| | ( Van Steen et al 2005) ( Lange and Laird 2006)

SLIDE 9

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 528

Power to detect genes with multiple DSL

top : top 5 SNPs in the ranking bottom: top 10 SNPs in the ranking

(Van Steen et al 2005)

SLIDE 10

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 529

Power to detect genes with multiple DSL

top : Benjamini-Yekutieli FDR control at 5% (general dependencies) bottom: Benjamini-Hochberg FDR control at 5% (Van Steen et al 2005)

SLIDE 11

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 530

FBAT multi-locus tests

(Rakovski et al 2008)

The new test has an overall

performance very similar to that of FBAT-LC

FBAT-SNP-PC attains higher power

in candidate genes with lower average pair-wise correlations and moderate to high allele frequencies with large gains (up to 80%).

(FBAT-LC : Xin et al 2008)

SLIDE 12

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 531

In contrast: popular multi-locus approaches for unrelateds

Parametric methods:
Regression
Logistic or (Bagged) logic regression
Non-parametric methods:
Combinatorial Partitioning Method (CPM)

quantitative phenotypes; interactions

Multifactor-Dimensionality Reduction (MDR)

qualitative phenotypes; interactions

Machine learning and data mining
The multiple testing problem becomes “unmanageable” when looking at

(genetic) interaction effects? More about this in Chapter 9.

SLIDE 13

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 532

1.c Multi-locus analysis epistasis analysis

Epistasis: what’s in a name?

Interaction is a kind of action that occurs as two or more objects have an

effect upon one another. The idea of a two-way effect is essential in the concept of interaction, as opposed to a one-way causal effect. (Wikipedia)

(slide : C Amos)

SLIDE 14

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 533

Epistasis: what’s in a name?

Distortions of Mendelian segregation ratios due to one gene masking the

effects of another (William Bateson 1861-1926).

Deviations from linearity in a statistical model (Ronald Fisher 1890-1962).

“Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans”

(Cordell 2002)

SLIDE 15

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 534

Why is there epistasis?

From an evolutionary biology perspective, for a phenotype to be buffered

against the effects of mutations, it must have an underlying genetic architecture that is comprised of networks of genes that are redundant and robust.

This creates dependencies among the genes in the network and is realized

as epistasis.

(slide: Y Chen, 2007)

SLIDE 16

Introduction to Genetic Epidemiology

K Van Steen

Different types of interactio

(Fisher, Wright)

m-a m+d qq Trait m-a m+d qq Trait

Cha

actions

m+a QQ Qq m+a QQ Qq

Chapter 7: A World of Interactions

535

(slide: C Amos)

SLIDE 17

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 536

Interpretation of epistasis

The study of epistasis poses problems of interpretability. Statistically,

epistasis is usually defined in terms of deviation from a model of additive multiple effects, but this might be on either a linear or logarithmic scale, which implies different definitions.

(Moore 2004)

Despite the aforementioned concerns, there is evidence that a direct search

for epistatic effects can pay dividends.

It is expected to have an increasing role in future analyses…

SLIDE 18

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 537

The frequency of epistasis

Not a new idea! (Bateson 1909)
Complexity of gene regulation and biochemical networks (Gibson 1996;

Templeton 2000)

Single gene results don’t replicate (Hirschhorn et al. 2002)
Gene-gene interactions are commonly found when properly investigated

(Templeton 2000)

Working hypothesis:

Single gene studies don’t replicate because gene-gene interactions are more important (Moore and Williams 2002)

(Moore 2003)

SLIDE 19

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 538

Slow shift from main towards epistatis effects

(Motsinger et al 2007)

SLIDE 20

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 539

Power of a gene-gene or gene-environment interaction analysis

There is a vast literature on power considerations
Most of this literature strengthen their beliefs by extensive simulation

studies

There is a need for user-friendly software tools that allow the user to

perform hands-on power calculations

Main package targeting interaction analyses is QUANTO (v1.2.1):
Available study designs for a disease (binary) outcome include the

unmatched case-control, matched case-control, case-sibling, case- parent, and case-only designs. Study designs for a quantitative trait include independent individuals and case parent designs.

Reference: Gauderman (2000a), Gauderman (2000b), Gauderman

(2003) / http://hydra.usc.edu/GxE

SLIDE 21

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 540

A simple example of epistasis

SLIDE 22

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 541

A simple disease model

Penetrance
Pr (affected | genotype)
One-locus Dominant Model

Genotype aa aA AA Status 1 1

SLIDE 23

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 542

A slightly more complicated two-locus model Genotype bb bB BB aa aA 1 1 AA 1 1 Enumeration of two-locus models

Although there are 29=512

possible models, because of symmetries in the data, only 50 of these are unique.

Enumeration allows 0 and 1 only

for penetrance values (‘fully penetrant’; i.e., “show” example).

SLIDE 24

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 543

Enumeration of two-locus models

(Li and Reich 2000)

Each model represents

a group of equivalent models under

permutations. The

representative model is the one with the smallest model number.

The six models studied

in Neuman and Rice [67] (‘RR, RD, DD, T, Mod, XOR’), as well as two single-locus models (‘IL’) – the recessive (R) and the interference (I) model, are marked.

SLIDE 25

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 544

Different degrees of epistasis

(slide: Motsinger)

SLIDE 26

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 545

Pure epistasis model for dichotomous traits

Suppose
p(A)=p(B)=p(a)=p(b)=0.5
HWE (hence, p(AA)=0.52=0.25,p(Aa)=20.52=0.5) and no LD
penetrances are given according to the table below

P(affected|genotype) Penetrance bb bB BB prob aa 1 0.25 aA 0.50 0.50 AA 1 0.25 prob 0.25 0.50 0.25 1

Then make multiple use of Bayes rule to retrieve the genotype distributions

in cases and controls

SLIDE 27

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 546

Pure epistasis model for dichotomous traits

Then the marginal genotype distributions for cases and controls are the

same, and hence one-locus approaches will be powerless! P(genotypes|affected) bb bB BB prob aa 0.25 0.25 aA 0.50 0 0.50 AA 0.25 0 0.25 prob 0.25 0.50 0.25 1 P(genotypes|unaffected) bb bB BB prob aa 0.083 0.167 0 0.25 aA 0.167 0.167 0.167 0.50 AA 0.167 0.083 0.25 prob 0.25 0.50 0.25 1 P(aa,BB|D) =p(D|aa,BB)p(aa,BB) / p(D) = 1 0.520.52/(1 0.520.52+0.5 20.52 2 0.52+1 0.520.52) = ¼ = 0.25

SLIDE 28

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 547

Purely epistatic 3-locus diseasemodel for quantitative traits

Assume all allele frequencies are 0.5
Heritability is 55% and prevalence is 6.25%

L.3=0 L.3=1 L.3=2 L.2 L.1 1 2 1 2 1 2 1 1 0.25 0 2 1

(Culverhouse et al 2002)

SLIDE 29

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 548

Expected genotype patterns for 3-locus model L.1 L.2 L.3 p(g) E[#affected] E[#unaff] 2 0.0156 25 2 2 0.0156 25 1 1 1 0.1250 50 10 Other 0.8438 90 Sum 1 100 100

(Culverhouse et al 2002) (sllide: J Ott 2004)

SLIDE 30

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 549

2 Epistasis detection: a challenging task

Main challenges

Variable selection
Modeling
Interpretation
Making inferences about biological epistasis from statistical epistasis

(slide Chen 2007)

SLIDE 31

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 550

2.a Variable selection

Introduction

The aim is to make clever selections of marker combinations to look at in an

epistasis analysis

This may not only aid in the interpretation of analysis results, but also

reduced the burden of multiple testing and the computational burden

SLIDE 32

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 551

Variable selection and multiple testing

Multiple testing is a thorny issue, the bane of statistical genetics.
The problem is not really the number of tests that are carried out: even

if a researcher only tests one SNP for one phenotype, if many other researchers do the same and the nominally significant associations are reported, there will be a problem of false positives.

(Balding 2006)

Example
Given 3 disease SNPS (e.g., Culverhouse 3-locus model before), making

inferences is not at all an easy task: Chi-sq = 166.7 (26 df), p=1.76 10-22

With 50,000 SNPS, there will be 2.1 1013 subsets of size 3

Applying Bonferroni correction, p = 3.6 10-9

A more manageable approach is to test all possible pairs of loci for

interaction effects, different in cases and controls (Hoh and Ott 2003)

SLIDE 33

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 552

Variable selection and multiple testing

Pre-screening for subsequent testing:
Independent screening and testing step (PBAT screening; Van Steen et

al 2005)

Dependent screening and testing step

SLIDE 34

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 553

Methods to correct for multiple testing

Family-wise error rates (FWER)
… In the presence of too many SNPs, the Bonferroni threshold will be

extremely low:

Bonferroni adjustments are conservative when statistical tests are not independent / Bonferroni adjustments control the error rate associated with the omnibus null hypothesis / The interpretation of a finding depends on how many statistical tests were performed

Permutation data sets
It is particularly handy for rare genotypes, small studies, non-normal

phenotypes, and tightly linked markers

In case-control data this is relatively straightforward / In family data this

is not at all an easy task …

SLIDE 35

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 554

Methods to correct for multiple testing

False discovery rate (FDR)
With too many SNPs it starts to break down and the power over

Bonferroni is minimal (e.g. see Van Steen et al 2005)

False-positive report probability (FPRP)
It is the probability of no true association between a genetic variant and

disease given a statistically significant finding, depends not only on the

bserved p-value but also on both the prior probability that the

association between the genetic variant and the disease is real and the statistical power of the test (Wacholder et al 2004)

In general, Bayesian approaches do not yet have a big role in genetic

association analyses, possibly because of computational burden?

Not yet well documented / What are the priors? (Balding 2006; Lucke 2008)

SLIDE 36

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 555

Variable selection and computation time

When SNPs do not have independent effects, however, it is impossible for

most current computer technologies to analyze the resulting astronomical number of possible combinations.

For instance, if 300000 SNPs have been measured at a density of 1 SNP

every 10 kilobases (kb), and if 10 statistical evaluations can be computed each second, then evaluation of each individual SNP would require 30000 seconds (ie, 8.3 hours) of computer time.

Exhaustive evaluation of the approximately 4 1010 pairwise combinations
f SNPs would require 1286 years.
Although it might be possible for a large supercomputer to complete these

computations in a reasonable amount of time, an exhaustive search of all combinations of 3 or 4 SNPs would not be possible even if every computer in the world were simultaneously working on the problem.

(Moore and Ritchie 2004)

SLIDE 37

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 556

2.b Modeling

Failure of traditional methods

A large number of SNPs are

genotyped

“multiple comparisons”

problem, very small p-values required for significance, which is even compounded in gene-environment interaction analyses.

Genetic loci may interact

(epistasis) in their influence on the phenotype

loci with small marginal

effects may go undetected

interested in the interaction

itself

Curse of dimensionality and sparse

“cells”

SLIDE 38

Introduction to Genetic Epidemiology

K Van Steen

Curse of dimensionality and

For 2 SNPs, there are 9 = 3
If the alleles are rare (MA

Cha

and sparse cells 9 = 32 possible two locus genotype com (MAF≤10%), then some cells will be em

Chapter 7: A World of Interactions

557

combinations. e empty

(slide: C Amos)

SLIDE 39

Introduction to Genetic Epidemiology

K Van Steen

Curse of dimensionality and

For 4 SNPs, there are 81 p

cells …

Cha

and sparse cells 81 possible combinations with more p

Chapter 7: A World of Interactions

558

re possible empty

(slide: C Amos)

SLIDE 40

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 559

Modeling: strategy 1

Strategy 1: Set association approach

At each SNP, compute an association statistic T
Build sum over 1, 2, 3, etc highest values t
Evaluate significance of given sum by permutation test
Sum with smallest p-value will point towards the markers to select
Smallest p is single statistic, find significance level
Is applicable to many SNPs and has also been used in microarray settings

(Hoh et al 2001)

SLIDE 41

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 560

Strategy 1: Set association approach

(Hoh et al 2001)

SLIDE 42

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 561

Modeling: strategy 2

Strategy 2: Multi-locus approaches

Most case control studies far too often do not take into account the multi-

locus nature of complex traits

When the aim is to analyze multiple SNPs or genes jointly, two classes of

approaches emerge:

Combine (properties of) single-locus statistics over multiple SNPs to
btain a new multivariate test statistic

Depending on whether SNPS are in high LD or not, different measures need to be taken

Look for patterns of genotypes at SNPs in different genomic locations

SLIDE 43

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 562

Two frameworks for multi-locus approaches (Onkamo and Toivonen 2006)

Parametric methods:
Regression
Logistic or (Bagged) logic regression
Non-parametric methods:
Tree-based methods:
Recursive Partitioning (Helix Tree)
Random Forests (R, CART)
Pattern recognition methods:
Mining association rules
Neural networks (NN)
Support vector machines (SVM)
Data reduction methods:
DICE (Detection of Informative Combined Effects)
MDR (Multifactor Dimensionality Reduction)

SLIDE 44

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 563

Non-parametric chi-square

The question is how to test for epistatic effects above and beyond

(independent) main effects (of single-locus genotype effects)

Use “usual” chi-square for interactions independent of main effects.

Isolate individual df’s.

Assess difference in interactions between cases and controls, since then

interactions may be better indicative for underlying pathways

Locus 2 Locus 1 BB Bb Bb AA Aa aa Main effect locus 1 2df Main effect locus 2 2 df Interactions 4 df Total 8 df

SLIDE 45

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 564

Partitioning chi-squares for one locus 2df 1 df 1 df

Simple disease model, population frequency K = 0.10

N = 100 cases, 100 controls.

Predicted numbers of cases and controls in given genotype classes,

and resulting odds ratios, OR

SLIDE 46

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 565

Partitioning chi-squares for two loci

3 × 3 table of genotypes (4 df) may be partitioned into 4 independent

components, each with 1 df.

Do such partitioning for cases and controls each (Agresti 2002).

BB Bb AA Aa BB Bb AA, Aa aa BB, Bb bb AA Aa BB, Bb bb AA, Aa aa

SLIDE 47

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 566

Partitioning chi-squares for two loci

Compare each of the four 2 by 2 subtables between cases and controls to

see whether their odds ratios are the same

SLIDE 48

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 567

Logistic regression

LR is a derivative of linear regression that fits a function to continuous or

discrete independent variables based on a dichotomous dependent variable (Hosmer and Lemeshow, 2000).

One of the most common procedures for variable selection in a LR analysis

is step-wise logistic regression (step LR) [Hosmer and Lemeshow, 2000].

In the step-wise procedure, each variable is tested for independent

effects, and those variables with significant effects are included in the model.

In a second step, interaction terms of those variables with significant

main effects are included, and significant effects are included in the model.

(Motsinger-Reif et al 2008)

SLIDE 49

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 568

Logistic regression

LR is a de facto standard for traditional association studies. Using

independent variables to predict a dichotomous dependent variable, LR by definition lacks the ability to characterize purely interactive effects.

Only variables that contain an independent main effect will be included in

the final model.

To properly evaluate non-linear purely interactive effects, combinations of

variables must be encoded as a single variable for inclusion in the analysis. Such an encoding scheme can be computationally expensive, depending on the number of variables used.

(Motsinger-Reif et al 2008)

SLIDE 50

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 569

Strategy 2: Look for patterns of genotypes using unrelated individuals

CPM = combinatorial partitioning method (Charlie Sing, U Michigan).

Applicable to small number (~50) of SNPs only.

MDR = multifactor-dimensionality reduction method (Jason Moore,

Vanderbuilt U)

LAD = logical analysis of data (P. Hammer, Rutgers U)
Mining association rules, Apriori algorithm (R. Agrawal)
Special approaches for microarray data

(Hoh and Ott 2003)

SLIDE 51

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 570

The MDR algorithm What is MDR?

A data mining approach to identify interactions among discrete variables

that influence a binary outcome

A nonparametric alternative to traditional statistical methods such as

logistic regression

Driven by the need to improve the power to detect gene-gene

interactions

(slide: L Mustavich)

SLIDE 52

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 571

The 6 steps of MDR

SLIDE 53

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 572

MDR Step 1

Divide data (genotypes, discrete

environmental factors, and affectation status) into 10 distinct subsets

SLIDE 54

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 573

MDR Step 2

Select a set of n genetic or

environmental factors (which are suspected of epistasis together) from the set of all variables in the training set

SLIDE 55

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 574

MDR Step 3

Create a contingency table for

these multi-locus genotypes, counting the number of affected and unaffected individuals with each multi-locus genotype

SLIDE 56

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 575

MDR Step 4

Calculate the ratio of cases to

controls for each multi-locus genotype

Label each multi-locus genotype as

“high-risk” or “low-risk”, depending on whether the case- control ratio is above a certain threshold

This is the dimensionality

reduction step: Reduces n-dimensional space to 1 dimension with 2 levels

SLIDE 57

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 576

MDR Step 5

To evaluate the developed model

in Step 4, use labels to classify individuals as cases or controls, and calculate the misclassification error

In fact: balanced accuracy is used

(arithmetic mean between sensitivity and specificity), which IS mathematically equivalent to classification accuracy when data are balanced

SLIDE 58

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 577

Repeat Steps 2 to 5

All possible combinations of n factors are evaluated sequentially for their

ability to classify affected and unaffected individuals in the training data, and the best n-factor model is selected in terms of minimal misclassification error

SLIDE 59

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 578

MDR Step 6

The independent test data from

the cross-validation are used to estimate the prediction error (testing accuracy) of the best model selected

SLIDE 60

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 579

Towards MDR Final

Steps 1 through 6 are repeated for each possible cross-validation interval
The best model across all 10 training and testing sets is selected on the

basis of the criterion:

Maximize the cross-validation consistency =

The number of times a particular model was the best model across the cross-validation subsets

The end of a cross-validation procedure also allows to compute the
average training accuracy
average testing accuracy
f best models over all cross-validation sets, and possible over multiple

runs (with different seeds, to reduce the chance of observing spurious results due to chance divisions of the data)

SLIDE 61

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 580

MDR final

The entire process is repeated for each k=1 to N loci combinations that are

computationally feasible and an optimal k-locus model is chosen for each level of k considered.

The final model is based on maximizing two criteria:
maximizing the (average) prediction accuracy
maximizing the (average) cross-validation consistency
Statistical significance is obtained by comparing the average cross-

validation consistency from the observed data to the distribution of average consistencies under the null hypothesis of no associations, derived empirically from 1000 permutations

(Ritchie et al 2001, Ritchie et al 2003, Hahn et al 2003)

SLIDE 62

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 581

Several measures of fitness to compare models Balanced accuracy

Balanced accuracy(BA) weighs the classification accuracy of the two classes

equally and it is thought to be more powerful than using accuracy alone when data are imbalanced, or when the counts of cases and controls are not equal (Velez et al 2007)

BA is calculated from a 2 × 2 table relating exposure to status by

[(sensitivity+specificity)/2]. Real case Real control Model case TP FP Model control FN TN When #cases = #controls, then TP+FN = FP+TN and BA = (TP+TN)/2*#cases = TP+TN/(total sample size)

SLIDE 63

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 582

Several measures of fitness to compare models Model-adjusted balanced accuracy

Model-adjusted balanced accuracy uses in addition a different threshold in

the MDR modeling, one that is based on the actual counts of case and control samples in the data.

When individuals have missing data, it accounts for the precise number
f individuals with complete data for that particular multi-locus

combination

This makes MDR robust to class imbalances (Velez et al 2007)

SLIDE 64

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 583

Hypothesis test of best model

Evaluate magnitude of cross-validation consistency and prediction error

estimates by adopting a permutation strategy

In particular:
Randomize disease labels
Repeat MDR analysis several times (1000?) to get distribution of cross-

validation consistencies and prediction errors

Use distributions to derive the p-values for the actual cross-validation

consistencies and prediction errors

SLIDE 65

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 584

Sample Quantiles

0% 0.045754 25% 0.168814 50% 0.237763 75% 0.321027 90% 0.423336 95% 0.489813 99% 0.623899 99.99% 0.872345 100% 1

An Example Empirical Distribution

Frequency 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10

SLIDE 66

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 585

The probability that we would see results as, or more, extreme than for instance 0.4500, simply by chance, is between 5% and 10% (slide: L Mustavich) The MDR Software Downloads

Available from www.sourceforge.net
The MDR method is described in further detail by Ritchie et al. (2001) and

reviewed by Moore and Williams (2002).

An MDR software package is available from the authors by request, and is

described in detail by Hahn et al. (2003). More information can also be found at http://phg.mc.vanderbilt.edu/Software/MDR

SLIDE 67

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 586

The authors

Multifactor dimensionality reduction software for detecting gene-gene and

gene-environment interactions. Hahn, Ritchie, Moore, 2003. Required operating system software Linux: Linux (Fedora version Core 3): Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_06-b03) Java HotSpot(TM) Client VM (build 1.4.2_06-b03, mixed mode) Windows: Windows (XP Professional and XP Home): Java(TM) 2 Runtime Environment, Standard Edition (build v1.4.2_05) Minimum system requirements

1 GHz Processor
256 MB Ram
800x600 screen resolution

SLIDE 68

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 587

SLIDE 69

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 588

Application to simulated data

To show MDR in action, we simulated 200 cases and 200 controls using

different multi-locus epistasis models (Evans 2006)

Scenario 1: 10 SNPs, adapted epistasis model M170, minor allele

frequencies of disease susceptibility pair 0.5

Scenario 2: 10 SNPs, epistasis model M27, minor allele frequencies of

disease susceptibility pair 0.25 M170 1 2 0.1 0 1 0.1 0 0.1 2 0.1 0 M27 1 2 1 0.1 0.1 2 0.1 0.1

All markers were assumed to be in HWE. No LD between the markers.

SLIDE 70

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 589

Application to simulated data Marginal distributions for the controls M170 0 1 2 0.07 0.12 0.07 0.25 1 0.12 0.26 0.12 0.50 2 0.07 0.12 0.07 0.25 0.25 0.50 0.25 M27 0 1 2 0.15 0.29 0.15 0.58 1 0.10 0.17 0.09 0.36 2 0.02 0.03 0.01 0.06 0.26 0.49 0.25 Marginal distributions for the cases M170 0 1 2 0.00 0.25 0.00 0.25 1 0.25 0.00 0.25 0.50 2 0.00 0.25 0.00 0.25 0.25 0.50 0.25 M27 0 1 2 0.00 0.00 0.00 1 0.57 0.29 0.86 2 0.10 0.05 0.14 0.00 0.66 0.33

SLIDE 71

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 590

Data format

The definition of the format is as follows:
All fields are tab-delimited.
The first line contains a header row. This row assigns a label to each

column of data. Labels should not contain whitespace.

Each following line contains a data row. Data values may be any string

value which does not contain whitespace.

The right-most column of data is the class, or status, column. The data

values for this column must be 1, to represent ”Affected” or ”Case” status, or 0, to represent ”Unaffected” or ”Control” status. No other values are allowed.

SLIDE 72

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 591

Easy data conversion

> M170data[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1

M170data <- rbind(M170.cases,M170.controls) M170ccdata <- matrix(NA,nrow=ss,ncol=nsnps) for (i in 1:nsnps){ M170ccdata[,i] <- apply(M170data[,c(2i-1,2i)],1,sum)-2 } M170ccdata <- cbind(M170ccdata,c(rep(1,200),rep(0,200))) write.table(M170ccdata,"M170ccdata.txt",sep="\t",row.names=F,col.names=F)

> M170ccdata[1,] [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [1,] 1 2 0 0 0 0 1 0 1 1 1

SLIDE 73

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 592

M170 case control data

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 … 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 2

SLIDE 74

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 593

Loading a data file (MDR 2.0 beta 3)

SLIDE 75

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 594

Configuring the analysis

SLIDE 76

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 595

Reducing the number of cross-validations CV=10 CV=3

(Motsinger and Ritchie 2006)

SLIDE 77

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 596

Reducing the number of cross-validations (CVs)

In general, CV is a useful approach for limiting false-positives by assessing

the generalizability of models (Coffey et al 2004)

The number of CV intervals in an MB-MDR analysis can be reduced from 10

to 5, but not to 3

CV seems to be rather important in the MDR algorithm:
Motsinger and Ritchie (2003) showed that, without CV, selection of a

final model is difficult, but that it is encouraging that the false-positive results almost always include at least one correct functional locus.

This indicates that perhaps, in the case of extremely large datasets, like

genomewide scans, where using any type of CV would be computationally infeasible, MDR could still be used (without CV) to identify at least one functional locus…

SLIDE 78

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 597

Search method configuration

SLIDE 79

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 598

Running the MDR analysis

SLIDE 80

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 599

Summary of results

SLIDE 81

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 600

Best MDR model

SLIDE 82

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 601

MDR best model

SLIDE 83

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 602

Values calculated by MDR

Measure Formula/Interpretation Balanced Accuracy (Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class Accuracy (TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified (skewed in favor of larger class) Sensitivity TP/(TP+FN); how likely a positive classification is correct Specificity TN/(TN+FP); how likely a negative classification is correct Odds Ratio (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups

SLIDE 84

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 603

Values calculated by MDR

Measure Formula/Interpretation Precision TP/(TP+FP); the proportion of relevant cases returned Kappa 2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy X2 Chi-squared score for the attribute constructed by MDR from this attribute combination F-Measure 2*TP/(2*TP+FP+FN); a function of sensitivity and precision

TP: true positive; TN: true negative; FP: false positive; FN: false negative

SLIDE 85

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 604

MDR CV results 0.8028 0.8028 0.8056 0.7861 0.7972 0.8000 0.8056 0.7889 0.7944 0.7917 average = 0.79751

SLIDE 86

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 605

MDR best model Graphical display on whole data If-then rules on whole data

SLIDE 87

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 606

The fitness landscape

Gives the fitness landscape across all models as a line chart (the default).
The models produced are on the x-axis of the chart. The models on the

x-axis are in the order in which they were generated (e.g., 1,2,3, …, 12, 13, 14, …)

Training accuracy is shown on the y-axis.

SLIDE 88

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 607

The fitness landscape

SNP1 0.5127778 SNP2 0.5286111 SNP3 0.52527773 SNP4 0.51555556 SNP5 0.5875 SNP6 0.5127778 SNP7 0.5158334 SNP8 0.5141667 SNP9 0.5144445 SNP10 0.5233334 SNP1,SNP2 0.7975 SNP1,SNP4 0.5375 SNP1,SNP5 0.5916667 SNP1,SNP3 0.5372222

…

SLIDE 89

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 608

Locus Dendrogram

The dendrogram provides a graphical representation of the interactions

between attributes (and the strength of those interactions) from the MDR analysis (max nr of interactions asked for) using an “interaction dendrogram”.

The purpose of the interaction dendrogram is to assist the user with

determining the nature of the interactions (redundant, additive, or synergistic).

SLIDE 90

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 609

Locus Dendrogram

The dendrogram is constructed using hierarchical cluster analysis with

average-linking.

The distance matrix used by the cluster analysis is constructed by

calculating the information gained by constructing two attributes using the MDR function (Moore et al 2006, Jakulin and Bratko 2003, Jakulin et al 2003)

SLIDE 91

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 610

Raw entropy values

Entropy is basically a defined a measure of randomness or disorder within a
system. More specifically indicates that the lower the entropy values are

the higher likelihood that the system is in a more probable state.

A classic example of this principle is the melting of a glass of ice in which as

the state becomes more unstable as the entropy increases.

A graphical illustration of the relationships between information theoretic measures on the joint distribution of attributes A and B. The surface area of a section corresponds to the labeled quantity (Jakulin 2003) [I(A;B) = mutual information = the amount of information provided by A about B = information gain.; H(A) = entropy of A]

SLIDE 92

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 611

Raw entropy values

Let us assume an attribute, A. We have observed its probability distribution,

PA(a). Shannon’s entropy measured in bits is a measure of predictability of an attribute is defined as:

!

Hence phrased differently, the higher the entropy, the less reliable are our

predictions about A. We can understand H(A) as the amount of uncertainty about A, as estimated from its probability distribution.

SLIDE 93

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 612

Raw entropy values

Single Attribute Values:
H(A): This is the entropy of the given attribute (A)
H(A|C): This is the entropy of the given attribute (A) given the class (C)
I(A;C): This is the information gain of the given attribute (A) given the

class (C)

Pairwise Values:
H(AB): This is the entropy of the given constructed attribute (AB)
H(AB|C): This is the entropy of the given constructed attribute (AB)

given class I

I(A;B): This is the information gain of attribute (A) given attribute (B)
I(A;B;C): This is the information gain for attribute (A) or Attribute (B)

given class (C)

I(AB;C): This is the information for the constructed attribute (AB) given

class I

SLIDE 94

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 613

Raw entropy values

Mutual information I(A ;B) as a function of r2 (as a measure of LD between

markers), for a subset of the Spanish Bladder Cancer data (SBCS) – unpublished results

SLIDE 95

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 614

SLIDE 96

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 615

Locus dendrogram

The colors range from red

representing a high degree of synergy (positive information gain), orange a lesser degree, and gold representing the midway point between synergy and redundancy. Synergy – The interaction between two attributes provides more information than the sum of the individual attributes. Redundancy – The interaction between attributes provides redundant information.

On the redundancy end of the

spectrum, the highest degree is represented by the blue color (negative information gain) with a lesser degree represented by green.

SLIDE 97

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 616

Positive and negative interactions

Say I(A;B;C) = I(A,B;C)−I(A;C) – I(B;C)
Assume that we are uncertain about the value of C, but we have

information about A and B.

Knowledge of A alone eliminates I(A;C) bits of uncertainty from C.
Knowledge of B alone eliminates I(B;C) bits of uncertainty from C.
However, the joint knowledge of A and B eliminates I(A,B;C) bits of

uncertainty.

Hence, if interaction information is positive, we benefit from an unexpected
synergy. If interaction information is negative, we suffer diminishing

marginal returns by introducing attributes that partly contribute redundant information.

SLIDE 98

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 617

Significance of the results

We simulated data

from a two-locus epistasis model.

The remaining SNPs

were generated at random…

Hence, what does it

mean that the best single effects model SNP5 was chosen? Answer: Every k-locus setting will give rise to a “best” model. MDR forces for every k-locus setting an optimal model.

SLIDE 99

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 618

Significance of results

The best model among all 1-3 locus models, is the one with maximal cross

validation consistency and maximum average balanced prediction accuracy

But how significant is this result?

SLIDE 100

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 619

Configuring the permutation analysis (MDR PT Module 0.4.8 alpha)

SLIDE 101

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 620

Performing the MDR permutation test

SLIDE 102

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 621

Performing the MDR permutation test

SLIDE 103

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 622

Performing the MDR permutation test SNP5 SNP1-SNP2 SNP1-SNP2-SNP5 Testing BA (p-value) 0.5875 (0.0540) 0.7975 (<0.0010) 0.7950 (<0.0010) CVC (p-value) 10 (0.2160) 10 (0.2160) 10 (0.2160) Obtained from MDR summary table Obtained from MDR Permutation Testing p-value calculator

SLIDE 104

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 623

Performing the MDR permutation test

Perm null distr for best k=1- 3 models SNP5 SNP1-SNP2 SNP1-SNP2-SNP5 Testing BA (p-value) 0.5875 (0.0540) 0.7975 (<0.0010) 0.7950 (<0.0010) CVC (p-value) 10 (0.2160) 10 (0.2160) 10 (0.2160) Perm null distr for best k- locus model (hence 3 distr) SNP5 SNP1-SNP2 SNP1-SNP2-SNP5 Testing BA (p-value) 0.5875 (0.0060- 0.0070) 0.7975 (0.0000-0.0010) 0.7950 (0.0000-0.0010) CVC (p-value) 10 (0.1720) 10 (0.0570) 10 (0.0440)

SLIDE 105

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 624

What is going on?

Perm null distr for best k- locus model (hence 3 distr) SNP5 SNP1-SNP2 SNP1-SNP2-SNP5 Testing BA (p-value) 0.5875 (0.0060- 0.0070) 0.7975 (0.0000-0.0010) 0.7950 (0.0000-0.0010) CVC (p-value) 10 (0.1720) 10 (0.0570) 10 (0.0440)

Effect of “strong” main effect is carried through in higher order interactions?

What will happen for data simulated under M27 (with main effects by

simulation?

SLIDE 106

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 625

Results for M27

SLIDE 107

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 626

Results for M27

Perm null distr for best k=1- 3 models SNP1-SNP2 SNP1-SNP2-SNP4 Testing BA (p-value) 0.8325 (<0.0010) 0.8600 (<0.0010) CVC (p-value) 10 (0.2310) 5 (0.9110)

What about SNP2? Why is this not highlighted as an important main effect?
Maximizing CVC first and then looking at prediction accuracy highlights

SNP1-SNP2. Maximizing prediction accuracy alone, would point towards SNP1-SNP2-SNP4.

SLIDE 108

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 627

Results for M27

Using permutation null distributions per k-locus setting, the following

results are obtained:

Perm null distr for best k- locus model (hence 3 distr) SNP1 SNP1-SNP2 SNP1-SNP2-SNP4 Testing BA (p-value) 0.7875 (<0.0010) 0.8325 (<0.0010) 0.8600 (<0.0010) CVC (p-value) 10 (0.1790) 10 (0.0620) 5 (0.9110)

Wouldn’t it be natural to correct for SNP1 when looking for interactions?
What if more than one main effect is present in the data?

SLIDE 109

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 628

Strengths of MDR

Facilitates simultaneous detection and characterization of multiple genetic

loci associated with a discrete clinical endpoint by reducing the dimensionality of the multi-locus data

Non-parametric – no values are estimated
Assumes no particular genetic model
Minimal false-positive rates

SLIDE 110

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 629

Weaknesses of MDR

Computationally intensive (especially with >10 loci)
The original MDR software supports diseasemodels with up to 15

factors at a time from a list of up to 500 total factors and a maximum sample size of 4,000 subjects.

Parallel MDR (Bush et al 2006) is a redesign of the initial MDR algorithm

to allow an unlimited number of study subjects, total variables and variable states, and to remove restrictions on the order of interactions being analyzed The algorithm gives an approximate 150-fold decrease in runtime for equivalent analyses.

The curse of dimensionality: decreased predictive ability with high

dimensionality and small sample due to cells with no data

SLIDE 111

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 630

Several (other) extensions to the MDR paradigm (CV based)

(Lou et al 2008)

SLIDE 112

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 631

Different measure to score model quality

One crucial component of the MDR algorithm measures the percentage of

cases and controls incorrectly labelled by the proposed classification – the classification error.

The combination of variables that produces the lowest classification

error is selected as the best or most fit model.

The correctly and incorrectly labelled cases and controls can be expressed

as a two-way contingency table.

The ability of MDR to detect gene-gene interactions can be improved by

replacing classification error with a different measure to score model quality.

Of 10 measures evaluated, Bush et al (2008) found that the likelihood

ratio and normalized mutual information (NMI) are measures that consistently improve the detection and power of MDR in simulated data over using classification error. These measures also reduce the inclusion of spurious variables in a multi-locus model.

SLIDE 113

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 632

Contingency table measures of classification performance (Bush et al 2008)

SLIDE 114

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 633

Towards an easy-to- adapt framework

(Lou et al 2008)

MB-MDR FAM-MDR

SLIDE 115

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 634

MB-MDR as a semi-parametric approach for unrelateds

Step 1: New risk cell identification

via association test on each genotype cell cj

Parametric or non-parametric test of

association

Step 2: Test one-dimensional

“genetic” construct X on Y

Step 3: assess significance
W = [b/se(b)]2, b=ln(OR)
Derive correct null distribution for W

(Calle et al 2007, Calle et al 2008)

SLIDE 116

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 635

Motivation 1 for MB-MDR

Some important interactions could be missed by MDR due to pooling too

many cells together

(Calle et al 2008)

SLIDE 117

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 636

Motivation 2 for MB-MDR

MDR cannot deal with main effects / confounding factors / non-

dichotomous outcomes

SLIDE 118

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 637

Motivation 3 for MB-MDR

MDR has low performance in the presence of genetic heterogeneity

(Calle et al 2008)

SLIDE 119

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 638

A comparison of analytical methods: GENN, RF, FITF, MDR, logistic regression GENN

Grammatical evolution neural network (GENN) is a novel pattern

recognition method developed to detect main effects or multi-locus models

f association without exhaustively searching all possible multi-locus

combinations.

Grammatical evolution (GE) is a machine-learning algorithm inspired by the

biological process of transcription and translation. GE uses a genetic algorithm in combination with a pre-specified grammar (set of translation rules) to automatically evolve an optimal computer program.

GENN utilizes GE to evolve the inputs (predictor variables), architecture

(arrangement of layers and functions), and weights of a neural network (NN) to optimally classify a given dataset.

(Motsinger-Reif et al 2008)

SLIDE 120

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 639

A schematic overview of the GENN method

(Motsinger-Reif 2008)

SLIDE 121

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 640

Random Forests (RF)

RF is a machine-learning technique that builds a forest of classification trees

wherein each component tree is grown from a bootstrap sample of the data, and the variable at each tree node is selected from a random subset

f all variables in the data (Breiman, 2001). The final classification of an

individual is determined by voting over all trees in the forest.

RF models may uncover interactions among factors that do not exhibit

strong marginal effects, without demanding a pre-specified model (McKinney et al., 2006).

Additionally, tree methods are suited to dealing with certain types of

genetic heterogeneity, since splits near the root node define separate model subsets in the data.

(Motsinger-Reif et al 2008)

SLIDE 122

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 641

Random Forests (RF)

Each tree in the forest is constructed as follows from data having N

individuals and M explanatory variables:

Choose a training sample by selecting N individuals, with replacement,

from the entire data set.

At each node in the tree, randomly select m variables from the entire

set of M variables in the data. The absolute magnitude of m is a function of the number of variables in the data set and remains constant throughout the forest building process.

Choose the best split at the current node from among the subset of m

variables selected above.

Iterate the second and third steps until the tree is fully grown (no

pruning).

Repetition of this algorithm yields a forest of trees, each of which has been

trained on bootstrap samples of individuals

(Motsinger-Reif et al 2008)

SLIDE 123

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 642

A schematic overview of the RF method

(Motsinger-Reif et al 2008)

SLIDE 124

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 643

Advantages of the Random Forest method

It can handle a large number of input variables.
It estimates the relative importance of variables in determining

classification, thus providing a metric for feature selection.

RF produces a highly accurate classifier with an internal unbiased estimate
f generalizability during the forest building process.
RF is fairly robust in the presence of etiological heterogeneity and relatively

high amounts of missing data (Lunetta et al., 2004).

Finally, and of increasing importance as the number of input variables

increases, learning is fast and computation time is modest even for very large data sets (Robnik-Sikonja, 2004).

(Motsinger-Reif et al 2008)

SLIDE 125

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 644

Focused Interaction Testing Framework (FITF)

The FITF was recently developed to detect epistatic interactions that

predict disease risk. Details of the FITF algorithm and software can be found in Millstein et al. (2006).

FITF is a modification of the interaction testing framework (ITF) method,

which pre-screens all possible gene sets to focus on those that potentially are the most informative and reduce the multiple testing problem by reducing the number of statistical tests performed.

FITF has been shown to outperform MDR when interactions involved

additive, recessive, or dominant genes (Millstein et al., 2006).

(Motsinger-Reif et al 2008)

SLIDE 126

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 645

Focused Interaction Testing Framework (FITF)

The FITF algorithm modifies the ITF approach to reduce the overall number
f variants tested with an initial filter process. A chi-square goodness-of-fit

statistic that compares the observed with the expected Bayesian distribution of multi-locus genotype combinations in a combined case- control population is used in a prescreening initial stage.

This statistic, referred to as the chi-square subset (CSS), has the form:

where ni is the observed number of subjects (regardless of case/control status) in the ith genotype group and r is the total number of genotype

groups. The expected ni, noted as E(ni), is estimated based on the sample

marginal genotype frequencies of each gene.

SLIDE 127

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 646

Conclusion on comparison

MDR results in the one and two-locus models were comparable to GENN
GENN performs poorly with the three locus models considered in

Motsinger-Reif et al (2008).

This highlights a disadvantage of an evolutionary computation

approach in exploring purely epistatic models—it is much less likely that three loci will be stochastically assembled into a model to evaluate than two loci.

Both GENN and MDR outperformed FITF
Because GENN and MDR both utilize permutation distributions for

significance testing, correction for multiple testing is unnecessary. While the filter stage of FITF does reduce the number of tests performed with the ITF strategy, there are still a very large number of tests that are corrected for.

SLIDE 128

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 647

Conclusion on comparison

Both RF and stepLR were unable to detect purely epistatic models.
Since both require marginal main effects to perform variable selection

tasks.

Future extensions/modifications of these approaches should consider

this limitation and modify the variable selection process to capture pure interactions.

Some groups have in fact begun to make modifications in this way

(Bureau et al., 2005)

SLIDE 129

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 648

2.c Interpretation of multi-locus results

It is always a good idea to use several model selection criterions before

“interpreting”

(Ritchie et al 2007)

SLIDE 130

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 649

A flexible framework for analysis acknowledging interpretation capability

The framework contains four steps to detect, characterize, and interpret

epistasis

Select interesting combinations of SNPs
Construct new attributes from those selected
Develop and evaluate a classification model using the newly

constructed attribute(s)

Interpret the final epistasis model using visual methods

(Moore et al 2005)

SLIDE 131

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 650

Flexible framework Step 1

Attribute selection
Use entropy-based measures of information gain (IG) and interaction
Evaluate the gain in information about a class variable (e.g. case-control

status) from merging two attributes together

This measure of IG allows us to gauge the benefit of considering two (or

more) attributes as one unit

(slide: Chen 2007)

SLIDE 132

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 651

Information gain

Recall McGill’s multiple mutual information (Te Sun Han 1980) :

"; $; % "; $|% "; $ (information gain)

If I(A;B;C) > 0
Evidence for an attribute interaction that cannot be linearly

decomposed

If I(A;B;C) < 0
The information between A and B is redundant
If I(A;B;C) = 0
Evidence of conditional independence or a mixture of synergy and

redundancy

SLIDE 133

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 652

Illustration of entropy-based measures on Model 1 (Ritchie et al 2001)

SLIDE 134

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 653

Attribute selection based on entropy

Entropy-based IG is estimated for each individual attribute (i.e. main

effects) and each pairwise combination of attributes (i.e. interaction effects).

Pairs of attributes are sorted and those with the highest IG, or percentage
f entropy in the class removed, are selected for further consideration

(slide: Chen 2007)

SLIDE 135

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 654

Attribute selection based on relief F

The Relief statistic was developed by the computer science community as a

powerful method for determining the quality or relevance of an attribute (i.e. variable) for predicting a discrete endpoint or class variable (Kira and Rendell 1992, Konenko 1994, Robnik-Sikonja and Kononenko 2003).

Relief is especially useful when there is an interaction between two or more

attributes and the discrete class variable.

It is thus superior to univariate filters such as a chi-square test of

independence (see later) when interactions are present.

SLIDE 136

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 655

Attribute selection based on relief F

In particular, Relief estimates the quality of attributes through a type of

nearest neighbor algorithm that selects neighbors (instances) from the same class and from the different class based on the vector of values across attributes.

Weights (W) or quality estimates for each attribute (A) are estimated based
n whether the nearest neighbor (nearest hit, H) of a randomly selected

instance (R) from the same class and the nearest neighbor from the other class (nearest miss, M) have the same or different values.

This process of adjusting weights is repeated for m instances.
The algorithm produces weights for each attribute ranging from -1 (worst)

to +1 (best).

SLIDE 137

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 656

Attribute selection based on relief F (applied to M27)

Only the top 10% of scores will be returned to the filtered data set

SLIDE 138

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 657

Attribute selection based on relief F applied to M27

For the M27 simulated data, this reduction of

the overall attribute count does not make sense of course (# SNPs = 10 !)

SLIDE 139

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 658

Attribute selection based on TuRF

ReliefF is able to capture attribute interactions because it selects nearest

neighbors using the entire vector of values across all attributes.

However, this advantage is also a disadvantage because the presence of

many noisy attributes can reduce the signal the algorithm is trying to

capture. The “tuned” ReliefF algorithm (TuRF) systematically removes

attributes that have low quality estimates so that the ReliefF values if the remaining attributes can be re-estimated.

(Moore and White 2008)

SLIDE 140

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 659

Attribute selection based on chi-squared

The MDR software provides a simple chi-square test of independence as a

univariate filter.

The manual specifies that this filter should be used to condition your

MDR analysis on those attributes that have an independent main effect.

However, the MDR software itself does not give you a lot of options to

actually perform this conditioning …

The ReliefF filter will be more useful for capturing those attributes that are

likely to be involved in an interaction.

SLIDE 141

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 660

Attribute selection based on chi-squared (applied to M27)

SLIDE 142

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 661

Attribute selection based on odds ratio

The odds ratio (OR) is a way of comparing whether the probability of a

certain event is the same for two groups.

An odds ratio of 1 implies that the event is equally likely in both groups.
An odds ratio that is greater than 1 implies that the event is most likely

in the first group whereas

A value less than one implies that the event is less likely in the first

group.

When an attribute is polytomous (i.e. more than 2 levels) MDR calculates

the OR for each possible contrast and then reports the largest OR value.

For 3 levels 0, 1, 2, the following contrasts are considered

0 vs 1 ; 0 vs 2 ; 1 vs 2 ; 0 vs 1&2 ; 1 vs 0&2 ; 2 vs 1&0

SLIDE 143

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 662

Flexible framework Step 2

Constructive induction, for instance MDR
A multi-locus genotype combination is considered high-risk if the ratio
f cases to controls exceeds given threshold T, else it is considered low-

risk

Genotype combinations considered to be high-risk are labeled G1 while

those considered low-risk are labeled G0.

This process constructs a new one-dimensional attribute with levels G0

and G1

(slide: Chen 2007)

SLIDE 144

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 663

Flexible framework Step 3

Classification and machine learning
The single attribute obtained in Step 2 can be modeled using machine

learning and classification techniques

Bayes classifiers as one technique
Mitchell (1997) defines the naive Bayes classifier as

arg max

+,- . /0 1 2|/0 3 245

where vj is one of a set of V classes and ai is one of n attributes describing

an event or data element. The class associated with a specific attribute list is the one, which maximizes the probability of the class and the probability of each attribute value given the specified class.

SLIDE 145

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 664

Flexible framework Step 3

The standard way to apply the naive Bayes classifier to genotype data

would be to use the genotype information for each individual as a list of attributes to distinguish between the two hypotheses ‘‘The subject is high-risk’’ and ‘‘The subject is low-risk’’.

Alternatively, an odds ratio for the single multilocus attribute can also be

estimated using logistic regression to facilitate a traditional epidemiological analysis and interpretation.

Evaluation of the predictor can be carried out using cross-validation

(Hastie et al., 2001) and permutation testing (Good, 2000), for example.

(Moore et al 2006)

SLIDE 146

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 665

Flexible framework Step 4

Interpretation –interaction graphs
Comprised of a node for each

attribute with pairwise connections between them.

Each node is labeled the

percentage of entropy removed (i.e. IG) by each attribute.

Each connection is labeled the

percentage of entropy removed for each pairwise Cartesian product of attributes.

(slide: Chen 2007)

SLIDE 147

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 666

Flexible framework Step 4

Interpretation –dendrograms
Hierarchical clustering is used to build a dendrogram that places

strongly interacting attributes close together at the leaves of the tree.

SLIDE 148

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 667

Hierarchical clustering with average linkage

Here the distance between two clusters is defined as the average of

distances between all pairs of objects, where each pair is made up of one

bject from each group

SLIDE 149

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 668

Flexible framework

The flexibility of this framework is the ability to plug and play
Different attribute selection methods
ther than the entropy-based
Different constructive induction algorithms
ther than the MDR
Different machine learning strategies
ther than a naïve Bayes classifier

(slide: Chen 2007)

SLIDE 150

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 669

3 Future challenges

Integration of –omics data in GWAs

SLIDE 151

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 670

Integrations of –omics data in GWAs

(Hirschhorn 2009)

SLIDE 152

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 671

Integration of –omics data in GWAs A few “straightforward” examples:

Post-analysis
As validation tool in main effects GWAs
During the analysis:
Epistasis screening (FAM-MDR)

Use expression values to prioritize multi-locus combinations

Main effects screening (PBAT)

Construct an overall phenotype for each marker based on the linear combination of expression values (e.g., within 1Mb from the marker) that maximizes heritability and perform FBAT-PC screening to prioritize SNPs

SLIDE 153

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 672

Extensive boundary crossing collaborations Statistical Genetics Research Club (www.statgen.be)

SLIDE 154

Introduction to Genetic Epidemiology Chapter 7: A World of Interactions

K Van Steen 673

In-class discussion document

Moore J 2005. A global view of epistasis. Nature Genetics 37(1): 13-14