4. Applications in Computational Biology Karsten Borgwardt - - PowerPoint PPT Presentation

4 applications in computational biology
SMART_READER_LITE
LIVE PREVIEW

4. Applications in Computational Biology Karsten Borgwardt - - PowerPoint PPT Presentation

4. Applications in Computational Biology Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 231 / 253 4.1 Cotraining for Phenotype Prediction based on: Damian Roqueiro, Menno Witteveen, Verneri Anttila,


slide-1
SLIDE 1
  • 4. Applications in Computational Biology

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 231 / 253

slide-2
SLIDE 2

4.1 Cotraining for Phenotype Prediction

based on: Damian Roqueiro, Menno Witteveen, Verneri Anttila, Gisela Terwindt, Arn van den Maagdenberg, Karsten

  • Borgwardt. In silico phenotyping via co-training for improved phenotype prediction from genotype. ISMB 2015,

Bioinformatics (2015) 31 (12): i303-i310.

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 232 / 253

slide-3
SLIDE 3

Goal

Construction of a genotype classifier

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 233 / 253

slide-4
SLIDE 4

Goal

Construction of a genotype classifier

Important implications for disease diagnosis and therapy Yet, h relies on training dataset with labeled examples

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 233 / 253

slide-5
SLIDE 5

Goal

Construction of a genotype classifier

Important implications for disease diagnosis and therapy Yet, h relies on training dataset with labeled examples Increasingly larger availability of genotype data

Not sufficient disease phenotypes for genotype samples

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 233 / 253

slide-6
SLIDE 6

Goal

Construction of a genotype classifier

Important implications for disease diagnosis and therapy Yet, h relies on training dataset with labeled examples Increasingly larger availability of genotype data

Not sufficient disease phenotypes for genotype samples

Can we boost the performance of a classifier when few labeled examples are available? → Use co-training

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 233 / 253

slide-7
SLIDE 7

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-8
SLIDE 8

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-9
SLIDE 9

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-10
SLIDE 10

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-11
SLIDE 11

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-12
SLIDE 12

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-13
SLIDE 13

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-14
SLIDE 14

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-15
SLIDE 15

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-16
SLIDE 16

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-17
SLIDE 17

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-18
SLIDE 18

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-19
SLIDE 19

Co-training

Blum & Mitchell, 1988

Dataset D with

two classes of data: labeled (L) & unlabeled (U) X features X is the union of subsets of features X1 and X2

A labeled object x = ((x1, x2), y) Learn classifiers h1 and h2 on each view of L Iteratively

use h1 to label instances in U and add to L use h2 to label instances in U and add to L repeat until U = ∅ . . . or other condition

Two requirements

x1 and x2 should be conditionally independent of each other given y X1 or X2 are sufficient to train h1 or h2 to classify data points in D

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 234 / 253

slide-20
SLIDE 20

Proposed approach

Apply co-training to migraine dataset

Dutch cohorts, 1,938 patients Two disease phenotypes:

migraine with aura (820) migraine without aura (1,118)

Data available for each patient:

disease phenotype (aura vs. no aura) clinical covariates (e.g. pulsating quality?) genotype data: single nucleotide polymorphisms (SNPs)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 235 / 253

slide-21
SLIDE 21

Proposed approach

Apply co-training to migraine dataset

Dutch cohorts, 1,938 patients Two disease phenotypes:

migraine with aura (820) migraine without aura (1,118)

Data available for each patient:

disease phenotype (aura vs. no aura) clinical covariates (e.g. pulsating quality?) genotype data: single nucleotide polymorphisms (SNPs)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 235 / 253

slide-22
SLIDE 22

Assumption: implicit price-tag of data

Source: http://www.flaticon.com/authors/freepik

Disease phenotype (diagnosis) Clinical covariates (results of tests) Genotype data (DNA sequencing)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 236 / 253

slide-23
SLIDE 23

Decaying cost of sequencing/genotyping

Source: National Human Genome Research Institute http://www.genome.gov/

Cost of genotyping (array): ∼$110 per sample

HumanOmniExpress-24 BeadChips 713,014 markers

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 237 / 253

slide-24
SLIDE 24

Clinical covariates

International Classification of Headache Disorders (ICHD) guidelines

traits recorded as binary values:

attack length pulsation unilaterality aggravation by physical exercise vomiting photophobia and others. . .

age of onset

Source: The ICHD, 3rd edition (beta version). Cephalalgia 2013 Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 238 / 253

slide-25
SLIDE 25

Genotype data

Patients genotyped with Illumina arrays covering ∼ 500,000 SNPs For each patient and SNP, the two alleles were coded as:

0 if the patient is homozygous for the major allele 1 if heterozygous 2 if homozygous for the minor allele

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 239 / 253

slide-26
SLIDE 26

Genotype data

Patients genotyped with Illumina arrays covering ∼ 500,000 SNPs For each patient and SNP, the two alleles were coded as:

0 if the patient is homozygous for the major allele 1 if heterozygous 2 if homozygous for the minor allele

No population stratification present SNPs were pre-processed and filtered After filtering → 463,825 SNPs

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 239 / 253

slide-27
SLIDE 27

In silico phenotyping: methodology

Data preparation

Randomly partition the data into 3 sets

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 240 / 253

slide-28
SLIDE 28

In silico phenotyping: methodology

Assumptions:

set II → no labels set III → no clinical covariates

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 240 / 253

slide-29
SLIDE 29

In silico phenotyping: methodology

Step 1

train classifier hc on clinical covariates in set I impute phenotype in set II hc: bagging predictors Breiman, L. (1996)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 240 / 253

slide-30
SLIDE 30

In silico phenotyping: methodology

Step 2

train classifier hg on true + imputed labels in sets I + II predict phenotype in set III hg: random forest Breiman, L. (2001)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 240 / 253

slide-31
SLIDE 31

Construction of hc (clinical covariates)

Used bagging predictors Breiman, L. (1996)

5,000 predictors: logistic regressors √c of all c features were assigned to each predictor predicted disease phenotype was the average of all predictors

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 241 / 253

slide-32
SLIDE 32

Construction of hg (genotype data)

Used random forest Breiman, L. (2001)

10,000 trees √ k of all the top k features were used at each node imputed labels in II had to be binarized k = 2,000

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 242 / 253

slide-33
SLIDE 33

Univariate feature selection for hg

To reduce the dimensionality of the genotype data

Each SNPi was ranked according to Pearson correlation p-value between:

The genotype values of SNPi The true + imputed disease phenotypes

Selected top k = 2,000

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 243 / 253

slide-34
SLIDE 34

Are classifiers hc and hg correlated?

Created 100 random partitions

training set = 80% test set = 20%

Correlation coefficient: µ = 0.082, σ = 0.047 Manhattan distance: µ = 172.2, σ = 8.8

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 244 / 253

slide-35
SLIDE 35

Classification performance: range

Lower bound

Classification performance

Lower bound: genotype classifier hg was trained only on I and evaluated on III

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 245 / 253

slide-36
SLIDE 36

Classification performance: range

Lower bound Upper bound

Classification performance

Lower bound: genotype classifier hg was trained only on I and evaluated on III Upper bound: hg was trained on I + II Univariate feature selection only on I, training on I (true) + II (imputed)

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 245 / 253

slide-37
SLIDE 37

Classification performance

Created 100 random partitions of the data

set I = 10% set II = 70% set III = 20%

AUC scores Metric µ σ Lower bound, training only on I 0.574 0.034

  • Univ. feat. sel. on I, training on I+II

0.608 0.035 In silico phenotyping (co-training) 0.646 0.029 Upper bound, I+II with true labels 0.689 0.025

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 246 / 253

slide-38
SLIDE 38

Domain knowledge in univariate feature selection

Obtained 1,000 SNPs from migraine meta-analysis Anttila, V. et al. Nat Genet (2013)

23,285 migraine patients 81,453 migraine-free control individuals

  • nly 168 SNPs overlapped with our genotype arrays

Comparison of 168 SNPs vs. top k in univariate feature selection

AUC scores Metric µ σ Using domain knowledge (168 SNPs) 0.6460 0.0293 Univariate feature selection (2,000 SNPs) 0.6457 0.0289

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 247 / 253

slide-39
SLIDE 39

Selecting top k SNPs

Used k = 2,000 Examined the effect of this choice with two analyses:

Sets: I = 10%, II = 70%, III = 20% (100 random partitions)

Method 1

Build hg with varying numbers of k For each k, compute mean AUC

AUC scores Number of top k SNPs µ σ 200 0.624 0.031 400 0.631 0.032 800 0.638 0.031 1,600 0.644 0.028 2,000 0.646 0.029 3,200 0.648 0.028 6,400 0.651 0.030 12,800 0.650 0.027 25,600 0.648 0.029 51,200 0.643 0.026

Method 2

Internal cross-validation on I+II

AUC scores Number of top k SNPs µ σ 200 0.541 0.053 400 0.549 0.054 800 0.558 0.056 1,600 0.565 0.062 2,000 0.569 0.060 3,200 0.577 0.055 6,400 0.590 0.057 12,800 0.592 0.054 25,600 0.595 0.058 51,200 0.593 0.061 102,400 0.590 0.054 204,800 0.581 0.058 Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 248 / 253

slide-40
SLIDE 40

Varying the size of one set

Size of set II = 40→70%

Fixed I = 10%, III = 20%

AUC scores Number of samples in II µ σ 774 40% of the data 0.597 0.038 969 50% 0.604 0.035 1,162 60% 0.611 0.035 1,356 70% 0.646 0.029

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 249 / 253

slide-41
SLIDE 41

Varying the size of one set

Size of set II = 40→70%

Fixed I = 10%, III = 20%

AUC scores Number of samples in II µ σ 774 40% of the data 0.597 0.038 969 50% 0.604 0.035 1,162 60% 0.611 0.035 1,356 70% 0.646 0.029

Size of set I = 10→1%

Fixed II = 70%, III = 20%

AUC scores Number of samples in I µ σ 193 10% of the data 0.646 0.029 96 5% 0.619 0.034 19 1% 0.605 0.035

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 250 / 253

slide-42
SLIDE 42

Varying the sizes of I and II simultaneously

Size of sets I & II = 40%; III = 20%

100 random partitions

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 251 / 253

slide-43
SLIDE 43

Varying the sizes of I and II simultaneously

(contd.)

For each cell in the grid compute ∆AUC

In silico phenotyping (co-training) Lower bound, training only on I

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 252 / 253

slide-44
SLIDE 44

Summary

Presented an approach to in silico phenotyping

impute disease phenotype using genotype and clinical covariates augment datasets → train models for phenotype prediction

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 253 / 253

slide-45
SLIDE 45

Summary

Presented an approach to in silico phenotyping

impute disease phenotype using genotype and clinical covariates augment datasets → train models for phenotype prediction

Factors that affect co-training

  • riginal training dataset → small: allow improvement by augmenting

co-training dataset → large: adding few samples will not change the classification performance clinical covariates → not redundant to genetic data: predictive for the disease phenotype

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 253 / 253

slide-46
SLIDE 46

Summary

Presented an approach to in silico phenotyping

impute disease phenotype using genotype and clinical covariates augment datasets → train models for phenotype prediction

Factors that affect co-training

  • riginal training dataset → small: allow improvement by augmenting

co-training dataset → large: adding few samples will not change the classification performance clinical covariates → not redundant to genetic data: predictive for the disease phenotype

What the future may hold

large sequencing projects create more genotypic data biobanks collecting more samples health record DBs collect more clinical information on patients

Karsten Borgwardt Department Biosysteme Data Mining 2 Course, Basel Spring Semester 2016 253 / 253