A Combinatorial Approach to the Analysis of Differential Gene - - PowerPoint PPT Presentation
A Combinatorial Approach to the Analysis of Differential Gene - - PowerPoint PPT Presentation
A Combinatorial Approach to the Analysis of Differential Gene Expression Data The Use of Graph Algorithms for Disease Prediction and Screening The Goal To classify patients based on expression profiles Presence of cancer Type of
The Goal
- To classify patients based on expression profiles
– Presence of cancer – Type of cancer – Response to treatment
- To identify the genes required for accurate
classification
– Too many = unnecessary noise – Too few = insufficient information
Classic Clustering Problem
- Current techniques:
– Hierarchical Clustering – K-Means Clustering – Self-Organizing Maps – Others
- Drawbacks:
– Determining cluster boundaries difficult with diffuse data – Objects can only belong to one group
Eliminate Poorly Covering Genes Raw Data Set of Discriminatory Genes Gene Scores Verify by Classification Calculate Sample Similarities Apply Threshold Eliminate Poorly Discriminating Genes
Algorithmic Training
Dominating Set Maximal Cliques Gene Scoring
Raw Data Eliminate Poorly Discriminating Genes
Algorithmic Training
The Gene Scoring Function: Identifying Discriminators
2 4 6 8 10 2 4 6 8
score(genei) = mclassA − mclassB − σ classA +σ classB
vs.
Eliminate Poorly Covering Genes Raw Data Eliminate Poorly Discriminating Genes
Algorithmic Training
Eliminate Poorly Covering Genes
Samples Genes
Class 2 Class 1
Eliminate Poorly Covering Genes Raw Data Calculate Sample Similarities Apply Threshold Eliminate Poorly Discriminating Genes
Algorithmic Training
Create Unweighted Graph
- Complete, edge-weighted graph
– Vertices = samples – Edge weight = similarity metric
- Remove edge weights
– If edge weight < threshold, remove edge from graph – Otherwise, keep edge, ignore weight
- Result: incomplete unweighted graph
The Edge Weight Function
score(genei)•(1− expression_valueij − expression_valueik )
[ ]
∑
where, expression valueij = expression value of genei for samplej
Eliminate Poorly Covering Genes Raw Data Set of Discriminatory Genes Gene Scores Verify by Classification Calculate Sample Similarities Apply Threshold Eliminate Poorly Discriminating Genes
Algorithmic Training
- A completely connected subset of vertices in a
graph
- Maximal clique = local optimization
- NP-complete
What is a Clique?
Classification Using Clique
Class2 Class 1 Class 1 Class 3 Class 2 GRAPH
A Selection of Discriminators
electron transport cytochrome P450 4B1 CYP4B1 cell growth, cell differentiation four and a half LIM domains 1 FHL1 alcohol dehydrogenase activity alcohol dehydrogenase IB ADH1B
- xygen transport
hemoglobin, beta HBB transmembrane receptor protein serine/threonine kinase signaling pathway transforming growth factor, beta receptor II TGFBR2 plasminogen binding protein tetranectin TNA
Raw Data Classify Unknown Samples Calculate Sample Similarities Apply Threshold Set of Discriminatory Genes, Scores
The Algorithm - Unsupervised
Summary
- Intersection of clique and dominating set
techniques improves results
- Combined orthogonal scoring identifies limited
number of discriminatory genes
- Clique offers means of validating obtained scores
and weights
- Our technique identifies differing set of
discriminatory genes from original paper
- Clique-based classification a viable complement to
present clustering methods
Ongoing and Future Research
- Reverse Training
- Train to distinguish among types of cancer
- Experiment with different weight functions (ex.
Pearson’s coefficient)
- Investigate using less stringent techniques
– Near-cliques – Neighborhood search – K-dense subgraphs
- Port codes to SGI Altix supercomputer