Novel Computational and Integrative Tools for the Analysis of Gene - - PowerPoint PPT Presentation

novel computational and integrative tools for the
SMART_READER_LITE
LIVE PREVIEW

Novel Computational and Integrative Tools for the Analysis of Gene - - PowerPoint PPT Presentation

Novel Computational and Integrative Tools for the Analysis of Gene Co-Expression Data Michael A. Langston Department of Computer Science University of Tennessee currently on leave to Computer Science and Mathematics Division Oak Ridge


slide-1
SLIDE 1

Novel Computational and Integrative Tools for the Analysis of Gene Co-Expression Data

Michael A. Langston

Department of Computer Science University of Tennessee

currently on leave to

Computer Science and Mathematics Division Oak Ridge National Laboratory USA

Graph Algorithms Research Laboratory --- University of Tennessee 1

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-2
SLIDE 2

Technology Mapping

Biological Knowledge

. . Protein Structure . . Gene Regulatory Networks . . Sequence Homology . . Protein function . . Cell Physiology . .

Analysis Tools

. . Ontology . . Cis-Regulatory Elements . . Quantitative Trait Loci . . Combinatorial Algorithms . . Bayesian Networks . .

Graph Algorithms Research Laboratory --- University of Tennessee 2

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-3
SLIDE 3

Many Network Actions

  • Cis and trans (direct and indirect) regulation
  • Post-transcriptional regulation (e.g., alternate splicing)
  • µRNA (e.g., functional RNA, RNAi and gene silencing)
  • All are forms of co-regulation.
  • Not to be confused with mere differential expression.
  • Thus the central problem is clique.
  • But it’s NP-complete to decide clique.
  • In fact it’s NP-complete even to approximate clique!
  • Nevertheless, with new mathematical tools (FPT) we

can solve clique optimally using vertex cover.

  • Confines “combinatorial explosion” to the parameter.

Graph Algorithms Research Laboratory --- University of Tennessee 3

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-4
SLIDE 4

A Little Complexity Theory

  • The Classic View:

P NP PSPACE Σ 2

P

“easy” “hard” “fuggetaboutit”

… …

Graph Algorithms Research Laboratory --- University of Tennessee 4

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-5
SLIDE 5

Parameter Sensitivity: Instance(n,k)

  • Suppose our problem is, say, NP-complete.
  • Consider an algorithm with a time bound

such as O(2k+n).

  • And now one with a time bound more like

O(2k+n).

  • Both are exponential in parameter value(s).
  • But what happens when k is fixed?
  • FPT confines superpolynomial behavior to

the parameter.

Graph Algorithms Research Laboratory --- University of Tennessee 5

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-6
SLIDE 6

A Little Complexity Theory

The Parameterized View:

FPT …

W[1] W[2] XP

“heuristics only” “fuggettaboutit” “solvable” (even if NP-hard!)

Graph Algorithms Research Laboratory --- University of Tennessee 6

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-7
SLIDE 7

On Solving Clique

  • Clique is a central problem all right, but it’s

not FPT (unless the W hierarchy collapses).

  • Fortunately, Vertex Cover is FPT.
  • And Vertex Cover is a complementary dual to

Clique:

G G _

Graph Algorithms Research Laboratory --- University of Tennessee 7

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-8
SLIDE 8

Solving Vertex Cover

Clique

COMPLEXITY THEORY Problem Classification Algorithm Selection GRAPH ALGORITHMS Modeling Optimization PARALLELISM AND GRIDS Speedup Collaboration RECONFIGURATION Hardware Acceleration Fast Prototyping

Intellectual Property Available Technologies

Graph Algorithms Research Laboratory --- University of Tennessee 8

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-9
SLIDE 9

The Vertex Cover Project

  • use preprocessing via degree

structures

  • then kernelize to reduce to a

computational core

  • employ branching to explore

the core

  • finally, interleave all three

Graph Algorithms Research Laboratory --- University of Tennessee 9

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-10
SLIDE 10

Sample Grid Architecture

Middleware (NetSolve) Foundational Fabric (Switches and Depots) Compute Resources (Grid Service Clusters) NetSolve Client NetSolve Agent Distributed Storage NetSolve Servers

Key: NetSolve’s program description file facility

Graph Algorithms Research Laboratory --- University of Tennessee 10

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-11
SLIDE 11

Hardware Acceleration

Reconfigurable devices Very different algorithms VHDL versus C I/O is often the most critical resource With current implementations, we are able to solve sub-instances:

  • of size 512 or less,
  • and with speedups north of about 125

FPGA chip

Graph Algorithms Research Laboratory --- University of Tennessee 11

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-12
SLIDE 12

The Clique Compute Engine

Parametric Tuning, Decomposition, and Refinement Highly Parallel Computation Pre-Processed Graph Cliques for Post-Processing Recalcitrant Sub-problem Reconfigurable Technology FPGA FPGA FPGA

PE PE PE PE

PE PE PE

Graph Algorithms Research Laboratory --- University of Tennessee 12

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-13
SLIDE 13

Vertex Cover Driver Splitter Job Scheduler Initialize Branching Handle Machine Branching Job List Handle Machine Handle Machine Branching Branching ssh Open Socket

Processor 1 Processor N Processor 2

. . . . . .

A simple mechanism. (Sometimes too simple.)

Graph Algorithms Research Laboratory --- University of Tennessee 13

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-14
SLIDE 14

Distributed Subtree Splitting

Pruning is needed at processor 4. 1 2 3 4

… … … …

Graph Algorithms Research Laboratory --- University of Tennessee 14

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

Processor 1 is still active. Processor 2 is still active. Processor 3 is still active. Send a subtree to the job queue.

slide-15
SLIDE 15

Sample Results on Protein Sequence Data

Graph Name Graph Size Cover Size Instance Type Sequential Kernelization Sequential Branching Parallel Branching 7 seconds Not needed 82 minutes ~ 5 days 6+ days 141 minutes ~ 5 days 6+ days Dynamic Decomposition 34 seconds Not needed 20 minutes 140 minutes 620 minutes 34 seconds 203 minutes 203 minutes 399 398 2044 SH2-5 839 2043 Yes No Yes No SH2-5 839 SH3-10 2466 SH3-10 2466

Graph Algorithms Research Laboratory --- University of Tennessee 15

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

32 PEs @ 500MHz. Load balancing is critical. “No” is harder than “yes.” The hardest computations. So clique size is 422.

slide-16
SLIDE 16

A Toolchain for Microarray Analysis

Raw Data Edge-Weighted Graph Unweighted Graph Gene Expression Profiles

cDNA or mRNA Microarrays Compute Spearman’s Rank Coefficients Filter With Threshold Value Normalization Validation*

*NP-complete

Clique-Centric Toolkit Clique Extraction

e.g., Maximum Clique*, All Maximal Cliques

Pre-Processing Tools

e.g., Graph Separators and Partitioning

Post-Processing Tools

e.g., Neighborhood Search, Subgraph Expansion Genes of Interest *Putative and Experimental

Graph Algorithms Research Laboratory --- University of Tennessee 16

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-17
SLIDE 17

A Sample Study

  • Data acquisition depends on:

– organism and tissue type – independent variable (e.g., time course, life stages) – chip technologies/vendors; cDNA vs mRNA – normalization methods, coefficient computations

  • In this particular study:

– 32 Mus musculus RI strains – brain tissue – Affymetrix U74Av2 mRNA Arrays – MAS5.0 package, Spearman rank order

Graph Algorithms Research Laboratory --- University of Tennessee 17

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-18
SLIDE 18

Computational Experience

Graph Algorithms Research Laboratory --- University of Tennessee 18

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

  • 12,422 probe set IDs (genes, vertices)
  • Over 100M edges
  • Employed a variety of thresholds
  • Many days of highly parallel CPU time
  • With the threshold set at 0.5:

–the maximum clique size is 369 –density made this a difficult computation

  • But we could do it via FPT:

–contrast with brute force

slide-19
SLIDE 19

Zeroing in on Biological Relevance

  • Clique versus clustering
  • Too low a threshold produces large

cliques, which can be hard to evaluate

  • Too high a threshold produces small

cliques, which can exaggerate noise

  • Iterating, we settle on a threshold of 0.85:
  • maximum clique size is 17
  • there are 5227 maximal cliques

Graph Algorithms Research Laboratory --- University of Tennessee 19

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-20
SLIDE 20

100 200 300 400 500 600 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Clique Size Distribution

Clique Size Number

  • f

Cliques

Graph Algorithms Research Laboratory --- University of Tennessee 20

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-21
SLIDE 21

Gene Distribution Across Cliques

500 1000 1500 2000 2500 3000 3500 93806_at 98455_at 94201_at 98553_at 160683_at 93990_at 160686_at 160930_at 93667_at 97394_at 160251_at 101058_at

Number

  • f

Cliques with Gene

cut off at 100 cliques

Genes (Probe Set IDs)

Graph Algorithms Research Laboratory --- University of Tennessee 21

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-22
SLIDE 22

Sample Genes of Interest

  • 93806_at = Veli3 (aka Lin7c)
  • In the human ortholog:
  • structural cytoskeletal protein
  • signal transducer
  • Important interactions with Cask, Mask1

Sample Cliques of Interest

  • Common CREs
  • Enriched Ontologies
  • Quantitative Trait Loci

Biological Process Molecular Function Cellular Component

Graph Algorithms Research Laboratory --- University of Tennessee 22

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-23
SLIDE 23

Technology Integration

Combinatorial Algorithms, Clique Analysis

Gene Expression Data Acquisition and Normalization WebQTL DB GeneKeyDB Sequence Analysis (SNPs,

CREs)

Data Mining (Gene Set

Enrichment, Biological Ontologies)

Quantitative Linkage Analysis WebQTL.org GoTreeMachine GeneNetViz

Computational Tool Modules Data Infrastructure Visualization & Validation Graph Algorithms Research Laboratory --- University of Tennessee 23

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-24
SLIDE 24

Let’s Remember What We’re Trying to Accomplish

  • Every cell in an organism has the same DNA.
  • It’s the regulatory mechanisms that seem to

change from one tissue type to another.

  • Clique gives us putative co-regulation, and

with it a set of targets for verification.

  • The main goal remains the elucidation of

gene regulatory networks.

  • Now let’s look at the bigger picture.

Graph Algorithms Research Laboratory --- University of Tennessee 24

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-25
SLIDE 25

How do These Cliques Interact?

  • Motivation:

–capture the notion of regulatory networks interacting with other regulatory networks

  • The clique intersection graph:

–each clique is represented by a vertex –if the intersection of a pair of cliques is nonempty, then an edge is added to connect their corresponding vertices

Graph Algorithms Research Laboratory --- University of Tennessee 25

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-26
SLIDE 26

15-cliques in green 16-cliques in black 17-cliques in red The Clique Intersection Graph for this Data Set

Graph Algorithms Research Laboratory --- University of Tennessee 26

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-27
SLIDE 27

Dealing with Noisy Data

  • Clique is the “gold standard.”
  • But the data is seldom without errors.
  • What we really want are very dense

subgraphs.

  • It’s straightforward to use neighborhoods,

but on real data:

– 1-neighborhoods produce edge densities of

  • nly around 16%.

– 2-neighborhoods produce edge densities of

  • nly around 6%.

Graph Algorithms Research Laboratory --- University of Tennessee 27

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-28
SLIDE 28

Dealing with Noisy Data

Paraclique:

  • Clique gloms onto highly

connected vertices.

  • 280-clique is transformed

into a 466-paraclique.

  • Edge density is

north of 96%.

  • Lift and separate.

279

280-clique

279 279

. . . . . . . . .

466-paraclique

Graph Algorithms Research Laboratory --- University of Tennessee 28

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-29
SLIDE 29

Paracliques and QTL

Seven Quantative Trait Loci

Transcript abundance can be the phenotype! There’s a high probability that somewhere in here is a polymorphism controlling this trait.

Graph Algorithms Research Laboratory --- University of Tennessee 29

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-30
SLIDE 30

Paracliques and QTL

Concentrated Parental Alleles

Two Paracliques Graph Algorithms Research Laboratory --- University of Tennessee 30

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-31
SLIDE 31

Currently Working On

  • More and larger M. musculus mRNA arrays
  • New H. sapiens mRNA arrays (>19k genes)
  • The eukaryotic “usual suspects”
  • Prokaryotes

– Rhodopseudomas palustris

  • three operons for nitrogen fixation

– Shewanella oneidensis

  • Metallic precipitates

– Synechococcus elongatus

  • Heatshock cyanobacteria

Graph Algorithms Research Laboratory --- University of Tennessee 31

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-32
SLIDE 32

Ongoing Work, Threshold Setting

  • WAG, Kentucky Windage
  • Maximum Clique Size
  • Scrutinize Key Gene Correlates
  • Use Functional Similarity

Functional Similarity via GO Distance via Spearman Rank Look for inflection point

Graph Algorithms Research Laboratory --- University of Tennessee 32

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-33
SLIDE 33

Ongoing Implementation Efforts

  • Sample codes released:
  • Clustal XP (Phylogeny)
  • CAMDA (Disease Screening)
  • Building out to clique variants
  • Establishing a Portal at ORNL
  • Porting codes to SGI Altix, Cray X1
  • ORNL National Leadership Class Facility
  • Most importantly: professional interaction

Graph Algorithms Research Laboratory --- University of Tennessee 33

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS

slide-34
SLIDE 34

Collaborators and Acknowledgments

UT: Naima Moustaid-Moussa Greg Peterson Arnold Saxton ORNL: Stefan Kirov Frank Larimer Nagiza Samatova Jay Snoddy Brynn Voy Bing Zhang UTHSC: Elissa Chesler Ivan Gerling Rob Williams International: Mike Fellows Henning Fernau Rolf Niedermeier Fran Rosamond Postdocs: Faisal Abu-Khzam (CS) Nicole Baldwin (µBio) Henry Suters (Math) Students: John Eblen Xinxia Peng Chris Symons Yun Zhang Jon Scharff Josh Steadmon

Graph Algorithms Research Laboratory --- University of Tennessee 34

Graph Algorithms Research Laboratory – University of Tennessee

UT-ORNL 22 June 2005 DIMACS