ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
Analysis of High-Throughput Biological Data Part II: Computational - - PowerPoint PPT Presentation
Analysis of High-Throughput Biological Data Part II: Computational - - PowerPoint PPT Presentation
NZIMA NZIMA Napier Napier 2008 2008 Analysis of High-Throughput Biological Data Part II: Computational Bottlenecks and Novel Applications Mike Langston Professor Department of Electrical Engineering and Computer Science University of
2
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
3
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
4
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology
- How do biological entities function in unison and at
all levels of scale?
- Linkage, communication and networks (graphs!)
5
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation
Here are five mouse genes with Pearson correlations
- f at least 0.65. What of
- noise?
- experimental design?
- circadian rhythms?
- other confounds?
- other metrics?
6
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation
Coefficient Profiles
Sometimes via
- Pearson
- Spearman
- Mutual Information
- Etc
Other times we need
- p-values
- Bonferroni corrections
- q-values
- false discovery rates...
7
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems
8
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1014+ cells, 200+ cell types
9
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms)
10
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins)
11
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins) Transcriptome Translation (tRNA) via transcription (mRNA) Function and Signaling (siRNA, miRNA, etc)
12
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins) Transcriptome Translation (tRNA) via transcription (mRNA) Function and Signaling (siRNA, miRNA, etc) Other: metabalome, lipidome, interactome, omeome!
13
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization
- highly dependent on scale
14
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization
- highly dependent
- n scale
- the only omics often
seen is a “rediculome”
15
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization Computational Tools - focus usually on dense subgraphs
16
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique
- must run often
- time is a limiting factor
- exploit fixed-parameter tractability (FPT)
17
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique
- huge outputs
- various orderings
- memory is often the limiting factor
18
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique Biclique
- new algorithms
- bipartite graphs
19
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Foundations
Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique Biclique Paraclique
- noisy data
20
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
21
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Raw Data Gene Expression Profiles Edge-Weighted Complete Graph
cDNA or mRNA Microarrays cDNA or mRNA Microarrays Correlation Computation Correlation Computation High-Pass Filtering High-Pass Filtering Normalization Normalization
Real-Valued Matrix
Graph Transforms Graph Transforms
Unweighted Incomplete Graph
Clique-Centric Methods k-Cores k-Connected Components Principal Component Analysis Principal Component Analysis k-Means Clustering k-Means Clustering
… . . . . . . . .
Paraclique
. . . . . . .
Maximal Clique Maximum Clique
. . . Increasing Edge Density (and Increasing Problem Complexity)
NP-complete Problems Unsupervised Methods
Biclique
. . .
HCS Subgraphs
. . . . . . .
FPT VC Codes HPC & Novel Methods
Toolchain
22
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Gene (vertex) comparisons:
- differential expression
- does not require multiple conditions
- compare the two lists of gene expression levels
23
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Correlate (edge) comparisons
- differential correlation
- requires multiple conditions in control versus stimulus
- compare two lists of gene-gene correlations
24
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Putative network (clique) comparisons
- differential topology
- compare cliques, sort by ontology, CREs, etc
- consider granularity, for example, with the clique intersection graph
25
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Seven Quantative Trait Loci
There’s a high probability that somewhere in here is a polymorphism controlling this trait. Transcript abundance can be the phenotype!
26
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Coexpression Analysis
Concentrated Parental Alleles
Two Paracliques
27
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
28
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients)
29
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec)
30
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec) Transcriptomic Data (e.g., gene expression from µarrays)
31
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec) Transcriptomic Data (e.g., gene expression from µarrays) Genotypic Data: SNPs
- DNA sequence variations, each occurring
when a single nucleotide in the genome differs between members of a species
- highly conserved throughout evolution and within population
- almost always just two alleles
- detected with SNP arrays designed to detect polymorphisms
32
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Integration
Proteins
A T TC CG TCA CGT AGC TGT
mRNA Co-expression Network Multi-Locus Genetic Regulatory Network Models Natural Allelic Perturbations (SNPs) Protein-Gene Relationships
Proteins Proteins
Protein Peak Factors
T/ C C/ G A/ T G/ G C/ T
Putative Biomarkers
Diseased Healthy
Data Integration
33
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
34
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Human Health
Data Description
- Göteborg, Sweden: 56 patients and 39 controls
- Affymetrix HU133 arrays
- roughly 33,000 genes
- hay fever, eczema
- nasal secretions, lymphocytes, skin
35
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Description
- Göteborg, Sweden: 56 patients and 39 controls
- Affymetrix HU133 arrays
- roughly 33,000 genes
- hay fever, eczema
- nasal secretions, lymphocytes, skin
Preprocessing
- MAS5.0
- log transformed
- centered around zero with z scores
- probesets with consistently low expression levels removed
- replicates averaged
Application, Human Health
36
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Data Description
- Göteborg, Sweden, 56 patients and 39 controls
- Affymetrix HU133 arrays
- roughly 33,000 genes
- hay fever, eczema
- nasal secretions, lymphocytes, skin
Preprocessing
- MAS5.0
- log transformed
- centered around zero with z scores
- probesets with consistently low expression levels removed
- replicates averaged
Threshold Selection
- chosen to balance graph densities
- AFFX spots retained for quality control
Application, Human Health
37
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
500000 1000000 1500000 2000000 2500000
- 1
- .
9 4
- .
8 8
- .
8 2
- .
7 6
- .
7
- .
6 4
- .
5 8
- .
5 2
- .
4 6
- .
4
- .
3 4
- .
2 8
- .
2 2
- .
1 6
- .
1
- .
4 . 2 . 8 . 1 4 . 2 . 2 6 . 3 2 . 3 8 . 4 4 . 5 . 5 6 . 6 2 . 6 8 . 7 4 . 8 . 8 6 . 9 2 . 9 8 Correlation Value Frequency Patient Control
Correlation Coefficient Distribution
Application, Human Health
38
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
59 51315 45471 4415 0.92 66 243232 75541 5317 0.91 71 1579041 118900 6254 0.90 79 15067064 178144 7169 0.89 84 240146378 256346 8009 0.88 Maximum Size Maximal Cliques Edges Vertices Threshold
Control
28 11322 11322 2628 0.92 35 41605 26031 3405 0.91 45 114030 40933 4146 0.90 52 447176 62271 4999 0.89 61 2298595 91152 5809 0.88 Maximum Size Maximal Cliques Edges Vertices Threshold
Patient
ribosomal or RNA-related T-lymphocytes or epithelial cells
Graph Properties
Application, Human Health
39
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Clique profiles using the five most highly represented genes:
Patient Control 56% CDH3 21% GTPBP4 64% FGFR3 24% SLC25A13 64% PPL 26% DKFZP564O123 65% NFIB 27% RANBP6 66% FGFR2 29% UBE1C Clique membership Gene Symbol Clique membership Gene Symbol
Application, Human Health
40
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Clique profiles using the five most highly represented genes:
Patient Control 56% CDH3 21% GTPBP4 64% FGFR3 24% SLC25A13 64% PPL 26% DKFZP564O123 65% NFIB 27% RANBP6 66% FGFR2 29% UBE1C Clique membership Gene Symbol Clique membership Gene Symbol
Of course gene representation is only a small part of the story.
Application, Human Health
41
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
We can use traditional algorithmic tools
- extract cores, cliques and other dense subgraphs
- check for scale-freeness, putative TFs, hubs, etc
Application, Human Health
42
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
We can use traditional algorithmic tools
- extract cores, cliques and other dense subgraphs
- check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools
- sort subgraphs by ontological enrichment, CREs, etc
- compare to literature, databases, etc
- match genes and gene products with known interactions
Application, Human Health
43
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
We can use traditional algorithmic tools
- extract cores, cliques and other dense subgraphs
- check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools
- sort subgraphs by ontological enrichment, CREs, etc
- compare to literature, databases, etc
- match genes and gene products with known interactions
It’s tempting to scan for your favorites...
Application, Human Health
44
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
We can use traditional algorithmic tools
- extract cores, cliques and other dense subgraphs
- check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools
- sort subgraphs by ontological enrichment, CREs, etc
- compare to literature, databases, etc
- match genes and gene products with known interactions
It’s tempting to scan for your favorites... But our goal is to identify altered interactions
Application, Human Health
45
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Differential Analysis
Gene (vertex) comparisons:
- differential expression
- does not require multiple conditions
- compare the two lists of gene expression levels
Correlate (edge) comparisons
- differential correlation
- requires multiple conditions in control, in dose
- compare the two lists of gene-gene correlations
Putative network (clique) comparisons
- differential topology
- focus on network aka clique differences
- consider the clique intersection graph
Application, Human Health
46
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Differential Analysis
Gene (vertex) comparisons:
- differential expression
- does not require multiple conditions
- compare the two lists of gene expression levels
Correlate (edge) comparisons
- differential correlation
- requires multiple conditions in control, in dose
- compare the two lists of gene-gene correlations
Putative network (clique) comparisons
- differential topology
- focus on network aka clique differences
- consider the clique intersection graph
Ongoing Work
- 62 genes pass all three screens, 6 match a known pathway
- ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina
Application, Human Health
47
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Differential Analysis
Gene (vertex) comparisons:
- differential expression
- does not require multiple conditions
- compare the two lists of gene expression levels
Correlate (edge) comparisons
- differential correlation
- requires multiple conditions in control, in dose
- compare the two lists of gene-gene correlations
Putative network (clique) comparisons
- differential topology
- focus on network aka clique differences
- consider the clique intersection graph
Ongoing Work
- 62 genes pass all three screens, 6 match a known pathway
- ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina
For Impact
- concentrate on real data, and working with bench biologists
- strategic publications (e.g., Nature Genetics, PLoS Comp Bio, etc)
Application, Human Health
48
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
49
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Protein Complex Prediction
Peptidase activity complex
yeast proteins
Protein binding complex
edge deleted edge added
protein complexes
Protein-Protein Interaction Network
50
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Protein Complex Prediction
Peptidase activity complex
yeast proteins
Protein binding complex
edge deleted edge added
protein complexes
Protein-Protein Interaction Network
Recognize as Cluster Editing
51
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Protein Complex Prediction
Computational Experience
- algorithms studied by Guo, Niedermeier, Damaschke, others
- synthetic graphs
- known edit distances
- various sizes, densities and distances
52
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Protein Complex Prediction
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 36 (No) 37 (No) 38 (No) 39 (No) 40 (40) 41 (40) 42 (40) 43 (40)
Refined Branching without Interleaving Refined Branching with Interleaving
1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 16 (No) 17 (No) 18 (No) 19 (No) 20 (20) 21 (21/20) 22 (21/20) Runtimes (seconds)
Basic Branching without Interleaving Basic Branching with Interleaving Refined Branching without Interleaving Refined Branching with Interleaving
Edit distance tried (found) . . .
Computational Experience
- non-monotonic behavior
- importance of interleaving
- benefits of refinement
53
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 Edit Distance Tried Log (Runtime in Seconds)
Protein Complex Prediction
Nice application, but best methods still too slow
No instances Yes instances
27+ hours
20 vertices 60 edit distance
54
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Outline of Talk
Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms
55
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
Gregor Mendel, 1822-1884 pea experiments
- green vs yellow
- round vs wrinkly
wrinkly
- inheritance, dominant and recessive traits (alleles)
- monogenetic phenotypes
- very “lucky”
56
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
Gregor Meldel, pea experiments
- green vs yellow
- round vs wrinkly
- inheritance, dominance, monogenetic phenotypes
- but most traits appear to be “complex” (polygenetic)
- many allelic combinations convey evolutionary (dis)advantage
- simple rules of Mendelian inheritance do not apply
- need a measure of independence: Linkage Disequilibrium (LD)
57
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
LD: a measure of statistical dependence between genetic markers
- non-random association of alleles at two or more loci
- the occurrence in a population of two linked alleles at a frequency
higher or lower than expected on the basis of the individual frequencies
- not necessarily on the same chromosome
58
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
LD: a measure of statistical dependence between genetic markers
- non-random association of alleles at two or more loci
- the occurrence in a population of two linked alleles at a frequency
higher or lower than expected on the basis of the individual frequencies
- not necessarily on the same chromosome
Reflects biologically meaningful association of loci
59
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
LD: a measure of statistical dependence between genetic markers
- non-random association of alleles at two or more loci
- the occurrence in a population of two linked alleles at a frequency
higher or lower than expected on the basis of the individual frequencies
- not necessarily on the same chromosome
Reflects biologically meaningful association of loci Generally a result of population history
- population genealogy
- recombination frequency
- co-adaptive allele selection
- natural selection
- other factors
LD: a measure of statistical dependence between genetic markers
- non-random association of alleles at two or more loci
- the occurrence in a population of two linked alleles at a frequency
higher or lower than expected on the basis of the individual frequencies
- not necessarily on the same chromosome
Reflects biologically meaningful association of loci
60
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
Evaluation of Mus musculus breeding strategies Solution: Use SNPs, correlation, paraclique and proximity
Standard Inbred (SI) Recombinant Inbred (RI) BXD, LXS, etc The Collaborative Cross
61
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
200 400 600 800 1000 1200 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques containing more than 3 SNPs that cross multiple chromosomes 67SI 89BXD
200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques containing more than 3 SNPs 67SI 89BXD
Number of LD Networks Number of Non-Syntenic LD Networks
67 Inbred Strains 200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques (size > 3) 1 chrs 2 chrs 3 chrs 4 chrs 5 chrs 6 chrs 7 chrs 8 chrs 9 chrs 10 chrs 11 chrs 12 chrs 13 chrs 14 chrs 15 chrs 16 chrs 17 chrs
89 BXD Strains
200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques (Size>3) 1 chrs 2 chrs 3 chrs 4 chrs
Recombinant Inbred Standard Inbred
Chromosome Coverage
62
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Application, Model Organisms
rs13476024 CEL-1_103029662
Chr 1 Chr 7 Standard Inbred Chr 11
rs3664950 rs3724175 rs3674958 rs13480968
Chr 4
rs3718552
Chr 7
rs8243991 UT_7_136.8857 8 rs4226997 rs3714636 rs3694146 rs63342 10 rs1347955 3 mCV22291963 rs1347955 4 rs13479555 rs63925 43 CEL- 7_126142971 rs36639 88 rs630347 7 rs366012 2 rs36592 92 rs1347956 6 rs1347956 7 rs62121 86 rs13479569 rs1347957 0 rs366616 0 rs13479571 CEL- 7_126301023 rs1347955 9 CEL- 7_126570687 rs6216320
Recombinant Inbred Chr 7
Example of Contrasting Paraclique Profiles
63
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE
NZIMA Napier 2008
Collaborators
Research Scientists (Incomplete!): Mikael Benson Elissa Chesler Frank Dehne Mike Fellows Ivan Gerling Dan Goldowitz Malak Kotb Mark Ragan Arnold Saxton Brynn Voy Rob Williams Bing Zhang Current Students: Bhavesh Borate Patricia Carey John Eblen Jeremy Jay Zuopan Li Sudhir Naswa Andy Perkins Vivek Philip Charles Phillips Gary Rogers Jon Scharff Yun Zhang
64
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE