Analysis of High-Throughput Biological Data Part II: Computational - - PowerPoint PPT Presentation

analysis of high throughput biological data part ii
SMART_READER_LITE
LIVE PREVIEW

Analysis of High-Throughput Biological Data Part II: Computational - - PowerPoint PPT Presentation

NZIMA NZIMA Napier Napier 2008 2008 Analysis of High-Throughput Biological Data Part II: Computational Bottlenecks and Novel Applications Mike Langston Professor Department of Electrical Engineering and Computer Science University of


slide-1
SLIDE 1

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Analysis of High-Throughput Biological Data Part II: Computational Bottlenecks and Novel Applications

Mike Langston

Professor Department of Electrical Engineering and Computer Science University of Tennessee

and

Collaborating Scientist Biological Sciences Division Oak Ridge National Laboratory USA

21 February 2008 NZIMA Napier 2008

slide-2
SLIDE 2

2

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-3
SLIDE 3

3

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-4
SLIDE 4

4

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology

  • How do biological entities function in unison and at

all levels of scale?

  • Linkage, communication and networks (graphs!)
slide-5
SLIDE 5

5

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation

Here are five mouse genes with Pearson correlations

  • f at least 0.65. What of
  • noise?
  • experimental design?
  • circadian rhythms?
  • other confounds?
  • other metrics?
slide-6
SLIDE 6

6

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation

Coefficient Profiles

Sometimes via

  • Pearson
  • Spearman
  • Mutual Information
  • Etc

Other times we need

  • p-values
  • Bonferroni corrections
  • q-values
  • false discovery rates...
slide-7
SLIDE 7

7

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems

slide-8
SLIDE 8

8

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1014+ cells, 200+ cell types

slide-9
SLIDE 9

9

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms)

slide-10
SLIDE 10

10

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins)

slide-11
SLIDE 11

11

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins) Transcriptome Translation (tRNA) via transcription (mRNA) Function and Signaling (siRNA, miRNA, etc)

slide-12
SLIDE 12

12

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics: key to deciphering complex systems Humans: 1013+ cells, 200+ cell types Genome (blueprint, 20K+ genes, 10M+ polymorphisms) Proteome (functional units, unknown # of proteins) Transcriptome Translation (tRNA) via transcription (mRNA) Function and Signaling (siRNA, miRNA, etc) Other: metabalome, lipidome, interactome, omeome!

slide-13
SLIDE 13

13

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization

  • highly dependent on scale
slide-14
SLIDE 14

14

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization

  • highly dependent
  • n scale
  • the only omics often

seen is a “rediculome”

slide-15
SLIDE 15

15

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization Computational Tools - focus usually on dense subgraphs

slide-16
SLIDE 16

16

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique

  • must run often
  • time is a limiting factor
  • exploit fixed-parameter tractability (FPT)
slide-17
SLIDE 17

17

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique

  • huge outputs
  • various orderings
  • memory is often the limiting factor
slide-18
SLIDE 18

18

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique Biclique

  • new algorithms
  • bipartite graphs
slide-19
SLIDE 19

19

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Foundations

Systems Biology Correlation Omics Visualization Computational Tools Maximum Clique Maximal Clique Biclique Paraclique

  • noisy data
slide-20
SLIDE 20

20

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-21
SLIDE 21

21

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Raw Data Gene Expression Profiles Edge-Weighted Complete Graph

cDNA or mRNA Microarrays cDNA or mRNA Microarrays Correlation Computation Correlation Computation High-Pass Filtering High-Pass Filtering Normalization Normalization

Real-Valued Matrix

Graph Transforms Graph Transforms

Unweighted Incomplete Graph

Clique-Centric Methods k-Cores k-Connected Components Principal Component Analysis Principal Component Analysis k-Means Clustering k-Means Clustering

… . . . . . . . .

Paraclique

. . . . . . .

Maximal Clique Maximum Clique

. . . Increasing Edge Density (and Increasing Problem Complexity)

NP-complete Problems Unsupervised Methods

Biclique

. . .

HCS Subgraphs

. . . . . . .

FPT VC Codes HPC & Novel Methods

Toolchain

slide-22
SLIDE 22

22

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Gene (vertex) comparisons:

  • differential expression
  • does not require multiple conditions
  • compare the two lists of gene expression levels
slide-23
SLIDE 23

23

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Correlate (edge) comparisons

  • differential correlation
  • requires multiple conditions in control versus stimulus
  • compare two lists of gene-gene correlations
slide-24
SLIDE 24

24

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Putative network (clique) comparisons

  • differential topology
  • compare cliques, sort by ontology, CREs, etc
  • consider granularity, for example, with the clique intersection graph
slide-25
SLIDE 25

25

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Seven Quantative Trait Loci

There’s a high probability that somewhere in here is a polymorphism controlling this trait. Transcript abundance can be the phenotype!

slide-26
SLIDE 26

26

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Coexpression Analysis

Concentrated Parental Alleles

Two Paracliques

slide-27
SLIDE 27

27

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-28
SLIDE 28

28

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients)

slide-29
SLIDE 29

29

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec)

slide-30
SLIDE 30

30

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec) Transcriptomic Data (e.g., gene expression from µarrays)

slide-31
SLIDE 31

31

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients) Proteomic Data (e. g., amino acid peaks from mass spec) Transcriptomic Data (e.g., gene expression from µarrays) Genotypic Data: SNPs

  • DNA sequence variations, each occurring

when a single nucleotide in the genome differs between members of a species

  • highly conserved throughout evolution and within population
  • almost always just two alleles
  • detected with SNP arrays designed to detect polymorphisms
slide-32
SLIDE 32

32

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Integration

Proteins

A T TC CG TCA CGT AGC TGT

mRNA Co-expression Network Multi-Locus Genetic Regulatory Network Models Natural Allelic Perturbations (SNPs) Protein-Gene Relationships

Proteins Proteins

Protein Peak Factors

T/ C C/ G A/ T G/ G C/ T

Putative Biomarkers

Diseased Healthy

Data Integration

slide-33
SLIDE 33

33

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-34
SLIDE 34

34

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Human Health

Data Description

  • Göteborg, Sweden: 56 patients and 39 controls
  • Affymetrix HU133 arrays
  • roughly 33,000 genes
  • hay fever, eczema
  • nasal secretions, lymphocytes, skin
slide-35
SLIDE 35

35

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Description

  • Göteborg, Sweden: 56 patients and 39 controls
  • Affymetrix HU133 arrays
  • roughly 33,000 genes
  • hay fever, eczema
  • nasal secretions, lymphocytes, skin

Preprocessing

  • MAS5.0
  • log transformed
  • centered around zero with z scores
  • probesets with consistently low expression levels removed
  • replicates averaged

Application, Human Health

slide-36
SLIDE 36

36

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Data Description

  • Göteborg, Sweden, 56 patients and 39 controls
  • Affymetrix HU133 arrays
  • roughly 33,000 genes
  • hay fever, eczema
  • nasal secretions, lymphocytes, skin

Preprocessing

  • MAS5.0
  • log transformed
  • centered around zero with z scores
  • probesets with consistently low expression levels removed
  • replicates averaged

Threshold Selection

  • chosen to balance graph densities
  • AFFX spots retained for quality control

Application, Human Health

slide-37
SLIDE 37

37

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

500000 1000000 1500000 2000000 2500000

  • 1
  • .

9 4

  • .

8 8

  • .

8 2

  • .

7 6

  • .

7

  • .

6 4

  • .

5 8

  • .

5 2

  • .

4 6

  • .

4

  • .

3 4

  • .

2 8

  • .

2 2

  • .

1 6

  • .

1

  • .

4 . 2 . 8 . 1 4 . 2 . 2 6 . 3 2 . 3 8 . 4 4 . 5 . 5 6 . 6 2 . 6 8 . 7 4 . 8 . 8 6 . 9 2 . 9 8 Correlation Value Frequency Patient Control

Correlation Coefficient Distribution

Application, Human Health

slide-38
SLIDE 38

38

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

59 51315 45471 4415 0.92 66 243232 75541 5317 0.91 71 1579041 118900 6254 0.90 79 15067064 178144 7169 0.89 84 240146378 256346 8009 0.88 Maximum Size Maximal Cliques Edges Vertices Threshold

Control

28 11322 11322 2628 0.92 35 41605 26031 3405 0.91 45 114030 40933 4146 0.90 52 447176 62271 4999 0.89 61 2298595 91152 5809 0.88 Maximum Size Maximal Cliques Edges Vertices Threshold

Patient

ribosomal or RNA-related T-lymphocytes or epithelial cells

Graph Properties

Application, Human Health

slide-39
SLIDE 39

39

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Clique profiles using the five most highly represented genes:

Patient Control 56% CDH3 21% GTPBP4 64% FGFR3 24% SLC25A13 64% PPL 26% DKFZP564O123 65% NFIB 27% RANBP6 66% FGFR2 29% UBE1C Clique membership Gene Symbol Clique membership Gene Symbol

Application, Human Health

slide-40
SLIDE 40

40

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Clique profiles using the five most highly represented genes:

Patient Control 56% CDH3 21% GTPBP4 64% FGFR3 24% SLC25A13 64% PPL 26% DKFZP564O123 65% NFIB 27% RANBP6 66% FGFR2 29% UBE1C Clique membership Gene Symbol Clique membership Gene Symbol

Of course gene representation is only a small part of the story.

Application, Human Health

slide-41
SLIDE 41

41

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

We can use traditional algorithmic tools

  • extract cores, cliques and other dense subgraphs
  • check for scale-freeness, putative TFs, hubs, etc

Application, Human Health

slide-42
SLIDE 42

42

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

We can use traditional algorithmic tools

  • extract cores, cliques and other dense subgraphs
  • check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools

  • sort subgraphs by ontological enrichment, CREs, etc
  • compare to literature, databases, etc
  • match genes and gene products with known interactions

Application, Human Health

slide-43
SLIDE 43

43

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

We can use traditional algorithmic tools

  • extract cores, cliques and other dense subgraphs
  • check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools

  • sort subgraphs by ontological enrichment, CREs, etc
  • compare to literature, databases, etc
  • match genes and gene products with known interactions

It’s tempting to scan for your favorites...

Application, Human Health

slide-44
SLIDE 44

44

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

We can use traditional algorithmic tools

  • extract cores, cliques and other dense subgraphs
  • check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools

  • sort subgraphs by ontological enrichment, CREs, etc
  • compare to literature, databases, etc
  • match genes and gene products with known interactions

It’s tempting to scan for your favorites... But our goal is to identify altered interactions

Application, Human Health

slide-45
SLIDE 45

45

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Differential Analysis

Gene (vertex) comparisons:

  • differential expression
  • does not require multiple conditions
  • compare the two lists of gene expression levels

Correlate (edge) comparisons

  • differential correlation
  • requires multiple conditions in control, in dose
  • compare the two lists of gene-gene correlations

Putative network (clique) comparisons

  • differential topology
  • focus on network aka clique differences
  • consider the clique intersection graph

Application, Human Health

slide-46
SLIDE 46

46

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Differential Analysis

Gene (vertex) comparisons:

  • differential expression
  • does not require multiple conditions
  • compare the two lists of gene expression levels

Correlate (edge) comparisons

  • differential correlation
  • requires multiple conditions in control, in dose
  • compare the two lists of gene-gene correlations

Putative network (clique) comparisons

  • differential topology
  • focus on network aka clique differences
  • consider the clique intersection graph

Ongoing Work

  • 62 genes pass all three screens, 6 match a known pathway
  • ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina

Application, Human Health

slide-47
SLIDE 47

47

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Differential Analysis

Gene (vertex) comparisons:

  • differential expression
  • does not require multiple conditions
  • compare the two lists of gene expression levels

Correlate (edge) comparisons

  • differential correlation
  • requires multiple conditions in control, in dose
  • compare the two lists of gene-gene correlations

Putative network (clique) comparisons

  • differential topology
  • focus on network aka clique differences
  • consider the clique intersection graph

Ongoing Work

  • 62 genes pass all three screens, 6 match a known pathway
  • ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina

For Impact

  • concentrate on real data, and working with bench biologists
  • strategic publications (e.g., Nature Genetics, PLoS Comp Bio, etc)

Application, Human Health

slide-48
SLIDE 48

48

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-49
SLIDE 49

49

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Protein Complex Prediction

Peptidase activity complex

yeast proteins

Protein binding complex

edge deleted edge added

protein complexes

Protein-Protein Interaction Network

slide-50
SLIDE 50

50

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Protein Complex Prediction

Peptidase activity complex

yeast proteins

Protein binding complex

edge deleted edge added

protein complexes

Protein-Protein Interaction Network

Recognize as Cluster Editing

slide-51
SLIDE 51

51

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Protein Complex Prediction

Computational Experience

  • algorithms studied by Guo, Niedermeier, Damaschke, others
  • synthetic graphs
  • known edit distances
  • various sizes, densities and distances
slide-52
SLIDE 52

52

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Protein Complex Prediction

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 36 (No) 37 (No) 38 (No) 39 (No) 40 (40) 41 (40) 42 (40) 43 (40)

Refined Branching without Interleaving Refined Branching with Interleaving

1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 16 (No) 17 (No) 18 (No) 19 (No) 20 (20) 21 (21/20) 22 (21/20) Runtimes (seconds)

Basic Branching without Interleaving Basic Branching with Interleaving Refined Branching without Interleaving Refined Branching with Interleaving

Edit distance tried (found) . . .

Computational Experience

  • non-monotonic behavior
  • importance of interleaving
  • benefits of refinement
slide-53
SLIDE 53

53

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

1.E-03 1.E-02 1.E-01 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 Edit Distance Tried Log (Runtime in Seconds)

Protein Complex Prediction

Nice application, but best methods still too slow

No instances Yes instances

27+ hours

20 vertices 60 edit distance

slide-54
SLIDE 54

54

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Outline of Talk

Foundations Gene Coexpression Analysis Data Integration Application to Human Health Protein Complex Prediction Application to Model Organisms

slide-55
SLIDE 55

55

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

Gregor Mendel, 1822-1884 pea experiments

  • green vs yellow
  • round vs wrinkly

wrinkly

  • inheritance, dominant and recessive traits (alleles)
  • monogenetic phenotypes
  • very “lucky”
slide-56
SLIDE 56

56

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

Gregor Meldel, pea experiments

  • green vs yellow
  • round vs wrinkly
  • inheritance, dominance, monogenetic phenotypes
  • but most traits appear to be “complex” (polygenetic)
  • many allelic combinations convey evolutionary (dis)advantage
  • simple rules of Mendelian inheritance do not apply
  • need a measure of independence: Linkage Disequilibrium (LD)
slide-57
SLIDE 57

57

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

LD: a measure of statistical dependence between genetic markers

  • non-random association of alleles at two or more loci
  • the occurrence in a population of two linked alleles at a frequency

higher or lower than expected on the basis of the individual frequencies

  • not necessarily on the same chromosome
slide-58
SLIDE 58

58

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

LD: a measure of statistical dependence between genetic markers

  • non-random association of alleles at two or more loci
  • the occurrence in a population of two linked alleles at a frequency

higher or lower than expected on the basis of the individual frequencies

  • not necessarily on the same chromosome

Reflects biologically meaningful association of loci

slide-59
SLIDE 59

59

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

LD: a measure of statistical dependence between genetic markers

  • non-random association of alleles at two or more loci
  • the occurrence in a population of two linked alleles at a frequency

higher or lower than expected on the basis of the individual frequencies

  • not necessarily on the same chromosome

Reflects biologically meaningful association of loci Generally a result of population history

  • population genealogy
  • recombination frequency
  • co-adaptive allele selection
  • natural selection
  • other factors

LD: a measure of statistical dependence between genetic markers

  • non-random association of alleles at two or more loci
  • the occurrence in a population of two linked alleles at a frequency

higher or lower than expected on the basis of the individual frequencies

  • not necessarily on the same chromosome

Reflects biologically meaningful association of loci

slide-60
SLIDE 60

60

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

Evaluation of Mus musculus breeding strategies Solution: Use SNPs, correlation, paraclique and proximity

Standard Inbred (SI) Recombinant Inbred (RI) BXD, LXS, etc The Collaborative Cross

slide-61
SLIDE 61

61

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

200 400 600 800 1000 1200 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques containing more than 3 SNPs that cross multiple chromosomes 67SI 89BXD

200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques containing more than 3 SNPs 67SI 89BXD

Number of LD Networks Number of Non-Syntenic LD Networks

67 Inbred Strains 200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques (size > 3) 1 chrs 2 chrs 3 chrs 4 chrs 5 chrs 6 chrs 7 chrs 8 chrs 9 chrs 10 chrs 11 chrs 12 chrs 13 chrs 14 chrs 15 chrs 16 chrs 17 chrs

89 BXD Strains

200 400 600 800 1000 1200 1400 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mutual Information Number of Paracliques (Size>3) 1 chrs 2 chrs 3 chrs 4 chrs

Recombinant Inbred Standard Inbred

Chromosome Coverage

slide-62
SLIDE 62

62

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Application, Model Organisms

rs13476024 CEL-1_103029662

Chr 1 Chr 7 Standard Inbred Chr 11

rs3664950 rs3724175 rs3674958 rs13480968

Chr 4

rs3718552

Chr 7

rs8243991 UT_7_136.8857 8 rs4226997 rs3714636 rs3694146 rs63342 10 rs1347955 3 mCV22291963 rs1347955 4 rs13479555 rs63925 43 CEL- 7_126142971 rs36639 88 rs630347 7 rs366012 2 rs36592 92 rs1347956 6 rs1347956 7 rs62121 86 rs13479569 rs1347957 0 rs366616 0 rs13479571 CEL- 7_126301023 rs1347955 9 CEL- 7_126570687 rs6216320

Recombinant Inbred Chr 7

Example of Contrasting Paraclique Profiles

slide-63
SLIDE 63

63

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Collaborators

Research Scientists (Incomplete!): Mikael Benson Elissa Chesler Frank Dehne Mike Fellows Ivan Gerling Dan Goldowitz Malak Kotb Mark Ragan Arnold Saxton Brynn Voy Rob Williams Bing Zhang Current Students: Bhavesh Borate Patricia Carey John Eblen Jeremy Jay Zuopan Li Sudhir Naswa Andy Perkins Vivek Philip Charles Phillips Gary Rogers Jon Scharff Yun Zhang

slide-64
SLIDE 64

64

ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE

NZIMA Napier 2008

Geeks Я Us